The Setup

In the style of an interview on usesthis.com, I want to give you some info about the hard- and software I'm currently using for my projects. I'll also tell you how to deploy some of that software.

What hardware do I use?

I have two machines, a desktop and a server, connected via Gigabit Ethernet. Each machine is dedicated to a very particular set of tasks, tasks that I have selected based on the systems' configuration. The server is optimized for stability and runs database services 24/7, while the desktop is used for data analysis and manipulation. The server sports a "Haswell" Xeon E3 v3 processor and has 32GB of ECC RAM. OS, apps, and fresh data are kept on SSDs. In addition, I maintain a large RAID-Z2 array for archival purposes. The desktop is a Hackintosh and built around an "Ivy Bridge" quad-Core i7. It also has 32GB of RAM, non-ECC though.

The desktop rig from the inside.

And what software?

Most of the software I use is open-source. The server runs Fedora 22, the desktop Mac OS X (which, notably, is not open source). One thing I particularly like about Mac OS is the built-in transparent memory compression technology. It comes in quite handy at times, even with 32GB of RAM. Fedora, on the other hand, has a convenient web UI called Cockpit that gives me a lot of control over system services and Docker containers.

Oh yeah, did I mention that I use Docker? Docker is a virtualization solution that comes with thousands of pre-packaged applications for easy deployment. It's a neat and fun way to check out new (and potentially unstable) software plus their dependencies without messing up your base OS installation.

Cockpit shows me which Docker containers are running and what resources they are consuming.

MongoDB

For my current project, I rely on MongoDB as my data warehouse. MongoDB is a NoSQL database, and, unlike relational databases, it is built around the philosophy of dynamic schemas that allows me to make up and change the way I store data along as I go. It couldn't be any other way, though, because MongoDB has a serious restriction in that it doesn't support joins.

Let me tell you quickly how I deployed a MongoDB 3.0 instance on my server. I followed the Performance Best Practices for MongoDB white paper where I could. My setup is simple and doesn't make use of replication or sharding.

Step 1: Prepare the server.

Luckily, most of my working datasets (and their indexes) fit in RAM. However, since I wanted to cut down the latency associated with some of my more write-heavy operations, I decided to put everything on SSDs. (Note that, even though SSDs alleviate the I/O bottleneck of spinning hard drives, they remain vastly inferior to RAM.) So the first step is to figure out the partitioning and the file system. I use the mature and I/O-optimized XFS, mounted in /srv/docker/mongodb.

MongoDB works best with transparent hugepages turned off. The default for Fedora is to have them enabled. To disable them permanently, I created two files: the script /usr/lib/systemd/scripts/transparent-hugepages.sh and the service file /usr/lib/systemd/system/transparent-hugepages.service. Their contents, respectively, are:

#!/bin/sh

if /usr/bin/test -f /sys/kernel/mm/transparent_hugepage/khugepaged/defrag; then
  /usr/bin/echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/defrag
fi

if /usr/bin/test -f /sys/kernel/mm/transparent_hugepage/defrag; then
  /usr/bin/echo never > /sys/kernel/mm/transparent_hugepage/defrag
fi

if /usr/bin/test -f /sys/kernel/mm/transparent_hugepage/enabled; then
  /usr/bin/echo never > /sys/kernel/mm/transparent_hugepage/enabled
fi

and

[Unit]
Description=Turn off transparent hugepages

[Service]
Type=oneshot
ExecStart=/usr/lib/systemd/scripts/transparent-hugepages.sh

[Install]
WantedBy=multi-user.target

As root, I then typed

[root@server ~]# chown 755 /usr/lib/systemd/scripts/transparent-hugepages.sh
[root@server ~]# systemctl enable transparent-hugepages
[root@server ~]# systemctl start transparent-hugepages

to start the service.

Finally, I created a firewall rule for mongodb. On Fedora, firewall rules are stored as XML files in /etc/firewalld/services. There, I created a file called mongodb.xml reading:

<?xml version="1.0" encoding="utf-8"?>
<service>
  <short>MongoDB 1</short>
  <description>MongoDB for Kaggle's Acquire Valued Shoppers Challenge</description>
  <port protocol="tcp" port="27017"/>
</service>

I then added the rule to the active zone and reloaded the firewall:

[root@server ~]# firewall-cmd --permanent --zone=FedoraServer --add-service=mongodb
[root@server ~]# firewall-cmd --reload

That is all in terms of preparations.

Step 2: Get the Docker image.

There's a popular and well-maintained Docker image with MongoDB 3.0. I got it with:

[root@server ~]# docker pull docker.io/mongo:latest

You can find more information about the mongo repository here. It is based on the Debian Wheezy image.

Step 3: Create and start a Docker container over the MongoDB image.

Then, I ran:

[root@server ~]# docker run -d \
  --name mongodb_acquire_valued_shoppers_challenge \
  --net=host \
  -p 27017:27017 \
  -v /srv/docker/mongodb/acquire_valued_shoppers_challenge:/data/db
  --privileged=false \
  mongo --storageEngine=wiredTiger

A couple of things happened here:

  1. I sent the container to the background.

  2. I gave the container a name that is not a nerdy joke.

  3. I gave the container full access to the host's network interface, barring the right to reconfigure it.

  4. I mapped the port 27017 on the docker server to the port 27017 in the container.

  5. I mapped a data directory on the Docker host, /srv/docker/mongodb/acquire_valued_shoppers_challenge, to a directory inside the container. Beforehand, that folder's SELinux context was changed to unconfined_u:object_r:svirt_sandbox_file_t:s0.

  6. Otherwise, however, I isolated the container from others and from the host.

  7. I selected the mongo image.

  8. I activated MongoDB's all-new WiredTiger storage engine that -- thanks to a combination of deep algorithm improvements and native data compression -- is more I/O efficient than the legacy MMAPv1 engine.

However, I did not configure any security measures within the container -- no database users, no SSL certificates, nothing. Of course, you may find this outrageously unsafe, but it's deployed on a private network, after all.

Python, Pandas, and PyMongo

Now for the desktop. At its core, my data science software stack consists of Python, IPython Notebook, Numpy, Matplotlib, SciPy, Pandas, and the PyMongo MongoDB driver. With the MacPorts package management system, I got these up and running with

[tscholak@client ~]$ sudo xcode-select --install
[tscholak@client ~]$ sudo port selfupdate
[tscholak@client ~]$ sudo port install python27 py27-numpy py27-matplotlib py27-scipy py27-pandas py27-pymongo py27-ipython +notebook
[tscholak@client ~]$ sudo port select --set python python27
[tscholak@client ~]$ sudo port select --set ipython ipython27
[tscholak@client ~]export PATH=/opt/local/bin:/opt/local/sbin:$PATH
[tscholak@client ~]$ ipython notebook

The last command starts an interactive Python shell and opens the Jupyter web UI in the default web browser, e.g. in Safari.

Hydrogen & Atom

Jupyter notebooks are great for interactive exploratory analyses. They are convenient, flexible, and give you instant gratification. However, I find they are not the best choice for hacking together complex algorithms and batch jobs. That's why I started using Hydrogen, a plugin for GitHub's Atom editor. The user experience is similar to what Light Table and Juno offer for the Julia language. Hydrogen connects to the IPython Jupyter kernel to allow for executing and displaying output of any Python code directly within Atom. Since Jupyter is language agnostic, Hydrogen also works with other languages, e.g. R. The only inconvenience it has is that, for it to work, you have to start Atom from the command line, e.g. via

[tscholak@client ~]$ atom awesome_algorithm.py

It's quite annoying when you find out right after clicking on the Atom icon that you forgot about this... Again...

Hydrogen runs your code directly in Atom using any Jupyter kernels you have installed. Output is displayed in a small overlay window, as can be seen above for a Pandas DataFrame object formatted as a table.

MongoHub

MongoHub is a graphical front-end for MongoDB that runs natively on OS X. It's stable, capable, but not a revelation in user experience.

MongoHub in action. Here I browse through some query results.

Monary

Monary is an alternative MongoDB driver for Python that is slowly but steadily advancing. Whereas PyMongo is implemented in pure Python, Monary is a Python wrapper for the super-fast MongoDB C driver. The biggest advantage of Monary is speed, its biggest disadvantage the yet incomplete feature set. For instance, it doesn't do any updates, and aggregation lacks support for certain keyword arguments. It's also not as well tested as PyMongo. That being said, I use Monary quite often and haven't observed any serious problems so far.

You can get Monary like this:

$ sudo port install mercurial libbson mongo-c-driver
$ hg clone https://bitbucket.org/djcbeach/monary
$ cd monary
$ python setup.py install

What Would Be My Dream Setup?

I think it qualifies as a self-evident truth that you can never have enough RAM. This may change one day with the arrival of HP's "The Machine". Until then, however, the more RAM the better. LGA 115x motherboards max out at $32 \mathrm{GB}$. That's why both, my desktop and my server, will never run with more than that. If I had had the budget for it, I would have gone for an LGA 2011 board instead. They usually come with eight DIMM slots that each accept modules with as much as $64 \mathrm{GB}$, for a total of $512 \mathrm{GB}$ of RAM.