Bayesian Machine Learning -- Reasoning under Uncertainty with Edward

People are faced daily with situations involving uncertain outcomes. However, in these situations, we often make impulsive decisions, or risky decisions that are based solely on our past experience. These situations show what truly terrible decision makers we can be. What is true for us in daily life is also true for executives that have to make decisions that are critical to their business. Businesses have a lot of data, but what they typically lack are tools to utilize the data for rational decision making.

that's me Torsten Scholak on Bayesianism, Talk, and Edward


If you read this, you probably heard the claim that the battery performance of the new iPhones is markedly different depending on whether its system on chip (SoC) is manufactured by TSMC or by Samsung. Early reports indicated that the TSMC chip allows for a dramatically better battery lifetime than its Samsung counterpart. Supposedly, this is due to TSMC's slight advantage in the semiconductor device fabrication process, which gives the TSMC chip a competitive edge -- although both iPhone chips are based on the same Apple A9 design and despite the fact that Samsung uses a smaller 14 nm process.

that's me Torsten Scholak on Bayesianism, Apple, and Chipgate

A Dockerized Login Server for Docker Services

I needed a way to securely access my MongoDB instance running in one of my Docker containers from the stupendous wilds of the Internet. This had to overcome the fact that the server with the MongoDB instance is behind a firewall with NAT.

that's me Torsten Scholak on Docker, MongoDB, and FreeIPA

A $K$-Means Odyssey

In this article, I tell you how I

  1. implemented two versions of the tf-idf statistic,

  2. applied these statistics to a certain consumer-item transaction dataset,

  3. created the spherical mini-batch $k$-means implementation that you can find on my GitHub repository,

  4. deployed spherical $k$-means clustering on the consumer-item data, and

  5. trained models with different $k$.

I call this an odyssey, because, as I was exploring, coding, and waiting for results, it occurred to me that I was lost on a journey to nowhere, and that, ultimately, all this wouldn't help me with my project (for reasons explained here). Yet, I was compelled to write about this futile adventure, because someone -- maybe you! -- might find it useful.

that's me Torsten Scholak on big data, Python, MongoDB, and market segmentation

Big-Data Work-Flow with MongoDB

For my current project, I need to be able to routinely handle tens of gigabytes of data. After a failed attempt to create a convenient work-flow with HDF5 on a single machine, I decided to store everything in a MongoDB database on my server and to have my desktop retrieving only the data that I need. I'm happy to report that I have finally concocted a system that works.

that's me Torsten Scholak on big data, Python, and MongoDB

The Setup

In the style of an interview on, I want to give you some info about the hard- and software I'm currently using for my projects. I'll also tell you how to deploy some of that software.

that's me Torsten Scholak on update, Python, MongoDB, and Docker

Communities and Markets

I have been fairly quiet as of late. This is not because nothing has happened, but rather because I have been (and still am) busy working on a data science project in which I try something new and exciting. Recently, I was out of town for a couple of days traveling the Maritimes, and I found some time to reflect and to start writing on a series of articles about this particular project. This is now the first.

In my new project, I am studying the segmentation of large markets by identifying communities of consumers with similar shopping habits. This is an effort to improve recommendation systems and to allow for better business opportunities.

that's me Torsten Scholak on big data, graphs, community detection, and market segmentation

Unfold with unfoldr

unfoldr is the second program of my complex systems software suite that I open-source. (histogramr was the first one.) For those of you who study the spectral features of complex systems such as random networks I think unfoldr will prove quite useful. Given the eigenvalues of an ensemble of random matrices, unfoldr calculates the nearest-neighbor level spacings of the unfolded spectrum, either as a whole or for slices of it. You can specify how you want to cut the spectrum into slices — linearly, logarithmically —, and unfoldr will calculate the level spacings for each slice individually. With this you can study how the level spacing statistics change with energy and whether or not there is a phase transition in the spectrum.

that's me Torsten Scholak on code and physics

Meet histogramr

histogramr is a piece of software that produces multivariate histograms from numerical data. I started working on histogramr during my PhD and have been using and improving it ever since. It has been instrumental in many of my scientific achievements (Scholak et al. 2010; Scholak et al. 2011; Scholak, Wellens, and Buchleitner 2011; Scholak, Wellens, and Buchleitner 2011; Scholak et al. 2011; Zech et al. 2013; Scholak, Wellens, and Buchleitner 2014), because it has allowed for the statistical analysis of extremely large scientific data sets.

that's me Torsten Scholak on code and big data

Jekyll and GitHub Pages

By the time you read this, I will have tossed my old obsolete Wordpress website in favor of this one, a modern Jekyll powered GitHub page. I have started from scratch. The old site didn't have enough content worth rescuing. Unfortunately, I never found the ambition or the time to write blog posts. Of course, this is going to change now.

that's me Torsten Scholak on update, guide, and jekyll

Finding the right fonts

Kasper, the theme that I am using for this web site, comes with three fonts: Open Sans, Merriweather, and Inconsolata. I did not like these very much, so I set out to find suitable replacements.

that's me Torsten Scholak on update

New blog!

So much still to do, so little to share (at this point). But that will change soon.

that's me Torsten Scholak on update