If you read this, you probably heard the claim that the battery performance of the new iPhones is markedly different depending on whether its system on chip (SoC) is manufactured by TSMC or by Samsung. Early reports indicated that the TSMC chip allows for a dramatically better battery lifetime than its Samsung counterpart. Supposedly, this is due to TSMC's slight advantage in the semiconductor device fabrication process, which gives the TSMC chip a competitive edge -- although both iPhone chips are based on the same Apple A9 design and despite the fact that Samsung uses a smaller 14 nm process.

Whatever the technical explanations are for the discrepancy, when I first heard about this, I was more concerned with the statistical significance of the findings. The amount of reliable measurement data seemed to be far too small. In other words, I was not convinced that the alleged superiority of the TSMC chip over the Samsung chip was not just the result of pure chance. Moreover, I speculated that -- after the news had spread and people began to obsess over this issue -- some people secretly wanted there to be a difference, and newer test results were thus most likely filtered and biased.

Stats to The Rescue!

In order to put the claims on sound statistical footing, someone had to gather a lot of annotated battery performance data (i.e. reproducible battery benchmark data labelled by chip type) and use proper statistical tools to calculate the odds that the TSMC chip is indeed better than the Samsung one. I was determined to do exactly that, yet I found myself without access to such data. I resigned and turned to twitter to proclaim my bitterness loudly to the world.

To my astonishment, John Poole from Primate Labs answered my call and provided some Geekbench battery data for the iPhone 6S. That was Friday. That first batch of data was not separated by chip type, unfortunately. At that point in time, Geekbench did not collect that information. However, on the same day, John pushed an update for Geekbench that would remedy that issue. On Monday, he sent me another dataset with new data, this time labelled by chip type. A huge thanks to him!

So, here is what I did with that precious Geekbench data:

  • First, I defined a Bayesian model -- a mixture of Gaussians in this case, and then
  • I inferred the model's parameters from the data -- i.e. the location and the variance of the Gaussians.

To improve the inference, I used all data -- not only the labelled data in the second batch, but also the unlabelled data of the first batch. Predicting labels for the unlabelled data points was icing on the cake.

I will first reveal the results and then comment on the methodology.


Have a look at the figure below:

The image shows violin plots for the location of the Gaussians (in Geekbench's proprietary battery score units).

The statistical model I defined expresses my belief that the battery score observations are distributed according to a mixture of two distinct components, one for each iPhone chip. In particular, I assume that the observations of each chip follow one of two distinct Gaussian distributions, where the location and the variance of these Gaussians are believed to differ with the chip type. The plot shows the statistical distributions of the locations of the Gaussians that I inferred from the Geekbench data. The depicted violin shapes are kernel density plots. They reflect the degree of belief in a particular value of the locations. The violins illustrate that, given the observed data, there is remaining uncertainty with respect to these locations. However, despite this uncertainty, the locations of the two chip classes are well separated in value, and the TSMC chip clearly has the better average performance (Samsung chips are marked as N71AP in the iPhone 6S, whereas TSMC models are named N71mAP).

The discrepancy between the chips is made explicit by the following figure:

Plotted is the probability that the lift of the iPhone 6S with TSMC chip relative to its Samsung-powered counterpart is at least as large as the value on the horizontal axis.

The figure can be understood in the following way: A lift of, say, 30% of N71mAP over N71AP means that the N71mAP model has a 30% better average battery score than the N71AP. The lift is on the figure's horizontal axis. The probability of that lift (or a higher one) is measured on the vertical axis. According to the plot, there is an almost unit probability that the TSMC chip (N71mAP) is at least 30% better on average than the Samsung chip (N71AP). This probability goes down sharply for higher lift values.

These results present strong evidence that the Geekbench battery performance of the TSMC chip is on average much better than that of the Samsung chip.

I am unable to say if this performance discrepancy is relevant for daily usage. Apple claims it's not. So do several news websites that tested the iPhone 6S under "real world" conditions. However, as with the earlier battery performance claims from last week, these new studies suffer from the same small sample bias, and should be taken with a grain of salt.

It is true that Geekbench's battery test is a synthetic benchmark. It drains the battery by putting a lot of stress on the SoC -- CPU and GPU alike. It is designed to measure battery life under heavy load, e.g. while playing graphics-intensive games. Of course, one might argue that this is not representative of typical real-world usage. In fact, for the most common tasks like browsing the web and even light graphics usage, the screen has the highest battery consumption, and thus differences in battery lifetime between models with TSMC and Samsung chip will be small. This is in agreement with those new studies mentioned above. Notwithstanding, for the usage scenario tested by Geekbench (e.g. lots of games, little browsing), my analysis provides compelling evidence that the devices with TSMC chip are indeed much better. So, if this is your usage scenario, then you better make sure you get an iPhone with a SoC manufactured by TSMC.

Statistical Model

My model is a simple Bayesian Gaussian mixture model. I assume that the observed data is a sequence of random numbers drawn from a mixture of two normal distributions. Specifically, I consider the following model:

\begin{align} \left. X_n \middle| y_n, \left\{\mu_k, \sigma_k\right\} \right. & \sim \mathcal{N}\left(\mu_{y_n}, \sigma_{y_n}^2\right), \\ \left. y_n \middle| \boldsymbol{\pi} \right. & \sim \mathcal{C}\left(\boldsymbol{\pi}\right), \\ \left. \boldsymbol{\pi} \middle| \alpha \right. & \sim \mathcal{D}\left(\alpha, \ldots, \alpha\right), \\ \mu_k & \sim \mathcal{U}\left(-\infty, \infty\right), \\ \sigma_k^{-1} & \sim \mathcal{U}\left(0, \infty\right), \end{align}

where $k = 1, \ldots, K$. The number $K$ is the number of classes, and it is fix. Since there are two chip types, I set $K = 2$. The observations $X_n$, $n = 1, \ldots, N$, are conditionally independent. Each one is believed to be drawn from a normal distribution with mean $\mu_{y_n}$ and variance $\sigma_{y_n}^2$. Furthermore, each data point is a member of a single category $y_n \in \{1, \ldots, K\}$. These indicator variables are distributed according to a categorical distribution, $\mathcal{C}\left(\boldsymbol{\pi}\right)$, where the parameter $\boldsymbol{\pi}$ is a $K$-simplex: it is a vector of $K$ probabilities $\pi_k$ with $0 \le \pi_k \le 1$ that sum up to one, i.e. $\sum_{k=1}^K \pi_k = 1$. The simplex is given a symmetric Dirichlet prior with concentration parameter $\alpha > 0$. I set $\alpha$ to $\frac{1}{2}$. Lastly, I put independent priors on $\mu_k$ and $\sigma_k$. I did this because I think that the location of the mean should be uninformative of the variance (and vice versa). The priors are furthermore flat. Letting $\sigma_k^{-1}$ be uniformly distributed on the positive real axis implements a Jeffreys prior for $\sigma$ (Kass and Wasserman 1996).

For Bayesian analyses, it is common to compute the posterior distribution by means of MCMC sampling. I used PyStan to explore the parameter space, which is why I had to write down my entire model in the Stan programming language (Stan Modeling Language Users Guide and Reference Manual, Version 2.8.0 2015). I coded it as follows:

data {
    // training data
    int<lower=1> K;             // number of classes
    int<lower=1> D;             // number of features
    int<lower=0> N;             // number of labelled examples
    int<lower=1,upper=K> y[N];  // class for labelled example
    vector[D] x[N];             // features for labelled example
    // test data
    int<lower=0> Np;            // number of unlabelled examples
    vector[D] xp[Np];           // features for unlabelled example
    // hyperparameters
    vector<lower=0>[K] alpha;   // class prior
parameters {
    simplex[K] pi;                   // class prevalence
    vector[D] mu[K];                 // locations of features
    vector<lower=0>[D] invsigma[K];  // inverse scales of features
transformed parameters {
    vector<lower=0>[D] sigma[K];     // scales of features
    vector[K] z[Np];
    for (k in 1:K)
        sigma[k] <- rep_vector(1, D) ./ invsigma[k];
    for (np in 1:Np)
        for (k in 1:K)
            z[np, k] <- normal_log(xp[np], mu[k], sigma[k])
                        + categorical_log(k, pi);
model {
    // priors
    pi ~ dirichlet(alpha);
    for (n in 1:N)
        y[n] ~ categorical(pi);
    // likelihoods
    for (n in 1:N)
        x[n] ~ normal(mu[y[n]], sigma[y[n]]);
                                     // diagonal covariance matrix
    for (np in 1:Np)
generated quantities {
    vector[K] sm[Np];
    int<lower=1,upper=K> yp[Np];
    for (np in 1:Np) {
        sm[np] <- softmax(z[np]);
        yp[np] <- categorical_rng(sm[np]);

Note that the model is defined such that it can be used in cases with more than one feature, but only if the naive independence assumption between the features applies.

The above implementation allows me to use both labelled and unlabelled data for the parameter inference. To this end, I assume that the labels of the unlabelled dataset are missing completely at random. A conceptually very similar implementation is described in section 13.3 of version 2.8.0 of the Stan manual (2015). The nice thing about this idea is that all data -- not only the labelled, but also the unlabelled data -- contributes to the parameter inference. Since I had only 29 labelled data points to work with, this really helped to improve the results.

In the end, I added the generated quantities block in order to draw values for the missing labels yp. In other words, I can predict the manufacturer of an iPhone's SoC based on its battery test score. What I created can perhaps be described as a semi-supervised version of the common naive Bayes classifier.


This analysis would not have been possible without the help of John Poole from Primate Labs, the software company behind Geekbench!


  1. Stan Modeling Language Users Guide and Reference Manual, Version 2.8.0, 2015. Available at:
      date-added = {2015-10-14 06:17:46 +0000},
      date-modified = {2015-10-14 06:18:05 +0000},
      title = {Stan Modeling Language Users Guide and Reference Manual, Version 2.8.0},
      url = {},
      year = {2015},
      bdsk-url-1 = {}
  2. Kass, Robert E, and Larry Wasserman, 1996, β€œThe Selection of Prior Distributions by Formal Rules,” Journal of the American Statistical Association 91 (Taylor & Francis Group), 1343–1370.
      author = {Kass, Robert E and Wasserman, Larry},
      date-added = {2015-10-14 05:42:46 +0000},
      date-modified = {2015-10-14 05:43:15 +0000},
      journal = {Journal of the American Statistical Association},
      number = {435},
      pages = {1343--1370},
      publisher = {Taylor \& Francis Group},
      title = {The selection of prior distributions by formal rules},
      volume = {91},
      year = {1996}