Model comparison in practice

Data set download

[2]:

import numpy as np
import polars as pl

import cmdstanpy
import arviz as az

import bebi103

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()

Loading BokehJS ...

When comparing models, posterior predictive checks are a must. As discussed in the previous lesson, we can use information criteria to assess relative predictive effectiveness of models. In order to understand this lesson, you will need have carefully read and understand the previous lesson on the theory behind model comparison.

The key quantity we compute for more comparison is the expected log pointwise predictive density, or elpd. There were a few key ideas and assumptions in using elpd.

The elpd is an approximation of the difference in the Kullback-Leibler divergence between the posterior predictive distribution and the true generative distribution.
For our set of measurements \(y = (y_1, y_2, \ldots y_N)\), where each \(y_i\) may be multidimensional (as you would have, for example, of beak length/beak depth measurements for a single finch), we assume that \(y_i\)’s are independently distributed, both in the model and in the true generative distribution.

With these assumptions, we can approximately compute the elpd using the Watanabe-Akaike information criterion (WAIC) or leave-one-out cross validation (LOO). See this paper by Vehtari, Gelman, and Gabry (arXiv version) for more details about the implementation. As described in that paper, LOO, when computed using Pareto-smoothed importance sampling, is the preferred method for computing an approximate elpd.

Importantly, the (approximate) elpd by itself is not terribly useful in assessing a model. The elpd of one prospective model needs to be compared to another. For this comparison, we can compute Akaike weights. This is the most straightforward calculation of relative weights of respective models, and perhaps easiest to understand. However, it may not be the best way to assess the predictive capabilities of a model, especially in situations where the true generative model is not known (which is often the case for us as scientists). As we think about generative models, and we are not sure which model best generates observed data, it is useful to think about model averaging if our aim is to be predictive. The idea here is that we do not know which model generates data. Instead, we try to find a combination of models that spans all of the models we are considering, that best generate the data. The respective weights of the models give their contributions to this combination of models. As a scientist, I tend to shy away from model averaging; I am rather seeking to understand how nature generates the observations I see, and nature is not trying to predict, nor average models. However, taking a model averaging approach with an eye for optimizing predictive performance leads to more robust estimates of model weights, as outlined in this paper by Yao and coworkers, which describes a technique known as stacking.

In this tutorial, we will demonstrate how to calculate model weights both by using Akaike weights (and variants thereof), and stacking. Conveniently, this may be done approximately directly from samples out of the posterior distributions. The ArviZ package provides much of the functionality we need.

An example model comparison

To show how we can do an model comparison, we will again consider the data set from Singer and coworkers, where they performed RNA FISH to determine the copy numbers of RNA transcripts of specific genes in individual cells in their samples. The data set can be downloaded here.

We will work with the Rex1 gene. In previous lessons, we considered two models. First, a model in which transcript counts are generated from a single Negative Binomial distribution. Second, a mixture model in which the transcript counts are generated from two Negative Binomial distributions. Note that this is a purely academic exercise, since the first model will fail posterior predictive checks quite spectacularly. We would never really come to this point where we needed to do a model comparison, but we proceed to demonstrate how it is done in practice.

Computing the WAIC and LOO

We will start with the WAIC. We use the scale="deviance" kwarg to get the WAIC. By default, az.waic() will return the estimate for the elpd and not the traditionally used \(-2\mathrm{elpd}_\mathrm{WAIC}\), which is why we use the scale="deviance" kwarg.

[8]:

az.waic(samples, scale="deviance")

[8]:

Computed from 4000 posterior samples and 279 observations log-likelihood matrix.

              Estimate       SE
deviance_waic  3281.21    17.32
p_waic            1.92        -

The output gives an estimate for the WAIC, and also the contribution of \(p_\mathrm{waic}\). It also gives an estimate of the standard error in the WAIC.

Let’s now compute the LOO.

[9]:

single_loo = az.loo(samples, scale="deviance")

single_loo

[9]:

Computed from 4000 posterior samples and 279 observations log-likelihood matrix.

             Estimate       SE
deviance_loo  3281.22    17.32
p_loo            1.93        -
------

Pareto k diagnostic values:
                         Count   Pct.
(-Inf, 0.5]   (good)      279  100.0%
 (0.5, 0.7]   (ok)          0    0.0%
   (0.7, 1]   (bad)         0    0.0%
   (1, Inf)   (very bad)    0    0.0%

We see that the LOO and WAIC give almost identical results (as they should). The Pareto k diagnostic is also good, meaning that the LOO calculation does not have obvious pathologies. Remember, though, that LOO has better performance across a wider variety of models.

Calculations with the mixture model

Now, let’s do the same calculation for the mixture model. In this case, we do not have a built-in distribution to use a _logpmf function. Fortunately, the log_mix() function in Stan accomplishes exactly what we need, as it did when we added it to target in the model specification.

data {
  int<lower=0> N;
  array[N] int<lower=0> n;
}


parameters {
  vector<lower=0>[2] alpha;
  vector<lower=0>[2] b;
  real<lower=0, upper=1> w;
}


transformed parameters {
  vector[2] beta_ = 1.0 ./ b;
}


model {
  // Priors
  alpha ~ lognormal(0.0, 2.3);
  b ~ lognormal(4.6, 2.3);
  w ~ beta(1.0, 1.0);

  // Likelihood
  for (n_val in n) {
    target += log_mix(
      w,
      neg_binomial_lpmf(n_val | alpha[1], beta_[1]),
      neg_binomial_lpmf(n_val | alpha[2], beta_[2])
    );
  }
}


generated quantities {
  array[N] int n_ppc;
  array[N] real log_lik;

  // Posterior predictive checks
  for (i in 1:N) {
    if (uniform_rng(0.0, 1.0) < w) {
      n_ppc[i] = neg_binomial_rng(alpha[1], beta_[1]);
    }
    else {
      n_ppc[i] = neg_binomial_rng(alpha[2], beta_[2]);
    }
  }

  // Pointwise log likelihood
  for (i in 1:N) {
    log_lik[i] = log_mix(
      w,
      neg_binomial_lpmf(n[i] | alpha[1], beta_[1]),
      neg_binomial_lpmf(n[i] | alpha[2], beta_[2]));
  }
}

Let’s compile the model and get some samples.

[10]:

with bebi103.stan.disable_logging():
    sm_mix = cmdstanpy.CmdStanModel(stan_file="neg_binom_mix.stan")
    samples_mix = sm_mix.sample(data=data, seed=3252)

samples_mix = az.from_cmdstanpy(
    posterior=samples_mix, posterior_predictive="n_ppc", log_likelihood="log_lik"
)

We’ll do our usual diagnostic checks and make a corner plot.

[11]:

# Check diagnostics
bebi103.stan.check_all_diagnostics(samples_mix)

# Make corner plot
bokeh.io.show(
    bebi103.viz.corner(
        samples_mix,
        parameters=["alpha[0]", "alpha[1]", "b[0]", "b[1]", "w"],
        xtick_label_orientation=np.pi / 4,
    )
)

ESS for parameter alpha[0] is 10.004040383981042.
tail-ESS for parameter alpha[0] is 318.7069921736581.
ESS for parameter alpha[1] is 9.836302265033044.
tail-ESS for parameter alpha[1] is 128.2393686900295.
ESS for parameter b[0] is 6.432640073574931.
tail-ESS for parameter b[0] is 126.12692598049735.
ESS for parameter b[1] is 6.210628341260978.
tail-ESS for parameter b[1] is 183.7239430399982.
ESS for parameter beta_[0] is 6.432641423716465.
tail-ESS for parameter beta_[0] is 126.12692598048358.
ESS for parameter beta_[1] is 6.210629331032064.
tail-ESS for parameter beta_[1] is 183.72394304000517.
ESS for parameter w is 6.1460975400222795.
tail-ESS for parameter w is 170.8943375375399.
  ESS or tail-ESS below 100 per chain indicates that expectation values
  computed from samples are unlikely to be good approximations of the
  true expectation values.

Rhat for parameter alpha[0] is 1.3240338139575638.
Rhat for parameter alpha[1] is 1.3286803456589167.
Rhat for parameter b[0] is 1.6654700530731357.
Rhat for parameter b[1] is 1.72055885067414.
Rhat for parameter beta_[0] is 1.665470122994841.
Rhat for parameter beta_[1] is 1.7205589366050218.
Rhat for parameter w is 1.7356984105623074.
  Rank-normalized Rhat above 1.01 indicates that the chains very likely have not mixed.

0 of 4000 (0.0%) iterations ended with a divergence.

0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.

E-BFMI indicated no pathological behavior.

Oof! We failed the ESS and Rhat diagnostics. Recall that sampling is tricky because of label switching. To protect against label switching, we will initialize the walker near where chain 1 sampled.

[12]:

# Compute mean of parameters for chain 1
inits = {
    "alpha": [
        float(samples_mix.posterior.sel(chain=1, alpha_dim_0=0).alpha.mean()),
        float(samples_mix.posterior.sel(chain=1, alpha_dim_0=1).alpha.mean()),
    ],
    "b": [
        float(samples_mix.posterior.sel(chain=1, b_dim_0=0).b.mean()),
        float(samples_mix.posterior.sel(chain=1, b_dim_0=0).b.mean()),
    ],
    "w": float(samples_mix.posterior.sel(chain=1).w.mean()),
}

# Take a look
inits

[12]:

{'alpha': [3.2774578420000005, 5.23979613],
 'b': [6.64700481, 6.64700481],
 'w': 0.16986432480000002}

Now we’ll use this as our initial walker positions.

[13]:

# Sample
with bebi103.stan.disable_logging():
    samples_mix = sm_mix.sample(data, inits=inits)

# Convert to ArviZ
samples_mix = az.from_cmdstanpy(
    posterior=samples_mix, posterior_predictive="n_ppc", log_likelihood="log_lik"
)

# Check diagnostics
bebi103.stan.check_all_diagnostics(samples_mix)

# Make corner plot
bokeh.io.show(
    bebi103.viz.corner(
        samples_mix,
        parameters=["alpha[0]", "alpha[1]", "b[0]", "b[1]", "w"],
        xtick_label_orientation=np.pi / 4,
    )
)


Effective sample size looks reasonable for all parameters.

Rhat looks reasonable for all parameters.

0 of 4000 (0.0%) iterations ended with a divergence.

0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.

E-BFMI indicated no pathological behavior.

Everything looks good! We can do a quick posterior predictive check.

[14]:

n_ppc = samples_mix.posterior_predictive.n_ppc.stack(
    {"sample": ("chain", "draw")}
).transpose("sample", "n_ppc_dim_0")

bokeh.io.show(
    bebi103.viz.predictive_ecdf(
        n_ppc,
        name="n_ppc",
        data=df["Rex1"].to_numpy(),
        x_axis_label="mRNA copy number",
    )
)

Much nicer! The model allows for plenty of flexibility to allow for the observed bimodal behavior. Let’s proceed to compute the LOO and WAIC, starting with the WAIC.

[15]:

az.waic(samples_mix, scale="deviance")

[15]:

Computed from 4000 posterior samples and 279 observations log-likelihood matrix.

              Estimate       SE
deviance_waic  3191.86    21.72
p_waic            5.77        -

And the LOO.

[16]:

mix_loo = az.loo(samples_mix, scale="deviance")

mix_loo

[16]:

Computed from 4000 posterior samples and 279 observations log-likelihood matrix.

             Estimate       SE
deviance_loo  3191.88    21.72
p_loo            5.78        -
------

Pareto k diagnostic values:
                         Count   Pct.
(-Inf, 0.5]   (good)      279  100.0%
 (0.5, 0.7]   (ok)          0    0.0%
   (0.7, 1]   (bad)         0    0.0%
   (1, Inf)   (very bad)    0    0.0%

As expected, we get almost the same value for the two information criteria. Let’s take a quick look at the difference of the LOO’s between the mixture model and the single Negative Binomial model.

[17]:

mix_loo.elpd_loo - single_loo.elpd_loo

[17]:

-89.33459299576725

Remember that for historical reasons,

\begin{align} \text{WAIC} \approx -2\,\text{elpd}, \\[1em] \text{LOO} \approx -2\,\text{elpd}. \\[1em] \end{align}

The bigger the elpd is, the smaller the Kullback-Leibler divergence is, so the better the model is. Thus, a bigger elpd means a smaller WAIC or LOO. So, the smaller the WAIC or LOO is, the closer the model is to the true generative model. This WAIC and LOO are smaller for the mixture model than for the single Negative Binomial model, so it is a better model.

Computing the weights

We can directly compute the Akaike weights from the values of the LOO, using

\begin{align} w_i = \frac{\exp\left[-(\text{LOO}_i-\text{LOO}_j)/2\right]}{1 + \exp\left[-(\text{LOO}_i-\text{LOO}_j)/2\right]}. \end{align}

[18]:

d_loo = mix_loo.elpd_loo - single_loo.elpd_loo
w_single = np.exp(d_loo/2) / (1 + np.exp(d_loo/2))
w_mix = 1 - w_single

print('           Mixture model weight:', w_mix)
print('Single Neg. Binom. model weight:', w_single)

           Mixture model weight: 1.0
Single Neg. Binom. model weight: 3.99245113565338e-20

In agreement with our posterior predictive checks, the mixture model is far more predictive than the single negative binomial model.

As I mentioned above, ArviZ offers more a sophisticated means of computing the weights using stacking. The results tend to be less extreme (and therefore more conservative) that directly computing the Akaike weights. We can use the az.compare() function to do the calculation. We will do it using the LOO (WAIC is default, so we use the ic kwarg). The first input is a dictionary containing the MCMC samples, where the keys of the dictionary are the names of the models.

[19]:

az.compare({"single": samples, "mix": samples_mix}, ic="loo", scale="deviance")

[19]:

	rank	elpd_loo	p_loo	elpd_diff	weight	se	dse	warning	scale
mix	0	3191.884444	5.778852	0.000000	0.983497	21.718306	0.000000	False	deviance
single	1	3281.219037	1.927518	89.334593	0.016503	17.315398	17.565504	False	deviance

The mixture model is still dominant. (Again, we are not going into the details of the stacking calculation, but you can read about it in this paper by Yao and coworkers.)

[20]:

bebi103.stan.clean_cmdstan()

Computing environment

[21]:

%load_ext watermark
%watermark -v -p numpy,polars,cmdstanpy,arviz,bokeh,bebi103,jupyterlab
print("cmdstan   :", bebi103.stan.cmdstan_version())

Python implementation: CPython
Python version       : 3.12.4
IPython version      : 8.25.0

numpy     : 1.26.4
polars    : 1.2.1
cmdstanpy : 1.2.4
arviz     : 0.18.0
bokeh     : 3.4.1
bebi103   : 0.1.21
jupyterlab: 4.0.13

cmdstan   : 2.35.0