Model comparison in practice
[2]:
import numpy as np
import pandas as pd
import cmdstanpy
import arviz as az
import bebi103
import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()
/Users/bois/miniconda3/envs/bebi103_build/lib/python3.11/site-packages/dask/dataframe/_pyarrow_compat.py:23: UserWarning: You are using pyarrow version 11.0.0 which is known to be insecure. See https://www.cve.org/CVERecord?id=CVE-2023-47248 for further details. Please upgrade to pyarrow>=14.0.1 or install pyarrow-hotfix to patch your current version.
warnings.warn(
When comparing models, posterior predictive checks are a must. As discussed in the previous lesson, we can use information criteria to assess relative predictive effectiveness of models. In order to understand this lesson, you will need have carefully read and understand the previous lesson on the theory behind model comparison.
The key quantity we compute for more comparison is the expected log pointwise predictive density, or elpd. There were a few key ideas and assumptions in using elpd.
The elpd is an approximation of the difference in the Kullback-Leibler divergence between the posterior predictive distribution and the true generative distribution.
For our set of measurements \(y = (y_1, y_2, \ldots y_N)\), where each \(y_i\) may be multidimensional (as you would have, for example, of beak length/beak depth measurements for a single finch), we assume that \(y_i\)’s are independently distributed, both in the model and in the true generative distribution.
With these assumptions, we can approximately compute the elpd using the Watanabe-Akaike information criterion (WAIC) or leave-one-out cross validation (LOO). See this paper by Vehtari, Gelman, and Gabry (arXiv version) for more details about the implementation. As described in that paper, LOO, when computed using Pareto-smoothed importance sampling, is the preferred method for computing an approximate elpd.
Importantly, the (approximate) elpd by itself is not terribly useful in assessing a model. The elpd of one prospective model needs to be compared to another. For this comparison, we can compute Akaike weights. This is the most straightforward calculation of relative weights of respective models, and perhaps easiest to understand. However, it may not be the best way to assess the predictive capabilities of a model, especially in situations where the true generative model is not known (which is often the case for us as scientists). As we think about generative models, and we are not sure which model best generates observed data, it is useful to think about model averaging if our aim is to be predictive. The idea here is that we do not know which model generates data. Instead, we try to find a combination of models that spans all of the models we are considering, that best generate the data. The respective weights of the models give their contributions to this combination of models. As a scientist, I tend to shy away from model averaging; I am rather seeking to understand how nature generates the observations I see, and nature is not trying to predict, nor average models. However, taking a model averaging approach with an eye for optimizing predictive performance leads to more robust estimates of model weights, as outlined in this paper by Yao and coworkers, which describes a technique known as stacking.
In this tutorial, we will demonstrate how to calculate model weights both by using Akaike weights (and variants thereof), and stacking. Conveniently, this may be done approximately directly from samples out of the posterior distributions. The ArviZ package provides much of the functionality we need.
An example model comparison
To show how we can do an model comparison, we will again consider the data set from Singer and coworkers, where they performed RNA FISH to determine the copy numbers of RNA transcripts of specific genes in individual cells in their samples. The data set can be downloaded here.
We will work with the Rex1 gene. In previous lessons, we considered two models. First, a model in which transcript counts are generated from a single Negative Binomial distribution. Second, a mixture model in which the transcript counts are generated from two Negative Binomial distributions. Note that this is a purely academic exercise, since the first model will fail posterior predictive checks quite spectacularly. We would never really come to this point where we needed to do a model comparison, but we proceed to demonstrate how it is done in practice.
Computing the pointwise log likelihood
Recalling from lecture, the elpd is the the logarithm of the posterior predictive distribution, \(f(\tilde{y}_i\mid y)\), averaged over the true generative distribution \(f_t(\tilde{y}_i)\).
\begin{align} \text{elpd} = \sum_{i=1}^N\int\mathrm{d}\tilde{y}_i\,\,f_t(\tilde{y}_i)\,\ln f(\tilde{y}_i\mid y). \end{align}
When generating our samples, we therefore also need to compute samples of the value of the log likelihood. To do this, for each set of parameters \(\theta\) that we sample out of the posterior, we compute the pointwise log likelihood of the data set, using the parameters \(\theta\). The Stan code below includes these log likelihood samples as well as posterior predictive checks for the single Negative Binomial model. Read the code carefully.
data {
int<lower=0> N;
array[N] int<lower=0> n;
}
parameters {
real log10_alpha;
real log10_b;
}
transformed parameters {
real alpha = 10^log10_alpha;
real b = 10^log10_b;
real beta_ = 1.0 / b;
}
model {
// Priors
log10_alpha ~ normal(0, 1);
log10_b ~ normal(2, 1);
// Likelihood
n ~ neg_binomial(alpha, beta_);
}
generated quantities {
array[N] int n_ppc;
array[N] real log_lik;
// Draw posterior predictive data set
for (i in 1:N) {
n_ppc[i] = neg_binomial_rng(alpha, beta_);
}
// Compute pointwise log likelihood
for (i in 1:N) {
log_lik[i] = neg_binomial_lpmf(n[i] | alpha, beta_);
}
}
In the array log_lik
, I store the pointwise log likelihood. That is, for each measurement (in this case for each \(n_i\)), I compute the log likelihood for that data point using the parameters (in this case alpha
and beta_
) that I sampled out of the posterior. Conveniently, Stan’s distributions all have a function that ends in _lpdf
that compute the log probability density function for the distribution (with _lpmf
for discrete distributions that computes the log
probability mass function for the distribution).
Let’s sample out of this generative model, keeping samples of posterior predictive data sets and pointwise log likelihoods.
[3]:
# Load data and make data dictionary
df = pd.read_csv(os.path.join(data_path, "singer_transcript_counts.csv"), comment="#")
data = {"N": len(df), "n": df["Rex1"].values.astype(int)}
with bebi103.stan.disable_logging():
# Compile the model
sm = cmdstanpy.CmdStanModel(stan_file="neg_binom.stan")
# Perform sampling
samples = sm.sample(data=data)
When we convert the samples to an ArviZ object, we need to specify the posterior predictive and log likelihood.
[4]:
# Convert to ArviZ object
samples = az.from_cmdstanpy(
posterior=samples, posterior_predictive="n_ppc", log_likelihood="log_lik"
)
Now, we will check the diagnostics (as we always should) and make a corner plot.
[5]:
# Check diagnostics
bebi103.stan.check_all_diagnostics(samples)
# Make a corner plot
bokeh.io.show(bebi103.viz.corner(samples, parameters=['alpha', 'b']))
Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
Everything looks good. Actually, it doesn’t. The sampling looks good, but we should do posterior predictive checks.
[6]:
n_ppc = samples.posterior_predictive.n_ppc.stack(
{"sample": ("chain", "draw")}
).transpose("sample", "n_ppc_dim_0")
bokeh.io.show(
bebi103.viz.predictive_ecdf(
n_ppc,
name="n_ppc",
data=df["Rex1"].values,
x_axis_label="mRNA copy number",
)
)
We have clearly failed the posterior predictive checks. We can stop here, but we will continue to compute the WAIC and LOO for illustrative purposes. As we do that, let’s take a quick look at the output so we can see how the log likelihood samples are organized. The log likelihood is stored in ArviZ InferenceData
objects in the log_likelihood
attribute.
[7]:
samples.log_likelihood
[7]:
<xarray.Dataset> Dimensions: (chain: 4, draw: 1000, log_lik_dim_0: 279) Coordinates: * chain (chain) int64 0 1 2 3 * draw (draw) int64 0 1 2 3 4 5 6 7 ... 993 994 995 996 997 998 999 * log_lik_dim_0 (log_lik_dim_0) int64 0 1 2 3 4 5 ... 273 274 275 276 277 278 Data variables: log_lik (chain, draw, log_lik_dim_0) float64 -5.53 -5.972 ... -6.953 Attributes: created_at: 2024-02-01T21:51:04.108606 arviz_version: 0.17.0 inference_library: cmdstanpy inference_library_version: 1.2.0
The log likelihood is three dimensional, one dimension each for chain and draw, and then the final dimension is for the pointwise log-likelihood estimates.
Given that the log likelihood is stored in the ArviZ object, we can directly use the samples to compute the WAIC and LOO.
Computing the WAIC and LOO
We will start with the WAIC. We use the scale="deviance"
kwarg to get the WAIC. By default, az.waic()
will return the estimate for the elpd and not the traditionally used \(-2\mathrm{elpd}_\mathrm{WAIC}\), which is why we use the scale="deviance"
kwarg.
[8]:
az.waic(samples, scale="deviance")
[8]:
Computed from 4000 posterior samples and 279 observations log-likelihood matrix.
Estimate SE
deviance_waic 3281.44 17.28
p_waic 2.04 -
The output gives an estimate for the WAIC, and also the contribution of \(p_\mathrm{waic}\). It also gives an estimate of the standard error in the WAIC.
Let’s now compute the LOO.
[9]:
single_loo = az.loo(samples, scale="deviance")
single_loo
[9]:
Computed from 4000 posterior samples and 279 observations log-likelihood matrix.
Estimate SE
deviance_loo 3281.45 17.28
p_loo 2.04 -
------
Pareto k diagnostic values:
Count Pct.
(-Inf, 0.5] (good) 279 100.0%
(0.5, 0.7] (ok) 0 0.0%
(0.7, 1] (bad) 0 0.0%
(1, Inf) (very bad) 0 0.0%
We see that the LOO and WAIC give almost identical results (as they should). The Pareto k diagnostic is also good, meaning that the LOO calculation does not have obvious pathologies. Remember, though, that LOO has better performance across a wider variety of models.
Calculations with the mixture model
Now, let’s do the same calculation for the mixture model. In this case, we do not have a built-in distribution to use a _logpmf
function. Fortunately, the log_mix()
function in Stan accomplishes exactly what we need, as it did when we added it to target
in the model specification.
data {
int<lower=0> N;
array[N] int<lower=0> n;
}
parameters {
vector<lower=0>[2] alpha;
vector<lower=0>[2] b;
real<lower=0, upper=1> w;
}
transformed parameters {
vector[2] beta_ = 1.0 ./ b;
}
model {
// Priors
alpha ~ lognormal(0.0, 2.3);
b ~ lognormal(4.6, 2.3);
w ~ beta(1.0, 1.0);
// Likelihood
for (n_val in n) {
target += log_mix(
w,
neg_binomial_lpmf(n_val | alpha[1], beta_[1]),
neg_binomial_lpmf(n_val | alpha[2], beta_[2])
);
}
}
generated quantities {
array[N] int n_ppc;
array[N] real log_lik;
// Posterior predictive checks
for (i in 1:N) {
if (uniform_rng(0.0, 1.0) < w) {
n_ppc[i] = neg_binomial_rng(alpha[1], beta_[1]);
}
else {
n_ppc[i] = neg_binomial_rng(alpha[2], beta_[2]);
}
}
// Pointwise log likelihood
for (i in 1:N) {
log_lik[i] = log_mix(
w,
neg_binomial_lpmf(n[i] | alpha[1], beta_[1]),
neg_binomial_lpmf(n[i] | alpha[2], beta_[2]));
}
}
Let’s compile the model and get some samples.
[10]:
with bebi103.stan.disable_logging():
sm_mix = cmdstanpy.CmdStanModel(stan_file="neg_binom_mix.stan")
samples_mix = sm_mix.sample(data=data, seed=3252)
samples_mix = az.from_cmdstanpy(
posterior=samples_mix, posterior_predictive="n_ppc", log_likelihood="log_lik"
)
We’ll do our usual diagnostic checks and make a corner plot.
[11]:
# Check diagnostics
bebi103.stan.check_all_diagnostics(samples_mix)
# Make corner plot
bokeh.io.show(
bebi103.viz.corner(
samples_mix,
parameters=["alpha[0]", "alpha[1]", "b[0]", "b[1]", "w"],
xtick_label_orientation=np.pi / 4,
)
)
ESS for parameter alpha[0] is 11.288214890989488.
tail-ESS for parameter alpha[0] is 30.741700971541686.
ESS for parameter alpha[1] is 14.927706967062717.
ESS for parameter b[0] is 7.261021660730003.
tail-ESS for parameter b[0] is 36.24951441172062.
ESS for parameter b[1] is 7.508176956465391.
tail-ESS for parameter b[1] is 45.51440474005312.
ESS for parameter w is 7.240249308310153.
tail-ESS for parameter w is 31.504119431969524.
ESS for parameter beta_[0] is 7.261011495230468.
tail-ESS for parameter beta_[0] is 36.249514411720966.
ESS for parameter beta_[1] is 7.508176685094074.
tail-ESS for parameter beta_[1] is 45.514404740052576.
ESS or tail-ESS below 100 per chain indicates that expectation values
computed from samples are unlikely to be good approximations of the
true expectation values.
Rhat for parameter alpha[0] is 1.2726060100400034.
Rhat for parameter alpha[1] is 1.1940342985931218.
Rhat for parameter b[0] is 1.5314590580911727.
Rhat for parameter b[1] is 1.501652442205019.
Rhat for parameter w is 1.531795690467478.
Rhat for parameter beta_[0] is 1.5314591389659473.
Rhat for parameter beta_[1] is 1.5016524586402822.
Rank-normalized Rhat above 1.01 indicates that the chains very likely have not mixed.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
Oof! We failed the ESS and Rhat diagnostics. Recall that sampling is tricky because of label switching. To protect against label switching, we will initialize the walker near where chain 1 sampled.
[12]:
# Compute mean of parameters for chain 1
inits = {
"alpha": [
float(samples_mix.posterior.sel(chain=1, alpha_dim_0=0).alpha.mean()),
float(samples_mix.posterior.sel(chain=1, alpha_dim_0=1).alpha.mean()),
],
"b": [
float(samples_mix.posterior.sel(chain=1, b_dim_0=0).b.mean()),
float(samples_mix.posterior.sel(chain=1, b_dim_0=0).b.mean()),
],
"w": float(samples_mix.posterior.sel(chain=1).w.mean()),
}
# Take a look
inits
[12]:
{'alpha': [3.1764846319999998, 5.31663406],
'b': [6.643601560000001, 6.643601560000001],
'w': 0.17168086700000001}
Now we’ll use this as our initial walker positions.
[13]:
# Sample
with bebi103.stan.disable_logging():
samples_mix = sm_mix.sample(data, inits=inits)
# Convert to ArviZ
samples_mix = az.from_cmdstanpy(
posterior=samples_mix, posterior_predictive="n_ppc", log_likelihood="log_lik"
)
# Check diagnostics
bebi103.stan.check_all_diagnostics(samples_mix)
# Make corner plot
bokeh.io.show(
bebi103.viz.corner(
samples_mix,
parameters=["alpha[0]", "alpha[1]", "b[0]", "b[1]", "w"],
xtick_label_orientation=np.pi / 4,
)
)
Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
Everything looks good! We can do a quick posterior predictive check.
[14]:
n_ppc = samples_mix.posterior_predictive.n_ppc.stack(
{"sample": ("chain", "draw")}
).transpose("sample", "n_ppc_dim_0")
bokeh.io.show(
bebi103.viz.predictive_ecdf(
n_ppc,
name="n_ppc",
data=df["Rex1"].values,
x_axis_label="mRNA copy number",
)
)
Much nicer! The model allows for plenty of flexibility to allow for the observed bimodal behavior. Let’s proceed to compute the LOO and WAIC, starting with the WAIC.
[15]:
az.waic(samples_mix, scale="deviance")
[15]:
Computed from 4000 posterior samples and 279 observations log-likelihood matrix.
Estimate SE
deviance_waic 3191.49 21.76
p_waic 5.57 -
And the LOO.
[16]:
mix_loo = az.loo(samples_mix, scale="deviance")
mix_loo
[16]:
Computed from 4000 posterior samples and 279 observations log-likelihood matrix.
Estimate SE
deviance_loo 3191.52 21.76
p_loo 5.59 -
------
Pareto k diagnostic values:
Count Pct.
(-Inf, 0.5] (good) 279 100.0%
(0.5, 0.7] (ok) 0 0.0%
(0.7, 1] (bad) 0 0.0%
(1, Inf) (very bad) 0 0.0%
As expected, we get almost the same value for the two information criteria. Let’s take a quick look at the difference of the LOO’s between the mixture model and the single Negative Binomial model.
[17]:
mix_loo.elpd_loo - single_loo.elpd_loo
[17]:
-89.92969330711185
Remember that for historical reasons,
\begin{align} \text{WAIC} \approx -2\,\text{elpd}, \\[1em] \text{LOO} \approx -2\,\text{elpd}. \\[1em] \end{align}
The bigger the elpd is, the smaller the Kullback-Leibler divergence is, so the better the model is. Thus, a bigger elpd means a smaller WAIC or LOO. So, the smaller the WAIC or LOO is, the closer the model is to the true generative model. This WAIC and LOO are smaller for the mixture model than for the single Negative Binomial model, so it is a better model.
Computing the weights
We can directly compute the Akaike weights from the values of the LOO, using
\begin{align} w_i = \frac{\exp\left[-(\text{LOO}_i-\text{LOO}_j)/2\right]}{1 + \exp\left[-(\text{LOO}_i-\text{LOO}_j)/2\right]}. \end{align}
[18]:
d_loo = mix_loo.elpd_loo - single_loo.elpd_loo
w_single = np.exp(d_loo/2) / (1 + np.exp(d_loo/2))
w_mix = 1 - w_single
print(' Mixture model weight:', w_mix)
print('Single Neg. Binom. model weight:', w_single)
Mixture model weight: 1.0
Single Neg. Binom. model weight: 2.964935286246496e-20
In agreement with our posterior predictive checks, the mixture model is far more predictive than the single negative binomial model.
As I mentioned above, ArviZ offers more a sophisticated means of computing the weights using stacking. The results tend to be less extreme (and therefore more conservative) that directly computing the Akaike weights. We can use the az.compare()
function to do the calculation. We will do it using the LOO (WAIC is default, so we use the ic
kwarg). The first input is a dictionary containing the MCMC samples, where the keys of the dictionary are the names of the models.
[19]:
az.compare({"single": samples, "mix": samples_mix}, ic="loo", scale="deviance")
/Users/bois/miniconda3/envs/bebi103_build/lib/python3.11/site-packages/arviz/stats/stats.py:309: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value 'False' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
df_comp.loc[val] = (
/Users/bois/miniconda3/envs/bebi103_build/lib/python3.11/site-packages/arviz/stats/stats.py:309: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value 'deviance' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
df_comp.loc[val] = (
[19]:
rank | elpd_loo | p_loo | elpd_diff | weight | se | dse | warning | scale | |
---|---|---|---|---|---|---|---|---|---|
mix | 0 | 3191.517389 | 5.587162 | 0.000000 | 0.987199 | 21.757238 | 0.000000 | False | deviance |
single | 1 | 3281.447082 | 2.042356 | 89.929693 | 0.012801 | 17.283447 | 17.480505 | False | deviance |
The mixture model is still dominant. (Again, we are not going into the details of the stacking calculation, but you can read about it in this paper by Yao and coworkers.)
[20]:
bebi103.stan.clean_cmdstan()
Computing environment
[21]:
%load_ext watermark
%watermark -v -p numpy,pandas,cmdstanpy,arviz,bokeh,bebi103,jupyterlab
print("cmdstan :", bebi103.stan.cmdstan_version())
Python implementation: CPython
Python version : 3.11.5
IPython version : 8.15.0
numpy : 1.26.3
pandas : 2.1.4
cmdstanpy : 1.2.0
arviz : 0.17.0
bokeh : 3.3.0
bebi103 : 0.1.19
jupyterlab: 4.0.10
cmdstan : 2.33.1