25. Variational Bayesian inference

Heidi Klumpe contributed significantly to this lesson.

[2]:

import io

import numpy as np
import polars as pl
import polars.selectors as cs

import cmdstanpy
import arviz as az

import iqplot
import bebi103

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()

Loading BokehJS ...

Large data-sets, mathematical models including many variables, and hierarchical models, among other modeling problems, can require computationally expensive sampling with MCMC. We can speed up sampling if we instead use an analytical approximation of the posterior and then sample from it. Variational inference (VI) is a technique that adopts this approach. Variational inference is sometimes referred to as variational Bayes, a term that is used interchangeably with variational inference in the Stan documentation.

Importantly, MCMC samples asymptotically approach the true posterior density, but VI does not. VI will only give information about an approximation of the posterior, and that approximation may not be good.

In this lesson, we will go over the basic concepts behind VI and implementations in Stan. For more detail, I recommend this review of VI David Blei and others.

The main ideas of variational inference

For VI, we want to approximate a posterior distribution \(g(\theta \mid y)\) with an approximate distribution \(Q(\theta)\). We saw in lessons on model comparison that the Kullback-Leibler divergence is used to quantify the dissimilarity between two probability distributions (note that it is not symmetric in the two distributions, and therefore cannot be considered a distance). So, we want the Kullback-Leibler divergence to be as small as possible. That is, we wish to find a \(Q\) such that the K-L divergence

\begin{align} D_{KL} \left(Q\middle\| g \right) = \sum_\theta Q(\theta)\, \ln \frac{Q(\theta)}{g(\theta \mid y)} \end{align}

is as close to zero as possible. (The K-L divergence is zero if \(Q = g\), and positive otherwise.) Note that we can replace the sum with an integral for continuous \(\theta\). Because the full posterior appears in this expression, to evaluate the K-L divergence directly, we would have to compute the evidence, \(f(y) = \int d\theta\, g(\theta, y)\), which is generally an intractable integral. Instead, we would like the objective function of our optimization to be in terms of distributions that are more easily expressed, so we modify this expression. To proceed, we write the posterior in terms of the joint probability,

\begin{align} g(\theta \mid y) = \frac{\pi(\theta, y)}{f(y)}. \end{align}

We can substitute this expression into the above expression for the K-L divergence and rearrange using properties of logarithms.

\begin{align} D_{KL} \left(Q\middle\| g \right) &= \sum_\theta Q(\theta) \, \ln \frac{Q(\theta)}{\frac{\pi(\theta, y)}{f(y)}}\\[1em] &= \sum_\theta Q(\theta)\, \ln \frac{Q(\theta)}{\pi(\theta, y)} + \sum_\theta Q(\theta) \, \ln f(y)\\[1em] &= \sum_\theta Q(\theta)\, \ln \frac{Q(\theta)}{\pi(\theta, y)} + \ln f(y)\sum_\theta Q(\theta) \\[1em] &= \sum_\theta Q(\theta)\, \ln \frac{Q(\theta)}{\pi(\theta, y)} + \ln f(y). \end{align}

Note that we effectively replaced the posterior (the conditional distribution \(g(\theta \mid y)\)) with the product of the likelihood and prior, the joint distribution \(\pi(\theta, y) = f(y \mid \theta) g(\theta)\).

We further rearrange to get

\begin{align} \ln f(y) &= D_{KL} \left(Q\middle\| g \right) - \sum_\theta Q(\theta) \ln \frac{Q(\theta)}{\pi(\theta, y)} \\ &= D_{KL} \left(Q\middle\| g \right) + \mathcal{L}\left(Q(\theta)\right), \end{align}

where

\begin{align} \mathcal{L}\left(Q(\theta)\right) = - \sum_\theta Q(\theta) \ln \frac{Q(\theta)}{\pi(\theta, y)}. \end{align}

The quantity \(\mathcal{L}(Q)\) is the evidence lower bound (ELBO). That is because the the log evidence is guaranteed to be at least as big as \(\mathcal{L}(Q)\). Further, since the evidence does not depend on the parameters \(\theta\), the negative ELBO is equal to Kullback-Leiber divergence up to an additive constant. So, choosing the \(\theta\) that maximizes the ELBO is equivalent to finding the \(\theta\) that minimizes the K-L divergence.

The ELBO is also sometimes called the “negative variational free energy.” This is because it has the form of a free energy, summing up an expected value for the energy and the entropy:

\begin{align} \mathcal{L}(Q) &= -\sum_\theta Q(\theta) \ln \frac{Q(\theta)}{\pi(\theta, y)} \\[1em] &= \sum_\theta Q(\theta) \left[ \ln \pi(\theta, y) - \ln {Q(\theta)} \right]\\[1em] &= \sum_\theta Q(\theta) \ln \pi(\theta, y) - \sum_\theta Q(\theta)\ln Q(\theta). \end{align}

The first term approximates an expected value for the “energy”, and the second closely resembles the definition of the Shannon entropy.

For \(Q(\theta)\) to be a useful approximation of the posterior, we should choose \(Q(\theta)\) such that \(\mathcal{L}(Q)\) is easy to compute and optimize, but not so simple that \(Q(\theta)\) very poorly approximates the posterior.

Choosing Q(θ)

We can choose any family of functions to approximate our posterior, but it is particularly useful to choose the mean-field variational family. This assumes the parameters of our statistical model (\(\theta\)) are independent, with their own unique parameters. Stated mathematically,

\begin{align} Q(\theta) = \prod_i q_i (\theta_i ; \phi_i), \end{align}

where \(q_i\) are partitions of the approximation, each of which is an independent function of only one parameter \(\theta_i\), and \(\phi_i\) are the unique latent parameters that parametrize \(q_i\). A posterior distribution in general does not factorize this way, which is what makes this an approximation.

To arrive at our final approximation, we need expressions for each \(q_i\). For a given \(q_i\), we want to estimate the posterior as a function of \(\theta_i\) only. To do this, we replace all \(\theta_{j\neq i}\) terms in the posterior with their expected values, i.e. \(\theta_j^{mean} = \int d\theta_j\, q_j\). This is why this is considered a mean field approach—we approximate the effect of other variables as a “mean field,” rather than their true values. Note also that the mean is some constant, for which there is an analytical expression for most probability distributions.

But now we see our final hurdle. To fully specify \(q_i\), we need information about all the other partitions \(q_{j\neq i}\), since the mean of that distribution may appear in the expression for \(q_i\). Thus, to complete the optimization, we iteratively update \(\phi_i\) for each parameter \(\theta_i\), updating one variable at a time. We stop when we arrive at a solution (i.e. a set of \(q_i\) and \(\phi_i\)) that maximizes the ELBO.

Summary of VI algorithm

Our goal is to find a distribution \(Q(\theta)\) that maximizes the ELBO, which is equivalent to minimizing the dissimilarity between the posterior \(g(\theta \mid y)\) and an approximate distribution \(Q(\theta)\).
Posit that \(Q(\theta) = \prod_i q_i(\theta_i; \phi_i)\). Note that you can derive what the functional form for \(q_i\) should be (e.g. Normal, Gamma, etc.), but Stan’s VI algorithm will do it for you. This defines the “restricted class” from which you find your best \(Q(\theta)\)
Vary \(\phi_i\) to maximize the ELBO and find the best \(Q(\theta)\). This gives the final form of approximation of the posterior.
Draw samples out of the approximation of the posterior. This does not need MCMC; independent samples may be directly drawn using a randon number generator and transforms.

Automatic Differentiation Variational Inference and Stan

The algorithm Stan uses for VI is called Automatic Differentiation Variational Inference (ADVI). The technique was developed by Alp Kucukelbir and was published in 2017.

I will not go into the details of the ADVI algorithm here, but will give a few highlights that are important to understand while using it. First, any parameters that must be positive or are otherwise constrained are transformed such that they may take on any real value. We will call the transformed parameters \(\zeta\). The ADVI algorithm is clever in its choice of transform, and you should see the Kucukelbir, et al. paper for details. Next, we choose a family of distributions for \(Q(\theta)\). One choice is a Gaussian mean field family where

\begin{align} &Q(\zeta) = \prod_i q_i(\zeta_i),\\[1em] &\zeta_i \sim \text{Norm}(\mu_i, \sigma_i). \end{align}

The ADVI algorithm then finds the values of the \(\mu_i\)’s and \(\sigma_i\)’s that maximize the ELBO using clever a optimization algorithm. Note that when we choose the Gaussian mean field family, we lose information about the covariance of parameter values.

Stan’s implementation of ADVI also allows specification of a full rank Gaussian.

\begin{align} &\zeta \sim \text{MultiNorm}(\boldsymbol{\mu}, \mathsf{\Sigma}). \end{align}

There are now many more parameters to manipulate to maximize the ELBO (all of the entries of \(\mathsf{\Sigma}\) instead of just the diagonal ones in the mean field case), and the optimization calculation is more difficult than with the mean field family. The benefit is that the extra complexity (and therefore flexibility) allows better approximation of the posterior and more covariance information about the parameters is retained.

After the optimal parameters are found for the optimizing distributions, samples of \(\zeta\) are drawn out of these Gaussian distributions and these samples are transformed back into \(\theta\).

Examples

We will now apply ADVI to two examples we have already studied with MCMC. We will use the same Stan models we used previously. Using ADVI is as simple as using example the Stan model as you would with MCMC, but using variational sampling instead of MCMC sampling.

Example 1: Spindle size as a function of volume

We will first consider a model for microtubule spindle size based on Matt Good’s experiments. We considered this model in lesson 11. Here is a quick plot of the data.

[3]:

# Load in Data Frame
df = pl.read_csv(os.path.join(data_path, "good_invitro_droplet_data.csv"), comment_prefix="#")

# Pull out numpy arrays
ell = df["Spindle Length (um)"].to_numpy()
d = df["Droplet Diameter (um)"].to_numpy()

# Make a plot
p = bokeh.plotting.figure(
    frame_height=200,
    frame_width=300,
    x_axis_label="droplet diameter (µm)",
    y_axis_label="spindle length (µm)",
    x_range=[0, 250],
    y_range=[0, 50],
)

p.scatter(
    source=df.to_dict(),
    x="Droplet Diameter (um)",
    y="Spindle Length (um)",
    alpha=0.3,
)

bokeh.io.show(p)

The generative model we used for this data set is

\begin{align} &\log_{10} \phi \sim \text{Norm}(1.5, 0.75),\\[1em] &\gamma \sim \text{Beta}(1.1, 1.1), \\[1em] &\sigma \sim \text{HalfNorm}(10),\\[1em] &\mu_i = \frac{\gamma d_i}{\left(1+(\gamma d_i/\phi)^3\right)^{\frac{1}{3}}}, \\[1em] &l_i \mid d_i, \gamma, \phi, \sigma \sim \text{Norm}(\mu_i, \sigma) \;\forall i. \end{align}

The corresponding Stan code is

functions {
  real ell_theor(real d, real phi, real gamma_) {
    real denom_ratio = (gamma_ * d / phi)^3;
    return gamma_ * d / cbrt(1 + denom_ratio);
  }
}


data {
  int N;
  int N_ppc;
  array[N] real d;
  array[N_ppc] real d_ppc;
  array[N] real ell;
}


parameters {
  real<lower=0> phi;
  real<lower=0, upper=1> gamma_;
  real<lower=0> sigma_0;
}


transformed parameters {
  array[N] real mu;
  array[N] real sigma;

  for (i in 1:N) {
    mu[i] = ell_theor(d[i], phi, gamma_);
    sigma[i] = sigma_0 * mu[i];
  }
}


model {
  phi ~ lognormal(log(20.0), 0.75);
  gamma_ ~ beta(1.1, 1.1);
  sigma_0 ~ gamma(2.0, 10.0);

  ell ~ normal(mu, sigma);
}


generated quantities {
  array[N_ppc] real ell_ppc;

for (i in 1:N_ppc) {
    real mu_ppc = ell_theor(d_ppc[i], phi, gamma_);
    ell_ppc[i] = normal_rng(mu_ppc, sigma_0 * mu_ppc);
  }
}

Let’s compile the model. The compilation for ADVI is exactly the same as for MCMC.

[4]:

with bebi103.stan.disable_logging():
    sm = cmdstanpy.CmdStanModel(stan_file="spindle.stan")

Now, we can perform ADVI. To do so, we use sm.variational() instead of sm.sample(). We specify that we want to use a Gaussian mean field approximation using the algorithm="meanfield" kwarg. Stan will automatically do the optimization. Remember, though, for general optimization problems, there are no guarantees that we can find a global optimum.

Finally, we then use draws=4000 to get samples out of the approximate posterior.

[5]:

N_ppc = 200
d_ppc = np.linspace(0.1, 250, N_ppc)

data = dict(N=len(df), d=d, ell=ell, N_ppc=N_ppc, d_ppc=d_ppc)

with bebi103.stan.disable_logging():
    samples_vi = sm.variational(
        data=data,
        algorithm="meanfield",
        draws=4000,
        seed=3252,
    )

Unfortunately, ArviZ does not have support for VI results, so we need to extract the information about the VI calculation directly from the object returned by sm.variational(). First, we can look at Stan’s output pertaining to its optimization calculation for maximizing the ELBO. To do this, we will pull out the stdout_files and print their content

[6]:

for fname in samples_vi.runset._stdout_files:
    with open(fname, "r") as f:
        print(f.read())

method = variational
  variational
    algorithm = meanfield (Default)
      meanfield
    iter = 10000 (Default)
    grad_samples = 1 (Default)
    elbo_samples = 100 (Default)
    eta = 1 (Default)
    adapt
      engaged = true (Default)
      iter = 50 (Default)
    tol_rel_obj = 0.01 (Default)
    eval_elbo = 100 (Default)
    output_samples = 4000
id = 1 (Default)
data
  file = /var/folders/j_/c5r9ch0913v3h1w4bdwzm0lh0000gn/T/tmpcshdblkp/01ufoem2.json
init = 2 (Default)
random
  seed = 3252
output
  file = /var/folders/j_/c5r9ch0913v3h1w4bdwzm0lh0000gn/T/tmpcshdblkp/spindlezaxcnbxg/spindle-20240727161414.csv
  diagnostic_file =  (Default)
  refresh = 100 (Default)
  sig_figs = -1 (Default)
  profile_file = profile.csv (Default)
  save_cmdstan_config = false (Default)
num_threads = 1 (Default)

------------------------------------------------------------
EXPERIMENTAL ALGORITHM:
  This procedure has not been thoroughly tested and may be unstable
  or buggy. The interface is subject to change.
------------------------------------------------------------



Gradient evaluation took 0.000392 seconds
1000 transitions using 10 leapfrog steps per transition would take 3.92 seconds.
Adjust your expectations accordingly!


Begin eta adaptation.
Iteration:   1 / 250 [  0%]  (Adaptation)
Iteration:  50 / 250 [ 20%]  (Adaptation)
Iteration: 100 / 250 [ 40%]  (Adaptation)
Iteration: 150 / 250 [ 60%]  (Adaptation)
Iteration: 200 / 250 [ 80%]  (Adaptation)
Iteration: 250 / 250 [100%]  (Adaptation)
Success! Found best value [eta = 0.1].

Begin stochastic gradient ascent.
  iter             ELBO   delta_ELBO_mean   delta_ELBO_med   notes
   100        -6093.398             1.000            1.000
   200        -3580.301             0.851            1.000
   300        -3473.163             0.578            0.702
   400        -3406.187             0.438            0.702
   500        -3310.589             0.356            0.031
   600        -3197.393             0.303            0.035
   700        -3119.716             0.263            0.031
   800        -3118.113             0.230            0.031
   900        -3055.396             0.207            0.029
  1000        -3007.033             0.188            0.029
  1100        -2958.594             0.090            0.025
  1200        -2882.045             0.022            0.025
  1300        -2825.802             0.021            0.021
  1400        -2755.129             0.021            0.025
  1500        -2656.263             0.022            0.025
  1600        -2549.177             0.023            0.025
  1700        -2429.472             0.025            0.026
  1800        -2310.505             0.031            0.027
  1900        -2205.968             0.033            0.037
  2000        -2103.187             0.036            0.042
  2100        -2027.055             0.039            0.042
  2200        -1952.219             0.040            0.042
  2300        -1908.866             0.040            0.042
  2400        -1881.840             0.039            0.042
  2500        -1870.947             0.036            0.042
  2600        -1869.602             0.032            0.038
  2700        -1859.828             0.027            0.038
  2800        -1852.081             0.023            0.023
  2900        -1855.085             0.018            0.014
  3000        -1848.251             0.013            0.006   MEDIAN ELBO CONVERGED

Drawing a sample of size 4000 from the approximate posterior...
COMPLETED.

The stochastic gradient ascent was used to maximize the ELBO, and we got the message that the optimization converged. Stan then proceeded to draw 4000 samples out of the approximate posterior.

Now, we can obtain the draws using the samples_vi.variational_sample_pd object, which has the samples as a Pandas data frame that we can convert to Polars.

[7]:

df_vi = pl.from_pandas(samples_vi.variational_sample_pd)

# Take a look
df_vi.head()

[7]:

shape: (5, 1_546)

lp__	log_p__	log_g__	phi	gamma_	sigma_0	mu[1]	mu[2]	mu[3]	mu[4]	mu[5]	mu[6]	mu[7]	mu[8]	mu[9]	mu[10]	mu[11]	mu[12]	mu[13]	mu[14]	mu[15]	mu[16]	mu[17]	mu[18]	mu[19]	mu[20]	mu[21]	mu[22]	mu[23]	mu[24]	mu[25]	mu[26]	mu[27]	mu[28]	mu[29]	mu[30]	mu[31]	…	ell_ppc[164]	ell_ppc[165]	ell_ppc[166]	ell_ppc[167]	ell_ppc[168]	ell_ppc[169]	ell_ppc[170]	ell_ppc[171]	ell_ppc[172]	ell_ppc[173]	ell_ppc[174]	ell_ppc[175]	ell_ppc[176]	ell_ppc[177]	ell_ppc[178]	ell_ppc[179]	ell_ppc[180]	ell_ppc[181]	ell_ppc[182]	ell_ppc[183]	ell_ppc[184]	ell_ppc[185]	ell_ppc[186]	ell_ppc[187]	ell_ppc[188]	ell_ppc[189]	ell_ppc[190]	ell_ppc[191]	ell_ppc[192]	ell_ppc[193]	ell_ppc[194]	ell_ppc[195]	ell_ppc[196]	ell_ppc[197]	ell_ppc[198]	ell_ppc[199]	ell_ppc[200]
f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	…	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64
0.0	-1835.7	-0.331212	38.227	0.842629	0.113781	21.4108	22.1168	22.8631	23.8182	23.8182	24.1072	24.2215	24.3915	24.3915	24.3915	24.56	25.1095	25.1095	25.1095	25.2172	25.2172	25.3774	25.3774	25.4833	25.6407	25.6407	25.7964	25.7964	25.8994	25.8994	26.0524	26.0524	26.1535	26.3037	26.403	26.403	…	38.2483	43.1542	36.8225	44.6687	38.6655	44.027	41.5272	32.6999	24.6706	37.8338	36.9349	41.2051	44.2299	31.7005	42.7867	42.2848	43.1836	33.9228	41.9244	40.996	31.8709	35.849	37.397	38.2963	35.6625	38.7252	43.7003	42.1884	34.5169	36.3899	42.9012	39.4834	43.5096	34.064	43.2421	34.121	40.6432
0.0	-1840.41	-0.785308	39.1501	0.865228	0.112093	21.9749	22.6985	23.4632	24.4415	24.4415	24.7375	24.8546	25.0287	25.0287	25.0287	25.2012	25.7639	25.7639	25.7639	25.8741	25.8741	26.0381	26.0381	26.1465	26.3076	26.3076	26.467	26.467	26.5724	26.5724	26.729	26.729	26.8324	26.9862	27.0878	27.0878	…	44.8528	38.8592	40.2555	47.7231	40.8165	30.9545	34.2806	37.7542	37.4417	34.8077	37.1392	36.5394	29.1805	41.4704	39.7445	40.5452	39.1893	40.01	41.2436	34.443	39.7261	40.4521	41.0513	29.2142	42.0837	39.0003	48.0292	39.1756	28.6942	40.5889	42.452	41.2124	43.8919	29.0793	48.1099	34.2611	35.8589
0.0	-1858.06	-2.31351	40.2776	0.844027	0.123076	21.6262	22.3587	23.1358	24.1349	24.1349	24.4383	24.5585	24.7375	24.7375	24.7375	24.9149	25.495	25.495	25.495	25.6089	25.6089	25.7785	25.7785	25.8907	26.0577	26.0577	26.2231	26.2231	26.3325	26.3325	26.4953	26.4953	26.603	26.7631	26.869	26.869	…	42.3057	46.0865	45.6753	37.4586	41.1205	38.0462	42.2148	34.9582	47.5769	35.8423	40.7599	45.7058	48.779	44.3128	48.8098	35.7057	34.5446	37.2556	41.5195	42.7433	45.7598	29.7988	40.6122	51.1274	43.2444	47.429	46.3507	42.1029	34.7582	35.2856	37.0872	35.451	29.3246	37.8917	44.5176	46.6073	38.6007
0.0	-1852.57	-2.25062	37.6107	0.823538	0.120399	20.95	21.6435	22.3767	23.3158	23.3158	23.6001	23.7126	23.8799	23.8799	23.8799	24.0457	24.5867	24.5867	24.5867	24.6928	24.6928	24.8505	24.8505	24.9548	25.1098	25.1098	25.2633	25.2633	25.3647	25.3647	25.5155	25.5155	25.6152	25.7633	25.8611	25.8611	…	34.7001	33.7657	41.016	30.6929	31.1228	33.5226	42.1452	36.15	43.1601	40.1269	42.5085	40.002	43.9939	37.5753	38.0004	36.3048	40.0719	36.1017	39.0039	32.2246	45.6354	35.539	38.8681	37.8399	35.1457	42.5557	44.9505	39.6907	40.7962	37.7344	39.5881	33.1519	32.7614	29.6408	40.0122	42.1793	43.3586
0.0	-1837.03	-0.709175	38.3351	0.837112	0.112368	21.3053	22.0116	22.7586	23.7155	23.7155	24.0052	24.1198	24.2904	24.2904	24.2904	24.4594	25.011	25.011	25.011	25.1191	25.1191	25.2799	25.2799	25.3863	25.5444	25.5444	25.7009	25.7009	25.8043	25.8043	25.9581	25.9581	26.0598	26.2108	26.3107	26.3107	…	36.3895	38.4879	35.5521	38.4783	39.2572	42.3769	48.5354	39.8614	36.312	38.5278	38.7153	38.7268	39.7198	38.5268	37.7287	46.057	35.1441	46.7131	30.4929	37.2464	45.3229	36.6589	42.0863	37.3919	44.1494	38.8274	39.5048	42.607	40.1233	37.0054	37.1512	28.8405	32.4625	43.7155	41.5097	42.004	35.4243

The lp__ column is always zero (it is a legacy feature that is not ever used). The log_p__ and log_g__ columns are values of the log posterior and log approximate posterior, respectively, so we do not really use those. The remaining columns are samples of the parameter values and posterior predictive checks.

We can also directly get the values of \(\gamma\), \(\phi\) and \(\sigma\) that resulted in the maximal ELBO.

[8]:

samples_vi.variational_params_pd

[8]:

	lp__	log_p__	log_g__	phi	gamma_	sigma_0	mu[1]	mu[2]	mu[3]	mu[4]	...	ell_ppc[191]	ell_ppc[192]	ell_ppc[193]	ell_ppc[194]	ell_ppc[195]	ell_ppc[196]	ell_ppc[197]	ell_ppc[198]	ell_ppc[199]	ell_ppc[200]
0	0.0	0.0	0.0	38.4521	0.853254	0.116228	21.6552	22.3666	23.1182	24.0794	...	44.4477	37.673	31.7122	42.1776	35.381	37.9442	34.6642	46.0142	39.2081	38.4419

1 rows × 1546 columns

Let’s take a quick look at the posterior predictive checks to make sure the likelihood parametrized by draws out of the approximate posterior can generate the data.

[9]:

ell_ppc = df_vi.select(cs.contains('ell_ppc')).to_numpy()

bokeh.io.show(
    bebi103.viz.predictive_regression(
        ell_ppc,
        d_ppc,
        data=(d, ell),
        x_axis_label="droplet diameter (µm)",
        y_axis_label="spindle length (µm)",
    )
)

This looks ok, but remember that we are looking at how the likelihood can generate data using parameters sampled from the approximate posterior. Now, let’s look at the samples out of the approximate posterior.

[10]:

parameters = ["phi", "gamma_", "sigma_0"]

bokeh.io.show(bebi103.viz.corner(df_vi[parameters], parameters=parameters))

Notice that the marginal posterior distribution for each pair of parameters is symmetric. This is because the covariance between parameters is neglected using the Gaussian mean field approximation of the posterior.

To compare, let’s look at a corner plot from the parameters obtained using MCMC.

[11]:

with bebi103.stan.disable_logging():
    samples = az.from_cmdstanpy(sm.sample(data=data))

bokeh.io.show(bebi103.viz.corner(samples, parameters=parameters))

The full posterior clearly shows a strong covariance between \(\gamma\) and \(\phi\). For a clearer comparison, let’s compare the ECDFs of the marginal distributions for \(\gamma\) and \(\phi\) obtained from MCMC and from VI.

[12]:

def plot_marginal_ecdfs():
    p_phi = iqplot.ecdf(
        samples.posterior.phi.values.flatten(), legend_label="MCMC", x_axis_label="φ"
    )
    p_phi = iqplot.ecdf(
        df_vi['phi'],
        line_kwargs=dict(line_color="orange"),
        p=p_phi,
        legend_label="ADVI",
    )

    p_gamma = iqplot.ecdf(samples.posterior.gamma_.values.flatten(), x_axis_label="γ")
    p_gamma = iqplot.ecdf(
        df_vi['gamma_'],
        line_kwargs=dict(line_color="orange"),
        p=p_gamma,
    )

    return p_phi, p_gamma

bokeh.io.show(bokeh.layouts.column(*plot_marginal_ecdfs()))

We see that there are differences in the marginal distributions between the approximate and exact posteriors. This will in general be the case; the approximate posterior based on VI will never be exact (unless of course the problem is set up such that the posterior is in fact Gaussian), and the approximation may introduce substantial differences.

To attempt to do better, capturing the covariance between \(\phi\) and \(\gamma\), and hopefully also getting the marginal distributions for each closer to the true posterior, let’s use a full rank Gaussian family in our approximation.

[13]:

# Take samples using full rank Gaussian approximation
with bebi103.stan.disable_logging():
    samples_vi = sm.variational(
        data=data, algorithm="fullrank", draws=4000, seed=3251
    )

# Convert to data frame
df_vi = pl.from_pandas(samples_vi.variational_sample_pd)

# Show corner plot
bokeh.io.show(bebi103.viz.corner(df_vi[parameters], parameters=parameters))

This seems to have at least captured the covariance between \(\phi\) and \(\gamma\). Let’s check the marginal distributions for each, again compared to MCMC.

[14]:

bokeh.io.show(bokeh.layouts.column(*plot_marginal_ecdfs()))

The ADVI results are much closer to those from full MCMC. Importantly, since there is some stochasticity to the maximization of the ELBO procedure, we will not get the same results each time. Let’s try again with a different seed.

[15]:

# Take samples using full rank Gaussian approximation
with bebi103.stan.disable_logging():
    samples_vi = sm.variational(
        data=data, algorithm="fullrank", output_samples=4000, seed=88812
    )

# Convert to data frame
df_vi = pl.from_pandas(samples_vi.variational_sample_pd)

# Show corner plot
bokeh.io.show(bebi103.viz.corner(df_vi[parameters], parameters=parameters))

In this case, we do not see covariance between \(\phi\) and \(\gamma\), but rather that they covary with \(\sigma_0\). We could still pass posterior predictive checks; the approximate posterior could generate the observed data, but of course the approximate posterior is not the model. Nonetheless, the marginal distributions at least give reasonable values for the range of values the parameters may take.

If we were doing a more careful analysis with ADVI, we might run it multiple times, checking the ELBO value of each, and finding the best approximation of the posterior.

Example 2: A multilevel hierarchical model

For our next example, we will use the contrived multilevel hierarchical model we considered in lesson 20. First, let’s “load” in the contrived data and plot it as a reminder.

[16]:

data_str = "".join(
    [
        "day,batch,colony,y\nm,1,1,11.40\nm,1,1,10.54\n",
        "m,1,1,12.17\nm,1,1,12.41\nm,1,1,9.97\nm,1,2,10.76\n",
        "m,1,2,9.16\nm,1,3,9.50\nm,2,1,9.34\nm,2,1,10.14\n",
        "m,2,2,10.72\nm,2,2,10.63\nm,3,1,11.37\nm,3,1,10.51\n",
        "m,4,1,11.06\nm,4,1,10.68\nm,4,1,12.58\nm,4,2,11.21\n",
        "m,4,2,11.07\nm,4,2,10.74\nm,4,2,11.68\nm,4,3,10.65\n",
        "m,4,3,9.06\nw,1,1,10.40\nw,1,2,10.75\nw,1,2,11.42\n",
        "w,1,2,10.42\nw,1,2,9.18\nw,1,2,10.69\nw,1,2,9.37\n",
        "w,1,2,11.32\nw,2,1,9.90\nw,2,1,10.53\nw,2,1,10.76\n",
        "w,3,1,11.08\nw,3,1,9.27\nw,3,1,12.01\nw,3,1,12.20\n",
        "w,3,1,11.23\nw,3,1,10.96\nr,1,1,9.73\nr,1,2,11.25\n",
        "r,1,2,9.99\nr,1,2,10.12\nr,1,3,9.65\nr,1,3,10.18\nr,1,4,12.70\n",
    ]
)

data_str = (
    data_str.replace("m", "monday").replace("w", "wednesday").replace("r", "thursday")
)
df = pl.read_csv(io.StringIO(data_str))

bokeh.io.show(
    iqplot.strip(
        df,
        q="y",
        cats=["day", "batch"],
        color_column="colony",
        marker_kwargs=dict(alpha=0.6),
    )
)

The statistical model we used is

\begin{align} &\theta \sim \text{Norm}(10, 3) \\[1em] &\tau \sim \text{HalfNorm}(5) \\[1em] &\theta_1 \sim \text{Norm}(\theta, \tau) \\[1em] &\theta_2 \sim \text{Norm}(\theta_1, \tau) \\[1em] &\theta_3 \sim \text{Norm}(\theta_2, \tau) \\[1em] &\sigma \sim \text{HalfNorm}(5) \\[1em] &y\sim \text{Norm}(\theta_3, \sigma). \end{align}

We coded this up as a Stan model with noncentering.

data {
  // Total number of data points
  int N;

  // Number of entries in each level of the hierarchy
  int J_1;
  int J_2;
  int J_3;

  //Index arrays to keep track of hierarchical structure
  array[J_2] int index_1;
  array[J_3] int index_2;
  array[N] int index_3;

  // The measurements
  array[N] real y;
}


parameters {
  // Hyperparameters level 0
  real theta;

  // How hyperparameters vary
  real<lower=0> tau;

  // Hyperparameters level 1
  vector[J_1] theta_1;

  // Hyperparameters level 2
  vector[J_2] theta_2;

  // Parameters
  vector[J_3] theta_3;
  real<lower=0> sigma;
}


model {
  theta ~ normal(10, 3);
  sigma ~ normal(0, 5);
  tau ~ normal(0, 5);

  theta_1 ~ normal(theta, tau);
  theta_2 ~ normal(theta_1[index_1], tau);
  theta_3 ~ normal(theta_2[index_2], tau);

  y ~ normal(theta_3[index_3], sigma);
}

Before we can sample or perform ADVI, we need to encode the data for Stan to use.

[17]:

data, df = bebi103.stan.df_to_datadict_hier(
    df, level_cols=["day", "batch", "colony"], data_cols="y"
)

Now, we’ll load and compile the model and draw MCMC samples and samples using ADVI.

[18]:

with bebi103.stan.disable_logging():
    sm = cmdstanpy.CmdStanModel(stan_file="hier_lognorm.stan")
    samples = az.from_cmdstanpy(sm.sample(data=data, adapt_delta=0.99, show_progress=False))
    samples_vi = sm.variational(
        data=data, algorithm="fullrank", draws=4000, seed=32520, require_converged=False
    )

df_vi = pl.from_pandas(samples_vi.variational_sample_pd)

First, we’ll make a corner plot for the full MCMC samples. We will only show the hyperparameters for ease of display.

[19]:

bokeh.io.show(
    bebi103.viz.corner(samples, parameters=["theta", "sigma", "tau"])
)

We see a strong funnel shape with small values of \(\tau\) being heavily sampled. There is a strong tendency toward pooling.

Now, let’s look at the corner plot of the variational samples.

[20]:

bokeh.io.show(bebi103.viz.corner(df_vi, parameters=["theta", "sigma", "tau"]))

Oof! This does not look much like the posterior! The parameter \(\theta\) is underestimated and \(\tau\) is overestimated. Let’s compare the marginal ECDFs.

[21]:

p_theta = iqplot.ecdf(
    samples.posterior.theta.values.flatten(),
    legend_label="MCMC",
    x_axis_label="θ",
    frame_height=150,
)
p_theta = iqplot.ecdf(
    df_vi['theta'],
    line_kwargs=dict(line_color='orange'),
    p=p_theta,
    legend_label="ADVI",
)

p_sigma = iqplot.ecdf(
    samples.posterior.sigma.values.flatten(), x_axis_label="σ", frame_height=150,
)
p_sigma = iqplot.ecdf(
    df_vi['sigma'],
    line_kwargs=dict(line_color='orange'),
    p=p_sigma,
)

p_tau = iqplot.ecdf(
    samples.posterior.tau.values.flatten(), x_axis_label="τ", frame_height=150,
)
p_tau = iqplot.ecdf(
    df_vi['tau'],
    line_kwargs=dict(line_color='orange'),
    p=p_tau,
)

bokeh.io.show(bokeh.layouts.gridplot([p_theta, p_sigma, p_tau], ncols=1))

The VI results are a serious departure from full MCMC. In general, ADVI performs poorly when there is complex model structure like we have here in this hierarchical model. There are custom methods for ADVI inference with hierarchical models, for example as described in this paper by Ranganath and coworkers, but these are not implemented in Stan.

In general, my advice is to employ VI only when you really need the speed and your model is not too complex.

[22]:

bebi103.stan.clean_cmdstan()

Computing environment

[23]:

%load_ext watermark
%watermark -v -p numpy,polars,cmdstanpy,arviz,bokeh,iqplot,bebi103,jupyterlab
print("cmdstan   :", bebi103.stan.cmdstan_version())

Python implementation: CPython
Python version       : 3.12.4
IPython version      : 8.25.0

numpy     : 1.26.4
polars    : 1.2.1
cmdstanpy : 1.2.4
arviz     : 0.18.0
bokeh     : 3.4.1
iqplot    : 0.3.7
bebi103   : 0.1.22
jupyterlab: 4.0.13

cmdstan   : 2.35.0