25. Variational Bayesian inference

Heidi Klumpe contributed significantly to this lesson.


[2]:
import io

import numpy as np
import pandas as pd

import cmdstanpy
import arviz as az

import iqplot
import bebi103

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()
Loading BokehJS ...

Large data-sets, mathematical models including many variables, and hierarchical models, among other modeling problems, can require computationally expensive sampling with MCMC. We can speed up sampling if we instead use an analytical approximation of the posterior and then sample from it. Variational inference (VI) is a technique that adopts this approach. Variational inference is sometimes referred to as variational Bayes, a term that is used interchangeably with variational inference in the Stan documentation.

Importantly, MCMC samples asymptotically approach the true posterior density, but VI does not. VI will only give information about an approximation of the posterior, and that approximation may not be good.

In this lesson, we will go over the basic concepts behind VI and implementations in Stan. For more detail, I recommend this review of VI David Blei and others.

The main ideas of variational inference

For VI, we want to approximate a posterior distribution \(g(\theta \mid y)\) with an approximate distribution \(Q(\theta)\). We saw in lessons on model comparison that the Kullback-Leibler divergence is used to quantify the dissimilarity between two probability distributions (note that it is not symmetric in the two distributions, and therefore cannot be considered a distance). So, we want the Kullback-Leibler divergence to be as small as possible. That is, we wish to find a \(Q\) such that the K-L divergence

\begin{align} D_{KL} \left(Q\middle\| g \right) = \sum_\theta Q(\theta)\, \ln \frac{Q(\theta)}{g(\theta \mid y)} \end{align}

is as close to zero as possible. (The K-L divergence is zero if \(Q = g\), and positive otherwise.) Note that we can replace the sum with an integral for continuous \(\theta\). Because the full posterior appears in this expression, to evaluate the K-L divergence directly, we would have to compute the evidence, \(f(y) = \int d\theta\, g(\theta, y)\), which is generally an intractable integral. Instead, we would like the objective function of our optimization to be in terms of distributions that are more easily expressed, so we modify this expression. To proceed, we write the posterior in terms of the joint probability,

\begin{align} g(\theta \mid y) = \frac{\pi(\theta, y)}{f(y)}. \end{align}

We can substitute this expression into the above expression for the K-L divergence and rearrange using properties of logarithms.

\begin{align} D_{KL} \left(Q\middle\| g \right) &= \sum_\theta Q(\theta) \, \ln \frac{Q(\theta)}{\frac{\pi(\theta, y)}{f(y)}}\\[1em] &= \sum_\theta Q(\theta)\, \ln \frac{Q(\theta)}{\pi(\theta, y)} + \sum_\theta Q(\theta) \, \ln f(y)\\[1em] &= \sum_\theta Q(\theta)\, \ln \frac{Q(\theta)}{\pi(\theta, y)} + \ln f(y)\sum_\theta Q(\theta) \\[1em] &= \sum_\theta Q(\theta)\, \ln \frac{Q(\theta)}{\pi(\theta, y)} + \ln f(y). \end{align}

Note that we effectively replaced the posterior (the conditional distribution \(g(\theta \mid y)\)) with the product of the likelihood and prior, the joint distribution \(\pi(\theta, y) = f(y \mid \theta) g(\theta)\).

We further rearrange to get

\begin{align} \ln f(y) &= D_{KL} \left(Q\middle\| g \right) - \sum_\theta Q(\theta) \ln \frac{Q(\theta)}{\pi(\theta, y)} \\ &= D_{KL} \left(Q\middle\| g \right) + \mathcal{L}\left(Q(\theta)\right), \end{align}

where

\begin{align} \mathcal{L}\left(Q(\theta)\right) = - \sum_\theta Q(\theta) \ln \frac{Q(\theta)}{\pi(\theta, y)}. \end{align}

The quantity \(\mathcal{L}(Q)\) is the evidence lower bound (ELBO). That is because the the log evidence is guaranteed to be at least as big as \(\mathcal{L}(Q)\). Further, since the evidence does not depend on the parameters \(\theta\), the negative ELBO is equal to Kullback-Leiber divergence up to an additive constant. So, choosing the \(\theta\) that maximizes the ELBO is equivalent to finding the \(\theta\) that minimizes the K-L divergence.

The ELBO is also sometimes called the “negative variational free energy.” This is because it has the form of a free energy, summing up an expected value for the energy and the entropy:

\begin{align} \mathcal{L}(Q) &= -\sum_\theta Q(\theta) \ln \frac{Q(\theta)}{\pi(\theta, y)} \\[1em] &= \sum_\theta Q(\theta) \left[ \ln \pi(\theta, y) - \ln {Q(\theta)} \right]\\[1em] &= \sum_\theta Q(\theta) \ln \pi(\theta, y) - \sum_\theta Q(\theta)\ln Q(\theta). \end{align}

The first term approximates an expected value for the “energy”, and the second closely resembles the definition of the Shannon entropy.

For \(Q(\theta)\) to be a useful approximation of the posterior, we should choose \(Q(\theta)\) such that \(\mathcal{L}(Q)\) is easy to compute and optimize, but not so simple that \(Q(\theta)\) very poorly approximates the posterior.

Choosing Q(θ)

We can choose any family of functions to approximate our posterior, but it is particularly useful to choose the mean-field variational family. This assumes the parameters of our statistical model (\(\theta\)) are independent, with their own unique parameters. Stated mathematically,

\begin{align} Q(\theta) = \prod_i q_i (\theta_i ; \phi_i), \end{align}

where \(q_i\) are partitions of the approximation, each of which is an independent function of only one parameter \(\theta_i\), and \(\phi_i\) are the unique latent parameters that parametrize \(q_i\). A posterior distribution in general does not factorize this way, which is what makes this an approximation.

To arrive at our final approximation, we need expressions for each \(q_i\). For a given \(q_i\), we want to estimate the posterior as a function of \(\theta_i\) only. To do this, we replace all \(\theta_{j\neq i}\) terms in the posterior with their expected values, i.e. \(\theta_j^{mean} = \int d\theta_j\, q_j\). This is why this is considered a mean field approach—we approximate the effect of other variables as a “mean field,” rather than their true values. Note also that the mean is some constant, for which there is an analytical expression for most probability distributions.

But now we see our final hurdle. To fully specify \(q_i\), we need information about all the other partitions \(q_{j\neq i}\), since the mean of that distribution may appear in the expression for \(q_i\). Thus, to complete the optimization, we iteratively update \(\phi_i\) for each parameter \(\theta_i\), updating one variable at a time. We stop when we arrive at a solution (i.e. a set of \(q_i\) and \(\phi_i\)) that maximizes the ELBO.

Summary of VI algorithm

  1. Our goal is to find a distribution \(Q(\theta)\) that maximizes the ELBO, which is equivalent to minimizing the dissimilarity between the posterior \(g(\theta \mid y)\) and an approximate distribution \(Q(\theta)\).

  2. Posit that \(Q(\theta) = \prod_i q_i(\theta_i; \phi_i)\). Note that you can derive what the functional form for \(q_i\) should be (e.g. Normal, Gamma, etc.), but Stan’s VI algorithm will do it for you. This defines the “restricted class” from which you find your best \(Q(\theta)\)

  3. Vary \(\phi_i\) to maximize the ELBO and find the best \(Q(\theta)\). This gives the final form of approximation of the posterior.

  4. Draw samples out of the approximation of the posterior. This does not need MCMC; independent samples may be directly drawn using a randon number generator and transforms.

Automatic Differentiation Variational Inference and Stan

The algorithm Stan uses for VI is called Automatic Differentiation Variational Inference (ADVI). The technique was developed by Alp Kucukelbir and was published in 2017.

I will not go into the details of the ADVI algorithm here, but will give a few highlights that are important to understand while using it. First, any parameters that must be positive or are otherwise constrained are transformed such that they may take on any real value. We will call the transformed parameters \(\zeta\). The ADVI algorithm is clever in its choice of transform, and you should see the Kucukelbir, et al. paper for details. Next, we choose a family of distributions for \(Q(\theta)\). One choice is a Gaussian mean field family where

\begin{align} &Q(\zeta) = \prod_i q_i(\zeta_i),\\[1em] &\zeta_i \sim \text{Norm}(\mu_i, \sigma_i). \end{align}

The ADVI algorithm then finds the values of the \(\mu_i\)’s and \(\sigma_i\)’s that maximize the ELBO using clever a optimization algorithm. Note that when we choose the Gaussian mean field family, we lose information about the covariance of parameter values.

Stan’s implementation of ADVI also allows specification of a full rank Gaussian.

\begin{align} &\zeta \sim \text{MultiNorm}(\boldsymbol{\mu}, \mathsf{\Sigma}). \end{align}

There are now many more parameters to manipulate to maximize the ELBO (all of the entries of \(\mathsf{\Sigma}\) instead of just the diagonal ones in the mean field case), and the optimization calculation is more difficult than with the mean field family. The benefit is that the extra complexity (and therefore flexibility) allows better approximation of the posterior and more covariance information about the parameters is retained.

After the optimal parameters are found for the optimizing distributions, samples of \(\zeta\) are drawn out of these Gaussian distributions and these samples are transformed back into \(\theta\).

Examples

We will now apply ADVI to two examples we have already studied with MCMC. We will use the same Stan models we used previously. Using ADVI is as simple as using example the Stan model as you would with MCMC, but using variational sampling instead of MCMC sampling.

Example 1: Spindle size as a function of volume

We will first consider a model for microtubule spindle size based on Matt Good’s experiments. We considered this model in lesson 11. Here is a quick plot of the data.

[3]:
# Load in Data Frame
df = pd.read_csv(os.path.join(data_path, "good_invitro_droplet_data.csv"), comment="#")

# Pull out numpy arrays
ell = df["Spindle Length (um)"].values
d = df["Droplet Diameter (um)"].values

# Make a plot
p = bokeh.plotting.figure(
    frame_height=200,
    frame_width=300,
    x_axis_label="droplet diameter (µm)",
    y_axis_label="spindle length (µm)",
    x_range=[0, 250],
    y_range=[0, 50],
)

p.circle(
    source=df,
    x="Droplet Diameter (um)",
    y="Spindle Length (um)",
    alpha=0.3,
)

bokeh.io.show(p)

The generative model we used for this data set is

\begin{align} &\log_{10} \phi \sim \text{Norm}(1.5, 0.75),\\[1em] &\gamma \sim \text{Beta}(1.1, 1.1), \\[1em] &\sigma \sim \text{HalfNorm}(10),\\[1em] &\mu_i = \frac{\gamma d_i}{\left(1+(\gamma d_i/\phi)^3\right)^{\frac{1}{3}}}, \\[1em] &l_i \mid d_i, \gamma, \phi, \sigma \sim \text{Norm}(\mu_i, \sigma) \;\forall i. \end{align}

The corresponding Stan code is

functions {
  real ell_theor(real d, real phi, real gamma_) {
    real denom_ratio = (gamma_ * d / phi)^3;
    return gamma_ * d / cbrt(1 + denom_ratio);
  }
}


data {
  int N;
  int N_ppc;
  array[N] real d;
  array[N_ppc] real d_ppc;
  array[N] real ell;
}


parameters {
  real<lower=0> phi;
  real<lower=0, upper=1> gamma_;
  real<lower=0> sigma_0;
}


transformed parameters {
  array[N] real mu;
  array[N] real sigma;

  for (i in 1:N) {
    mu[i] = ell_theor(d[i], phi, gamma_);
    sigma[i] = sigma_0 * mu[i];
  }
}


model {
  phi ~ lognormal(log(20.0), 0.75);
  gamma_ ~ beta(1.1, 1.1);
  sigma_0 ~ gamma(2.0, 10.0);

  ell ~ normal(mu, sigma);
}


generated quantities {
  array[N_ppc] real ell_ppc;

for (i in 1:N_ppc) {
    real mu_ppc = ell_theor(d_ppc[i], phi, gamma_);
    ell_ppc[i] = normal_rng(mu_ppc, sigma_0 * mu_ppc);
  }
}

Let’s compile the model. The compilation for ADVI is exactly the same as for MCMC.

[4]:
with bebi103.stan.disable_logging():
    sm = cmdstanpy.CmdStanModel(stan_file="spindle.stan")

Now, we can perform ADVI. To do so, we use sm.variational() instead of sm.sample(). We specify that we want to use a Gaussian mean field approximation using the algorithm="meanfield" kwarg. Stan will automatically do the optimization. Remember, though, for general optimization problems, there are no guarantees that we can find a global optimum.

Finally, we then use draws=4000 to get samples out of the approximate posterior.

[5]:
N_ppc = 200
d_ppc = np.linspace(0.1, 250, N_ppc)

d = df["Droplet Diameter (um)"].values
ell = df["Spindle Length (um)"].values

data = dict(N=len(df), d=d, ell=ell, N_ppc=N_ppc, d_ppc=d_ppc)

with bebi103.stan.disable_logging():
    samples_vi = sm.variational(
        data=data,
        algorithm="meanfield",
        draws=4000,
        seed=3252,
    )

Unfortunately, ArviZ does not have support for VI results, so we need to extract the information about the VI calculation directly from the object returned by sm.variational(). First, we can look at Stan’s output pertaining to its optimization calculation for maximizing the ELBO. To do this, we will pull out the stdout_files and print their content

[6]:
for fname in samples_vi.runset._stdout_files:
    with open(fname, "r") as f:
        print(f.read())
method = variational
  variational
    algorithm = meanfield (Default)
      meanfield
    iter = 10000 (Default)
    grad_samples = 1 (Default)
    elbo_samples = 100 (Default)
    eta = 1.000000 (Default)
    adapt
      engaged = 1 (Default)
      iter = 50 (Default)
    tol_rel_obj = 0.010000 (Default)
    eval_elbo = 100 (Default)
    output_samples = 4000
id = 1 (Default)
data
  file = /var/folders/j_/c5r9ch0913v3h1w4bdwzm0lh0000gn/T/tmpk8298dbe/nqkhtet6.json
init = 2 (Default)
random
  seed = 3252
output
  file = /var/folders/j_/c5r9ch0913v3h1w4bdwzm0lh0000gn/T/tmpk8298dbe/spindlecdpka3qa/spindle-20240220003452.csv
  diagnostic_file =  (Default)
  refresh = 100 (Default)
  sig_figs = -1 (Default)
  profile_file = profile.csv (Default)
  save_cmdstan_config = 0 (Default)
num_threads = 1 (Default)

------------------------------------------------------------
EXPERIMENTAL ALGORITHM:
  This procedure has not been thoroughly tested and may be unstable
  or buggy. The interface is subject to change.
------------------------------------------------------------



Gradient evaluation took 0.00035 seconds
1000 transitions using 10 leapfrog steps per transition would take 3.5 seconds.
Adjust your expectations accordingly!


Begin eta adaptation.
Iteration:   1 / 250 [  0%]  (Adaptation)
Iteration:  50 / 250 [ 20%]  (Adaptation)
Iteration: 100 / 250 [ 40%]  (Adaptation)
Iteration: 150 / 250 [ 60%]  (Adaptation)
Iteration: 200 / 250 [ 80%]  (Adaptation)
Success! Found best value [eta = 1] earlier than expected.

Begin stochastic gradient ascent.
  iter             ELBO   delta_ELBO_mean   delta_ELBO_med   notes
   100       -16953.664             1.000            1.000
   200       -26130.759             0.676            1.000
   300       -23178.757             0.493            0.351
   400        -4616.726             1.375            1.000
   500        -3575.492             1.158            0.351
   600        -3390.535             0.974            0.351
   700        -3084.661             0.849            0.291
   800        -1877.438             0.823            0.351
   900        -1845.521             0.734            0.291
  1000        -1847.926             0.661            0.291
  1100        -1841.887             0.561            0.127   MAY BE DIVERGING... INSPECT ELBO
  1200        -1846.126             0.526            0.099   MAY BE DIVERGING... INSPECT ELBO
  1300        -1841.222             0.514            0.055   MAY BE DIVERGING... INSPECT ELBO
  1400        -1844.497             0.112            0.017
  1500        -1844.251             0.083            0.003   MEDIAN ELBO CONVERGED

Drawing a sample of size 4000 from the approximate posterior...
COMPLETED.

The stochastic gradient ascent was used to maximize the ELBO, and we got the message that the optimization converged. Stan then proceeded to draw 4000 samples out of the approximate posterior.

Now, we can obtain the draws using the samples_vi.variational_sample_pd object, which has the samples as a data frame.

[7]:
df_vi = samples_vi.variational_sample_pd

# Take a look
df_vi.head()
[7]:
lp__ log_p__ log_g__ phi gamma_ sigma_0 mu[1] mu[2] mu[3] mu[4] ... ell_ppc[191] ell_ppc[192] ell_ppc[193] ell_ppc[194] ell_ppc[195] ell_ppc[196] ell_ppc[197] ell_ppc[198] ell_ppc[199] ell_ppc[200]
0 0.0 -1833.69 -1.499310 37.4004 0.890305 0.112656 22.2885 22.9894 23.7256 24.6603 ... 38.9634 34.9517 33.7124 38.6103 42.0959 37.1982 44.5397 44.5302 40.5810 37.9226
1 0.0 -1838.96 -0.313062 37.2800 0.866539 0.112550 21.8002 22.4966 23.2294 24.1624 ... 34.6073 35.7513 30.6759 38.5644 40.7878 38.9494 34.2358 40.0993 34.2573 44.3059
2 0.0 -1833.97 -0.563498 37.3965 0.883906 0.114579 22.1614 22.8616 23.5975 24.5327 ... 39.0093 42.3316 34.3370 43.7898 44.6077 37.3891 36.7614 33.7377 44.4946 43.3266
3 0.0 -1853.56 -4.220800 37.1215 0.839312 0.117270 21.2285 21.9183 22.6460 23.5749 ... 38.0863 38.5711 36.8697 36.2959 39.7949 33.8898 38.4848 40.9590 44.5866 36.2442
4 0.0 -1832.93 -1.922950 37.7253 0.876273 0.116066 22.0482 22.7528 23.4944 24.4385 ... 42.3030 32.0449 33.8408 29.3432 37.8117 48.4723 37.3007 28.4747 37.3459 36.6706

5 rows × 1546 columns

The lp__ column is always zero (it is a legacy feature that is not ever used). The log_p__ and log_g__ columns are values of the log posterior and log approximate posterior, respectively, so we do not really use those. The remaining columns are samples of the parameter values and posterior predictive checks.

We can also directly get the values of \(\gamma\), \(\phi\) and \(\sigma\) that resulted in the maximal ELBO.

[8]:
samples_vi.variational_params_pd
[8]:
lp__ log_p__ log_g__ phi gamma_ sigma_0 mu[1] mu[2] mu[3] mu[4] ... ell_ppc[191] ell_ppc[192] ell_ppc[193] ell_ppc[194] ell_ppc[195] ell_ppc[196] ell_ppc[197] ell_ppc[198] ell_ppc[199] ell_ppc[200]
0 0.0 0.0 0.0 37.3255 0.872579 0.114267 21.9268 22.6247 23.3587 24.2927 ... 36.9062 29.4505 35.2221 35.2612 34.0087 37.0151 36.9185 44.0853 39.615 49.3834

1 rows × 1546 columns

Let’s take a quick look at the posterior predictive checks to make sure the likelihood parametrized by draws out of the approximate posterior can generate the data.

[9]:
ell_ppc = df_vi[df_vi.columns[df_vi.columns.str.contains("ell_ppc")]].values

bokeh.io.show(
    bebi103.viz.predictive_regression(
        ell_ppc,
        d_ppc,
        data=(d, ell),
        x_axis_label="droplet diameter (µm)",
        y_axis_label="spindle length (µm)",
    )
)

This looks ok, but remember that we are looking at how the likelihood can generate data using parameters sampled from the approximate posterior. Now, let’s look at the samples out of the approximate posterior.

[10]:
parameters = ["phi", "gamma_", "sigma_0"]

bokeh.io.show(bebi103.viz.corner(df_vi[parameters], parameters=parameters))

Notice that the marginal posterior distribution for each pair of parameters is symmetric. This is because the covariance between parameters is neglected using the Gaussian mean field approximation of the posterior.

To compare, let’s look at a corner plot from the parameters obtained using MCMC.

[11]:
with bebi103.stan.disable_logging():
    samples = az.from_cmdstanpy(sm.sample(data=data))

bokeh.io.show(bebi103.viz.corner(samples, parameters=parameters))

The full posterior clearly shows a strong covariance between \(\gamma\) and \(\phi\). For a clearer comparison, let’s compare the ECDFs of the marginal distributions for \(\gamma\) and \(\phi\) obtained from MCMC and from VI.

[12]:
def plot_marginal_ecdfs():
    p_phi = iqplot.ecdf(
        samples.posterior.phi.values.flatten(), legend_label="MCMC", x_axis_label="φ"
    )
    p_phi = iqplot.ecdf(
        df_vi.phi.values,
        line_kwargs=dict(line_color="orange"),
        p=p_phi,
        legend_label="ADVI",
    )

    p_gamma = iqplot.ecdf(samples.posterior.gamma_.values.flatten(), x_axis_label="γ")
    p_gamma = iqplot.ecdf(
        df_vi.gamma_.values,
        line_kwargs=dict(line_color="orange"),
        p=p_gamma,
    )

    return p_phi, p_gamma

bokeh.io.show(bokeh.layouts.column(*plot_marginal_ecdfs()))

We see that there are differences in the marginal distributions between the approximate and exact posteriors. This will in general be the case; the approximate posterior based on VI will never be exact (unless of course the problem is set up such that the posterior is in fact Gaussian), and the approximation may introduce substantial differences.

To attempt to do better, capturing the covariance between \(\phi\) and \(\gamma\), and hopefully also getting the marginal distributions for each closer to the true posterior, let’s use a full rank Gaussian family in our approximation.

[13]:
# Take samples using full rank Gaussian approximation
with bebi103.stan.disable_logging():
    samples_vi = sm.variational(
        data=data, algorithm="fullrank", draws=4000, seed=3251
    )

# Convert to data frame
df_vi = samples_vi.variational_sample_pd

# Show corner plot
bokeh.io.show(bebi103.viz.corner(df_vi[parameters], parameters=parameters))

This seems to have at least captured the covariance between \(\phi\) and \(\gamma\). Let’s check the marginal distributions for each, again compared to MCMC.

[14]:
bokeh.io.show(bokeh.layouts.column(*plot_marginal_ecdfs()))

The ADVI results are much closer to those from full MCMC. Importantly, since there is some stochasticity to the maximization of the ELBO procedure, we will not get the same results each time. Let’s try again with a different seed.

[15]:
# Take samples using full rank Gaussian approximation
with bebi103.stan.disable_logging():
    samples_vi = sm.variational(
        data=data, algorithm="fullrank", output_samples=4000, seed=88812
    )

# Convert to data frame
df_vi = samples_vi.variational_sample_pd

# Show corner plot
bokeh.io.show(bebi103.viz.corner(df_vi[parameters], parameters=parameters))

In this case, we do not see covariance between \(\phi\) and \(\gamma\), but rather that they covary with \(\sigma_0\). We could still pass posterior predictive checks; the approximate posterior could generate the observed data, but of course the approximate posterior is not the model. Nonetheless, the marginal distributions at least give reasonable values for the range of values the parameters may take.

If we were doing a more careful analysis with ADVI, we might run it multiple times, checking the ELBO value of each, and finding the best approximation of the posterior.

Example 2: A multilevel hierarchical model

For our next example, we will use the contrived multilevel hierarchical model we considered in lesson 20. First, let’s “load” in the contrived data and plot it as a reminder.

[16]:
data_str = "".join(
    [
        "day,batch,colony,y\nm,1,1,11.40\nm,1,1,10.54\n",
        "m,1,1,12.17\nm,1,1,12.41\nm,1,1,9.97\nm,1,2,10.76\n",
        "m,1,2,9.16\nm,1,3,9.50\nm,2,1,9.34\nm,2,1,10.14\n",
        "m,2,2,10.72\nm,2,2,10.63\nm,3,1,11.37\nm,3,1,10.51\n",
        "m,4,1,11.06\nm,4,1,10.68\nm,4,1,12.58\nm,4,2,11.21\n",
        "m,4,2,11.07\nm,4,2,10.74\nm,4,2,11.68\nm,4,3,10.65\n",
        "m,4,3,9.06\nw,1,1,10.40\nw,1,2,10.75\nw,1,2,11.42\n",
        "w,1,2,10.42\nw,1,2,9.18\nw,1,2,10.69\nw,1,2,9.37\n",
        "w,1,2,11.32\nw,2,1,9.90\nw,2,1,10.53\nw,2,1,10.76\n",
        "w,3,1,11.08\nw,3,1,9.27\nw,3,1,12.01\nw,3,1,12.20\n",
        "w,3,1,11.23\nw,3,1,10.96\nr,1,1,9.73\nr,1,2,11.25\n",
        "r,1,2,9.99\nr,1,2,10.12\nr,1,3,9.65\nr,1,3,10.18\nr,1,4,12.70\n",
    ]
)

data_str = (
    data_str.replace("m", "monday").replace("w", "wednesday").replace("r", "thursday")
)
df = pd.read_csv(io.StringIO(data_str))

bokeh.io.show(
    iqplot.strip(
        df,
        q="y",
        cats=["day", "batch"],
        color_column="colony",
        marker_kwargs=dict(alpha=0.6),
    )
)

The statistical model we used is

\begin{align} &\theta \sim \text{Norm}(10, 3) \\[1em] &\tau \sim \text{HalfNorm}(5) \\[1em] &\theta_1 \sim \text{Norm}(\theta, \tau) \\[1em] &\theta_2 \sim \text{Norm}(\theta_1, \tau) \\[1em] &\theta_3 \sim \text{Norm}(\theta_2, \tau) \\[1em] &\sigma \sim \text{HalfNorm}(5) \\[1em] &y\sim \text{Norm}(\theta_3, \sigma). \end{align}

We coded this up as a Stan model with noncentering.

data {
  // Total number of data points
  int N;

  // Number of entries in each level of the hierarchy
  int J_1;
  int J_2;
  int J_3;

  //Index arrays to keep track of hierarchical structure
  array[J_2] int index_1;
  array[J_3] int index_2;
  array[N] int index_3;

  // The measurements
  array[N] real y;
}


parameters {
  // Hyperparameters level 0
  real theta;

  // How hyperparameters vary
  real<lower=0> tau;

  // Hyperparameters level 1
  vector[J_1] theta_1;

  // Hyperparameters level 2
  vector[J_2] theta_2;

  // Parameters
  vector[J_3] theta_3;
  real<lower=0> sigma;
}


model {
  theta ~ normal(10, 3);
  sigma ~ normal(0, 5);
  tau ~ normal(0, 5);

  theta_1 ~ normal(theta, tau);
  theta_2 ~ normal(theta_1[index_1], tau);
  theta_3 ~ normal(theta_2[index_2], tau);

  y ~ normal(theta_3[index_3], sigma);
}

Before we can sample or perform ADVI, we need to encode the data for Stan to use.

[17]:
data, df = bebi103.stan.df_to_datadict_hier(
    df, level_cols=["day", "batch", "colony"], data_cols="y"
)

Now, we’ll load and compile the model and draw MCMC samples and samples using ADVI.

[18]:
with bebi103.stan.disable_logging():
    sm = cmdstanpy.CmdStanModel(stan_file="hier_lognorm.stan")
    samples = az.from_cmdstanpy(sm.sample(data=data, adapt_delta=0.99, show_progress=False))
    samples_vi = sm.variational(
        data=data, algorithm="fullrank", draws=4000, seed=32520, require_converged=False
    )

df_vi = samples_vi.variational_sample_pd

First, we’ll make a corner plot for the full MCMC samples. We will only show the hyperparameters for ease of display.

[19]:
bokeh.io.show(
    bebi103.viz.corner(samples, parameters=["theta", "sigma", "tau"])
)

We see a strong funnel shape with small values of \(\tau\) being heavily sampled. There is a strong tendency toward pooling.

Now, let’s look at the corner plot of the variational samples.

[20]:
bokeh.io.show(bebi103.viz.corner(df_vi, parameters=["theta", "sigma", "tau"]))

Oof! This does not look much like the posterior! The parameter \(\theta\) is underestimated and \(\tau\) is overestimated. Let’s compare the marginal ECDFs.

[21]:
p_theta = iqplot.ecdf(
    samples.posterior.theta.values.flatten(),
    legend_label="MCMC",
    x_axis_label="θ",
    frame_height=150,
)
p_theta = iqplot.ecdf(
    df_vi.theta.values,
    line_kwargs=dict(line_color='orange'),
    p=p_theta,
    legend_label="ADVI",
)

p_sigma = iqplot.ecdf(
    samples.posterior.sigma.values.flatten(), x_axis_label="σ", frame_height=150,
)
p_sigma = iqplot.ecdf(
    df_vi.sigma.values,
    line_kwargs=dict(line_color='orange'),
    p=p_sigma,
)

p_tau = iqplot.ecdf(
    samples.posterior.tau.values.flatten(), x_axis_label="τ", frame_height=150,
)
p_tau = iqplot.ecdf(
    df_vi.tau.values,
    line_kwargs=dict(line_color='orange'),
    p=p_tau,
)

bokeh.io.show(bokeh.layouts.gridplot([p_theta, p_sigma, p_tau], ncols=1))

The VI results are a serious departure from full MCMC. In general, ADVI performs poorly when there is complex model structure like we have here in this hierarchical model. There are custom methods for ADVI inference with hierarchical models, for example as described in this paper by Ranganath and coworkers, but these are not implemented in Stan.

In general, my advice is to employ VI only when you really need the speed and your model is not too complex.

[22]:
bebi103.stan.clean_cmdstan()

Computing environment

[23]:
%load_ext watermark
%watermark -v -p numpy,pandas,cmdstanpy,arviz,bokeh,iqplot,bebi103,jupyterlab
print("cmdstan   :", bebi103.stan.cmdstan_version())
Python implementation: CPython
Python version       : 3.11.5
IPython version      : 8.15.0

numpy     : 1.26.2
pandas    : 2.1.4
cmdstanpy : 1.2.0
arviz     : 0.17.0
bokeh     : 3.3.0
iqplot    : 0.3.5
bebi103   : 0.1.20
jupyterlab: 4.0.10

cmdstan   : 2.34.0