19  Model building with prior predictive checks

Open in Google Colab | Download notebook


# Colab setup ------------------
import os, shutil, sys, subprocess, urllib.request
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade polars iqplot colorcet bebi103 arviz cmdstanpy watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    from cmdstanpy.install_cmdstan import latest_version
    cmdstan_version = latest_version()
    cmdstan_url = f"https://github.com/stan-dev/cmdstan/releases/download/v{cmdstan_version}/"
    fname = f"colab-cmdstan-{cmdstan_version}.tgz"
    urllib.request.urlretrieve(cmdstan_url + fname, fname)
    shutil.unpack_archive(fname)
    os.environ["CMDSTAN"] = f"./cmdstan-{cmdstan_version}"
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------
import numpy as np
import polars as pl
import scipy.stats as st

import cmdstanpy
import arviz as az

import iqplot
import bebi103

import bokeh.io
bokeh.io.output_notebook()
Loading BokehJS ...

When we do generative modeling, it is important to understand what kind of data might be generated from our model before looking at the data. You may have been struggling to come up with good priors for your models. I find that this struggle is a symptom of the fact that we often think about what kind of data we might see in an experiment as opposed to how we might think the parameters of the generative process that produces the data may be distributed.

In this lesson, we discuss how prior predictive checks can help build appropriate priors for a model. The procedure is used check to make sure the generative model can actually generate reasonable data sets.

19.1 Building a generative model

When considering how to build the model, we should think about the data generation process. We generally specify the likelihood first, and then use it to identify what the parameters are, which tells us what we need to specify for the prior. Indeed, the prior can often only be understood in the context of the likelihood.

Note, though, that the split between the likelihood and prior need even be explicit. When we study hierarchical models, we will see that it is sometimes not obvious how to unambiguously define the likelihood and prior. Rather, in order to do Bayesian inference, we really only need to specify the joint distribution, \(\pi(y,\theta)\), which is the product of the likelihood and prior.

\[\begin{align} g(\theta \mid y) = \frac{f(y\mid \theta)\,g(\theta)}{f(\theta)} = \frac{\pi(y, \theta)}{f(\theta)}. \end{align} \]

After defining the joint distribution, either by defining a likelihood and a prior separately (which is what we usually do) or by directly defining the joint, we should check to make sure that making draws of data sets \(y\) out of it give reasonable results. In other words, we need to ask if the data generation process we have prescribed actually produces data that obeys physical constraints and matches our intuition. The procedure of prior predictive checks enables this. We use the joint distribution to generate parameter sets and data sets and then check if the results make sense. We will learn about doing this using both Numpy and Stan in this lesson.

19.2 The experiment

We will do a very simple experiment in which we take a sample of retinal tissue and expose it to a constant light source. We measure the spiking activity of a single retinal ganglion cell (RGC) over a time interval. We will consider two models, one with Poissonian spiking and one with Inverse Gaussian-distributed spiking. We will build generative models for each, starting with the Poissonian model. We will use prior predictive checks to hone in on the priors.

In order to do the prior predictive checks, we will assume we measure spiking for 30 seconds.

19.3 Model 1: Spikes are Poissonian

For our first model, we assume that spikes arise from a Poisson process, such that the interspike intervals are exponentially distributed.

19.3.1 The likelihood

Let \(y\) be the set of observed interspike intervals. Then, our likelihood is i.i.d. Exponentially distributed interspike intervals with rate parameter \(\beta\). Our likelihood is then

\[\begin{align} y_i \sim \text{Expon}(\beta) \;\forall i. \end{align} \]

This makes clear that the parameter we need to estimate is \(\beta\), so we will need to provide a prior for it.

19.3.2 The prior

With our likelihood specified, we need to give a prior for \(\beta\), \(g(\beta)\). For me, it is easier to think about a time scale than a rate, so I will think about priors of \(\tau = 1/\beta\). So, I ask, “What are reasonable values of waiting times for spikes?” I know that refractory times can be a few milliseconds, and we may have very sluggish spiking, so I want a prior that has probability mass between, say 1 ms and 1 s. I could choose a Log Normal prior for \(\tau\), since my prior knowledge ranges over three orders of magnitude. The Log Normal distribution also has the added benefit that \(\beta\) is guaranteed to be positive, which we need to properly parametrize the likelihood.

I will use the trick that if I want 95% of probability mass of a Normally distributed parameter to lie between two bounds, I choose a location parameter as the midpoint between the bounds and the scale parameter as a quarter of the distance between the two bounds. In this case, I want \(\log_{10} \tau\) to lie between 0 and 3, where \(\tau\) is in units of milliseconds. So, I choose the following prior.

\[\begin{align} &\log_{10} \tau \sim \text{Normal}(1.5, 0.75),\\[1em] &\beta = 10^{- \log_{10} \tau}. \end{align} \]

I have done the egregious sin of taking the logarithm of a quantity with units, but we understand that \(\tau\) is in units of milliseconds and \(\beta\) is in units of inverse milliseconds.

So, we now have a complete model; the likelihood and prior, and therefore the joint distribution, are specified. Here it is (all units are ms):

\[\begin{align} &\log_{10} \tau \sim \text{Normal}(1.5, 0.75),\\[1em] &\beta = 10^{- \log_{10} \tau},\\[1em] &y_i \sim \text{Expon}(\beta) \;\forall i. \end{align} \]

19.3.3 Prior predictive checks

Let us now generate samples out of this generative model. This procedure is known as prior predictive checking. We first generate parameter values drawing out of the prior distribution for \(\beta\). We then use those parameter values to generate a data set using the likelihood. We repeat this over and over again to see what kind of data sets we might expect out of our generative model. Each of these generated data sets is called a prior predictive sample. We can do this efficiently using Numpy’s random number generators.

# Time of recording
T = 30_000  # ms

# Instantiate random number generator
rng = np.random.default_rng()

# Number of prior predictive check samples
n_ppc_samples = 1000

# Draw parameters out of the prior
log_tau = rng.normal(1.5, 0.75, size=n_ppc_samples)
tau = 10**log_tau
beta = 1 / tau

# Draw data sets out of the likelihood for each set of prior params
def draw_spikes(beta, T):
    tau = 1 / beta

    t = rng.exponential(tau)
    spikes = []
    while t < T:
        spikes.append(t)
        isi = rng.exponential(tau)
        t += isi

    return np.array(spikes)


spikes = [draw_spikes(b, T) for b in beta]

There are many ways we could visualize the results. One informative plot is to make an ECDF the ISIs for each of the data sets we generated.

p = None
for spike_vals in spikes[::20]:
    p = iqplot.ecdf(
        np.diff(spike_vals),
        p=p,
        line_kwargs=dict(line_alpha=0.5, line_width=1),
        x_axis_label="ISI (ms)",
    )

bokeh.io.show(p)