15. Collector’s box of distributions

Probability distributions are the building blocks of any generative model. In building a model, we usually start with the likelihood and prescribe a distribution that generates the data, \(f(y\mid \theta)\). The likelihood makes clear for which parameters \(\theta\) we need priors, \(g(\theta)\). We will ultimately characterize the posterior distribution of the parameters, \(g(\theta\mid y)\).

Check out the Distribution Explorer

Sometimes, we need to derive a distribution from first principles. But often, we can use off-the-shelf named distributions in our modeling. It is therefore very valuable to be familiar with these distributions. You should therefore read and interact with the content of the Distribution Explorer.

Choosing distributions

Each of the named distributions has a story. Oftentimes the process of choosing a distribution for a model amounts to matching the story of the data generative process to that of a distribution. To help pare down which distributions to consider, we answer the following questions about the variable (either data or parameter) of interest.

  1. Is the variable a discrete quantity of a continuous one? Only discrete distributions describe discrete variables and only continuous distributions describe continuous variables.

  2. Is the variable univariate or multivariate?

  3. On what domain is the variable defined?

When we answer the questions, though, we typically answer them approximately. As an example, we can consider repeated measurements of the length of C. elegans egg lengths. We might measure these by taking microscope images and then counting the number of pixels along the long axis of the egg. We can then convert the pixel count to a length using the known interpixel distance of the optical setup.

  1. Is the egg length discrete or continuous? Strictly speaking, it is discrete. We are using a digital camera and we have precision to an integer number of pixels. This is generally the case with any digital measurement, which most of our measurements are. Nonetheless, we can approximate the measurement as continuous with little error. When we can approximate a discrete quantity as continuous, we usually want to go ahead and do that, as it simplifies the modeling and calculations.

  2. Is egg length univariate or multivariate? It is univariate.

  3. On what domain is egg length defined? Egg length is defined on the set of positive real numbers. We might then choose a likelihood defined on that domain. However, a Normal distribution would also be a good choice. It matches the story of how egg length might be generated (the sum of many processes, none of which has a giant variance), and its light tails mean that negative egg lengths are extremely unlikely (though not entirely impossible). So, a univariate Normal distribution is a good (approximate) distribution to use as a likelihood for egg length.

As another example, imagine we have a likelihood that is a bivariate Normal and we want to come up with a prior distribution for the \(2 \times 2\) covariance matrix.

  1. Is it discrete or continuous? It is continuous.

  2. Is the variable univariate or multivariate? The covariance matrix is multivariate. Note that it is often the case that we approximate multivariate variables as being a set if univariate variables. So, instead of coming up with a prior for a covariance matrix, we could come up with a prior for each of the three entries (three because the fourth is determined by symmetry). We can write the covariance matrix as \(\mathsf{\Sigma} = \mathsf{S} \cdot \mathsf{C} \cdot \mathsf{S}\), where \(S_{11} = \sigma_1\), \(S_{22} = \sigma_2\), \(S_{12} = S_{21} = 0\), \(C_{11} = C_{22} = 1\), and \(C_{12} = C_{21} = \rho\). We could then assume that \(\sigma_1\), \(\sigma_2\), and \(\rho\) are all independent and we can come up with priors for each.

  3. On what domain is the covariance matrix defined? The covariance matrix is positive definite, which defined the domain. We would therefore need to choose a distribution defined on that domain, and the LKJ distribution works for that purpose. Alternatively, we could define a distribution for \(\sigma_1\), \(\sigma_2\), and \(\rho\) separately. The domains of \(\sigma_1\) and \(\sigma_2\) are positive real numbers. We could use, e.g., Half-Normal priors for these two parameters. The parameter \(\rho\) is defined in the interval \([-1, 1]\). None of the named distributions in the Distribution Explorer are defined on this domain, but the Beta distribution is defined on the domain [0, 1]. We can use a linear transformation to get a distribution for \(\rho\).

\[\begin{split}\begin{align} &\xi \sim \text{Beta}(\alpha, \beta)\\ &\rho = 2\xi - 1. \end{align}\end{split}\]

Starting simple and adding flexibility

Imagine you build a generative model based on your domain knowledge and you find it can not quite capture all of the structure the observed data as may be evident from posterior predictive checks. You then wish to update your model. How should you choose a new likelihood?

In some cases, you need to rethink your domain knowledge. Maybe you can do some more theoretical work and come up with a new mathematical model (with your full generative model featuring random deviations of this mathematical model). Or, as is often the case, the main underpinnings of your theoretical model are exactly what you want, but you made some approximations in your derivations along the way that were inappropriate. You may also have neglected some effects. In these cases, your first model was an approximation or limit of the next model you may try.

Perhaps you do not really have an underlying theoretical model, but are simply modeling the data acquisition with a simple probability distribution. If that simple distribution misses some of the features of the data, you should select a new distribution for which the simple distribution is a limit or special case. This way, you bring more flexibility into the model, but still retain your initial inclinations based on your domain knowledge.

We can consider three examples.

  1. You model time between events (e.g., bursts of gene expression) as a Poisson process using an Exponential distribution. The CDF of your observed times has an inflection point, and the Exponential distribution cannot capture this. You therefore update your model to be have a Gamma likelihood. The Exponential distribution is a special case of the Gamma distribution. If the marginal posterior distribution of the parameter \(\alpha\) of the Gamma distribution has a lot of probability mass close to one, you know the Exponential model is reasonable. But if \(\alpha\) departs away from one, you know the Exponential distribution is inadequate.

  2. You make repeated measurements and use a Normal likelihood. You discover that the data contain too many extreme measurements to be commensurate with a Normal distribution. You can therefore use a Student-t distribution. The Normal distribution is the \(\nu\to\infty\) limit of the Student-t distribution. Therefore, if the posterior has very little probability mass for small \(\nu\), the Normal distribution may be adequate, but small \(\nu\) is indicative of heavy tails.

  3. Consider a set of measurements \(\{x_1, x_2, \ldots\}\). Each measurement has associated with is some measurement error such that \(x_i\sim\text{Norm}(\mu_i, s)\). Furthermore, there is a natural variability from measurement to measurement such that \(\mu_i \sim \text{Norm}(\mu, \sigma)\). You will show in a subsequent homework that in the limit where the natural variability is much greater than the measurement error (\(s\ll \sigma\)), the model become non-hierarchical with \(x_i \sim \text{Norm}(\mu, \sigma)\).

It is general good practice in model building to start simple and then sequentially build your models to capture the complexity of the data such that each more complex model has the simpler model preceding it as a special case or limit. This helps preserve domain knowledge and keep you from wandering into ad hoc statistical modeling. It also helps in model assessment to see if the added complexity is really necessary.

Priors for variables with support on the set of positive real numbers

We often have to choose priors with support on the set of positive real numbers. In practice, I find there are three classes of distributions I use for this.

  1. Half-distributions. Listed in order from lightest to heaviest tail, these are the Half-Normal, Half-Student-t and Half-Cauchy distributions. There is more mass around around zero in these distribution than the Gamma/Inv-Gamma distributions, but these can also have very heavy tails, as is the case with the Half-Cauchy.

  2. Gamma/Inv-Gamma distirbutions. The Gamma distribution and Inverse Gamma distributions are defined on the domain of positive real numbers. The Gamma distribution put more probability mass near zero than the Inverse Gamma. The Inverse Gamma has a much heavier tail. So, if you want to keep the probability mass away from zero and toward larger values, use an Inverse Gamma. For the opposite, use a Gamma.

  3. Log-Normal. I find the Log-Normal distribution to be very useful when I have order-of-magnitude estimates of the maximum and minimum values I might expect a parameter to have. This is the “bet-the-farm” approach to prior building. Consider for example a single molecule experiment where we are assessing the speed of a kinesin motor. I am sure that the kinesin motor walks slower than a 0.1 meters per second and faster than a nanometer per second. That is, I know a priori that kinesin’s walking speed is between \(10^{-3}\) µm/s and \(10^{5}\) µm/s. So, I might choose a prior such that the base ten logarithm of the kinesin speed is between –3 and 5. If I choose to have about 95% of the probability mass between these two values, I can choose a Normal distribution with a location parameter of 1 and a scale parameter of \((5 - (-3)) / 2 / 2 = 2\). Thus, \(\log_{10} v \sim \text{Norm}(1, 2)\). This is equivalent to a Log-Normal prior for \(v\). In practice, I almost always specify a Log-Normal prior as a Normal prior on the logarithm of the parameter and then exponentiate. Here is how that could look in Stan.

parameters {
  real log10_v;

transformed parameters {
  real v = 10^log10_v;

model {
  log10_v ~ normal(1, 2);