Choosing a hierarchical prior


Choice of a hierarchical prior is not always as straightforward as for priors we are used to considering because we have to specify the hyperprior and all conditional priors, \(g(\theta\mid \phi)\).

Exchangeability

The conditional probability, \(g(\theta\mid \phi)\), can take any reasonable form. In the case where we have no reason to believe that we can distinguish any one \(\theta_i\) from another prior to the experiment, then the label “\(i\)” applied to the experiment may be exchanged with the label of any other experiment. I.e., \(g(\theta_1, \theta_2, \ldots, \theta_k \mid \phi)\) is invariant to permutations of the indices. Parameters behaving this way are said to be exchangeable. A common (simple) exchangeable distribution is

\begin{align} g(\theta\mid \phi) = \prod_{i=1}^k g(\theta_i\mid \phi), \end{align}

which means that each of the parameters is an independent sample out of a distribution \(g(\theta_i\mid \phi)\), which we often take to be the same for all \(i\). This is reasonable to do in the worm reversal example.

In all of the hierarchical models we will work with, we will assume exchangeability. In situations where the indices of the experiment contain real information, meaning that the prior is no longer invariant to permuations of the indices, we lose exchangeability. An example would be if we did one set of experiments on one instrument and another set of experiments on another instrument. If we suspect these different instrument may have a real effect on the measured data, we need to explicitly model the differences. This is an example of a factor model, which feature more nuance than the hierarchical models we will consider. We will not go into factor models in this course (but we definitely would if we had more time! Add it to the long list of beautiful advanced topics….). Of course, we may choose to ignore differences between the two instruments in our modeling and recover exchangeability.

Choice of the conditional distribution

We need to specify our prior, which for this hierarchical model means that we have to specify the conditional distribution, \(g(\theta_i\mid \phi)\), as well as \(g(\phi)\). We could assume a Beta prior for \(\phi\); the one we chose in our original nonhierarchical model would be a good choice.

\begin{align} \phi \sim \text{Beta}(1.1, 1.1). \end{align}

For the conditional distribution \(g(\theta_i\mid \phi)\), we might also assume it is Beta-distributed. This necessitates another parameter because the Beta distribution has two parameters.

The Beta distribution is typically written as

\begin{align} g(\theta\mid \alpha, \beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\, \theta^{\alpha-1}(1-\theta)^{\beta-1}, \end{align}

where it is parametrized by positive constants \(\alpha\) and \(\beta\). The Beta distribution has mean and concentration, respectively, of

\begin{align} \phi &= \frac{\alpha}{\alpha + \beta}, \\[1em] \kappa &= \alpha + \beta. \end{align}

The concentration \(\kappa\) is a measure of how sharp the distribution is. The bigger \(\kappa\) is, the most sharply peaked the distribution is. Since we would like to parametrize our Beta distribution with its mean \(\phi\), we could use \(\kappa\) as our other parameter. So, our expression for the posterior is

\begin{align} g(\theta, \phi, \kappa \mid n, N) = \frac{f(n,N\mid \theta)\,\left( \prod_{i=1}^k g(\theta_i\mid \phi, \kappa)\right)\,g(\phi, \kappa)}{f(n, N)}. \end{align}

We are left to specify the hyperprior \(g(\phi, \kappa)\). We will take \(\phi\) to come from a Beta distribution and \(\kappa\) to come from an weakly informative Half-Normal. Note that to switch from a parametrization using \(\phi\) and \(\kappa\) to one using \(\alpha\) and \(\beta\), we can use

\begin{align} &\alpha = \phi \kappa\\[1em] &\beta = (1-\phi)\kappa. \end{align}

With all of this, we can now put together our model.

\begin{align} &\phi \sim \text{Beta}(1.1, 1.1), \\[1em] &\kappa \sim \text{HalfNorm}(0, 1000), \\[1em] &\alpha = \phi \kappa, \\[1em] &\beta = (1-\phi)\kappa,\\[1em] &\theta_i \sim \text{Beta}(\alpha, \beta) \;\;\forall i,\\[1em] &n_i \sim \text{Binom}(N_i, \theta_i)\;\;\forall i. \end{align}

This is a complete specification of a hierarchical model.