Generating a transition kernel: The Metropolis-Hastings algorithm

The Metropolis-Hastings algorithm covers a widely used class of algorithms for MCMC sampling. I will first state the algorithm here, and then will show that it satisfies the necessary conditions for the walkers to be sampling out of the target posterior distribution.

The algorithm/kernel

Say our walker is at position \(\theta_i\) in parameter space.

We randomly choose a candidate position \(\theta'\) to step to next from an arbitrary proposal distribution \(K(\theta'\mid \theta_i)\).
We compute the Metropolis ratio,

\[\begin{aligned} r = \frac{P(\theta')\,K(\theta_i\mid \theta')} {P(\theta_i)\,K(\theta'\mid \theta_i)}. \end{aligned}\]
If \(r \ge 1\), accept the step and set \(\theta_{i+1} = \theta'\). Otherwise, accept the step with probability \(r\). If we do not accept the step, set \(\theta_{i+1} = \theta_i\).

The last two steps are used to define the transition kernel \(T(\theta_{i+1}\mid \theta_i)\). We can define the acceptance probability of the proposal step as

\[\begin{aligned} \alpha(\theta_{i+1}\mid \theta_i) = \min(1, r) = \min\left(1, \frac{P(\theta_{i+1})\,K(\theta_i\mid \theta_{i+1})} {P(\theta_i)\,K(\theta_{i+1}\mid \theta_i)}\right). \end{aligned}\]

Then, the transition kernel is

\[\begin{aligned} T(\theta_{i+1}\mid \theta_i) = \alpha(\theta_{i+1}\mid \theta_i)\,K(\theta_{i+1}\mid \theta_i). \end{aligned}\]

Detailed balance

This algorithm seems kind of nuts! How on earth does this work? To investigate this, we consider the joint probability, \(P(\theta_{i+1}, \theta_i)\), that the walker is at \(\theta_i\) and \(\theta_{i+1}\) at sequential steps. We can write this in terms of the transition kernel,

\[\begin{split}\begin{aligned} P(\theta_{i+1}, \theta_i) &= P(\theta_i)\,T(\theta_{i+1}\mid \theta_i) \nonumber \\[1em] &= P(\theta_i)\,\alpha(\theta_{i+1}\mid \theta_i)\,K(\theta_{i+1}\mid \theta_i) \nonumber \\[1em] &= P(\theta_i)\,K(\theta_{i+1}\mid \theta_i)\,\min\left(1, \frac{P(\theta_{i+1})\,K(\theta_i\mid \theta_{i+1})} {P(\theta_i)\,K(\theta_{i+1}\mid \theta_i)}\right) \nonumber \\[1em] &= \min\left[P(\theta_i)\,K(\theta_{i+1}\mid \theta_i), P(\theta_{i+1})\,K(\theta_i\mid \theta_{i+1})\right] \nonumber \\[1em] &= P(\theta_{i+1})\,K(\theta_i\mid \theta_{i+1})\,\min\left(1,\frac{P(\theta_i)\,K(\theta_{i+1}\mid \theta_i)}{P(\theta_{i+1})\,K(\theta_i\mid \theta_{i+1})}\right) \nonumber \\[1em] &= P(\theta_{i+1})\, \alpha(\theta_i\mid \theta_{i+1})\,K(\theta_i\mid \theta_{i+1})\, \nonumber \\[1em] &= P(\theta_{i+1})\, T(\theta_i\mid \theta_{i+1}). \end{aligned}\end{split}\]

Thus, we have

\[\begin{aligned} P(\theta_i)\,T(\theta_{i+1}\mid \theta_i) = P(\theta_{i+1})\, T(\theta_i\mid \theta_{i+1}). \end{aligned}\]

This says that the rate of transition from \(\theta_i\) to \(\theta_{i+1}\) is equal to the rate of transition from \(\theta_{i+1}\) to \(\theta_i\). In this case, the transition kernel is said to satisfy detailed balance.

Any transition kernel that satisfies detailed balance has \(P(\theta)\) as an invariant distribution. This is easily shown.

\[\begin{split}\begin{aligned} \int \mathrm{d}\theta_i \,P(\theta_i)\,T(\theta_{i+1}\mid \theta_i) &= \int \mathrm{d}\theta_i\,P(\theta_{i+1})\, T(\theta_i\mid \theta_{i+1}) \nonumber \\[1em] &= P(\theta_{i+1})\left[\int \mathrm{d}\theta_i\, T(\theta_i\mid \theta_{i+1})\right] \nonumber \\[1em] &= P(\theta_{i+1}), \end{aligned}\end{split}\]

since the bracketed term is unity because the transition kernel is a probability density [1].

Note that all transition kernels that satisfy detailed balance have an invariant distribution. (If the chain is ergodic, this is a limiting distribution.) But not all kernels that have an invariant distribution satisfy detailed balance. So, detailed balance is a sufficient condition for a transition kernel having an invariant distribution.

Choosing the transition kernel

There is an art to choosing the transition kernel. The original Metropolis algorithm (1953), took \(K(\theta_{i+1}\mid \theta_i) = 1\). As a rule of thumb, you want to choose a proposal distribution such that you get an acceptance rate of about 0.4. If you accept every step, the walker just wanders around and it takes a while to get to the limiting distribution. If you reject too many steps, the walkers never move, and it again takes a long time to get to the limiting distribution. There are tricks to “tune” the walkers to achieve the target acceptance rate.

Gibbs sampling is a more modern approach to defining transition kernels. It was very popular for many years, though we will not go into the details. It is a special case of a Metropolis-Hastings sampler. These days, most samplers use Hamiltonian Monte Carlo (HMC). Stan uses a highly optimized HMC sampler which gives significant performance improvements for important subclasses of problems. To my knowledge and in my opinion, Stan is the current state-of-the-art in MCMC sampling software. (I cannot overemphasize how much of a performance improvement Stan’s sampler offers; it allows sampling of posteriors that are impossible with a more naive implementation of Metropolis-Hastings.)

Importantly, the HMC sampler can only handle continuous variables; it cannot sample discrete variables. Depending on your problem, this could be a limitation, but in most applications we are interested in estimating continuous parameters. There are also ways around sampling discrete variables, such as explicitly marginalizing over them, such that we rarely need to sample them anyhow.