Calculating derivatives from data with GPs
[2]:
import numpy as np
import pandas as pd
import scipy.optimize
import scipy.stats as st
import cmdstanpy
import arviz as az
import bebi103
import bokeh.io
bokeh.io.output_notebook()
Imagine doing an experiment in bacterial growth. You can measure the optical density (OD) over time. The growth rate is given by the derivative of the log OD over time. The problem is that you cannot directly observe a derivative.
There are various ways to smooth data so that you can then compute derivatives from the smoothed curve. Gaussian processes provide a convenient way to go about this. We use a GP prior and Normal likelihood to obtain a nonparametric curve defined by a GP posterior. However, instead of directly differentiating the nonparametric curve using, e.g., finite differences, we can actually infer characteristics about its derivative because the derivative of a GP is also a GP! This is remarkably convenient, as explained in this nice paper by Peter Swain and coworkers. Specifically, if we have a Normal likelihood and a GP prior, we know the posterior distribution is also a GP. Therefore, we can get the posterior distribution for derivatives directly from the posterior distribution for nonparametric functions.
In this lesson, we will put this to use. We will use a data set from our own Tom Röschinger. I will not go into detail about the experiment, but will rather extract a single bacterial growth curve out of the data set for wild type E. coli growth measured by optical density (OD).
[3]:
# Load in data set
df = pd.read_csv(
os.path.join(data_path, "roeschinger_growth_rate_data.csv"), comment="#"
)
# Extract time and OD600 for growth curve of interest
inds = (
(df["tetracycline_conc_µg_per_mL"] == 0)
& (df["promoter"] == "MG1655")
& (df["well"] == "A01")
)
t = df.loc[inds, "time_min"].values / 60
od600 = df.loc[inds, "OD600"].values
# Plot
p = bokeh.plotting.figure(
frame_height=250,
frame_width=350,
x_axis_label='time (hr)',
y_axis_label='OD'
)
p.circle(t, od600)
bokeh.io.show(p)
We see that we have what looks like an exponential growth phase that then reaches stationary phase with a very slow die off. We see to measure the rate of growth over time, which is given by the derivative of the nonparametric function we are using to describe this growth process.
Before proceeding, we will do the usual setup of centered and scaled data and the points at which we want to evaluate the nonparametric curve. For the growth curves, we will only evaluate the curves at the beginning and end points and extrapolate beyond.
[4]:
t_scaled = (t - t.mean()) / t.std()
od600_scaled = (od600 - od600.mean()) / od600.std()
# Sample at 250 points
Nstar = 250
# Set up xstar
xstar = np.linspace(t_scaled.min(), t_scaled.max(), Nstar)
Derivatives of GPs
Applying linear operators to Gaussian processes results in Gaussian processes. Since differentiation is a linear operation, the result of differentiating a GP is a GP. In what follows, we will state results without proof, but you can reference Solak, et al. and Swain, et al. for details. We will limit ourselves to cases where the x-variables are scalar (like time is in the present example); the analysis generalizes to multidimensional \(\mathbf{x}\), but the notation gets more cumbersome, and we often encounter scalar x-variables in practice.
Recall that if I have a Normal likelihood and a GP prior with kernel \(k(x, x';\theta_k)\), and the posterior predictive distribution for the nonparametric function \(f(x)\) is
\begin{align} f(x) \mid \mathbf{y}, \mathbf{x} \sim \text{GP}(m(x), k(x, x') + \sigma^2_x \delta_{x,x'}). \end{align}
We can sample values of \(f(x)\) as specific points \(\mathbf{x}_*\) according to
\begin{align} \mathbf{f}_* \mid \mathbf{x}_*, \mathbf{x}, \mathbf{y} \sim \text{MultiNorm}(\mathbf{m}_*, \mathsf{\Sigma}_*), \end{align}
where
\begin{align} &\mathbf{m}_* = \mathsf{K}_*^\mathsf{T}\cdot\mathsf{K}_\mathbf{y}^{-1}\cdot \mathbf{y},\\[1em] &\mathsf{\Sigma}_* = \mathsf{K}_{**} - \mathsf{K}_*^\mathsf{T}\cdot\mathsf{K}_\mathbf{y}^{-1}\cdot\mathsf{K}_*. \end{align}
The covariance matrices \(\mathsf{K}\) are defined such that \(K_{ij}(x, z) = k(x_i, z_j;\theta_k)\). Specifically,
\begin{align} &\mathsf{K}_\mathbf{y} = K(\mathbf{x}, \mathbf{x}) - \text{diag}(\boldsymbol{\sigma})\\[1em] &\mathsf{K}_* = K(\mathbf{x}, \mathbf{x}_*),\\[1em] &\mathsf{K}_{**} = K(\mathbf{x}_*, \mathbf{x}_*). \end{align}
Let \(\mathbf{g}_*\) be the derivatives of \(f(x)\) evaluated at positions \(\mathbf{x}_*\). In order to compute these, we do exactly the same procedure as above, but with covariance matrices computed by differentiating the kernel.
Let \(\partial_1 \mathsf{K}(\mathbf{x}_1, \mathbf{x}_2)\) be the covariance matrix constructed using a kernel given by the derivative of the kernel with respect to the first variable,
\begin{align} \frac{\partial}{\partial x_1}\,k(x_1, x_2;\theta_k). \end{align}
Let \(\partial_1 \partial_2 \mathsf{K}(\mathbf{x}_1, \mathbf{x}_2)\) be the covariance matrix constructed using a kernel given by the mixed second derivative the kernel,
\begin{align} \frac{\partial^2}{\partial x_1\,\partial x_2}\,k(x_1, x_2;\theta_k). \end{align}
As an example, for a squared exponential kernel,
\begin{align} k_\mathrm{SE}(x_1, x_2; \alpha, \rho) = \alpha^2\,\exp\left[-\frac{(x_1-x_2)^2}{2\rho^2}\right], \end{align}
we have
\begin{align} &\frac{\partial}{\partial x_1}\,k_\mathrm{SE}(x_1, x_2;\alpha, \rho) = -\frac{\alpha^2}{\rho^2}\,(x_1 - x_2)\,\exp\left[-\frac{(x_1-x_2)^2}{2\rho^2}\right],\\[1em] &\frac{\partial^2}{\partial x_1\,\partial x_2}\,k_\mathrm{SE}(x_1, x_2;\alpha, \rho) = \frac{\alpha^2}{\rho^2}\left(1 - \frac{(x_1 - x_2)^2}{\rho^2}\right)\exp\left[-\frac{(x_1-x_2)^2}{2\rho^2}\right]. \end{align}
For notational convenience, we define \(\partial_1\mathsf{K}_* = \partial_1 \mathsf{K}(\mathbf{x}, \mathbf{x}_*)\) and \(\partial_1\partial_2\mathsf{K}_{**} = \partial_1 \partial_2 \mathsf{K}(\mathbf{x}_*, \mathbf{x}_*)\).
With all of these definitions in place, we can write the expressions for \(\mathbf{g}_*\) and its corresponding covariance matrix \(\mathsf{\Sigma}_*^\mathbf{g}\).
\begin{align} &\mathbf{g}_* = -(\partial_1 \mathsf{K}_*)^\mathsf{T}\cdot\mathsf{K}_\mathbf{y}^{-1}\cdot \mathbf{y},\\[1em] &\mathsf{\Sigma}_*^\mathbf{g} = \partial_1\partial_2\mathsf{K}_{**} - (\partial_1\mathsf{K}_{**})^\mathsf{T}\cdot\mathsf{K}_\mathbf{y}^{-1}\cdot\partial_1\mathsf{K}_{**}. \end{align}
Derivatives of GPs in practice using optimization
Let’s now put these results to use. For the growth curve, we will assume the following model for the centered and scaled data.
\begin{align} &\alpha \sim \text{HalfNorm}(2),\\[1em] &\rho \sim \text{InvGamma}(2, 10), \\[1em] &\sigma \sim \text{HalfNorm}(0, 1),\\[1em] &f(t) \sim \text{GP}(0, k_\mathrm{SE}(t, t'; \alpha, \rho),\\[1em] &\text{OD}_i \sim \text{Norm}(f(t_i), \sigma)\;\;\forall i. \end{align}
We will first find the MAP values of the parameters \(\alpha\), \(\rho\), and \(\sigma\) using the optimization techniques we have already learned. We start by coding up the model.
[5]:
def log_prior(sigma, alpha, rho):
"""Log prior for homoscedastic error and GP hyperparameters"""
# Choose 1e-6 as upper bound for sigma to protect against
# singular covariance matrix
if sigma <= 1e-6 or alpha <= 0 or rho <= 0:
return -np.inf
lp = st.halfnorm.logpdf(sigma, 0, 1)
lp += st.halfnorm.logpdf(alpha, 0, 2)
lp += st.invgamma.logpdf(rho, 2, loc=0, scale=10)
return lp
def log_likelihood(x, y, sigma, alpha, rho):
"""Normal log likelihood for centered and scaled GP."""
diag_to_add = sigma ** 2 * np.eye(len(x))
Ky = bebi103.gp.cov_exp_quad(x, alpha=alpha, rho=rho) + diag_to_add
return st.multivariate_normal.logpdf(y, np.zeros_like(x), Ky)
def log_posterior(params, x, y):
"""Log posterior for GP with normal likelihood."""
sigma, alpha, rho = params
lp = log_prior(sigma, alpha, rho)
if lp == -np.inf:
return lp
return lp + log_likelihood(x, y, sigma, alpha, rho)
def neg_log_posterior(params, x, y):
return -log_posterior(params, x, y)
We can now perform the optimization to get the parameter values.
[6]:
# Initial guess
params_0 = np.ones(3)
# Perform optimization
args = (t_scaled, od600_scaled)
res = scipy.optimize.minimize(neg_log_posterior, params_0, args=args, method="powell")
# Extract results
sigma, alpha, rho = res.x
# Show results
print("α = {0:.4f}, ρ = {1:.4f}, σ = {2:.4f}".format(alpha, rho, sigma))
α = 1.1420, ρ = 0.5917, σ = 0.0118
/Users/bois/miniconda3/envs/bebi103_build/lib/python3.11/site-packages/scipy/optimize/_optimize.py:2489: RuntimeWarning: invalid value encountered in scalar multiply
tmp2 = (x - v) * (fx - fw)
With the parameter values in hand, we can perform the above calculations with augmented covariance matrices. This is implemented in the bebi103.gp.posterior_mean_cov()
function using the include_deriv=True
kwarg.
[7]:
mstar, Sigmastar, gstar, Sigma_g_star = bebi103.gp.posterior_mean_cov(
t_scaled, od600_scaled, xstar, sigma=sigma, rho=rho, alpha=rho, include_deriv=True
)
Now that we have the augmented \(\mathbf{m}_*\) and \(\mathsf{\Sigma}_*\), we can put them to use. First, we’ll plot the nonparametric function with the data.
[8]:
# Bound of credible interval
high = mstar + 1.96 * np.sqrt(np.diag(Sigmastar))
low = mstar - 1.96 * np.sqrt(np.diag(Sigmastar))
# Uncenter/scale
tstar = t.std() * xstar + t.mean()
od600star = od600.std() * mstar + od600.mean()
low = od600.std() * low + od600.mean()
high = od600.std() * high + od600.mean()
# Make plot
p = bokeh.plotting.figure(
frame_height=250,
frame_width=350,
x_axis_label='time (hr)',
y_axis_label='OD'
)
# Data and function
p.line(tstar, od600star, line_width=2, color="orange")
p = bebi103.viz.fill_between(
tstar,
high,
tstar,
low,
show_line=False,
patch_kwargs=dict(color="orange", alpha=0.3),
p=p,
)
p.circle(t, od600)
bokeh.io.show(p)
If you zoom in the above plot, the credible interval is more clear. Now, let’s plot the derivative! When we uncenter and unscale the derivative, we need to multiply by the standard deviation of the y-values of the observed data and divide by that of the x-values (in this case time).
[9]:
# Bound of credible interval for derivative
deriv_high = gstar + 1.96 * np.sqrt(np.diag(Sigma_g_star))
deriv_low = gstar - 1.96 * np.sqrt(np.diag(Sigma_g_star))
# Uncenter/scale
deriv_star = gstar * od600.std() / t.std()
deriv_low *= od600.std() / t.std()
deriv_high *= od600.std() / t.std()
# Make plot
p = bokeh.plotting.figure(
frame_height=250,
frame_width=350,
x_axis_label="time (hr)",
y_axis_label="d OD / dt (1/hr)",
)
p.line(tstar, deriv_star, line_width=2)
p = bebi103.viz.fill_between(
tstar,
deriv_high,
tstar,
deriv_low,
show_line=False,
patch_kwargs=dict(alpha=0.3),
p=p,
)
bokeh.io.show(p)
We now have a measure of the derivative of the OD, but we need the derivative of the log of the OD to get what we commonly refer to as the growth rate. To compute the value of \(\sigma\) from the growth rate, we use an approximation for the variance of the ratio of Gaussian distributed random variables. If \(z = x/y\) and \(x\) and \(y\) are both Gaussian distributed, then
\begin{align} \frac{\sigma^2_z}{\mu_z^2} \approx \frac{\sigma^2_x}{\mu_x^2} + \frac{\sigma^2_y}{\mu_y^2}. \end{align}
I will not derive this result here, since I favor sampling anyway (in which there are no approximations), which we will do shortly.
[10]:
# Compute growth rate
growth_rate = deriv_star / od600star
# Compute sigma for growth rate
sigma_growth_rate = np.sqrt(
(deriv_star / od600star) ** 2
* (
np.diag(Sigmastar) * od600.std() ** 2 / od600star ** 2
+ np.diag(Sigma_g_star) * (od600.std() / t.std()) ** 2 / deriv_star ** 2
)
)
# Compute bounds
gr_high = growth_rate + 1.96 * sigma_growth_rate
gr_low = growth_rate - 1.96 * sigma_growth_rate
Now we’re ready to plot!
[11]:
# Make plot
p = bokeh.plotting.figure(
frame_height=250,
frame_width=350,
x_axis_label='time (hr)',
y_axis_label='growth rate (1/hr)'
)
p.line(tstar, growth_rate, line_width=2)
p = bebi103.viz.fill_between(
tstar,
gr_high,
tstar,
gr_low,
show_line=False,
patch_kwargs=dict(alpha=0.3),
p=p,
)
bokeh.io.show(p)
Sampling derivatives with Stan
As we have already seen, sampling out of the GP posterior allows us to properly consider all values of the hyperparameters weighted by their posterior probability density. We can adjust the code from the last part of this lesson to include the derivative calculations outlined above. The code below is long, but should be self-explanatory. I use the same functions for computing the GP posterior \(\mathbf{m}_*\) and \(\mathsf{\Sigma}_*\) values as we have been using. I then have functions
for computing the derivative kernels and then the covariance matrices from the derivative kernel. The data
, transformed data
, parameters
, and model
blocks are the same as before. In the generated quantities
block, I add the extra calculations of the values of the derivatives \(\mathbf{g}_*\).
functions {
vector gp_posterior_mstar(vector y, matrix Ly, matrix Kstar) {
/*
* Obtain posterior mstar for a model with a Normal likelihood and GP prior
* for a given Cholesky decomposition, Ly, of the matrix Ky, and K*.
*/
// Get sizes
int N = size(y);
int Nstar = cols(Kstar);
// Compute xi = inv(Ky) . y, which is solution xi to Ky . xi = y.
vector[N] z = mdivide_left_tri_low(Ly, y);
vector[N] xi = mdivide_right_tri_low(z', Ly)';
// Compute mean vector mstar
vector[Nstar] mstar = Kstar' * xi;
return mstar;
}
matrix gp_posterior_sigmastar_cholesky(
vector y,
matrix Ly,
matrix Kstar,
matrix Kstarstar,
real delta) {
/*
* Obtain posterior Σ* for a model with a Normal likelihood and GP prior.
*/
// Get sizes
int N = size(y);
int Nstar = cols(Kstar);
// Compute Xi = inv(Ky) . Kstar, which is the solution Xi to Ky . Xi = Kstar.
matrix[N, Nstar] Z = mdivide_left_tri_low(Ly, Kstar);
matrix[N, Nstar] Xi = mdivide_right_tri_low(Z', Ly)';
// Compute Sigma_star (plus a small number of the diagonal to ensure pos. def.)
matrix[Nstar, Nstar] Sigmastar = Kstarstar - Kstar' * Xi
+ diag_matrix(rep_vector(delta, Nstar));
// Compute and return Cholesky decomposition
matrix[Nstar, Nstar] Lstar = cholesky_decompose(Sigmastar);
return Lstar;
}
real d1_se_kernel(real x1, real x2, real alpha, real rho) {
/*
* SE kernel differentiated by the first variable.
*/
real x_diff = x1 - x2;
return -(alpha / rho)^2 * x_diff * exp(-x_diff^2 / 2.0 / rho^2);
}
real d1_d2_se_kernel(real x1, real x2, real alpha, real rho) {
/*
* SE kernel differentiated by the first variable.
*/
real rho2 = rho^2;
real term1 = x1 - x2 + rho;
real term2 = x2 - x1 + rho;
return (alpha / rho2)^2 * term1 * term2 * exp(-(x1 - x2)^2 / 2.0 / rho2);
}
matrix d1_cov_exp_quad(real[] x1, real[] x2, real alpha, real rho) {
/*
* Derivative of SE covariance matrix with respect to first variable.
*/
int m = size(x1);
int n = size(x2);
matrix[m, n] d1_K;
for (i in 1:m) {
for (j in 1:n) {
d1_K[i, j] = d1_se_kernel(x1[i], x2[j], alpha, rho);
}
}
return d1_K;
}
matrix d1_d2_cov_exp_quad(real[] x1, real[] x2, real alpha, real rho) {
/*
* Derivative of SE covariance matrix with respect to first variable.
*/
int m = size(x1);
int n = size(x2);
matrix[m, n] d1_d2_K;
for (i in 1:m) {
for (j in 1:n) {
d1_d2_K[i, j] = d1_d2_se_kernel(x1[i], x2[j], alpha, rho);
}
}
return d1_d2_K;
}
}
data {
int<lower=1> N;
array[N] real x;
vector[N] y;
int<lower=1> Nstar;
array[Nstar] real xstar;
}
transformed data {
real delta = 1e-8;
}
parameters {
real<lower=0> alpha;
real<lower=0> rho;
real<lower=0> sigma;
}
model {
alpha ~ normal(0.0, 2.0);
rho ~ inv_gamma(0.5, 2.0);
sigma ~ normal(0.0, 1.0);
matrix[N, N] Ky = gp_exp_quad_cov(x, alpha, rho)
+ diag_matrix(rep_vector(square(sigma), N));
matrix[N, N] Ly = cholesky_decompose(Ky);
y ~ multi_normal_cholesky(rep_vector(0, N), Ly);
}
generated quantities {
vector[Nstar] fstar;
vector[Nstar] dfstar;
array[Nstar] real y_ppc;
{
// Build covariance matrices
matrix[N, N] Ky = cov_exp_quad(x, alpha, rho)
+ diag_matrix(rep_vector(square(sigma), N));
matrix[N, N] Ly = cholesky_decompose(Ky);
matrix[N, Nstar] Kstar = cov_exp_quad(x, xstar, alpha, rho);
matrix[Nstar, Nstar] Kstarstar = cov_exp_quad(xstar, xstar, alpha, rho);
// Derivative covariance matrices
matrix[N, Nstar] d1_Kstar = d1_cov_exp_quad(x, xstar, alpha, rho);
matrix[Nstar, Nstar] d1_d2_Kstarstar = d1_d2_cov_exp_quad(xstar, xstar, alpha, rho);
// Obtain m* and Sigma*
vector[Nstar] mstar = gp_posterior_mstar(y, Ly, Kstar);
matrix[Nstar, Nstar] Lstar = gp_posterior_sigmastar_cholesky(
y, Ly, Kstar, Kstarstar, delta);
// Obtain g* and Sigmag*
vector[Nstar] gstar = -gp_posterior_mstar(y, Ly, d1_Kstar);
matrix[Nstar, Nstar] L_g_star = gp_posterior_sigmastar_cholesky(
y, Ly, d1_Kstar, d1_d2_Kstarstar, delta);
// Sample nonparametric function
fstar = multi_normal_cholesky_rng(mstar, Lstar);
// Sample derivative of nonparametric function
dfstar = multi_normal_cholesky_rng(gstar, L_g_star);
// Posterior predictive check
y_ppc = normal_rng(fstar, sigma);
}
}
Let’s give it a go!
[12]:
# Create data dictionary for input
data = dict(N=len(t), x=t_scaled, y=od600_scaled, Nstar=Nstar, xstar=xstar)
# Compile and sample!
with bebi103.stan.disable_logging():
sm = cmdstanpy.CmdStanModel(stan_file="gp_growth_curve.stan")
samples = sm.sample(data=data)
# Convert to ArviZ
samples = az.from_cmdstanpy(samples, posterior_predictive=["fstar", "dfstar", "y_ppc"])
# Check diagnostics
bebi103.stan.check_all_diagnostics(samples)
Effective sample size looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.
[12]:
0
Great! Let’s start by checking out the hyperparameters.
[13]:
bokeh.io.show(bebi103.viz.corner(samples))
It looks good, and \(\sigma\) is small, as we saw in our MAP calculation. Now, let’s plot a posterior predictive check. To do so, we need to uncenter and unscale.
[14]:
y_ppc = (
samples.posterior_predictive["y_ppc"]
.stack({"sample": ("chain", "draw")})
.transpose("sample", "y_ppc_dim_0")
)
# Uncenter and unscale
od600_ppc = od600.std() * y_ppc + od600.mean()
tstar = t.std() * xstar + t.mean()
# Make the plot
bokeh.io.show(
bebi103.viz.predictive_regression(
od600_ppc,
tstar,
data=np.stack((t, od600)).transpose(),
x_axis_label="time (hr)",
y_axis_label="OD600",
)
)
There is not much noise in the data, so the nonparametric curve from this GP goes right through.
Now, we can look at the derivatives. We have no data to compare these with, so we just plot the posterior predictive samples of the derivative of \(f(t)\).
[15]:
dfstar_scaled = (
samples.posterior_predictive["dfstar"]
.stack({"sample": ("chain", "draw")})
.transpose("sample", "dfstar_dim_0")
)
# Uncenter and unscale
dfstar = dfstar_scaled * od600.std() / t.std()
# Plot posterior predictive
bokeh.io.show(
bebi103.viz.predictive_regression(
dfstar, tstar, x_axis_label="time (hr)", y_axis_label="d OD / dt (1/hr)"
)
)
We can also compute the growth rate from the samples and plot the result.
[16]:
# Grab fstar
fstar_scaled = (
samples.posterior_predictive["fstar"]
.stack({"sample": ("chain", "draw")})
.transpose("sample", "fstar_dim_0")
)
# Uncenter and unscale
fstar = od600.std() * fstar_scaled + od600.mean()
# Sample of growth rate
growth_rate = dfstar.values / fstar.values
# Make plot
bokeh.io.show(
bebi103.viz.predictive_regression(
growth_rate, tstar, x_axis_label="time (hr)", y_axis_label="growth rate (1/hr)"
)
)
It is apparently very hard to know the growth rate early on in the growth curve. We can also see that due to the highly variable growth rate, these bacteria not not really experiencing purely exponential growth.
[17]:
bebi103.stan.clean_cmdstan()
Computing environment
[18]:
%load_ext watermark
%watermark -v -p numpy,scipy,pandas,cmdstanpy,arviz,bokeh,bebi103,jupyterlab
print("cmdstan :", bebi103.stan.cmdstan_version())
Python implementation: CPython
Python version : 3.11.5
IPython version : 8.15.0
numpy : 1.26.2
scipy : 1.11.4
pandas : 2.1.4
cmdstanpy : 1.2.0
arviz : 0.17.0
bokeh : 3.3.0
bebi103 : 0.1.20
jupyterlab: 4.0.10
cmdstan : 2.34.0