Gaussian process hyperparameters by optimization
[2]:
import numpy as np
import polars as pl
import scipy.optimize
import scipy.stats as st
import cmdstanpy
import arviz as az
import bebi103
import bokeh.io
bokeh.io.output_notebook()
In the previous lesson we introduced Gaussian processes (GPs) with particular emphasis on the following model where the data consist of centered-and-scaled measurements
We learned that we can marginalize out
where
To start with, we will use the same data set as last time from Wolfenden and Snider, which can be downloaded here.
[3]:
df = pl.read_csv(os.path.join(data_path, 'wolfenden_arrhenius.csv'))
df = df.with_columns(
(1000 / pl.col('1000/T (1/K)')).alias('T (K)'),
(pl.col('ln k (1/s)').exp()).alias('k (1/s)')
)
T = df["T (K)"].to_numpy()
k = df["k (1/s)"].to_numpy()
While we’re at it, we should center and scale and also set up the points at which we want to evaluate the functions generated by the posterior Gaussian process.
[4]:
k_scaled = (k - k.mean()) / k.std()
T_scaled = (T - T.mean()) / T.std()
# Sample at 250 points
Nstar = 250
# Set up xstar
T_range = T_scaled.max() - T_scaled.min()
xstar = np.linspace(
T_scaled.min() - 0.05 * T_range, T_scaled.max() + 0.05 * T_range, Nstar
)
Priors for hyperparameters
Before we can proceed with inference, we need to first stipulate which kernel we will use and the hyperpriors for its hyperparameters. We will use a squared exponential kernel, which has hyperparameters
In choosing the priors for these hyperparameters, it is important to use domain knowledge about what kinds of functions might describe the data (remembering that the data are centered and scaled!). Specifically,
For
Choosing a prior for the length scale
Finally, I will choose a Half-Normal prior for
where
Computing the MAP with SciPy
As we did in early lessons, we can use SciPy to compute the maximal a posteriori probability parameter values of
[5]:
def log_prior(sigma, alpha, rho):
"""Log prior for homoscedastic error and GP hyperparameters"""
# Choose 1e-6 as upper bound for sigma to protect against
# singular covariance matrix
if sigma <= 1e-6 or alpha <= 0 or rho <= 0:
return -np.inf
lp = st.halfnorm.logpdf(sigma, 0, 1)
lp += st.halfnorm.logpdf(alpha, 0, 2)
lp += st.invgamma.logpdf(rho, 2, loc=0, scale=10)
return lp
Next, we define the log likelihood. To compute it, we first need to compute the covariance matrix bebi103.gp.cov_exp_quad()
function. We then add
[6]:
def log_likelihood(x, y, sigma, alpha, rho):
"""Normal log likelihood for centered and scaled GP."""
diag_to_add = sigma ** 2 * np.eye(len(x))
Ky = bebi103.gp.cov_exp_quad(x, alpha=alpha, rho=rho) + diag_to_add
return st.multivariate_normal.logpdf(y, np.zeros_like(x), Ky)
Finally, we code up the log posterior and negative log posterior to enable optimization.
[7]:
def log_posterior(params, x, y):
"""Log posterior for GP with normal likelihood."""
sigma, alpha, rho = params
lp = log_prior(sigma, alpha, rho)
if lp == -np.inf:
return lp
return lp + log_likelihood(x, y, sigma, alpha, rho)
def neg_log_posterior(params, x, y):
return -log_posterior(params, x, y)
We are now ready to optimize. We provide an initial guess for the parameters
[8]:
# Initial guess
params_0 = np.ones(3)
# Perform optimization
args = (T_scaled, k_scaled)
res = scipy.optimize.minimize(neg_log_posterior, params_0, args=args, method="powell")
# Extract results
sigma, alpha, rho = res.x
# Show results
print("α = {0:.4f}, ρ = {1:.4f}, σ = {2:.4f}".format(alpha, rho, sigma))
α = 1.5887, ρ = 2.5791, σ = 0.0886
/Users/bois/miniconda3/envs/bebi103_build/lib/python3.12/site-packages/scipy/optimize/_optimize.py:2472: RuntimeWarning: invalid value encountered in scalar multiply
tmp2 = (x - v) * (fx - fw)
Excellent! We can now make a plot of the nonparametric functions coming from the posterior GP with our optimal parameters. First, we compute gp.posterior_mean_cov()
function.
[9]:
mstar, Sigmastar = bebi103.gp.posterior_mean_cov(
T_scaled, k_scaled, xstar, sigma=sigma, rho=rho, alpha=rho
)
Finally, to make the plot, we compute the theoretical credible interval and then uncenter and unscale for plotting.
[10]:
# Bound of credible interval
high = mstar + 1.96 * np.sqrt(np.diag(Sigmastar))
low = mstar - 1.96 * np.sqrt(np.diag(Sigmastar))
# Uncenter/scale
Tstar = T.std() * xstar + T.mean()
kstar = k.std() * mstar + k.mean()
low = k.std() * low + k.mean()
high = k.std() * high + k.mean()
# Make plot
p = bokeh.plotting.figure(
frame_height=250, frame_width=350, x_axis_label="T (K)", y_axis_label="k (1/s)"
)
p.line(Tstar, kstar, line_width=2, color="orange")
p = bebi103.viz.fill_between(
Tstar,
high,
Tstar,
low,
show_line=False,
patch_kwargs=dict(color="orange", alpha=0.3),
p=p,
)
p.scatter(source=df.to_dict(), x="T (K)", y="k (1/s)")
bokeh.io.show(p)
Again, remember that the line in the plot is the most probable function for the MAP values of the hyperparameters and
Posterior predictive samples
To sample out of a posterior predictive distribution
[11]:
# Sample function
rng = np.random.default_rng()
def draw_gp_ppc(mstar, Sigmastar, sigma, y_mean, y_std):
"""Draw a GP posterior predictive sample of centered
and scaled data that is then uncentered and unscaled."""
# Draw nonparametric function
f = np.random.multivariate_normal(mstar, Sigmastar)
# Draw data
y = rng.normal(f, sigma)
# Uncenter and unscale
return y_std * y + y_mean
k_ppc = draw_gp_ppc(mstar, Sigmastar, sigma, k.mean(), k.std())
# Overlay new data on plot
p.scatter(Tstar, k_ppc, color='tomato', alpha=0.5, size=3)
bokeh.io.show(p)
We can generate many of these samples and them make a posterior predictive plot. Note, though, that this is not a full posterior predictive plot. Rather, we are fixing
[12]:
k_ppc = np.array(
[draw_gp_ppc(mstar, Sigmastar, sigma, k.mean(), k.std()) for _ in range(1000)]
)
bokeh.io.show(
bebi103.viz.predictive_regression(k_ppc, Tstar, data=np.stack((T, k)).transpose(),)
)
The posterior predictive plot looks good!
Obtaining hyperpriors by optimizing with Stan
While we have not discussed it thus far, Stan does have functionality to find MAP parameters by optimization. It may be easier to code you model in Stan and then use its optimizer. This can be the case for GP-based models. The Stan code for the model we have been considering is
data {
int<lower=1> N;
array[N] real x;
vector[N] y;
}
parameters {
real<lower=0> alpha;
real<lower=0> rho;
real<lower=0> sigma;
}
model {
alpha ~ normal(0.0, 2.0);
rho ~ inv_gamma(0.5, 2.0);
sigma ~ normal(0.0, 1.0);
matrix[N, N] Ky = gp_exp_quad_cov(x, alpha, rho)
+ diag_matrix(rep_vector(square(sigma), N));
matrix[N, N] Ly = cholesky_decompose(Ky);
y ~ multi_normal_cholesky(rep_vector(0, N), Ly);
}
Note that we use the Cholesky decomposition, as it is more numerically stable. Note also that the construction of the covariance matrix and its Cholesky decomposition are in the model
block instead of transformed parameters
because we do not need nor want them stored as output.
To find the MAP parameters, we simple compile the model and then use the optimize()
method.
[13]:
with bebi103.stan.disable_logging():
sm = cmdstanpy.CmdStanModel(stan_file="gp_kinetics_no_ppc.stan")
data = dict(N=len(T_scaled), x=T_scaled, y=k_scaled)
res = sm.optimize(data=data)
The resulting object has the optimal parameters presented in a dictionary, Numpy array, or Pandas data frame. In this case, the dictionary is easiest to use.
[14]:
res.optimized_params_dict
[14]:
OrderedDict([('lp__', 7.79106),
('alpha', 1.40338),
('rho', 2.25657),
('sigma', 0.0893158)])
The “parameter” lp__
is the value of the log posterior at the MAP. The remaining entries are the MAP parameter values, which you can see are close to what we got using SciPy, but differ slightly. This is because Stan does transformations of variables to handle constraints, and with the new transformed variables, the tolerance may be hit with different parameter values. We can verify that it gives the same result by again plotting the resulting nonparametric function.
[15]:
# Plot Numpy result
p = bokeh.plotting.figure(
frame_height=250, frame_width=350, x_axis_label="T (K)", y_axis_label="k (1/s)"
)
p.line(Tstar, kstar, line_width=2, color="orange")
p = bebi103.viz.fill_between(
Tstar,
high,
Tstar,
low,
show_line=False,
patch_kwargs=dict(color="orange", alpha=0.3),
p=p,
)
# Plot Stan result
sigma, alpha, rho = res.optimized_params_pd[["sigma", "alpha", "rho"]].values.flatten()
mstar, Sigmastar = bebi103.gp.posterior_mean_cov(
T_scaled, k_scaled, xstar, sigma=sigma, rho=rho, alpha=rho
)
high = mstar + 1.96 * np.sqrt(np.diag(Sigmastar))
low = mstar - 1.96 * np.sqrt(np.diag(Sigmastar))
Tstar = T.std() * xstar + T.mean()
kstar = k.std() * mstar + k.mean()
low = k.std() * low + k.mean()
high = k.std() * high + k.mean()
p.line(Tstar, kstar, line_width=2, color="tomato")
p = bebi103.viz.fill_between(
Tstar,
high,
Tstar,
low,
show_line=False,
patch_kwargs=dict(color="tomato", alpha=0.3),
p=p,
)
# Plot data
p.scatter(source=df.to_dict(), x="T (K)", y="k (1/s)")
bokeh.io.show(p)
The results almost exactly overlap.
[16]:
bebi103.stan.clean_cmdstan()
Computing environment
[17]:
%load_ext watermark
%watermark -v -p numpy,scipy,polars,cmdstanpy,arviz,bokeh,bebi103,jupyterlab
print("cmdstan :", bebi103.stan.cmdstan_version())
Python implementation: CPython
Python version : 3.12.8
IPython version : 8.27.0
numpy : 1.26.4
scipy : 1.14.1
polars : 1.17.1
cmdstanpy : 1.2.5
arviz : 0.20.0
bokeh : 3.6.0
bebi103 : 0.1.25
jupyterlab: 4.2.6
cmdstan : 2.36.0