“Hello, world” —Stan
[2]:
import numpy as np
import pandas as pd
import scipy.special
import scipy.stats as st
import cmdstanpy
import arviz as az
import iqplot
import bebi103
import colorcet
import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()
When getting familiar with a new programming language, we often write a “Hello, world” program. This is a simple, often minimal, to demonstrate some of the basic syntax of the language. Python’s “Hello, world” program is:
[3]:
print("Hello, world.")
Hello, world.
Here, we introduce Stan, and write a “Hello, world” program for it.
Before we do, we note that you may run Stan on your own machine if you have managed to get Stan and CmdStanPy installed. Otherwise, you can use AWS using the bebi103
Amazon Machine Image, available in the Oregon region. If you wish, you may also use Google Colab, though you will be limited in how many cores you can use and how long you can use them.
Basics of Stan programs
This is our first introduction to Stan, a probabilistic programming language that we will use for much of our statistical modeling. Stan is a separate language. It is not Python. It has a command line interface and interfaces for R, Python, Julia, Matlab, and Stata.
We will be using one of the two Python interfaces, CmdStanPy. PyStan is another popular interface. Remember, though, that Stan is a separate language, and any Stan program you write works across all of these interfaces.
Before we dive in and write our first Stan program to draw samples out of the Normal distribution, I want to tell you a few things about Stan. Briefly, Stan works as follows when using the CmdStanPy interface.
A user writes a model using the Stan language. This is usually stored in a
.stan
text file.The model is compiled in two steps. First, Stan translates the model in the
.stan
file into C++ code. Then, that C++ code is compiled into machine code.Once the machine code is built, the user can, via the CmdStanPy interface, sample out of the distribution defined by the model and perform other calculations (such as optimization and variational inference) with the model.
The results from the sampling are written to disk as CSV and txt files. As demonstrated below, we conveniently access these files using ArviZ, so we do not directly interact with them.
We will learn the Stan language structure and syntax as we go along. To start with, a Stan program consists of seven sections, called blocks. They are, in order
functions
: Any user-defined functions that can be used in other blocks.data
: Any inputs from the user. Most commonly, these are measured data themselves. You can also put user-adjustable parameters in this block as well, but nothing you intend to sample.transformed data
: Any transformations that need to be done on the data.parameters
: The parameters of the model. Stan will give you samples of the variables described in this block. These are the \(\theta\) in the posterior \(g(\theta\mid y)\).transformed parameters
: Any transformations that need to be done on the parameters.model
: Specification of the generative model. The sampler will sample the parameters \(\theta\) out of this model.generated quantities
: Any other quantities you want to calculate with each sample.
Not all blocks need to be in a Stan program, but they must be in this order. Some other important points to keep in mind as we venture into Stan:
The Stan documentation will be a very good friend of yours, both the user’s guide and reference manual.
The index origin of Stan is
1
, not0
as in Python.Stan is strongly statically typed, which means that you need to declare the data type of a variable explicitly before using it.
All Stan commands must end with a semicolon.
Blocks of code are separated using curly braces.
Stan programs are stored outside of your notebook in a
.stan
file. These are text files, which you can prepare with your favorite text editor, including the one included in JupyterLab.
Say hi, Stan
With this groundwork laid, let’s just go ahead and write our “Hello, world” Stan program to generate samples out of a standard Normal distribution (with zero mean and unit variance). (Note that this is not sampling out of a posterior.) Here is the code, which I have stored in the file hello_world.stan
.
parameters {
real x;
}
model {
x ~ normal(0, 1);
}
Note that there are two blocks in this particular Stan code, the parameters
block and the model
block. These are two of the seven possible blocks in a Stan code, and we will explore others in the next part of the lesson when we learn more about Stan after we complete our Hello, world program.
In the parameters
block, we have the names and types of parameters we want to obtain samples for. In this case, we want to obtain samples of a real number we will call x
.
In the model
block, we have our statistical model. The syntax is similar to how we would write the model on paper. We specify that x
, the parameter we want to get samples of, is Normally distributed with location parameter zero and scale parameter one.
Now that we have our code (which I have stored in a file named hello_world.stan
), we can use CmdStanPy to compile it and get CmdStanModel
, which is a Python object that provides access to the compiled Stan executable that we can conveniently access using Python syntax.
[4]:
sm = cmdstanpy.CmdStanModel(stan_file='hello_world.stan')
10:42:25 - cmdstanpy - INFO - compiling stan file /Users/bois/Dropbox/git/bebi103_course/2024/b/content/lessons/08/hello_world.stan to exe file /Users/bois/Dropbox/git/bebi103_course/2024/b/content/lessons/08/hello_world
10:42:34 - cmdstanpy - INFO - compiled model executable: /Users/bois/Dropbox/git/bebi103_course/2024/b/content/lessons/08/hello_world
Now that we have the Stan model, stored as the variable sm
, we can collect samples from it using the sm.sample()
method. We pass in the number of chains; that is, the number of Markov chains to use in sampling. We can also pass in the number of sampling iterations to do. We’ll do four chains, which each taking 1000 samples. Let’s do it!
[5]:
samples = sm.sample(
chains=4,
iter_sampling=1000,
)
10:42:35 - cmdstanpy - INFO - CmdStan start processing
10:42:35 - cmdstanpy - INFO - CmdStan done processing.
Notice that CmdStanPy conveniently gave us progress bars for the sampling. We can turn those off using the show_progress=False
kwarg of sm.sample()
.
Parsing output with ArviZ
At this point, Stan did its job and acquired the samples. So, it said “hello, world.”
Let’s take a look at the samples. They are stored as a CmdStanMCMC
instance.
[6]:
samples
[6]:
CmdStanMCMC: model=hello_world chains=4['method=sample', 'num_samples=1000', 'algorithm=hmc', 'adapt', 'engaged=1']
csv_files:
/var/folders/j_/c5r9ch0913v3h1w4bdwzm0lh0000gn/T/tmp3duzui3t/hello_world1oqwnzv3/hello_world-20240105104235_1.csv
/var/folders/j_/c5r9ch0913v3h1w4bdwzm0lh0000gn/T/tmp3duzui3t/hello_world1oqwnzv3/hello_world-20240105104235_2.csv
/var/folders/j_/c5r9ch0913v3h1w4bdwzm0lh0000gn/T/tmp3duzui3t/hello_world1oqwnzv3/hello_world-20240105104235_3.csv
/var/folders/j_/c5r9ch0913v3h1w4bdwzm0lh0000gn/T/tmp3duzui3t/hello_world1oqwnzv3/hello_world-20240105104235_4.csv
output_files:
/var/folders/j_/c5r9ch0913v3h1w4bdwzm0lh0000gn/T/tmp3duzui3t/hello_world1oqwnzv3/hello_world-20240105104235_0-stdout.txt
/var/folders/j_/c5r9ch0913v3h1w4bdwzm0lh0000gn/T/tmp3duzui3t/hello_world1oqwnzv3/hello_world-20240105104235_1-stdout.txt
/var/folders/j_/c5r9ch0913v3h1w4bdwzm0lh0000gn/T/tmp3duzui3t/hello_world1oqwnzv3/hello_world-20240105104235_2-stdout.txt
/var/folders/j_/c5r9ch0913v3h1w4bdwzm0lh0000gn/T/tmp3duzui3t/hello_world1oqwnzv3/hello_world-20240105104235_3-stdout.txt
This object that was returned by CmdStanPy points to CSV and text files Stan generated while running. We can load them into a more convenient format using ArviZ (pronounced like “RVs”, the abbreviation for “recreational vehicles” or “random variables”).
[7]:
samples = az.from_cmdstanpy(samples)
# Take a look
samples
[7]:
-
<xarray.Dataset> Dimensions: (chain: 4, draw: 1000) Coordinates: * chain (chain) int64 0 1 2 3 * draw (draw) int64 0 1 2 3 4 5 6 7 8 ... 992 993 994 995 996 997 998 999 Data variables: x (chain, draw) float64 -0.8778 -0.7971 0.7178 ... 0.2911 0.4784 Attributes: created_at: 2024-01-05T18:42:35.379080 arviz_version: 0.17.0 inference_library: cmdstanpy inference_library_version: 1.2.0
-
<xarray.Dataset> Dimensions: (chain: 4, draw: 1000) Coordinates: * chain (chain) int64 0 1 2 3 * draw (draw) int64 0 1 2 3 4 5 6 ... 993 994 995 996 997 998 999 Data variables: lp (chain, draw) float64 -0.3853 -0.3177 ... -0.04236 -0.1145 acceptance_rate (chain, draw) float64 0.6603 1.0 0.9287 ... 0.9879 0.9833 step_size (chain, draw) float64 1.05 1.05 1.05 ... 0.9408 0.9408 tree_depth (chain, draw) int64 1 1 1 1 1 1 2 1 1 ... 2 2 1 1 2 1 2 1 1 n_steps (chain, draw) int64 3 1 3 1 3 1 3 1 1 ... 3 3 3 3 7 1 3 3 1 diverging (chain, draw) bool False False False ... False False False energy (chain, draw) float64 2.378 0.4265 1.021 ... 0.1513 0.1152 Attributes: created_at: 2024-01-05T18:42:35.392548 arviz_version: 0.17.0 inference_library: cmdstanpy inference_library_version: 1.2.0
We used ArviZ to convert the data type to an ArviZ InferenceData
data type. This has two groups, posterior
, which contains the samples, and sample_stats
which gives information about the sampling. (Note that ArviZ named the group “posterior,” which it does by default, even though these samples are out of a standard Normal distribution and not out of a posterior distribution for some model we may have built.) We’ll start by looking at the samples themselves. Since the samples were
taken using the model
block, they are assumed to be samples out of a posterior distribution, and are therefore present in the samples.posterior
group.
[8]:
samples.posterior
[8]:
<xarray.Dataset> Dimensions: (chain: 4, draw: 1000) Coordinates: * chain (chain) int64 0 1 2 3 * draw (draw) int64 0 1 2 3 4 5 6 7 8 ... 992 993 994 995 996 997 998 999 Data variables: x (chain, draw) float64 -0.8778 -0.7971 0.7178 ... 0.2911 0.4784 Attributes: created_at: 2024-01-05T18:42:35.379080 arviz_version: 0.17.0 inference_library: cmdstanpy inference_library_version: 1.2.0
This is a new, interesting data type. This is an xarray Dataset. The xarray package is a very powerful package for data analysis. The two main data types we will use are xarray DataArray
s and xarray Dataset
s. You can think of a DataArray
like a Pandas data frame, except that the data need not be structured in a two-dimensional table like a data frame is. A Dataset
is a collection of DataArray
s and associated attributes. Interestingly, if
multiple DataArray
s in a Dataset
have the same indexes, you can index multiple arrays at the same time.
Essentially, you can think of xarray structures as Pandas data frames that can be arbitrarily multidimensional.
If we want to access the samples of x
, we do so like this.
[9]:
samples.posterior['x']
[9]:
<xarray.DataArray 'x' (chain: 4, draw: 1000)> array([[-0.877815 , -0.797133 , 0.717814 , ..., 2.1653 , -0.961177 , -0.69788 ], [-1.06055 , 0.498281 , 0.74102 , ..., -0.648949 , -0.648949 , -1.17533 ], [ 0.345216 , -0.520231 , 0.770906 , ..., -1.83455 , 0.0302054, -0.143349 ], [-0.28473 , 0.182923 , 0.156933 , ..., -0.296204 , 0.291055 , 0.478448 ]]) Coordinates: * chain (chain) int64 0 1 2 3 * draw (draw) int64 0 1 2 3 4 5 6 7 8 ... 992 993 994 995 996 997 998 999
We see that this is a two dimensional array, with the first index (the rows) being the chain and the second index (the columns) being the draw, of which there are 1000 for each chain. We can put all of our draws together by converting the DataArray
to a Numpy array using the .values
attribute and then raveling the Numpy array, and then plot an ECDF. The ECDF should look like a Normal distribution with location parameter zero and scale parameter one.
[10]:
bokeh.io.show(
iqplot.ecdf(
samples.posterior['x'].values.ravel()
)
)
Indeed it does! We have just verified that Stan properly said, “Hello, world.”
Direct sampling
Stan can also draw samples out of probability distributions without using MCMC, just as Numpy and Scipy can. For a generic posterior, we use MCMC, but for many named distributions we can directly sample.
Let’s draw 300 random numbers from a Normal distribution with location parameter zero and scale parameter one using Numpy and Scipy.
[11]:
rng = np.random.default_rng()
np_samples = rng.normal(0, 1, size=300)
sp_samples = st.norm.rvs(0, 1, size=300)
# Plot samples
p = iqplot.ecdf(
np_samples,
style='staircase',
palette=[colorcet.b_glasbey_category10[0]],
)
p = iqplot.ecdf(
sp_samples,
style='staircase',
palette=[colorcet.b_glasbey_category10[1]],
p=p,
)
bokeh.io.show(p)
To generate random draws from a standard Normal distribution without using Markov chain Monte Carlo, we use the following Stan code.
generated quantities {
real x;
x = normal_rng(0, 1);
}
Let’s compile it, and then comment on the code.
[12]:
sm_rng = cmdstanpy.CmdStanModel(stan_file='norm_rng.stan')
10:42:35 - cmdstanpy - INFO - compiling stan file /Users/bois/Dropbox/git/bebi103_course/2024/b/content/lessons/08/norm_rng.stan to exe file /Users/bois/Dropbox/git/bebi103_course/2024/b/content/lessons/08/norm_rng
10:42:43 - cmdstanpy - INFO - compiled model executable: /Users/bois/Dropbox/git/bebi103_course/2024/b/content/lessons/08/norm_rng
There is just one block in this particular Stan code, the generated quantities
block. In the generated quantities
block, we have code for that tells Stan what to generate for each set of parameters it encountered while doing Markov chain Mote Carlo. Here, we are not performing Markov chain Monte Carlo, so we do the “sampling” in fixed parameter mode when we call sm_rng.sample()
by setting the fixed_param
kwarg to True
.
[13]:
# Draw samples
stan_samples = sm_rng.sample(
chains=1,
iter_sampling=300,
fixed_param=True,
)
10:42:43 - cmdstanpy - INFO - CmdStan start processing
10:42:43 - cmdstanpy - INFO - CmdStan done processing.
To convert this sampling object to a Numpy array, we can first convert it to an ArviZ InferenceData
instance and then extract the Numpy array. Note that we will define the samples as coming from a “posterior,” even though it is not a posterior, since that’s the default for ArviZ.
[14]:
# Convert to ArviZ InferenceData
stan_samples = az.from_cmdstanpy(
posterior=stan_samples
)
# Extract Numpy array
stan_samples = stan_samples.posterior['x'].values.flatten()
Now, we can add the ECDF of these samples to the plot of Numpy and Scipy samples.
[15]:
p = iqplot.ecdf(
stan_samples,
style='staircase',
palette=[colorcet.b_glasbey_category10[2]],
p=p,
)
bokeh.io.show(p)
As we expect, sampling with Stan gives the same results.
Why are we using that?
Yes, sampling using MCMC with Stan is a novel feature, and we used it to sample out of a trivial distribution (a standard Normal), but we can use it to sample out of very complex distributions. But with respect to the direct sampling we just did, you might be thinking, “Sampling using Stan was so much harder than with Numpy! Why are we doing that?” The answer is that for more complicated models, and doing things like prior predictive checks and posterior predictive checks, using Stan for all modeling is more convenient.
Recalling also last term’s course, here is a breakdown of when we will use the respective samplers.
We will use Numpy for sampling techniques in frequentist-based inference, that is for things like computing confidence intervals and p-values using resampling methods.
We will use
scipy.stats
when plotting distributions and using optimization methods in Bayesian inference.We will occasionally use Numpy for prior predictive checks and posterior predictive checks (defined in coming lessons).
We will use Stan for everything else. This includes all Bayesian modeling that does not use optimization (and even some that does).
Displaying your Stan code
When you are working on assignments, your Stan models are written as separate files. It is instructive to display the Stan code in the Jupyter notebook. This is easily accomplished for any CmdStanPy model using the code()
method.
[16]:
print(sm.code())
parameters {
real x;
}
model {
x ~ normal(0, 1);
}
You should do this in your notebooks so the code is visible.
Saving samples
While your samples are saved in CSV and text files by Stan, is is convenient to save the sampling information in a format the can immediately be read into an ArviZ InferenceData object. The NetCDF format is useful for this. ArviZ enables saving as NetCDF as follows.
[17]:
samples.to_netcdf('stan_hello_world.nc')
[17]:
'stan_hello_world.nc'
When calling the function, it returns the string of the filename to which the NetCDF file is written. The samples can be read from the NetCDF file using az.from_netcdf()
.
[18]:
samples = az.from_netcdf('stan_hello_world.nc')
Cleaning up the shrapnel
When using Stan, CmdStanPy leaves a lot of files on your file system.
Your stan model is translated into C++, and the result is stored in a
.hpp
file.The
.hpp
file is compiled into an object file (.o
file).The
.o
file is used to build an executable.
All of these files are deposited in your present working directory, and can get annoying for version control purposes and can add clutter. To clean them up after you are finished running your models, you can run the function below.
[19]:
bebi103.stan.clean_cmdstan()
When doing sampling the results are stored in a /var/
directory in various CSV and text files. We never work with these directly, but rather read them into RAM in a convenience az.InferenceData
object using ArviZ. When exiting your session, CmdStanPy deletes all of these CSV files, etc., unless you specifically say which directory to store the results in your call to sm.sample()
using the outpur_dir
kwarg.
Computing environment
[20]:
%load_ext watermark
%watermark -v -p numpy,pandas,scipy,cmdstanpy,arviz,iqplot,bebi103,bokeh,colorcet,jupyterlab
print("cmdstan :", bebi103.stan.cmdstan_version())
Python implementation: CPython
Python version : 3.11.5
IPython version : 8.15.0
numpy : 1.26.2
pandas : 2.1.4
scipy : 1.11.4
cmdstanpy : 1.2.0
arviz : 0.17.0
iqplot : 0.3.5
bebi103 : 0.1.19
bokeh : 3.3.0
colorcet : 3.0.1
jupyterlab: 4.0.10
cmdstan : 2.33.1