Using Google Colab

Google Colab provides a convenient, free platform for running Jupyter notebooks in the cloud.

In order to use Google Colab, you must have a Google account. Caltech students and employees have an account through Caltech’s G Suite. Many of you may have a personal Google account, usually set up for things like GMail, YouTube, etc. For your work in this class, use your Caltech account. This will facilitate collaboration with your teammates in the course, as well as with course staff.

Many of you probably use your personal Google account on your machine, so it can get annoying to log in and out of it. A trick that I find useful is to use one browser, e.g., Safari or Microsoft Edge, for your personal use, web browsing, etc., and a different browser for your scientific work, including the work in this class. Google Colab are most tested for Chrome, Firefox, and Safari (in fact JupyterLab, which you will use on your own machine, only supports these three browsers).

Once you have either logged out of all of your personal accounts or have a different browser open, you can launch a Colab notebook by simply navigating to https://colab.research.google.com/. Alternatively, you can click the “Launch in Colab” badge at the top right of this page, and you will launch this notebook in Colab. That badge will appear in the top right of all pages in the course content generated from notebooks.

Watchouts when using Colab

If you do run a notebook in Colab, you are doing your computing on one of Google’s computers via a virtual machine. You get two CPU cores and limited (about 12 GB, but it varies) RAM. You can also get GPUs and TPUs (Google’s tensor processing units), but we will not use those in this course. The computing resources should be enough for all of our calculations this term, but due to the limitation to two cores and some of the watchouts below, Colab may not be the best platform to use.

If your notebook is idle for too long, you will get disconnected from your notebook. “Idle” means that cells are not being edited or executed. The idle timeout varies depending on the load on Google’s computers; I find that I almost always get disconnected if idle for an hour.
Your virtual machine will disconnect if it is being used for too long. It typically will only available for 12 hours before disconnecting, though times can vary, again based on load. If you do not efficiently code some of your sampling, the calculations may exceed 12 hours, so this may present a problem.

These limitations are in place so that Google can offer Colab for free. If you want more cores, longer timeouts, etc., you might want to check out Colab Pro. However, the free tier should work well for you in the course. You can of course always run on your own machine, and in fact are encouraged to do so.

There are additional software-specific watchouts when using Colab.

Colab does not allow for full functionality of Bokeh apps with Python callbacks.
Colab instances have specific software installed, so you will need to install anything else you need in your notebook. This is not a major burden, and is discussed in the next section.

To circumvent these limitations, you may with to upgrade to Colab Pro, which is not free. I recommend reading the Colab FAQs for more information about Colab.

Software in Colab

When you launch a Google Colab notebook, much of the software we will use in class is already installed. It is not always the latest version of the software, however. In fact, as of January 2024, Colab is running Python 3.10, whereas you will run Python 3.11 on your machine through your Anaconda installation. Nonetheless, most (but not all) of the analyses we do for this class will work just fine in Colab.

Because the notebooks in Colab have software preinstalled, and no more, you will often need to install software before you can run the rest of the code in a notebook. To enable this, when necessary, in the first code cell of each notebook in this class, we will have the following code (or a variant thereof depending on what is needed or if the default installations of Colab change). Running this code will not affect running your notebook on your local machine; the same notebook will work on your local machine or on Colab. Importantly, when using Stan, you will need to install Stan in your Colab session using cmdstanpy.install_cmdstan(), which can take some time, usually several minutes.

# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade polars iqplot colorcet datashader bebi103 arviz cmdstanpy watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    import cmdstanpy; cmdstanpy.install_cmdstan()
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------

The above method works well and ensures that you have the most recent version of CmdStan installed. The drawback is that the installation of CmdStan takes several minutes. As an alternative, if you want to quick installation, you can use pre-built binaries of CmdStan for Colab.

# Colab setup ------------------
import os, shutil, sys, subprocess, urllib.request
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade iqplot colorcet datashader bebi103 arviz cmdstanpy watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    from cmdstanpy.install_cmdstan import latest_version
    cmdstan_version = latest_version()
    cmdstan_url = f"https://github.com/stan-dev/cmdstan/releases/download/v{cmdstan_version}/"
    fname = f"colab-cmdstan-{cmdstan_version}.tgz"
    urllib.request.urlretrieve(cmdstan_url + fname, fname)
    shutil.unpack_archive(fname)
    os.environ["CMDSTAN"] = f"./cmdstan-{cmdstan_version}"
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------

Notebooks will use this faster mode of installing CmdStan.

In addition to installing the necessary software on a Colab instance, this also sets the relative path to data sets we will use in the course. When running in Colab, the data set is fetched from cloud storage on AWS. When running on your local machine for homeworks, the path to the data is one directory up from where you are working.

In most notebooks, the Colab and data path setup code cells are hidden in the HTML rendering to avoid clutter, but will be present when you download the notebooks.

A sample calculation

To verify that Colab works for you, fire up a Colab instance and run the code below.

[1]:

# Colab setup ------------------
import os, shutil, sys, subprocess, urllib.request
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade polars iqplot colorcet datashader bebi103 arviz cmdstanpy watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    from cmdstanpy.install_cmdstan import latest_version
    cmdstan_version = latest_version()
    cmdstan_url = f"https://github.com/stan-dev/cmdstan/releases/download/v{cmdstan_version}/"
    fname = f"collab-cmdstan-{cmdstan_version}.tgz"
    urllib.request.urlretrieve(cmdstan_url + fname, fname)
    shutil.unpack_archive(fname)
    os.environ["CMDSTAN"] = f"./cmdstan-{cmdstan_version}"
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------

import numpy as np

import bebi103
import cmdstanpy
import arviz as az

import bokeh.plotting
import bokeh.io
bokeh.io.output_notebook()

schools_data = {
    "J": 8,
    "y": [28, 8, -3, 7, -1, 1, 18, 12],
    "sigma": [15, 10, 16, 11, 9, 11, 10, 18],
}

schools_code = """
data {
  int<lower=0> J; // number of schools
  vector[J] y; // estimated treatment effects
  vector<lower=0>[J] sigma; // s.e. of effect estimates
}

parameters {
  real mu;
  real<lower=0> tau;
  vector[J] eta;
}

transformed parameters {
  vector[J] theta = mu + tau * eta;
}

model {
  eta ~ normal(0, 1);
  y ~ normal(theta, sigma);
}
"""

with open("schools_code.stan", "w") as f:
    f.write(schools_code)

with bebi103.stan.disable_logging():
    sm = cmdstanpy.CmdStanModel(stan_file="schools_code.stan")
    samples = sm.sample(data=schools_data, output_dir="./", show_progress=False)

samples = az.from_cmdstanpy(samples)

bebi103.stan.clean_cmdstan()

# Make a plot of samples
p = bokeh.plotting.figure(
    frame_height=250, frame_width=250, x_axis_label="μ", y_axis_label="τ"
)
p.scatter(
    np.ravel(samples.posterior["mu"]),
    np.ravel(samples.posterior["tau"]),
    alpha=0.1
)

bokeh.io.show(p)

Loading BokehJS ...

Computing environment

[2]:

%load_ext watermark
%watermark -v -p numpy,cmdstanpy,arviz,bebi103,bokeh,jupyterlab

Python implementation: CPython
Python version       : 3.12.5
IPython version      : 8.27.0

numpy     : 1.26.4
cmdstanpy : 1.2.4
arviz     : 0.20.0
bebi103   : 0.1.25
bokeh     : 3.4.1
jupyterlab: 4.2.5