44  Principal component analysis: A heuristic approach

Open in Google Colab | Download notebook

Code
# Colab setup ------------------
import os, shutil, sys, subprocess, urllib.request
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------
import warnings

import numpy as np
import scipy.io

import sklearn
import sklearn.decomposition
import sklearn.preprocessing

import bebi103

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()
Loading BokehJS ...

44.1 Principal component analysis: A heuristic approach

In our discussion of principal component analysis, we will first approach the problem heuristically without defining a full generative model. We will then demonstrate a Bayesian model behind this heuristic.

44.1.1 The problem of PCA

We often obtain high-dimensional data sets, in which each observation has many dimensions. In the parlance of practitioners of machine learning, each multidimensional observation is called a sample and each dimension is called a feature. We will use the observation/dimension and sample/feature nomenclature interchangeably. Considering the Palmer’s penguin data set, each observation (sample) is a penguin, and the dimensions (features) are bill depth, bill length, flipper length, and body mass.

The features, though many, may be related. We can imagine that there are unobservable variables, called latent variables at play the influence many features. For example, consider the four features of the penguin. They could all scale with the caloric intake of the penguins, in which case the latent variable of caloric intake could use used to predict all four of the features; it is responsible for most of the variation in what is observed. (We have seen latent variables before when we \(z_{ik}\) in Gaussian mixture models we encountered in Chapter 34.)

Let us now formalize this. Let \(\mathbf{y}\in \mathbb{R}^D\) be a set of observable features. There are \(D\) of them. In the case of the penguins, \(D = 4\). Let \(\mathbf{z} \in \mathbb{R}^L\) be the set of latent variables that influence the observations. Typically, \(L < D\) and often \(L \ll D\), and, as such, if we can infer the latent variable \(\mathbf{z}\), we say we have achieved dimensionality reduction. Specifically, we define a model for \(\mathbf{y}\), \(\mathbf{y} = h(\mathbf{z};\theta)\), where \(\theta\) is some set of parameters. So, to achieve dimensionality reduction, we want to find a function \(h(\mathbf{z};\theta)\) that most faithfully reproduces the observed \(\mathbf{y}\).

44.1.2 Linear \(h(\mathbf{z};\theta)\)

As is commonly done, we will restrict the functions \(h(\mathbf{z};\theta)\) we choose to be linear functions. Further, we stipulate that \(h(\mathbf{z};\theta)\) serves to orthogonally project \(\mathbf{z}\) onto \(\mathbf{y}\).

\[\begin{align} h(\mathbf{z};\mathsf{W}) = \mathsf{W}\cdot \mathbf{z}, \end{align} \]

where the parameters \(\theta\) are given by the \(D \times L\) matrix \(\mathsf{W}\), commonly referred to as a loading matrix. This matrix is semiorthogonal amd unitary such that the columns are orthonormal vectors and \(\mathsf{W}^\mathsf{T}\cdot \mathsf{W} = \mathsf{I}\), the identity matrix. If \(\mathbf{w}_j\) is a column of \(\mathsf{W}\), then

\[\begin{align} \mathbf{w}_j^\mathsf{T}\cdot \mathbf{w}_k = \delta_{jk}, \end{align} \]

where \(\delta_{jk}\) is the Kronecker delta, which is equal to one when \(j = k\) and zero otherwise.

Now, let \(\hat{\mathbf{y}}\in \mathbb{R}^D\) be the a data point we get by applying \(h(\mathbf{z};\mathsf{W})\),

\[\begin{align} \hat{\mathbf{y}} = h(\mathbf{z}; \mathsf{W}) = \mathsf{W} \cdot \mathbf{z}. \end{align} \tag{44.1}\]

We seek a loading matrix \(\mathsf{W}\) and set of latent variables \(\{\mathbf{z}\}\) that most faithfully reproduce the measured data \(\{\mathbf{y}\}\), where I am using the braces to denote a set of data; that is \(\{\mathbf{y}\} = \{\mathbf{y}_1, \mathbf{y}_2, \ldots \mathbf{y}_N\}\), where \(N\) is the number of observations we have made. That is, we want \(\mathbf{y}_i\) to be as close as possible to \(\hat{\mathbf{y}}_i\) for all \(i\). Toward this end, we define a loss function as the mean square distance between \(\mathbf{y}_i\) and \(\hat{\mathbf{y}}_i\).

\[\begin{align} \text{loss}(\mathsf{W}, \{\mathbf{z}\}) = \frac{1}{N}\sum_{i=1}^N \lVert \mathbf{y}_i - \hat{\mathbf{y}}_i\rVert_2^2 = \frac{1}{N}\sum_{i=1}^N \lVert \mathbf{y}_i - \mathsf{W}\cdot \mathbf{z}_i\rVert_2^2, \end{align} \tag{44.2}\]

The goal of principal component analysis is to find the \(\mathsf{W}\) that minimizes this loss function. Once this optimal \(\mathsf{W}\) is found, we compute \(\hat{\mathbf{z}} = \mathsf{W}^\mathsf{T}\cdot \mathbf{y}\); the entries of \(\hat{\mathbf{z}}\) are called the principal components.

44.1.3 The optimal projection

In this section, we work out what loading matrix \(\mathsf{W}\) optimizes the above loss function. Be careful not to get lost in the weeds of the mathematics that follows. Just remember that we are simply finding the projection that minimizes the mean squared difference between measured and projected data.

As we work to find the optimal projection, it will help us to write the loss function in terms of the columns of \(\mathsf{W}\).

\[\begin{align} \text{loss}(\mathsf{W}, \{\mathbf{z}\}) &= \frac{1}{N}\sum_{i=1}^N \left\lVert \mathbf{y}_i - \sum_{j=1}^L z_{ij}\mathbf{w}_j\right\rVert_2^2 \\[1em] &= \frac{1}{N}\sum_{i=1}^N\left(\mathbf{y}_i - \sum_{j=1}^L z_{ij}\mathbf{w}_j\right)^\mathsf{T}\cdot\left(\mathbf{y}_i - \sum_{j=1}^L z_{ij}\mathbf{w}_j\right) \\[1em] &= \frac{1}{N}\sum_{i=1}^N\left(\mathbf{y}_i^\mathsf{T}\cdot\mathbf{y}_i - 2\sum_{j=1}^L z_{ij}\mathbf{y}_i^\mathsf{T}\cdot\mathbf{w}_j + \sum_{j=1}^L\sum_{k=1}^L z_{ij}z_{ik}\mathbf{w}_j^\mathsf{T}\cdot\mathbf{w}_k\right). \end{align}\]

We wish to find the \(\{\mathbf{w}_j\}\) and \(\{\mathbf{z}_j\}\) that minimize the loss function. To do so while enforcing the orthonormality of \(\mathsf{W}\), we define a Lagrangian

\[\begin{align} \mathcal{L} &= \text{loss}(\mathsf{W}, \{\mathbf{z}\}) + \sum_{j=1}^L \lambda_j(1 - \mathbf{w}_j^\mathsf{T}\cdot \mathbf{w}_j) + \sum_{j=1}^L\sum_{k=j+1}^L \mu_{jk}\mathbf{w}_j^\mathsf{T}\cdot\mathbf{w}_k \\[1em] &= \frac{1}{N}\sum_{i=1}^N\left(\mathbf{y}_i^\mathsf{T}\cdot\mathbf{y}_i - 2\sum_{j=1}^L z_{ij}\mathbf{y}_i^\mathsf{T}\cdot\mathbf{w}_j + \sum_{j=1}^L\sum_{k=1}^L z_{ij}z_{ik}\mathbf{w}_j^\mathsf{T}\cdot\mathbf{w}_k\right) \\[1em] &\;\;\;\;+ \sum_{j=1}^L \lambda_j(\mathbf{w}_j^\mathsf{T}\cdot \mathbf{w}_j - 1) + \sum_{j=1}^L\sum_{k=j+1}^L \mu_{jk}\mathbf{w}_j^\mathsf{T}\cdot\mathbf{w}_k. \end{align}\]

where the Lagrange multipliers denoted with \(\lambda\) enforce the normality of the columns of \(\mathsf{W}\) and those denoted with \(\mu\) enforce the orthogonality of the columns of \(\mathsf{W}\).

Because the loss function and the normality constraints are all strictly convex, the necessary and sufficient conditions for minimizing the loss function are the Karush-Kuhn-Tucker (KKT) conditions,

\[\begin{align} &\frac{\partial \mathcal{L}}{\partial \mathbf{w}_j} = \mathbf{0}, \\[1em] &\frac{\partial \mathcal{L}}{\partial z_{ij}} = 0, \\[1em] &\frac{\partial \mathcal{L}}{\partial\lambda_j} = 0, \\[1em] &\frac{\partial \mathcal{L}}{\partial\mu_{jk}} = 0. \end{align}\]

for all \(i\in[1,N]\), \(j\in[1,L]\), and \(k \in [j+1,L]\). To avoid stumbling over indices that are present in sums, I prefer to differentiate with respect to a variable with a specific index, e.g., with respect to \(z_{lm}\). With that in mind, let us compute the derivatives of the KKT conditions one-by-one, in reverse order as they are listed above.

\[\begin{align} \frac{\partial \mathcal{L}}{\partial\mu_{lm}} = \mu_{lm}\mathbf{w}_l^\mathsf{T}\cdot\mathbf{w}_m = 0, \end{align}\]

noting that \(m > l\). Not surprisingly, this means that \(\mathbf{w}_j^\mathsf{T}\cdot\mathbf{w}_k = 0\) for all \(j\ne k\); the columns of \(\mathsf{W}\) are orthogonal.

Turning now to the derivative with respect to the \(\lambda\)’s,

\[\begin{align} \frac{\partial \mathcal{L}}{\partial\lambda_l} = \mathbf{w}_l^\mathsf{T}\cdot \mathbf{w}_l - 1 = 0, \end{align}\]

which means that \(\mathbf{w}_j^\mathsf{T}\cdot\mathbf{w}_j = 1\). With the previous KKT condition, this means that the columns of \(\mathsf{W}\) are orthonormal, or

\[\begin{align} \mathbf{w}_j^\mathsf{T}\cdot \mathbf{w}_k = \delta_{jk}. \end{align}\]

Now considering the derivative with respect to \(z_{lm}\).

\[\begin{align} \frac{\partial \mathcal{L}}{\partial z_{lm}} &= \frac{1}{N}\left(-2\mathbf{y}_l^\mathsf{T}\cdot\mathbf{w}_m + 2\sum_{k=1}^Lz_{lm}\mathbf{w}_m^T\cdot \mathbf{w}_k\right)\\[1em] &= -\frac{2}{N}\left(\mathbf{y}_l^\mathsf{T}\cdot\mathbf{w}_m - z_{lm}\right) = 0, \end{align}\]

where we have used \(\mathbf{w}_m^\mathsf{T}\cdot \mathbf{w}_k = \delta_{mk}\), which we worked out from the first two KKT conditions we considered. Thus, we have,

\[\begin{align} z_{ij} = \mathbf{y}_i^\mathsf{T}\cdot\mathbf{w}_j = \mathbf{w}_j^\mathsf{T}\cdot\mathbf{y}_i. \end{align}\]

We have satisfied three of the four KKT conditions. We can substitute these results into the loss function because doing so does not change the feasible set of minimizers. The updated Lagrangian is then

\[\begin{align} \text{loss}(\mathsf{W}) = &= \frac{1}{N}\sum_{i=1}^N\left(\mathbf{y}_i^\mathsf{T}\cdot\mathbf{y}_i - 2\sum_{j=1}^L (\mathbf{y}_i^\mathsf{T}\cdot\mathbf{w}_j)\mathbf{y}_i^\mathsf{T}\cdot\mathbf{w}_j + \sum_{j=1}^L(\mathbf{y}_i^\mathsf{T}\cdot\mathbf{w}_j)^2\right) \\[1em] &= \frac{1}{N}\sum_{i=1}^N\left(\mathbf{y}_i^\mathsf{T}\cdot\mathbf{y}_i - \sum_{j=1}^L(\mathbf{y}_i^\mathsf{T}\cdot\mathbf{w}_j)^2\right) \\[1em] &= \frac{1}{N}\sum_{i=1}^N \mathbf{y}_i^\mathsf{T}\cdot\mathbf{y}_i - \sum_{j=1}^L \left[\frac{1}{N}\sum_{i=1}^N(\mathbf{y}_i^\mathsf{T}\cdot\mathbf{w}_j)^2\right] \\[1em] &= \frac{1}{N}\sum_{i=1}^N \mathbf{y}_i^\mathsf{T}\cdot\mathbf{y}_i - \sum_{j=1}^L \mathbf{w}_j^\mathsf{T}\cdot\hat{\mathsf{\Sigma}}\cdot\mathbf{w}_j, \end{align}\]

where we have recognized the bracketed term as a quadratic form with the \(D\times D\) matrix \(\hat{\mathsf{\Sigma}}\) having entries given by

\[\begin{align} \hat{\Sigma}_{jk} = \frac{1}{N}\sum_{i=1}^N y_{ij}\, y_{ik}. \end{align}\]

We will clarify the suggestive symbol we chose for this matrix momentarily. The updated Lagrangian is then

\[\begin{align} \mathcal{L} &= \frac{1}{N}\sum_{i=1}^N \mathbf{y}_i^\mathsf{T}\cdot\mathbf{y}_i - \sum_{j=1}^L \mathbf{w}_j^\mathsf{T}\cdot\hat{\mathsf{\Sigma}}\cdot\mathbf{w}_j \\[1em] &\;\;\;\;+ \sum_{j=1}^L \lambda_j(\mathbf{w}_j^\mathsf{T}\cdot \mathbf{w}_j - 1) - \sum_{j=1}^L\sum_{k=j+1}^L \mu_{jk}\mathbf{w}_j^\mathsf{T}\cdot\mathbf{w}_k. \end{align}\]

Now, differentiating with respect to \(\mathbf{w}_l\) and setting the result to zero to satisfy our final KKT condition, we have

\[\begin{align} \frac{\partial \mathcal{L}}{\partial \mathbf{w}_l} = -2\hat{\mathsf{\Sigma}}\cdot\mathbf{w}_l + 2\lambda_l\,\mathbf{w}_l + \sum_{k=1}^{l-1} \mu_{kl}\mathbf{w}_k + \sum_{k=l+1}^{L} \mu_{lk}\mathbf{w}_k = \mathbf{0}. \end{align}\]

Dotting both sides of this equation with \(\mathbf{w}_l^\mathsf{T}\) yields

\[\begin{align} -2\mathbf{w}_l^\mathsf{T}\cdot\hat{\mathsf{\Sigma}}\cdot\mathbf{w}_l + 2\lambda_j\,\mathbf{w}_l^\mathsf{T}\cdot\mathbf{w}_j + \sum_{k=1}^{l-1} \mu_{kj}\mathbf{w}_l^\mathsf{T}\cdot\mathbf{w}_k + \sum_{k=l+1}^{L} \mu_{lk}\mathbf{w}_l^\mathsf{T}\cdot\mathbf{w}_k = 0. \end{align}\]

Using

\[\begin{align} \mathbf{w}_l^\mathsf{T}\cdot \mathbf{w}_k = 0, \end{align}\]

for \(l \ne k\), this reduces to

\[\begin{align} -2\mathbf{w}_l^\mathsf{T}\cdot\hat{\mathsf{\Sigma}}\cdot\mathbf{w}_l + 2\lambda_l\mathbf{w}_l^\mathsf{T}\cdot\mathbf{w}_l = 0, \end{align}\]

which we can rewrite as

\[\begin{align} \frac{\mathbf{w}_j^\mathsf{T}\cdot\hat{\mathsf{\Sigma}}\cdot\mathbf{w}_j}{\mathbf{w}_j^\mathsf{T}\cdot\mathbf{w}_j} = \lambda_j. \end{align}\]

We recognize this equation as an expression for a Rayleigh quotient. Evidently, then, the Lagrange multipliers \(\lambda_j\) are the eigenvalues of \(\hat{\mathsf{\Sigma}}\) and \(\mathbf{w}_j\) are the normalized eigenvectors. Inserting this expression for \(\mathbf{w}_j\) into the loss function yields

\[\begin{align} \text{loss}(\mathsf{W}) = \frac{1}{N}\sum_{i=1}^N \mathbf{y}_i^\mathsf{T}\cdot\mathbf{y}_i - \sum_{j=1}^L \mathbf{w}_j^\mathsf{T}\cdot(\lambda_j\mathbf{w}_j) = \frac{1}{N}\sum_{i=1}^N \mathbf{y}_i^\mathsf{T}\cdot\mathbf{y}_i - \sum_{j=1}^L \lambda_j. \end{align}\]

The loss function is minimized if we choose the \(L\) largest eigenvalues of \(\hat{\mathsf{\Sigma}}\). Therefore, the columns of \(\mathsf{W}\) are comprised of the eigenvectors of \(\hat{\mathsf{\Sigma}}\) corresponding to its largest eigenvalues.

44.1.4 Centering and scaling, the interpretation of \(\hat{\mathsf{\Sigma}}\), and maximal variance

Assume for a moment that the mean of \(\mathbf{y}\) is zero,

\[\begin{align} \bar{\mathbf{y}} = \frac{1}{N}\sum_{i=1}^N \mathbf{y}_i = \mathbf{0}, \end{align} \]

and the variance of each of the \(D\) entries in \(\mathbf{y}\) is one,

\[\begin{align} s_j = \frac{1}{N}\sum_{i=1}^N (y_{i,j} - \bar{y}_j)^2 = \frac{1}{N}\sum_{i=1}^N y_{i,j}^2 = 1\;\forall j\in[1, D]. \end{align} \]

In that case, the entries of \(\hat{\mathsf{\Sigma}}\) are

\[\begin{align} \hat{\Sigma}_{jk} = \frac{1}{N}\sum_{i=1}^N y_{ij}\, y_{ik} = \frac{1}{N}\sum_{i=1}^N (y_{ij} - \bar{y_j})(y_{ik} - \bar{y_k}), \end{align} \tag{44.3}\]

which we recognize as the elements of the empirical covariance matrix (or, equivalently, since the diagonal entries are all one, an empirical correlation matrix). Thus, \(\hat{\mathsf{\Sigma}}\) is the plug-in estimate for the covariance matrix when using centered-and-scaled data.

Consider a principal component \(z_{ij}\). The plug-in estimate for the variance of this is

\[\begin{align} \widehat{\mathbb{V}(z_{ij})} = \frac{1}{N}\sum_{i=1}^N z_{ij}^2 - \left(\frac{1}{N}\sum_{i=1}^Nz_{ij}\right)^2. \end{align} \]

Let us compute the first moment.

\[\begin{align} \frac{1}{N}\sum_{i=1}^Nz_{ij} = \frac{1}{N}\sum_{i=1}^N \mathbf{w}_{j}^\mathsf{T} \cdot \mathbf{y}_i = \mathbf{w}_{j}^\mathsf{T}\cdot\left(\frac{1}{N}\sum_{i=1}^N \mathbf{y}_i\right) = 0, \end{align}\]

where we have recognized the average in parentheses to be zero since the data are centered. Therefore, the plug-in estimate for the variance is

\[\begin{align} \widehat{\mathbb{V}(z_{ij})} = \frac{1}{N}\sum_{i=1}^N z_{ij}^2 = \frac{1}{N}\sum_{i=1}^N \frac{1}{N}\sum_{i=1}^N(\mathbf{y}_i^\mathsf{T}\cdot\mathbf{w}_j)^2. \end{align}\]

We have seen this expression before. This is a quaratic form involving the covariance matrix \(\hat{\mathsf{\Sigma}}\).

\[\begin{align} \widehat{\mathbb{V}(z_{ij})} = \mathbf{w}_j^\mathsf{T}\cdot\hat{\mathsf{\Sigma}}\cdot\mathbf{w}_j. \end{align}\]

Recall that the loss function is

\[\begin{align} \text{loss}(\mathsf{W}) = \frac{1}{N}\sum_{i=1}^N \mathbf{y}_i^\mathsf{T}\cdot\mathbf{y}_i - \sum_{j=1}^L \mathbf{w}_j^\mathsf{T}\cdot\hat{\mathsf{\Sigma}}\cdot\mathbf{w}_j = \text{constant} - \widehat{\mathbb{V}(z_{ij})}. \end{align}\]

Therefore, finding the transformation \(\mathbf{w}_j\) that minimizes the loss function is the same as finding the transformation that maximizes the variance of the projection. This is why the principal components are often referred to as the directions that have maximal variance.

As a final note, I mention that it is important to center and scale the data set before performing PCA because given directions may have high variance just because of the scaling of the measurement. This is generally the variance we do not wish to interpret.

44.1.5 Prescription for PCA

Following the above analysis, we arrive at a prescription for PCA. After choosing the number of principal components \(L\), do the following.

  1. Center and scale the data set by subtracting the mean and dividing by the standard deviation along each dimension.
  2. Compute the empirical covariance matrix.
  3. Compute the first \(L\) eigenvalues and eigenvectors.
  4. Construct the loading matrix \(\mathsf{W}\) from the eigenvectors corresponding to the \(L\) largest eigenvalues.
  5. The principal components are \(\mathbf{z}_i = \mathbf{W}^\mathsf{T} \cdot \mathbf{y}_i \;\forall i\).

44.1.6 Nonidentifiability of PCA

The loading matrix \(\mathsf{W}\) is comprised of the eigenvectors of the empirical covariance matrix and therefore the columns are orthogonal, since the covariance matrix is symmetric. However, any of these eigenvectors can be multiplied by a scalar (which is true in general for an eigenvector; if \(\mathbf{v}\) is an eigenvector of \(\mathsf{A}\), then so is \(\alpha \mathbf{v}\) for real \(\alpha \ne 0\).) We can insist that the columns of \(\mathsf{W}\) for an orthonormal basis, such that the magnitude of each eigenvector is 1, which is what is commonly done. However, any given column of \(\mathsf{W}\) may be multiplied by \(-1\), still resulting in a nonidentifiability. We therefore cannot uniquely find a loading matrix \(\mathsf{W}\) and therefore a unique set of principal components.

44.2 PCA of a data set

We will perform PCA on data from Remedios, et al. that we visited in ?sec-matlab-files. Recall that in that data set, we had recording from 115 neurons over 10 or so minutes. Let’s load in the data set and convert everything to Numpy arrays as we did in ?sec-matlab-files.

# Load in the data set and separate into Numpy arrays
data = scipy.io.loadmat(os.path.join(data_path, 'hypothalamus_calcium_imaging_remedios_et_al.mat'))
neural_data = data['neural_data']     # Row is time, column is neuron
attack_vector = data['attack_vector'].flatten()
sex_vector = data['sex_vector'].flatten()

We also have time points, which we can set up knowing that the sampling rate is 30 Hz.

# Time points at 30 Hz sampling
t = np.arange(len(attack_vector)) / 30

Finally, we can center and scale the data. We will also transpose the neural data so that it has dimension \(N\times D\) for ease in comparing to the above theoretical results.

# Transpose
y = neural_data.T

# Center data set by subtracting the mean from each time point and dividing by the standard deviation
y = (y - np.mean(y, axis=0)) / np.std(y, axis=0)

We can now carry out the prescription of PCA. We are hoping to visualize the trajectory of the neuronal activity of the mouse on a 2D plot, so we will specify that there are \(L = 2\) latent variables (principal componenents).

# Two principle components
L = 2

# Compute the plugin estimate for the covariance matrix
cov = np.cov(y.T)

# Compute the eigensystem
eigenvalues, eigenvectors = np.linalg.eig(cov)

# Sort highest to lowest
idx = eigenvalues.argsort()[::-1]

# Extract largest L to build eigenvalues and loading matrix
lam = eigenvalues[idx][:L]
W = eigenvectors[:, idx][:, :L]

# Compute principal components
z = np.dot(W.T, y.T)

We will plot the visualization of the principal components in a moment, but for now, we will take the opportunity to compute the fraction of the variance that each principal component contributes. Given that the eigenvalues are the eigenvalues of the covariance matrix, each eigenvalue gives the scale of the variance along the given eigenvector. Thus, the fraction of the total variance that runs along principal component \(j\) is \(\lambda_j / \sum_{j=1}^L \lambda_j\).

lam / eigenvalues.sum()
array([0.22153642, 0.15257567])

Evidently, the first two principal components account for nearly 40% of the total variance.

Now, we will proceed to plot the principal components. We will plot them against each other and color the glyphs by time or by sex of the presented mouse. We will also plot each of the two principal components vs time.

p1 = bokeh.plotting.figure(
    frame_height=350,
    frame_width=350,
    x_axis_label="PC 1",
    y_axis_label="PC 2",
)

p2 = bokeh.plotting.figure(
    frame_height=150,
    frame_width=550,
    x_axis_label="time",
    y_axis_label="PC",
)

# Set up date for the plot
time_color = bebi103.viz.q_to_color(t, bokeh.palettes.Viridis256)
sex_color = ["orchid" if s else "dodgerblue" for s in sex_vector]
cds = bokeh.models.ColumnDataSource(dict(pc1=z[0], pc2=z[1], t=t, color=time_color))

# We'll allow selection of which color we want to visualize
color_selector = bokeh.models.RadioButtonGroup(labels=["time", "sex"], active=0)
js_code = """
cds.data['color'] = color_selector.active == 0 ? time_color : sex_color;
cds.change.emit();
"""

color_selector.js_on_event(
    "button_click",
    bokeh.models.CustomJS(
        code=js_code,
        args=dict(
            time_color=time_color,
            sex_color=sex_color,
            cds=cds,
            color_selector=color_selector,
        ),
    ),
)

# Populate glyphs
p1.scatter(source=cds, x="pc1", y="pc2", color="color")
p2.line(source=cds, x='t', y='pc1', legend_label='PC 1')
p2.line(source=cds, x='t', y='pc2', color="orange", legend_label='PC 2')

# Lay out and show!
bokeh.io.show(
    bokeh.layouts.column(
        bokeh.layouts.row(p1, bokeh.layouts.column(bokeh.models.Div(text="Color by"), color_selector)),
        p2
    )
)

When looking at the coloring by sex, it is clear that there is a difference along the PC 1-axis depending on what sex of mouse is presented. When the male mouse is introduced about 250 seconds into the experiment, PC 1 shifts to exceed PC 2.

44.2.1 Reconstruction of neuronal measurements from principal components

By restricting ourselves to only two latent variables, we necessarily have omitted some of the dynamics observed in the system. To assess what we have lost, we can reconstruct the neuronal signal from the estimates of the latent variables and the corresponding loading matrix \(\mathsf{W}\). Recall from Equation 44.1 that the reconstructed data set is

\[\begin{align} \hat{\mathbf{y}}_i = h(\mathbf{z}_i; \mathsf{W}) = \mathsf{W} \cdot \mathbf{z}_i\;\forall i. \end{align} \]

We can directly compute this and compare to the measured signal.

# Compute the reconstruction
y_hat = np.dot(W, z)

# Uncenter and unscale the reconstruction
y_hat = np.std(neural_data, axis=0) * y_hat + np.mean(neural_data, axis=0)

# Kwargs for plotting
kwargs = dict(
    x_interpixel_distance=1/30,
    y_interpixel_distance=1,
    frame_height=150,
    frame_width=600,
    x_axis_label="time (s)",
    y_axis_label="neuron",
)

# Plot original data
p_neuron = bebi103.image.imshow(neural_data, **kwargs)

# Plot reconstructed data
p_hat = bebi103.image.imshow(y_hat, **kwargs)

p_neuron.yaxis.major_label_text_font_size = '0pt'
p_hat.yaxis.major_label_text_font_size = '0pt'

bokeh.io.show(bokeh.layouts.column(p_neuron, p_hat))