Homework 3.2: MAP estimates and zero-inflation for Drop-Seq controls (80 pts)
Data set download: File 1, File 2
This problem is inspired bythis paper by Valentine Svenssonand the related blog posts (viewablehereandhere).
Single-Cell RNA Seq (scRNA-Seq) is a technology that is coming into more and more common use. It enables researchers to get counts of mRNA transcripts for specific genes in single cells. In the technique, (usually) individual cells are encased in a droplet in a microfluidic device. Importantly, each droplet also has a cDNA “barcode” molecule that enables identification of which droplet reads of a given mRNA sequence come from.
An output from a scRNA-Seq experiment is a count matrix. In a count matrix, each row corresponds to a given droplet (indexed by its barcode sequence) and each column is a gene. The entries in the count matrix are integer counts of mRNA transcripts. The process by which the count matrix is generated from the reads is important and worth considering in a detailed analysis. Tools such as Kalisto enable computing the count matrix. We will not attempt to fully model the generative process of the count matrix.
As we saw with smFISH data last term, mRNA abundances in cells are often modeled using a Negative Binomial distribution, as it describes the distributions of transcripts under a model for bursty gene expression. The Negative Binomial distribution may also be useful for modeling mRNA transcript counts, even in the absence of cells, and we will explore that in this problem.
As controls for scRNA-Seq measurements, Zheng, et al. injected an ERCC (External RNA Controls Consortium) spike-in solution of RNA molecules into droplets. The idea here is that each droplet has RNA molecules that get sequenced, but there are no cells. Thus, any confounding problems with getting RNA transcript counts from cells is absent.
a) We would expect the counts for RNA transcripts for each gene to be Poisson distributed. Why?
b) It turns out that even in the control experiments, the RNA transcripts are not Poisson distributed. As Valentine Svensson points out in his paper and blog posts I linked to above, this is cause for some consternation among the research community using these techniques. There are various explanations for why this may be, but the cause is generally unknown. To capture the count distribution, then, we need a distribution that is more flexible than the Poisson. If you read about the Negative Binomial distribution in the Distribution Explorer, you will see that the Poisson distribution is a limit of the Negative Binomial. This is best understood in the \((\mu,\phi)\) formulation, as you can read in the Distribution Explorer.
The count matrix for this control experiment may be downloaded here: https://s3.amazonaws.com/bebi103.caltech.edu/data/zheng_gemcode_control.csv. (The data set was generated by Valentine Svensson—presumably from the raw data set deposited by Zheng and coworkers to the Short Read Archive under accession number SRP073767—available from FigShare, licensed under a CC-BY-4.0 license.) Additionally, information about the concentrations of the transcripts in the ERCC mix can be obtained from Thermo-Fisher. You can download these data here. For the data in the count matrix, the concentrations from ERCC mix 1 were diluted 50-fold.
For ease of discussion, although the sequences in the ERCC are not strictly genes, we will refer to them as “genes” here.
In this part of the problem, perform exploratory data analysis to determine which, if any, genes have obvious deviation from being Negative Binomially distributed. What do the the genes that deviate have in common? There is surely some explanation for why these genes have strange distributions having to do with how the experiment is performed, but we will not speculate on that here. Hint: Having most of the counts be zero is not obviously inconsistent with being Negative Binomially distributed.
c) Considering only the genes that do not have obvious deviation from Negative Binomial behavior, find the most probable parameter values for \(\mu\) and \(\phi\) for each gene. You will have to come up with reasonable priors for these parameters in this calculation.
You should do a quick graphical verification that a Negative Binomial likelihood is indeed a reasonable model. We will discuss more principled ways to check models in coming weeks; just do a quick-and-dirty look for now.
Also, compute credible intervals for each \(\mu\) and \(\phi\) that you calculate by assuming local Normality near the MAP. Do you notice any problems with the credible intervals? What might be the source of these problems?
d) The problem of zero inflation commonly occurs in Drop-Seq analysis. A zero-inflated distribution is typically defined as follows. Say count data follow some distribution \(F\), which may have nonzero probability mass at zero. Let \(p\) be the probability of obtaining a measurement of zero (in this case, not detecting any RNA transcript in a given droplet) due to some unspecified reason that is not included in the distribution \(F\). Then the counts \(y\) are distributed according to a zero-inflated version of \(F\) defined by a mixture model
\begin{align} y \sim p \,\delta_{0y} + (1 - p)F, \end{align}
where \(\delta_{0y}\) is one if \(y=0\) and zero otherwise.
In his paper, Valentine Svensson investigated whether the control scRNA-Seq data we are considering here are zero inflated relative to the Negative Binomial distribution. We will take a similar approach here. First, write down an expression for the probability of observing zero counts in terms of the parameters \(\mu\) and \(\phi\). For each set of MAP parameters you computed, compute the theoretical probability of obtaining zero counts. Then, plot these versus the fraction of zero counts for each gene. From this plot, do you think the data are zero-inflated?
e) As you discussed in part (a), we expect the counts to be Poisson distributed. It stands to reason that the deviation from being Poisson distributed is due to technical issues that are involved in quantifying all genes. (Of course, we saw that there may very well be other technical issues because the counts of some genes are obviously not Negative Binomially distributed.) With this in mind, we propose a new likelihood, wherein gene \(i\) has parameter \(\mu_i\), but all genes share a common parameter \(\phi\). That is the count of transcripts of gene \(i\) in droplet \(j\) is distributed as
\begin{align} y_{ij} \sim \text{NegBinom}(\mu_i, \phi)\,\forall i,j. \end{align}
Obtain MAP parameters for \(\{\mu_i\}\) and \(\phi\). Note that this is a trickier optimization problem because you now need to solve one big optimization problem. Report the MAP parameter values. Also report credible intervals if you can (the calculation of the Hessian will probably take several minutes) and comment on the results.
f) Using the MAP parameter \(\phi^*\) from part (e), make a smooth curve of the probability of observing a zero count versus \(\mu\). Then, overlay a plot of the fraction of observed zero counts versus the MAP estimate for \(\mu\). In light of this plot, do you think the data are zero-inflated?