Bayes’s theorem as a model for learning

Let’s say we did an experiment and got data set \(y_1\) as an investigation of hypothesis \(\theta\). Then, our posterior distribution is

\[\begin{aligned} g(\theta\mid y_1) = \frac{f(y_1 \mid \theta)\, g(\theta )}{f(y_1)}. \end{aligned}\]

Now, let’s say we did another experiment and got data \(y_2\). We already know \(y_1\) ahead of this experiment, so our prior is \(g(\theta\mid y_1)\), which is the posterior from the first experiment. So, we have

\[\begin{aligned} g(\theta\mid y_1, y_2) = \frac{f(y_2 \mid y_1, \theta)\, g(\theta \mid y_1)}{f(y_2 \mid y_1)}. \end{aligned}\]

Now, we plug in Bayes’s theorem applied to our first data set, giving

\[\begin{aligned} g(\theta\mid y_1, y_2) = \frac{f(y_2 \mid y_1, \theta)\,f(y_1 \mid \theta)\, g(\theta )}{f(y_2 \mid y_1)\, f(y_1 )}. \end{aligned}\]

By the product rule, the denominator is \(f(y_1, y_2 )\). Also by the product rule,

\[\begin{aligned} f(y_2 \mid y_1, \theta)\,f(y_1 \mid \theta) = f(y_1, y_2 \mid \theta). \end{aligned}\]

Inserting these expressions into equation the above expression for \(g(\theta\mid y_1, y_2)\) yields

\[\begin{aligned} g(\theta\mid y_1, y_2) = \frac{f(y_1, y_2 \mid \theta)\,g(\theta)}{f(y_1, y_2)}. \end{aligned}\]

So, acquiring more data gave us more information about our hypothesis in that same way as if we just combined \(y_1\) and \(y_2\) into a single data set. So, acquisition of more and more data serves to help us learn more and more about our hypothesis or parameter value.

Bayes theorem thus describes how we learn from data. We acquire data, and that updates our posterior distribution. That posterior distribution then becomes the prior distribution for interpreting the next data set we acquire, and so on. Data constantly update our knowledge.