Bayesian Parameter Estimation (BPE) is fundamentally different compared to MLE or MAP. Whereas the latter two solve for an optimal set of parameters $\hat{\boldsymbol \theta}$ for the model, BPE treats $\boldsymbol \theta$ as a random variable with a distribution $p(\boldsymbol \theta)$.
Setup
We are given a dataset $\mathcal D$, which contains $n$ i.i.d. features $\mathbf x_j$. Given a new feature vector $\mathbf x$, we want to classify it to some class $\omega$. One way to do this is by the Bayes’ decision rule. That is, we choose class $\omega_j$ over class $\omega_i$ if
$$ p(\mathbf x | \mathcal D_j) > p(\mathbf x | \mathcal D_i) $$where $\mathcal D_j$ only contains features belonging to class $\omega_j$, and vice versa. We can’t directly solve this without assuming further structure to the underlying distribution.
So, let’s assume that the distribution $p(\mathbf x | \mathcal D_j)$ is fully described by a model parameterized only by $\boldsymbol \theta$, a random variable. This distribution tells us how likely we are to find $\mathbf x$ if it were in class $\omega_j$. From now, I omit the subscript on $\mathcal D_j$ for brevity. Then we observe that
$$ \begin{align*} p(\mathbf x | \mathcal D) &= \int p(\mathbf x, \boldsymbol \theta | \mathcal D) d \boldsymbol \theta \\\\ &= \int p(\mathbf x | \boldsymbol \theta) p(\boldsymbol \theta | \mathcal D) d \boldsymbol \theta \end{align*} $$This is much more managable. We can compute $p(\mathbf x | \boldsymbol \theta)$ by plugging $\mathbf x$ into our assumed model. $p(\boldsymbol \theta | \mathcal D)$ can also be computed since
$$ \begin{align*} p(\boldsymbol \theta | \mathcal D) &= \frac{p(\mathcal D | \boldsymbol \theta) p(\boldsymbol \theta)}{\int p(\mathcal D | \boldsymbol \theta) p (\boldsymbol \theta) d \boldsymbol \theta} \quad \text{(Bayes' Rule)} \\\\ &= \alpha \cdot p(\mathcal D | \boldsymbol \theta) p(\boldsymbol \theta) \\\\ &= \alpha \cdot p(\boldsymbol \theta) \prod_{\mathbf x \in \mathcal D} p(\mathbf x | \boldsymbol \theta) \quad (\mathcal D \text{ i.i.d.}) \end{align*} $$To summarize, we have devised a method that gives us a likelihood for $\mathbf x$, averaged over all possible parameters $\boldsymbol \theta$, weighted by the prior and likelihood of $\boldsymbol \theta$ given the class conditional data $\mathcal D$.
Gaussian Case
In the case that our model is a Gaussian, with a mean $\boldsymbol \mu$ with distribution $p(\boldsymbol \mu)$ and a known covariance $\boldsymbol \Sigma$, BPE is quite easy to compute. In this case, our parameter set consists of just $\boldsymbol \mu$.
We assume the following
-
$p(\mathbf x | \boldsymbol \mu) \sim \mathcal N (\boldsymbol \mu, \boldsymbol \Sigma)$. That is, our model is valid for each class.
-
$p(\boldsymbol \mu) \sim \mathcal N(\boldsymbol \mu_0, \boldsymbol \Sigma_0)$. Here, $\boldsymbol \mu_0, \boldsymbol \Sigma_0$ are our “best guess” for the shape of each class conditional distribution, before seeing the data.
Keeping in mind that our goal is to compute $p(\mathbf x | \mathcal D)$, we first need to find $p(\boldsymbol \mu | \mathcal D)$. From Bayes’ theorem:
$$ p(\mu | \mathcal{D}) = \frac{p(\mathcal{D} | \mu) p(\mu)}{p(\mathcal{D})} \propto p(\mathcal{D} | \mu) p(\mu) $$Plugging the Gaussian formulas:
$$ \begin{align*} p(\mu | \mathcal{D}) &\propto \left( \prod\_{k=1}^n \exp\left( -\frac{1}{2} (\mathbf{x}\_k - \mu)^\top \Sigma^{-1} (\mathbf{x}\_k - \mu) \right) \right) \exp\left( -\frac{1}{2} (\mu - \mu\_0)^\top \Sigma\_0^{-1} (\mu - \mu\_0) \right) \\\\ &= \exp\left( -\frac{1}{2} \sum\_{k=1}^n (\mathbf{x}\_k - \mu)^\top \Sigma^{-1} (\mathbf{x}\_k - \mu) - \frac{1}{2} (\mu - \mu\_0)^\top \Sigma\_0^{-1} (\mu - \mu\_0) \right) \\\\ &= \exp\left( -\frac{1}{2} (\mu - \mu_n)^\top \Sigma_n^{-1} (\mu - \mu_n) \right) \\\\ \end{align*} $$where
$$ \begin{align*} \Sigma_n &= \left( n \Sigma^{-1} + \Sigma_0^{-1} \right)^{-1} \\\\ \mu_n &= \Sigma_n \left( \Sigma^{-1} \sum_{k=1}^n \mathbf x_k + \Sigma_0^{-1} \mu_0 \right) \end{align*} $$Derivation
We notice that the exponent is quadratic in $\mu$. This means $p(\mu | \mathcal D)$ must also be a Gaussian! Let’s put it in standard form. We handle the first and second terms in the exponent separately. First term:
$$ \begin{align*} &\sum_{k=1}^n (\mathbf{x}_k - \mu)^\top \Sigma^{-1} (\mathbf{x}k - \mu) \\\\ &= \sum\_{k=1}^n \left[ (\mathbf{x}_k^\top \Sigma^{-1} \mathbf{x}_k) - 2 \mathbf{x}k^\top \Sigma^{-1} \mu + \mu^\top \Sigma^{-1} \mu \right] \\\\ &= \text{const} - 2 \mu^\top \Sigma^{-1} \sum\_{k=1}^n \mathbf{x}_k + n \mu^\top \Sigma^{-1} \mu \end{align*} $$Second term:
$$ (\mu - \mu_0)^\top \Sigma_0^{-1} (\mu - \mu_0) = \mu^\top \Sigma_0^{-1} \mu - 2 \mu^\top \Sigma_0^{-1} \mu_0 + \text{const} $$Grouping them back together:
$$ \frac{1}{2} \left[ \mu^\top (n \Sigma^{-1} + \Sigma_0^{-1}) \mu - 2 \mu^\top \left( \Sigma^{-1} \sum_{k=1}^n \mathbf{x}_k + \Sigma_0^{-1} \mu_0 \right) \right] + \text{const} $$which simplifies to
$$ \frac{1}{2} (\mu - \mu_n)^\top \Sigma_n^{-1} (\mu - \mu_n) + \text{const} $$where
$$ \begin{align*} \Sigma_n^{-1} &= n \Sigma^{-1} + \Sigma_0^{-1} \\\\ \Sigma_n^{-1} \mu_n &= \Sigma^{-1} \sum_{k=1}^n \mathbf{x}_k + \Sigma_0^{-1} \mu_0 \\\\ \end{align*} $$which can be found by equating like terms.
Therefore, $p(\boldsymbol \mu | \mathcal D) \sim \mathcal N (\boldsymbol \mu_n, \boldsymbol \Sigma_n)$.
To complete the exercise, we need to find $p(\mathbf x | \mathcal D)$. Since $\mathbf x | \boldsymbol \mu \sim \mathcal N(\boldsymbol \mu, \boldsymbol \Sigma)$, we can express $\mathbf x = \boldsymbol \mu + \boldsymbol \epsilon$. It is evident that $\boldsymbol \epsilon \sim \mathcal N(0, \boldsymbol \Sigma)$. Then $\mathbf x \sim \mathcal N (\boldsymbol \mu_n, \boldsymbol \Sigma_n + \boldsymbol \Sigma)$.
So it turns out with this method that we don’t need to evaluate an integral at all!
In Summary
- $p(\boldsymbol \mu) \sim \mathcal N (\boldsymbol \mu_0, \boldsymbol \Sigma_0)$, where $\boldsymbol \mu_0, \boldsymbol \Sigma_0$ are “guessed”
- $p(\mathbf x | \boldsymbol \mu) \sim \mathcal N (\boldsymbol \mu, \boldsymbol \Sigma)$, where $\boldsymbol \mu, \boldsymbol \Sigma$ are the class conditional statistics computed from $\mathcal D$
- $p(\boldsymbol \mu | \mathcal D) \sim \mathcal N (\boldsymbol \mu_n, \boldsymbol \Sigma_n)$
- $p(\mathbf x | \mathcal D) \sim \mathcal N (\boldsymbol \mu_n, \boldsymbol \Sigma_n + \boldsymbol \Sigma)$. This function is used for the Bayes Decision Rule