Bayesian Parameter Estimation

November 25, 2024

Bayesian Parameter Estimation (BPE) is fundamentally different compared to MLE or MAP. Whereas the latter two solve for an optimal set of parameters $\hat{θ}$ for the model, BPE treats $θ$ as a random variable with a distribution $p (θ)$ .

Setup

We are given a dataset $D$ , which contains $n$ i.i.d. features $x_{j}$ . Given a new feature vector $x$ , we want to classify it to some class $ω$ . One way to do this is by the Bayes’ decision rule. That is, we choose class $ω_{j}$ over class $ω_{i}$ if

p (x ∣ D_{j}) > p (x ∣ D_{i})

where $D_{j}$ only contains features belonging to class $ω_{j}$ , and vice versa. We can’t directly solve this without assuming further structure to the underlying distribution.

So, let’s assume that the distribution $p (x ∣ D_{j})$ is fully described by a model parameterized only by $θ$ , a random variable. This distribution tells us how likely we are to find $x$ if it were in class $ω_{j}$ . From now, I omit the subscript on $D_{j}$ for brevity. Then we observe that

p (x ∣ D) = \int p (x, θ ∣ D) d θ = \int p (x ∣ θ) p (θ ∣ D) d θ

This is much more manageable. We can compute $p (x ∣ θ)$ by plugging $x$ into our assumed model. $p (θ ∣ D)$ can also be computed since

p (θ ∣ D) = \frac{p ( D ∣ θ ) p ( θ )}{\int p ( D ∣ θ ) p ( θ ) d θ} (Bayes’ Rule) = α \cdot p (D ∣ θ) p (θ) = α \cdot p (θ) x \in D \prod p (x ∣ θ) (D i.i.d.)

To summarize, we have devised a method that gives us a likelihood for $x$ , averaged over all possible parameters $θ$ , weighted by the prior and likelihood of $θ$ given the class conditional data $D$ .

Gaussian Case

In the case that our model is a Gaussian, with a mean $μ$ with distribution $p (μ)$ and a known covariance $Σ$ , BPE is quite easy to compute. In this case, our parameter set consists of just $μ$ .

We assume the following:

$p (x ∣ μ) \sim N (μ, Σ)$ . That is, our model is valid for each class.
$p (μ) \sim N (μ_{0}, Σ_{0})$ . Here, $μ_{0}, Σ_{0}$ are our “best guess” for the shape of each class conditional distribution, before seeing the data.

Keeping in mind that our goal is to compute $p (x ∣ D)$ , we first need to find $p (μ ∣ D)$ . From Bayes’ theorem:

p (μ ∣ D) = \frac{p ( D ∣ μ ) p ( μ )}{p ( D )} \propto p (D ∣ μ) p (μ)

Plugging the Gaussian formulas:

p (μ ∣ D) \propto (k = 1 \prod n exp (- \frac{1}{2} (x_{k} - μ)^{⊤} Σ^{- 1} (x_{k} - μ))) exp (- \frac{1}{2} (μ - μ_{0})^{⊤} Σ_{0}^{- 1} (μ - μ_{0})) = exp (- \frac{1}{2} k = 1 \sum n (x_{k} - μ)^{⊤} Σ^{- 1} (x_{k} - μ) - \frac{1}{2} (μ - μ_{0})^{⊤} Σ_{0}^{- 1} (μ - μ_{0})) = exp (- \frac{1}{2} (μ - μ_{n})^{⊤} Σ_{n}^{- 1} (μ - μ_{n}))

where

Σ_{n} μ_{n} = (n Σ^{- 1} + Σ_{0}^{- 1})^{- 1} = Σ_{n} (Σ^{- 1} k = 1 \sum n x_{k} + Σ_{0}^{- 1} μ_{0})

Derivation

We notice that the exponent is quadratic in $μ$ . This means $p (μ ∣ D)$ must also be a Gaussian! Let’s put it in standard form. We handle the first and second terms in the exponent separately. First term:

k = 1 \sum n (x_{k} - μ)^{⊤} Σ^{- 1} (x_{k} - μ) = k = 1 \sum n [(x_{k}^{⊤} Σ^{- 1} x_{k}) - 2 x_{k}^{⊤} Σ^{- 1} μ + μ^{⊤} Σ^{- 1} μ] = const - 2 μ^{⊤} Σ^{- 1} k = 1 \sum n x_{k} + n μ^{⊤} Σ^{- 1} μ

Second term:

(μ - μ_{0})^{⊤} Σ_{0}^{- 1} (μ - μ_{0}) = μ^{⊤} Σ_{0}^{- 1} μ - 2 μ^{⊤} Σ_{0}^{- 1} μ_{0} + const

Grouping them back together:

\frac{1}{2} [μ^{⊤} (n Σ^{- 1} + Σ_{0}^{- 1}) μ - 2 μ^{⊤} (Σ^{- 1} k = 1 \sum n x_{k} + Σ_{0}^{- 1} μ_{0})] + const

which simplifies to

\frac{1}{2} (μ - μ_{n})^{⊤} Σ_{n}^{- 1} (μ - μ_{n}) + const

where

Σ_{n}^{- 1} Σ_{n}^{- 1} μ_{n} = n Σ^{- 1} + Σ_{0}^{- 1} = Σ^{- 1} k = 1 \sum n x_{k} + Σ_{0}^{- 1} μ_{0}

which can be found by equating like terms.

Therefore, $p (μ ∣ D) \sim N (μ_{n}, Σ_{n})$ .

To complete the exercise, we need to find $p (x ∣ D)$ . Since $x ∣ μ \sim N (μ, Σ)$ , we can express $x = μ + ϵ$ . It is evident that $ϵ \sim N (0, Σ)$ . Then $x \sim N (μ_{n}, Σ_{n} + Σ)$ .

So it turns out with this method that we don’t need to evaluate an integral at all!

In Summary

$p (μ) \sim N (μ_{0}, Σ_{0})$ , where $μ_{0}, Σ_{0}$ are “guessed”
$p (x ∣ μ) \sim N (μ, Σ)$ , where $μ, Σ$ are the class conditional statistics computed from $D$
$p (μ ∣ D) \sim N (μ_{n}, Σ_{n})$
$p (x ∣ D) \sim N (μ_{n}, Σ_{n} + Σ)$ . This function is used for the Bayes Decision Rule

←

Hario V60 Recipes

The Zed Text Editor

→