• Setup
  • Gaussian Case
    • In Summary
  • Home
  • Posts
  • Notes
  • Books
  • Author
🇫🇷 fr 🇨🇳 zh 🇮🇳 ml

Nathaniel Thomas

Bayesian Parameter Estimation

November 25, 2024

Bayesian Parameter Estimation (BPE) is fundamentally different compared to MLE or MAP. Whereas the latter two solve for an optimal set of parameters θ^ for the model, BPE treats θ as a random variable with a distribution p(θ).

Setup

We are given a dataset D, which contains n i.i.d. features xj​. Given a new feature vector x, we want to classify it to some class ω. One way to do this is by the Bayes’ decision rule. That is, we choose class ωj​ over class ωi​ if

p(x∣Dj​)>p(x∣Di​)

where Dj​ only contains features belonging to class ωj​, and vice versa. We can’t directly solve this without assuming further structure to the underlying distribution.

So, let’s assume that the distribution p(x∣Dj​) is fully described by a model parameterized only by θ, a random variable. This distribution tells us how likely we are to find x if it were in class ωj​. From now, I omit the subscript on Dj​ for brevity. Then we observe that

p(x∣D)​=∫p(x,θ∣D)dθ=∫p(x∣θ)p(θ∣D)dθ​

This is much more manageable. We can compute p(x∣θ) by plugging x into our assumed model. p(θ∣D) can also be computed since

p(θ∣D)​=∫p(D∣θ)p(θ)dθp(D∣θ)p(θ)​(Bayes’ Rule)=α⋅p(D∣θ)p(θ)=α⋅p(θ)x∈D∏​p(x∣θ)(D i.i.d.)​

To summarize, we have devised a method that gives us a likelihood for x, averaged over all possible parameters θ, weighted by the prior and likelihood of θ given the class conditional data D.

Gaussian Case

In the case that our model is a Gaussian, with a mean μ with distribution p(μ) and a known covariance Σ, BPE is quite easy to compute. In this case, our parameter set consists of just μ.

We assume the following:

  1. p(x∣μ)∼N(μ,Σ). That is, our model is valid for each class.

  2. p(μ)∼N(μ0​,Σ0​). Here, μ0​,Σ0​ are our “best guess” for the shape of each class conditional distribution, before seeing the data.

Keeping in mind that our goal is to compute p(x∣D), we first need to find p(μ∣D). From Bayes’ theorem:

p(μ∣D)=p(D)p(D∣μ)p(μ)​∝p(D∣μ)p(μ)

Plugging the Gaussian formulas:

p(μ∣D)​∝(k=1∏n​exp(−21​(xk​−μ)⊤Σ−1(xk​−μ)))exp(−21​(μ−μ0​)⊤Σ0−1​(μ−μ0​))=exp(−21​k=1∑n​(xk​−μ)⊤Σ−1(xk​−μ)−21​(μ−μ0​)⊤Σ0−1​(μ−μ0​))=exp(−21​(μ−μn​)⊤Σn−1​(μ−μn​))​

where

Σn​μn​​=(nΣ−1+Σ0−1​)−1=Σn​(Σ−1k=1∑n​xk​+Σ0−1​μ0​)​
Derivation

We notice that the exponent is quadratic in μ. This means p(μ∣D) must also be a Gaussian! Let’s put it in standard form. We handle the first and second terms in the exponent separately. First term:

​k=1∑n​(xk​−μ)⊤Σ−1(xk​−μ)=k=1∑n​[(xk⊤​Σ−1xk​)−2xk⊤​Σ−1μ+μ⊤Σ−1μ]=const−2μ⊤Σ−1k=1∑n​xk​+nμ⊤Σ−1μ​

Second term:

(μ−μ0​)⊤Σ0−1​(μ−μ0​)=μ⊤Σ0−1​μ−2μ⊤Σ0−1​μ0​+const

Grouping them back together:

21​[μ⊤(nΣ−1+Σ0−1​)μ−2μ⊤(Σ−1k=1∑n​xk​+Σ0−1​μ0​)]+const

which simplifies to

21​(μ−μn​)⊤Σn−1​(μ−μn​)+const

where

Σn−1​Σn−1​μn​​=nΣ−1+Σ0−1​=Σ−1k=1∑n​xk​+Σ0−1​μ0​​

which can be found by equating like terms.

Therefore, p(μ∣D)∼N(μn​,Σn​).

To complete the exercise, we need to find p(x∣D). Since x∣μ∼N(μ,Σ), we can express x=μ+ϵ. It is evident that ϵ∼N(0,Σ). Then x∼N(μn​,Σn​+Σ).

So it turns out with this method that we don’t need to evaluate an integral at all!

In Summary

  • p(μ)∼N(μ0​,Σ0​), where μ0​,Σ0​ are “guessed”
  • p(x∣μ)∼N(μ,Σ), where μ,Σ are the class conditional statistics computed from D
  • p(μ∣D)∼N(μn​,Σn​)
  • p(x∣D)∼N(μn​,Σn​+Σ). This function is used for the Bayes Decision Rule

←
Hario V60 Recipes
The Zed Text Editor
→

back to top