• Goal
  • Example: Unknown $\boldsymbol{\mu}$
  • Home
  • Posts
  • Notes
  • Books
  • Author
🇫🇷 fr 🇨🇳 zh 🇮🇳 ml

Nathaniel Thomas

Maximum Likelihood Estimation

November 24, 2024

Goal

We are given a dataset D, which contains feature vectors xk​ and class labels ωk​. Denote Di​ as the set of features of class ωi​. We assume the following:

  1. That p(x∣ωj​)∼N(μj​,Σj​). That is, given a class label, the distribution of features belonging to that class forms a Gaussian with mean μj​ and covariance Σj​.
  2. The samples x∈Di​ are independent and identically distributed (i.i.d.) according to this assumed Gaussian distribution.

The problem that MLE seeks to solve is to find the most likely set of parameters μj​,Σj​, given the data. We denote

θ=(μ,Σ)

which includes the means and covariances for every class. The likelihood of θ is

l(θ)=p(D∣θ),

and the MLE of θ, θ^, is

θ^=argθmax​l(θ).

In practice, we use the log-likelihood for simpler computation:

l(θ)=logp(D∣θ),

since maximizing the log-likelihood is equivalent to maximizing the likelihood. In words, the likelihood tells us the probability of generating our dataset if each datapoint was drawn independently from the distribution defined by θ. The θ^ that maximizes this probability defines the actual distribution from which D was drawn.

We can attempt to find θ^ by setting the gradient of l(θ) to 0 and verifying the solution is a maximum. However, this doesn’t guarantee a global maximum.

Example: Unknown $\boldsymbol{\mu}$

Let’s assume that each element xk​ in our dataset D is drawn from a multivariate Gaussian with known covariance Σ but unknown mean μ. What is the MLE of μ?

μ^​=argμmax​p(D∣μ).

To find the MLE of μ, we maximize the likelihood function. For a multivariate Gaussian distribution:

p(xk​∣μ)=(2π)d/2∣Σ∣1/21​exp(−21​(xk​−μ)⊤Σ−1(xk​−μ)),

where d is the dimension of xk​.

Since we assumed that samples are independent, the likelihood of the dataset D is the product of the likelihoods of each xk​. This becomes a sum in log-space:

logp(D∣μ)​=k=1∑n​logp(xk​∣μ)=−2nd​log(2π)−2n​log∣Σ∣−21​k=1∑n​(xk​−μ)⊤Σ−1(xk​−μ).​

Taking the gradient and setting it to zero:

∇μ​logp(D∣μ^​)=k=1∑n​Σ−1(xk​−μ^​)=0.
Derivation of gradient

Consider the quadratic form, where x∈Rd×1, Σ∈Rd×d:

f(x)=x⊤Σx=i=1∑d​j=1∑d​xi​Σij​xj​.

Computing the gradient:

∂xk​∂f​=j=1∑d​Σkj​xj​+i=1∑d​xi​Σik​.

Where the first term comes from i=k and the second from j=k. We notice that:

∂xk​∂f​=(Σx)k​+(Σ⊤x)k​

so,

∇x​(x⊤Σx)=(Σ+Σ⊤)x.

In our case, we are differentiating with respect to μ, which brings a negative sign when substituting. Using the fact that Σ−1 is symmetric (as it is a covariance matrix) and the above result:

∇μ​((xk​−μ^​)⊤Σ−1(xk​−μ^​))=−2Σ−1(xk​−μ^​).

Multiplying by Σ on both sides:

k=1∑n​xk​=k=1∑n​μ^​=nμ^​,

which implies:

μ^​=n1​k=1∑n​xk​,

which is the sample mean! This result makes the most sense.


←
Building and Deploying Rust to a Hugo Site
Maximum A Posteriori (MAP) Estimation
→

back to top