Maximum Likelihood Estimation

November 24, 2024

Goal

We are given a dataset $D$ , which contains feature vectors $x_{k}$ and class labels $ω_{k}$ . Denote $D_{i}$ as the set of features of class $ω_{i}$ . We assume the following:

That $p (x ∣ ω_{j}) \sim N (μ_{j}, Σ_{j})$ . That is, given a class label, the distribution of features belonging to that class forms a Gaussian with mean $μ_{j}$ and covariance $Σ_{j}$ .
The samples $x \in D_{i}$ are independent and identically distributed (i.i.d.) according to this assumed Gaussian distribution.

The problem that MLE seeks to solve is to find the most likely set of parameters $μ_{j}, Σ_{j}$ , given the data. We denote

θ = (μ, Σ)

which includes the means and covariances for every class. The likelihood of $θ$ is

l (θ) = p (D ∣ θ),

and the MLE of $θ$ , $\hat{θ}$ , is

\hat{θ} = ar g θ max l (θ) .

In practice, we use the log-likelihood for simpler computation:

l (θ) = lo g p (D ∣ θ),

since maximizing the log-likelihood is equivalent to maximizing the likelihood. In words, the likelihood tells us the probability of generating our dataset if each datapoint was drawn independently from the distribution defined by $θ$ . The $\hat{θ}$ that maximizes this probability defines the actual distribution from which $D$ was drawn.

We can attempt to find $\hat{θ}$ by setting the gradient of $l (θ)$ to $0$ and verifying the solution is a maximum. However, this doesn’t guarantee a global maximum.

Example: Unknown $\boldsymbol{\mu}$

Let’s assume that each element $x_{k}$ in our dataset $D$ is drawn from a multivariate Gaussian with known covariance $Σ$ but unknown mean $μ$ . What is the MLE of $μ$ ?

\hat{μ} = ar g μ max p (D ∣ μ) .

To find the MLE of $μ$ , we maximize the likelihood function. For a multivariate Gaussian distribution:

p (x_{k} ∣ μ) = \frac{1}{( 2 π ) ^{d /2} ∣ Σ ∣ ^{1/2}} exp (- \frac{1}{2} (x_{k} - μ)^{⊤} Σ^{- 1} (x_{k} - μ)),

where $d$ is the dimension of $x_{k}$ .

Since we assumed that samples are independent, the likelihood of the dataset $D$ is the product of the likelihoods of each $x_{k}$ . This becomes a sum in log-space:

lo g p (D ∣ μ) = k = 1 \sum n lo g p (x_{k} ∣ μ) = - \frac{n d}{2} lo g (2 π) - \frac{n}{2} lo g ∣ Σ ∣ - \frac{1}{2} k = 1 \sum n (x_{k} - μ)^{⊤} Σ^{- 1} (x_{k} - μ) .

Taking the gradient and setting it to zero:

\nabla_{μ} lo g p (D ∣ \hat{μ}) = k = 1 \sum n Σ^{- 1} (x_{k} - \hat{μ}) = 0.

Derivation of gradient

Consider the quadratic form, where $x \in R^{d \times 1}$ , $Σ \in R^{d \times d}$ :

f (x) = x^{⊤} Σ x = i = 1 \sum d j = 1 \sum d x_{i} Σ_{ij} x_{j} .

Computing the gradient:

\frac{\partial f}{\partial x _{k}} = j = 1 \sum d Σ_{kj} x_{j} + i = 1 \sum d x_{i} Σ_{ik} .

Where the first term comes from $i = k$ and the second from $j = k$ . We notice that:

\frac{\partial f}{\partial x _{k}} = (Σ x)_{k} + (Σ^{⊤} x)_{k}

so,

\nabla_{x} (x^{⊤} Σ x) = (Σ + Σ^{⊤}) x .

In our case, we are differentiating with respect to $μ$ , which brings a negative sign when substituting. Using the fact that $Σ^{- 1}$ is symmetric (as it is a covariance matrix) and the above result:

\nabla_{μ} ((x_{k} - \hat{μ})^{⊤} Σ^{- 1} (x_{k} - \hat{μ})) = - 2 Σ^{- 1} (x_{k} - \hat{μ}) .

Multiplying by $Σ$ on both sides:

k = 1 \sum n x_{k} = k = 1 \sum n \hat{μ} = n \hat{μ},

which implies:

\hat{μ} = \frac{1}{n} k = 1 \sum n x_{k},

which is the sample mean! This result makes the most sense.

←

Building and Deploying Rust to a Hugo Site

Maximum A Posteriori (MAP) Estimation

→