Goal
We are given a dataset $\mathcal{D}$, which contains feature vectors $\mathbf{x}_k$ and class labels $\omega_k$. Denote $\mathcal{D}_i$ as the set of features of class $\omega_i$. We assume the following:
- That $p(\mathbf{x} \mid \omega_j) \sim \mathcal{N}(\boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)$. That is, given a class label, the distribution of features belonging to that class forms a Gaussian with mean $\boldsymbol{\mu}_j$ and covariance $\boldsymbol{\Sigma}_j$.
- The samples $\mathbf{x} \in \mathcal{D}_i$ are independent and identically distributed (i.i.d.) according to this assumed Gaussian distribution.
The problem that MLE seeks to solve is to find the most likely set of parameters $\boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j$, given the data. We denote
$$ \boldsymbol{\theta} = (\boldsymbol{\mu}, \boldsymbol{\Sigma}) $$which includes the means and covariances for every class. The likelihood of $\boldsymbol{\theta}$ is
$$ l(\boldsymbol{\theta}) = p(\mathcal{D} \mid \boldsymbol{\theta}), $$and the MLE of $\boldsymbol{\theta}$, $\hat{\boldsymbol{\theta}}$, is
$$ \hat{\boldsymbol{\theta}} = \arg \max_{\boldsymbol{\theta}} l(\boldsymbol{\theta}). $$In practice, we use the log-likelihood for simpler computation:
$$ l(\boldsymbol{\theta}) = \log p(\mathcal{D} \mid \boldsymbol{\theta}), $$since maximizing the log-likelihood is equivalent to maximizing the likelihood. In words, the likelihood tells us the probability of generating our dataset if each datapoint was drawn independently from the distribution defined by $\boldsymbol \theta$. The $\hat{\boldsymbol{\theta}}$ that maximizes this probability defines the actual distribution from which $\mathcal{D}$ was drawn.
We can attempt to find $\hat{\boldsymbol{\theta}}$ by setting the gradient of $l(\boldsymbol{\theta})$ to $0$ and verifying the solution is a maximum. However, this doesn’t guarantee a global maximum.
Example: Unknown $\boldsymbol{\mu}$
Let’s assume that each element $\mathbf{x}_k$ in our dataset $\mathcal{D}$ is drawn from a multivariate Gaussian with known covariance $\boldsymbol{\Sigma}$ but unknown mean $\boldsymbol{\mu}$. What is the MLE of $\boldsymbol{\mu}$?
$$ \hat{\boldsymbol{\mu}} = \arg \max_{\boldsymbol{\mu}} p(\mathcal{D} \mid \boldsymbol{\mu}). $$To find the MLE of $\boldsymbol{\mu}$, we maximize the likelihood function. For a multivariate Gaussian distribution:
$$ p(\mathbf{x}_k \mid \boldsymbol{\mu}) = \frac{1}{(2\pi)^{d/2} |\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x}_k - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x}_k - \boldsymbol{\mu})\right), $$where $d$ is the dimension of $\mathbf{x}_k$.
Since we assumed that samples are independent, the likelihood of the dataset $\mathcal{D}$ is the product of the likelihoods of each $\mathbf{x}_k$. This becomes a sum in log-space:
$$ \begin{align*} \log p(\mathcal{D} \mid \boldsymbol{\mu}) &= \sum_{k=1}^n \log p(\mathbf{x}_k \mid \boldsymbol{\mu}) \\ &= -\frac{nd}{2} \log(2\pi) - \frac{n}{2} \log |\boldsymbol{\Sigma}| - \frac{1}{2} \sum_{k=1}^n (\mathbf{x}_k - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x}_k - \boldsymbol{\mu}). \end{align*} $$Taking the gradient and setting it to zero:
$$ \nabla_{\boldsymbol{\mu}} \log p(\mathcal{D} \mid \hat{\boldsymbol{\mu}}) = \sum_{k=1}^n \boldsymbol{\Sigma}^{-1} (\mathbf{x}_k - \hat{\boldsymbol{\mu}}) = 0. $$Derivation of gradient
Consider the quadratic form, where $\mathbf{x} \in \mathbb{R}^{d \times 1}$, $\boldsymbol{\Sigma} \in \mathbb{R}^{d \times d}$:
$$ f(\mathbf{x}) = \mathbf{x}^\top \boldsymbol{\Sigma} \mathbf{x} = \sum_{i=1}^d \sum_{j=1}^d x_i \Sigma_{ij} x_j. $$Computing the gradient:
$$ \frac{\partial f}{\partial x_k} = \sum_{j=1}^d \Sigma_{kj} x_j + \sum_{i=1}^d x_i \Sigma_{ik}. $$Where the first term comes from $i=k$ and the second from $j=k$. We notice that:
$$ \frac{\partial f}{\partial x_k} = \left( \boldsymbol{\Sigma} \mathbf{x} \right)_k + \left( \boldsymbol{\Sigma}^\top \mathbf{x} \right)_k $$so,
$$ \nabla_{\mathbf{x}} \left( \mathbf{x}^\top \boldsymbol{\Sigma} \mathbf{x} \right) = (\boldsymbol{\Sigma} + \boldsymbol{\Sigma}^\top) \mathbf{x}. $$In our case, we are differentiating with respect to $\boldsymbol{\mu}$, which brings a negative sign when substituting. Using the fact that $\boldsymbol{\Sigma}^{-1}$ is symmetric (as it is a covariance matrix) and the above result:
$$ \nabla_{\boldsymbol{\mu}} \left( (\mathbf{x}_k - \hat{\boldsymbol{\mu}})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x}_k - \hat{\boldsymbol{\mu}}) \right) = -2\boldsymbol{\Sigma}^{-1} (\mathbf{x}_k - \hat{\boldsymbol{\mu}}). $$Multiplying by $\boldsymbol{\Sigma}$ on both sides:
$$ \sum_{k=1}^n \mathbf{x}_k = \sum_{k=1}^n \hat{\boldsymbol{\mu}} = n \hat{\boldsymbol{\mu}}, $$which implies:
$$ \hat{\boldsymbol{\mu}} = \frac{1}{n} \sum_{k=1}^n \mathbf{x}_k, $$which is the sample mean! This result makes the most sense.