Mixture Models

Spherical Gaussian¶

\(P(x; \mu, \sigma) = \cfrac{1}{(\sqrt{2\sigma^2\pi})^d} e^{-(x-\mu)^2/(2\sigma^2)}\) where \(d\) is dim of \(x\)

from a sample:

\(\hat{\mu} = \cfrac{1}{n}\sum\limits_{i = 1}^{n} x_i\)

\(\hat{\sigma}^2 = \cfrac{1}{dn}\sum\limits_{i = 1}^{n} ||x_i - \hat{\mu}||^2\)

Mix of spherical gaussians¶

Assuming that there are \(k\) clusters, there will be \(k\) gaussians.

\(p_i\) \(i \in \{1, 2, ..., k\}\) - frequency of points expected to see in each cluster

If all the parameters of our model are \(\theta\) then:

\(P(x | \theta) = \sum\limits_{i=1}^{k}p_i \mathbb{P}(x ; \hat{\mu}_i, \hat{\sigma}_i^2)\)

Estimating Mixtures of Labeled Data¶

\(\delta (i|t) = \begin{cases} 1 \text{ if } x_t \text{ in i} \\ 0\end{cases}\)

The Max Likelyhood objective:

\(\sum\limits_{t = 1}^{n} \sum\limits_{i = 1}^{k} \delta(i|t) \log p_i \mathbb{P}(x_t | \hat{\mu}_i, \hat{\sigma}_i^2)\)

\(\hat{n}_i = \sum\limits_{t = 1}^{n} \delta(i|t)\) (num points assigned to cluster \(i\))

\(p_i = \cfrac{\hat{n}_i}{n}\)

\(\hat{\mu}_i = \cfrac{1}{\hat{n}_i}\sum\limits_{t = 1}^{n} \delta(i|t) x_t\)

\(\hat{\sigma}_i^2 = \cfrac{1}{d \hat{n}_i}\sum\limits_{t = 1}^{n} \delta(i|t) ||x_t - \hat{\mu}||^2\)

Estimating Mixtures Without Labels¶

Max:

\(\sum\limits_{t = 1}^{n} \log \sum\limits_{i=1}^{k}p_i \mathbb{P}(x ; \hat{\mu}_i, \hat{\sigma}_i^2)\)

A closed form solution cannot be obtained here

initialise all \(\hat{\sigma}_i^2 = \hat{\sigma}^{2} = \cfrac{1}{d n}\sum\limits_{t = 1}^{n} ||x_t - \hat{\mu}||^2\)

The E Step:

\(w(i|t) = \cfrac{p_i \mathbb{P}(x_t ; \hat{\mu}_i, \hat{\sigma}_i^2)}{\mathbb{P}(x_t | \theta)} = \cfrac{p_i \mathbb{P}(x_t ; \hat{\mu}_i, \hat{\sigma}_i^2)}{\sum\limits_{j=1}^{k}p_j \mathbb{P}(x_t ; \hat{\mu}_j, \hat{\sigma}_j^2)}\)

\(w(i|t)\) softly assigns each point to a cluster by a weight, this is similar to the labeled case where we could do a definite \(0\) or \(1\) assignment with \(\delta\).

The M Step:

\(\hat{n}_i = \sum\limits_{t = 1}^{n} w(i|t)\) (effecive num points assigned to cluster \(i\))

\(p_i = \cfrac{\hat{n}_i}{n}\)

\(\hat{\mu}_i = \cfrac{1}{\hat{n}_i}\sum\limits_{t = 1}^{n} w(i|t) x_t\)

\(\hat{\sigma}_i^2 = \cfrac{1}{d \hat{n}_i}\sum\limits_{t = 1}^{n} w(i|t) ||x_t - \hat{\mu}||^2\)

same properties of convergence as \(k\)-means