Finding Hidden Structure

The relationship between these three keeps slipping from my mind. Writing this down for my future self who will inevitably get confused again. (Written with help from Claude.)

You have data. No labels. But visually, you can see “clusters.”

Automatically finding these clusters is the problem of clustering. There is a fundamental question hidden here:

“Which cluster does each data point belong to?” — How do we infer this hidden information?

Three methods answer this question: K-means, GMM, and the EM algorithm. Remarkably, these are not independent methods but nested inside each other like Russian dolls.


graph LR
    subgraph EM["EM Algorithm"]
        subgraph GMM["GMM"]
            K["K-means"]
        end
    end
    style EM stroke:#2196F3,stroke-width:2px
    style GMM stroke:#FF9800,stroke-width:2px
    style K stroke:#E91E63,stroke-width:2px


K-means: Cutting with Scissors

K-means is the most intuitive clustering method. The algorithm repeats two steps:

  1. Assign each data point to the nearest center
  2. Move each center to the mean of its assigned group

graph TD
    A["Initialize centers"] --> B["Assign each point to
nearest center"] B --> C["Move centers to
group means"] C --> D{"Any change?"} D -->|"Yes"| B D -->|"No"| E["Done"]

The key characteristic is hard assignment. Each data point belongs to exactly one cluster. “70% A, 30% B” is not allowed.

What K-means minimizes is:

$$J = \sum_{k=1}^{K} \sum_{x_i \in C_k} |x_i - \mu_k|^2$$

The total distance between each data point and its assigned cluster center. Simple and fast. But the assumptions are strong:

  • All clusters are spherical
  • All clusters are roughly the same size
  • Boundaries are sharp

Real data is usually not this clean.


GMM: Wrapping in Clouds

The Gaussian Mixture Model relaxes K-means’ assumptions. Each cluster is modeled not as a “point” but as a “cloud” (Gaussian distribution).

$$p(x) = \sum_{k=1}^{K} \pi_k \cdot \mathcal{N}(x | \mu_k, \Sigma_k)$$

  • $\pi_k$: proportion of cluster $k$ (mixing weight)
  • $\mu_k$: center of cluster $k$
  • $\Sigma_k$: shape and size of cluster $k$ (covariance matrix)

The crucial difference from K-means is soft assignment. For each data point, we compute “the probability that this point came from cluster $k$.”

$$r_{ik} = \frac{\pi_k \cdot \mathcal{N}(x_i | \mu_k, \Sigma_k)}{\sum_j \pi_j \cdot \mathcal{N}(x_i | \mu_j, \Sigma_j)}$$

This $r_{ik}$ is called responsibility — “how responsible is cluster $k$ for data point $x_i$?”

K-meansGMM
Cluster shapeSpherical onlyElliptical, various sizes
AssignmentHard (0 or 1)Soft (probability)
ParametersCenters onlyCenters, covariance, mixing weights
Interpretation“This point is A”“This point is A with 80% probability”

EM Algorithm: A General Principle for Finding Hidden Things

How do we learn GMM’s parameters? If we knew which cluster each data point came from, it would be easy. But that is exactly the information we want to find. A chicken-and-egg situation.

If we know cluster assignments, we can compute parameters. If we know parameters, we can compute assignments.

The EM (Expectation-Maximization) algorithm breaks this deadlock by alternating between the two.

  • E-step (Expectation): Compute the expected values of latent variables given current parameters
  • M-step (Maximization): Update parameters using those expected values

graph TD
    A["Initialize parameters θ⁰"] --> B["E-step
Compute posterior of
latent variables given θ"] B --> C["M-step
Update parameters
using posteriors"] C --> D{"Converged?"} D -->|"No"| B D -->|"Yes"| E["Final parameters θ*"]

Applied to GMM:

  • E-step: Compute responsibility $r_{ik}$ (probability each point belongs to each cluster)
  • M-step: Update $\mu_k$, $\Sigma_k$, $\pi_k$ using $r_{ik}$ as weights

EM is not just for GMM. It is a general framework applicable to any probabilistic model with latent variables — Hidden Markov Models, topic models, and more.


Key Insight: K-means Is the Limit of GMM

Here is where the most interesting connection emerges. If we impose these constraints on GMM:

  1. Fix all covariances to $\sigma^2 I$ (spherical, same size)
  2. Take $\sigma \to 0$

What happens to the responsibility $r_{ik}$?

$$\lim_{\sigma \to 0} r_{ik} = \begin{cases} 1 & \text{if } k = \arg\min_j |x_i - \mu_j|^2 \ 0 & \text{otherwise} \end{cases}$$

Soft assignment becomes hard assignment. Probability 1 for the nearest cluster, 0 for everything else. This is exactly K-means’ assign step.

Intuitively: variance $\sigma^2$ is “how spread out the cluster cloud is.” When clouds become infinitely sharp (variance → 0), each cloud becomes a point, and soft boundaries become razor-sharp boundaries.


graph LR
    A["GMM
σ² large
spread clouds"] --> B["GMM
σ² small
sharp clouds"] B --> C["K-means
σ² → 0
points"] A -.- D["soft assignment
probabilistic boundary"] C -.- E["hard assignment
sharp boundary"] style A stroke:#2196F3,stroke-width:2px style B stroke:#FF9800,stroke-width:2px style C stroke:#E91E63,stroke-width:2px

EM’s Guarantee: It Improves Every Iteration

Why does EM work? The key is the ELBO (Evidence Lower Bound).

What we want to maximize is the log-likelihood $\log p(X|\theta)$. When direct maximization is hard, EM repeatedly raises a lower bound on it.

$$\log p(X|\theta) \geq \underbrace{E_{q(Z)}[\log p(X,Z|\theta)] + H[q(Z)]}_{\text{ELBO}}$$

  • The E-step tightens the ELBO against $\log p(X|\theta)$ (pushes the lower bound up as far as possible)
  • The M-step raises the tightened ELBO further (improves parameters)

Through this process, the log-likelihood never decreases. Monotonic increase is guaranteed at every iteration. However, convergence to the global optimum is not guaranteed — EM can get stuck in local optima.


Through the Lens of Information Geometry

Going one step deeper, EM’s workings can be cleanly explained in the language of information geometry.

EM = Iteratively Reducing KL Divergence

The E-step and M-step each minimize a different KL divergence.

  • E-step: Drives the KL divergence between $q(Z)$ and $p(Z|X,\theta)$ to zero.

$$q^*(Z) = \arg\min_q D_{KL}(q(Z) | p(Z|X,\theta)) = p(Z|X,\theta)$$

  • M-step: Reduces the KL divergence between the model and data distributions.

$$\theta^* = \arg\min_\theta D_{KL}(p_{\text{data}} | p_\theta)$$

EM is “alternately reducing two kinds of distance” — once in latent variable space, once in parameter space.


graph TD
    A["Current state"] --> B["E-step
KL divergence between
q(Z) and p(Z|X,θ) → 0"] B --> C["M-step
KL divergence between
model and data ↓"] C --> D["Better state"] D --> B style B stroke:#4CAF50,stroke-width:2px style C stroke:#FF9800,stroke-width:2px

Parameter Space Is Not Flat

Ordinary gradient descent treats parameter space as Euclidean (flat). But the parameter space of probability distributions is curved.

For example, changing a Gaussian’s mean from $\mu=0$ to $\mu=1$ versus from $\mu=100$ to $\mu=101$ is the same “numerical” change. But how much the distribution shape changes depends entirely on the variance ($\sigma^2$). Small variance means a big shape change; large variance means almost no change.

What measures “how much the distribution actually changes” is Fisher information.

$$F_{ij}(\theta) = E_{p_\theta}\left[\frac{\partial \log p_\theta(x)}{\partial \theta_i} \cdot \frac{\partial \log p_\theta(x)}{\partial \theta_j}\right]$$

Fisher information is the “metric tensor” that tells us the curvature of parameter space. Correcting the gradient with it gives the natural gradient.

$$\Delta\theta = -F(\theta)^{-1} \nabla_\theta \ell(\theta)$$

EM ≈ Natural Gradient

According to Amari (1998), the EM algorithm’s update is equivalent to natural gradient descent. That is, EM automatically accounts for the curvature of parameter space, updating parameters in the “informationally most efficient direction.”

This is why EM often converges faster than plain gradient descent. It follows the shortest path in probability distribution space, not in flat coordinates.

Information-Geometric Hierarchy: K-means → GMM → EM

MethodInformation-geometric interpretation
K-meansAssigns each point to a δ-distribution. KL divergence is either infinity or zero
GMM-EMProjection onto the manifold of Gaussian mixtures
EM (general)Natural gradient descent on the probability manifold defined by the latent variable model

Fisher Information and “Inertia” in GMM

One more intuition about Fisher information in the GMM context:

  • A cluster with many high-responsibility data points → large Fisher information for that cluster’s parameters → parameters resist change (“high inertia”)
  • Conversely, regions with ambiguous responsibility (data points split across clusters) → small Fisher information → parameters change easily (“flexible”)

Confident assignments lead to stability; uncertain regions respond more sensitively. In the physics analogy: heavy objects (confident clusters) resist being pushed, while light objects (uncertain clusters) move dramatically with small forces.


Summary: Three Branches from One Root


graph TD
    EM["EM Algorithm
General framework for MLE
with latent variables"] GMM["GMM
Latent = cluster membership
Observation model = Gaussian"] KM["K-means
GMM with σ→0
soft → hard assignment"] IG["Information Geometry
EM = natural gradient
Automatically accounts for
parameter space curvature"] EM --> GMM GMM --> KM EM -.- IG style EM stroke:#2196F3,stroke-width:2px style GMM stroke:#FF9800,stroke-width:2px style KM stroke:#E91E63,stroke-width:2px style IG stroke:#9C27B0,stroke-width:2px
  • K-means is the intuitive approach: “assign to the nearest center”
  • GMM gains flexibility through probabilistic assignment
  • EM is the general framework for “estimating hidden variables”
  • Information geometry explains why EM is efficient and why these three relate the way they do

Ultimately, these three methods answer the same question — “how do we find invisible structure?” — at different levels of generality.


References

  • Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 9.
  • Amari, S. (1998). Natural Gradient Works Efficiently in Learning. Neural Computation, 10(2), 251-276.
  • Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm. JRSS-B, 39(1), 1-38.