Why Information Geometry Matters

What does it mean for an AI model to “learn”? Simply put, it’s the process of reducing how wrong it is. But there’s a hidden question in this process:

“In which direction, and how quickly, should it change to be most efficient?”

The mathematical tool that answers this question is Information Geometry.


A Metaphor Borrowed from Physics

Recall Newton’s second law of motion:

$$F = ma$$

  • $F$ (force): the driving power that pushes an object
  • $m$ (mass): the degree of resistance to change
  • $a$ (acceleration): the actual rate of change that occurs

The heavier an object, the slower it moves under the same force. AI learning works in a surprisingly similar way.


graph LR
    A["Force (F)"] -->|"÷ Mass (m)"| B["Acceleration (a)"]
    C["Information\nMismatch"] -->|"÷ Information\nInertia"| D["Learning\nSpeed"]
    style A fill:#ff9999
    style C fill:#ff9999
    style B fill:#99ff99
    style D fill:#99ff99


Step 1: “How Wrong Are We?” — KL Divergence

An AI model holds “predictions” about the world. The tool that measures how different these predictions are from reality is KL Divergence (Kullback-Leibler Divergence).

$$D_{KL}(p | q_\theta) = \sum_x p(x) \log \frac{p(x)}{q_\theta(x)}$$

It looks complex, but the core idea is simple:

  • $p(x)$: the actual pattern of the world (ground truth)
  • $q_\theta(x)$: what the AI currently believes (prediction)
  • $D_{KL}$: the “distance” between the two (larger means more wrong)

Think of it like this: it’s a numerical score for “how far off the weather forecast was from the actual weather.”

Large value → the model is very wrong → it needs to change a lot.


Step 2: “The Force Driving Change” — The Gradient

The force for change is the gradient of the KL divergence. Just as a ball rolls down the steepest slope of a hill, the AI moves in the direction that “reduces its wrongness the fastest.”

$$\text{Force} = -\nabla_\theta D_{KL}(p | q_\theta)$$

The minus sign means “the direction that reduces wrongness.” We’re going downhill, not uphill.


Step 3: “Resistance to Change” — The Fisher Information Matrix

Here comes the core concept of information geometry: the Fisher Information Matrix.

$$F_{ij}(\theta) = E_{q_\theta}\left[\frac{\partial \log q_\theta(x)}{\partial \theta_i} \cdot \frac{\partial \log q_\theta(x)}{\partial \theta_j}\right]$$

If the formula feels intimidating, think of it this way:

“If we nudge the model’s parameters just slightly, how sensitively do the predictions change?”

  • Large Fisher information → small parameter changes cause big prediction shifts → “rigid state” → resists change
  • Small Fisher information → parameter changes barely affect predictions → “flexible state” → changes easily

It plays the same role as mass ($m$) in physics. Just as heavier objects are harder to push, models with large Fisher information resist change.


Step 4: Putting It All Together — Natural Gradient Descent

Combining these three elements like Newton’s law, we get the core equation of AI learning:

$$\Delta\theta = -F(\theta)^{-1} \nabla_\theta D_{KL}(p | q_\theta)$$

PhysicsInformation GeometryMeaning
Acceleration $a$Parameter change $\Delta\theta$The actual change that occurs
Force $F$KL divergence gradient $\nabla D_{KL}$The driving force behind change
Inverse mass $1/m$Inverse Fisher matrix $F^{-1}$Flexibility to change

This is Natural Gradient Descent: the method of learning along “the most efficient path in information space.”


Standard Gradient Descent vs Natural Gradient Descent

Standard gradient descent (SGD) simply moves in the “steepest direction.” But this depends on the coordinate system of the parameter space. The same problem can lead to different directions if you change the coordinates.

Natural gradient descent moves in the “informationally most efficient direction.” It always finds the optimal path regardless of the coordinate system.


graph TD
    A["Current Model State"] --> B{"Which direction?"}
    B -->|"Standard SGD"| C["Steepest in\nparameter space"]
    B -->|"Natural Gradient"| D["Most efficient in\ninformation space"]
    C --> E["Path depends on\ncoordinate system"]
    D --> F["Always the\nshortest path"]

To use an analogy: standard SGD walks along the grid lines of a map, while natural gradient descent considers the actual terrain to find the fastest route.


Where Is This Actually Used?

This isn’t abstract theory. It’s actively used in real AI systems:

  • TRPO/PPO (Reinforcement Learning): core algorithms for robot control and game AI
  • Adam Optimizer: the most widely used deep learning optimizer incorporates an approximation of Fisher information in its design

Key Takeaway

What information geometry ultimately tells us is this:

“The learning speed ($\Delta\theta$) of a system is proportional to the gradient of information mismatch ($\nabla D_{KL}$), adjusted by the structural stability of the predictive model ($F$).”

Just as physics’ $F=ma$ describes the motion of objects, information geometry’s natural gradient equation describes the “motion of intelligence.” It provides a unified mathematical framework for understanding biological adaptation, neural network learning, and the evolution of all predictive systems.