Some thoughts on Information Geometry

Author

Michele Cespa

Published

December 12, 2025

“Classical thermodynamics … is the only physical theory of universal content which I am convinced … will never be overthrown.” Albert Einstein

Introduction

Very much continuing from the theme of the last post, I want to talk about some pretty elegant approaches to statistical physics - motivated purely from the statistical point of view. I think much of the astonishing success of thermodynamics (as Einstein clearly agrees) can be attributed to how well it abstracts ideas away from the noisy physical properties of a system. To tackle these topics properly, I’ll need to introduce quite a few things from the statistical toolbox, so bear with me for a bit…

Projections

In geometry we are often interested in “shortest distances” which motivates the notion of orthogonal projections. The standard Euclidean projection can be formulated in a few ways but the one useful to us will be:

let \(Q = \vec{\theta} \cdot \vec{x}\) define a plane and \(P\) be some point we want to project onto the plane: \[ \begin{align} proj(P, Q) = \arg\min_{x'} || Q(x') - P ||^2 \end{align} \]

This is nothing new, but serves as a reminder that projection is an extremisation problem and often we deal with the convenient Euclidean distance, which among other things is symmetric in its arguments. The KL-Divergence will serve as our “distance” measure when talking about probability distributions. We can now think of the planes we project onto as being manifolds where the coordinates are now the parameters of the distribution - the manifold of gaussians would be parametrised by the coordinate tuple \((\mu, \sigma)\). Similar to the Euclidean projection we can define the following:

let \(p \in \mathcal{P}\) be our distribution of interest and \(\mathcal{Q}\) be the manifold embedded in \(\mathcal{P}\) onto which we want to project: \[ \begin{align} proj_M(p, \mathcal{Q}) &= \arg\min_{q \in \mathcal{Q}} D(p || q) \\ proj_I(p, \mathcal{Q}) &= \arg\min_{q \in \mathcal{Q}} D(q || p) \end{align} \]

The KL-Divergence (\(D(p || q)\)) is in general not symmetric in its arguments and hence is not quite a “distance” measure - although locally is is and defines the Fisher Information metric. This asymmetry introduces two such projections which for now I’ll denote “M” and “I”.

Exponential Families

Another ingredient we’ll need is the notion of an exponential family. These functional forms turn out to be a very convenient way to express distributions. An exponential family for points \(x \in \mathbb{R}^n\) is characterised by:

  • an input subspace \(S \subseteq \mathbb{R^n}\)
  • a base measure \(h: \mathbb{R}^n \rightarrow \mathbb{R}\)
  • a set of features \(\mathbf{T}(x) = (T_1(x),..., T_m(x))\) with \(T_i: \mathbb{R}^n \rightarrow \mathbb{R}\)

The exponential family is then:

\[ \begin{align} p_{\theta}(x) &= \frac{1}{Z}\exp[\mathbf{\theta} \cdot \mathbf{T}(x)]h(x) \\ Z &= \sum_{x \in S} \exp[\mathbf{\theta} \cdot \mathbf{T}(x)]h(x) \end{align} \]

\(Z\) is a partition function which turns out to be an immensely valuable object. For convenience, this \(Z\) is sometimes subsumed into the exponent and written as \(G(\theta) = \log Z\) - the log partition function. Most common distributions we deal with (Bernoulli, Beta, Poisson, Gaussian etc) can be written in this form but the parameters \(\theta\) may not turn out to be the ones we are familiar with. I’ll demonstrate for the case of a 1D Gaussian:

In this case: \(S = \mathbb{R}\), \(h = \frac{1}{\sqrt{2\pi}}\), \(\mathbf{T}(x) = (x, x^2)\) \[ \begin{align} p_{\theta}(x) = \frac{1}{\sqrt{2\pi}}\exp[(x, x^2) \cdot (\frac{\mu}{\sigma^2}, \frac{-1}{2\sigma^2}) - (\frac{\mu}{2\sigma^2} + \log\sigma)] \end{align} \]

So the natural parameters are not the familiar \((\mu, \sigma)\) pair but instead \((\frac{\mu}{\sigma^2}, \frac{-1}{2\sigma^2})\) and the log partition function is \(G(\theta) = \frac{\mu}{2\sigma^2} + \log\sigma\).

M-Projection and MLE

To convince you there is in fact a point to all this, let’s consider the gradient of \(G\):

\[ \begin{align} G(\theta) &= \log Z = \log\Bigg[\sum_{x \in S} \exp[\mathbf{\theta} \cdot \mathbf{T}(x)] h(x) \Bigg] \\ \frac{\partial G}{\partial \theta_k} &= \frac{1}{Z} \sum_{x \in S} T_k(x) \exp[\mathbf{\theta} \cdot \mathbf{T}(x)] h(x) = \mathbb{E}[T_k(x)] \end{align} \]

This is interesting. The gradient of the log partition function is exactly the expected value of our feature vector. What if we use an exponential family as our parametric distribution when performing an MLE?

\[ \begin{align} \mathcal{L} = \sum_{i=1}^N \log p_{\theta}(x_i) &= \sum_{i=1}^N \mathbf{\theta} \cdot \mathbf{T}(x_i) - G(\theta) + \log h(x_i) \\ &= \theta \cdot \sum_{i=1}^N \mathbf{T}(x_i) - NG(\theta) + \sum_{i=1}^N \log h(x_i) \\ \frac{\partial \mathcal{L}}{\partial \theta_k} &= \sum_{i=1}^N T_k(x_i) - N\frac{\partial G}{\partial \theta_k} = 0 \\ \frac{\partial G}{\partial \theta_k} &= \frac{1}{N}\sum_{i=1}^N T_k(x_i) = \mathbb{E}[T_k(x)] \end{align} \]

This is even better. The MLE distribution is such that the sample averages of the features correspond exactly to the expected values. Tying this back to projections, we can consider the likelihood maximisation in terms of KL-Divergence:

For notational convenience, let \(p\) be the ground truth distribution and our model (previously \(p_{\theta}\)) be \(q \in \mathcal{Q}\) where \(\mathcal{Q}\) is the manifold parametrised by \(\theta\).

\[ \begin{align} \max_{q \in \mathcal{Q}} \mathcal{L} &= \max_{q \in \mathcal{Q}} \frac{1}{N} \sum_{i=1}^N \log q(x_i) \\ &\approx \max_{q \in \mathcal{Q}} \mathbb{E}_p[\log q] \\ &= \min_{q \in \mathcal{Q}} -\mathbb{E}_p[\log q] \\ &= \min_{q \in \mathcal{Q}} D(p || q) \\ \end{align} \]

Therefore the MLE distribution is the M-projection of the ground truth \(p\) onto the manifold of our parametrised model \(\mathcal{Q}\). If we choose for this manifold to describe an exponential family, then the expected values of our MLE distribution will match exactly the sample averages.

I-Projection and MaxEnt

Sometimes we actually don’t have any individual data points at all, but we may have some moments which we wish to constrain our model to preserving. We’ll get to the thermodynamics later, but take a classic thermal reservoir where we can talk about internal energy \(U\) which is a macroscopic average, the data points would consitute individual samples of energy which (suppose for example fluctuations are extremely fast) may be beyond the resolution of our measuring instrument (a simple thermometer). Anyhow, what to do?

If we only have moments as prior knowledge, and we know that unfortunately the moments of a distribution do not uniquely characterise a distribution, it might be difficult to choose a parametric model. So we don’t, we instead produce the most unbiased distribution we can, the maximum entropy (MaxEnt) distribution - subject to these constraints.

Let \(\alpha_i = \mathbb{E}[T_i(x)] = \sum_{x \in \mathcal{X}} q(x) T_i(x)\) be the fixed values for our moments.

\[ \begin{align} H_{\theta} &= -\sum_{x \in \mathcal{X}} q(x) \log q(x) + \sum_i \theta_i(\alpha_i - \sum_{x \in \mathcal{X}} q(x) T_i(x)) + \beta(1 - \sum_{x \in \mathcal{X}} q(x))\\ q_{\theta} &= \max_{q} H_{\theta} \\ &= \min_{q} -H_{\theta} + H(q, u) \text{ (u is the uniform distribution)} \\ &= \min_{q} D_{\theta}(q || u) \\ &= \frac{1}{Z}\exp[-\sum_j \theta_j T_j(x)] \end{align} \]

This is an I-projection. We are looking for the distribution \(q\) which is closest to the uniform distribution (under our constraints). The convenient thing about this formalism is the ease of introducing a non-uniform prior. If we have a non-uniform prior we replace the \(u\) in the KL-Divergence with our prior \(p\). Before actually doing this, let’s just convince ourselves this is in fact valid:

\[ \begin{align} -D(q || p) = H(q) - H(q, p) \end{align} \]

We want to maximise the entropy in \(q\) and minimise the cross-entropy between \(q\) and our prior \(p\) (\(H(q, p)\) is minimal at \(q = p\) by Gibb’s Inequality). This does in fact correspond to maximising \(-D(q || p)\) or equivalently minimising \(D(q || p)\). The result of this is:

\[ \begin{align} q_{\theta} &= \min_{q} D_{\theta}(q || p) \\ &= \frac{1}{Z}\exp[-\mathbf{\theta} \cdot \mathbf{T}(x)] p(x) \end{align} \]

This is an exponential family where the base measure \(h(x)\) is our prior distribution and the features are our imposed moments.

What does Bayes think?

I’ve now used the word prior enough times to warrant a quick detour into the bayesian perspective on things. My treatment thus far has been purely frequentist - any prior knowledge has been encoded as a “target” distribution we minimise distance to not as a bayesian prior. This isn’t a discussion on bayesian inference so I’ll keep the example brief.

Consider the standard exercise of estimating the MAP parameter(s) \(\hat{\theta}_{MAP}\) and the associated distribution in the limit of a large dataset:

\[ \begin{align} \hat{\theta}_{MAP} &= \arg\max_{\theta} \lim_{N \rightarrow \infty} \sum_{i=1}^N \log p(x_i | \theta) + \log p(\theta) \\ &= \arg\max_{\theta} \lim_{N \rightarrow \infty} \sum_{i=1}^N \log p(x_i | \theta) \\ &= \hat{\theta}_{MLE} \\ \end{align} \]

But we know that the MLE distribution (also in the limit of large \(N\)) corresponds to an M-projection from the ground truth onto the manifold of our parametrisation. So in the limit of large data, the MAP estimator converges to the MLE (nothing new) which in turn corresponds to a projection (new bit).

Thermodynamics

Let’s derive the canonical partition function through the MaxEnt principle. We have a thermal system with internal energy \(U\) (an average over the energy all microstates). We are interested in the probability of the system being in the \(i^{th}\) microstate, we call this \(q_i\).

\[ \begin{align} H_{\beta} &= -\sum_i q_i \log q_i + \beta(U - \sum_i q_i E_i) + \gamma(1 - \sum_i q_i) \\ q_i &= \arg\max_{q} H_{\beta} = \frac{1}{Z} e^{-\beta E_i} \\ Z & = \sum_i e^{-\beta E_i} \end{align} \]

Of course this \(\beta\) parameter is the familiar \(1/k_BT\). Annoyingly for the pure statisticians, to obtain this relation we do need to actually invoke some physics, albeit not much.

The first law of thermodynamics states that: \(dU = TdS - pdV\) from which we derive that \(\frac{1}{T} = (\frac{\partial S}{\partial U})_V\). I derived the physical entropy \(S\) in my previous blog post.

\[ \begin{align} S &= -k_B \sum_i q_i \log q_i \\ &= -k_B \sum_i q_i (-\log Z - \beta E_i) \\ &= k_B \log Z + \beta k_B U \\ \frac{\partial S}{\partial U} &= \beta k_B = 1/T \\ \beta &= 1/k_BT \end{align} \]

What we did here was an I-projection with uniform prior while fixing the expected value of energy to \(U\). As we noted earlier, the result is a member of an exponential family which means that the gradient of the log partition function recovers our energy \(U\) - a common result in thermodynamics.

The End

I’ve necessarily been brief and less rigorous in places because I think what’s most important to take from the study of these things is the intuition of what the framework is telling you, the maths can sometimes get in the way. For more rigorous treatments I recommend any of the following: