Statistical Inference on the Circle
This is a mathematical digression into statistical inference on the circle. This is motivated by a strange mix of things :
In learning group representations with machine learning I was exposed to the von-Mises probability distribution defined on the circle. Practically this distribution appears quite effective and efficient to work with. Mathematically though, the definition of this probability distribution depends on the embedding of the circle in \(\mathbb{R}^2\) as opposed to depending only on intrinsic properties of \(S^1\). From a geometric perspective ideally we would work with mathematical objects on the circle that depend only on intrinsic properties of the circle. Thus I got totally distracted and ended up on this tangent.
Mean values are closely tied to \(\mathbb{R}\) and the arithmetic on it. When we step away from the real line to the circle - the natural operation somewhat different. The circle is a group under addition mod 1 (if we give the circle coordinates [0,1]). What is the natural analog of the mean value then? Is this even a meaningful question?
The second point relates to an interest that extends far beyond the circle - in many cases the mean and variance of a distribution do not give an accurate picture of the distribution. This is common in ‘robust statistics’ where one is attempting to estimate properties of a distribution in the presence of outliers or samples from some other polluting distribution. Such cases push me to consider where moments (mean, variance, ..) fail to effectively or efficiently describe a distribution.
What is Natural?
My initial response to the von-Mises distribution was ‘that’s not a natural object on the circle’. What would be ‘natural’ analog of the Normal (Gaussian) distribution on other sample spaces though? The Normal distribution has many unique & desirable properties - which of these properties we want to maintain will define a ‘Normal’ distribution on other sample spaces.
Properties of Normal Distribution
The normal distribution is the unique absolutely continuous distribution on \(\mathbb{R}\) whose (cumulants beyond the first two (i.e., other than the mean and variance) are zero.
It is also the unique continuous distribution with the maximum entropy for a specified mean and variance.
The Normal distribution belongs to the family of stable distributions which are the attractors of sums of independent, identically distributed distributions whether or not the mean or variance is finite.
It is invariant under the Fourier transform (or is an eigen-vector).
The Normal distribution is Stable in the sense that the space of Normal distributions is closed under convolution. (This does not uniquely specify it).
It solves the diffusion or heat equation. Given
\[\begin{equation} f(x | \mu, t) = \frac{1}{\sqrt{2\pi t}} e^{-(x-\mu)^2 / 2 t} \end{equation}\]
density of the Normal distribution written as a function of \(t = \sigma^2\). We have
\[\begin{equation} \partial_t f(x | \mu, t) = \frac{1}{2} \partial_x^2 f(x | \mu, t) \end{equation}\]
, so the density solves the diffusion equation - and as \(t \rightarrow 0\) the density approaches the delta function \(\delta(x - \mu)\).
Moment Based Definitions
- Finitely many non-zero cumulants.
- Maximum Entropy distribution with given moments.
Each of these requires a definition of expectation value. Expectation values are defined if our sample space is a vector space (or a Banach space where we can define Bochner integral) or if we choose an embedding of our sample space in a vector space.
The von Mises distribution falls into the latter case. If there is a reasonable choice of functions \(f_i : S^1 \rightarrow \mathbb{R}\) given by the problem at hand we can construct other maximum entropy distributions fixing those moments.
Fourier Transform
A notion of Fourier transform is available for many sample space that are also groups. But aside from \(\mathbb{R}^n\) the Fourier transform has a different domain than the original sample space, so talking about something being an eigenstate of the Fourier transform is right out.
Stability
Stability under binary operation - depends on choice of binary operation. The circle though has a natural binary operation that makes it into a group.
For \(S_1 = [0,1]\) let \(x+y\) be the sum modulo 1.
What are distributions such that given two probability spaces \(X\) and \(Y\) the distribution of the sum of their coordinates \(x+y\) is in the same family? Consider 2 distributions with on the circle with density \(\rho_X(x), \rho_Y(y)\) and their respective fourier transforms \(\rho_X(n), \rho_Y(m)\) for \(m \in \mathbb{Z}\). The condition that the distribution of the sum lives in the same family is
\[\begin{equation} \rho(z) = \sum_{x+y=z} \rho_X(x) \rho_Y(y) = \sum_{n} e^{i 2\pi n x} \rho_X(n) \rho_Y(n) \end{equation}\]
So any densities whose Fourier transform has the form
\[\begin{equation} \rho(n) = \exp{\sum \alpha_i f_i(n)} \end{equation}\]
is stable in the sense that convolution preserves the space.
We further must require that \(\rho(x) = \sum_n \exp{i 2\pi n x} \rho_X(n)\) is positive for all \(x\). Or equivalently that \(\rho_X(n)\) is the fourier transform of a non-negative function. The simplest viable form for the Fourier transform is \(\rho(n) = \exp{-\alpha |n|^\beta}\) These are unsurprisingly related to the family of stable distributions on \(\mathbb{R}\).
On \(\mathbb{R}\) the Normal distribution is the unique member of the family of stable distributions with finite variance. On the circle that distinguishing property vanishes - and there is even a wider range of possible values for \(\beta\) than on \(\mathbb{R}\).
Central Limit Theorem
Slight challenge here. A binary operation is not the full story for the central limit theorem. Given just a commutative binary operation and a set of samples we can construct the sum \(x_1 + x_2 + ...\). We require something like the mean \(n \bar{x} = x_1 + x_2 + x_3 + ...\). For the real numbers this has a unique solution for circle there are \(n\) possible solutions - so a ‘mean’ is not uniquely defined. The non-uniqueness here is the same as the non-uniqueness of the n-th root of a complex number. This non-uniqueness of the mean re-appears when attempting to do statistical inference on the circle below.
Diffusion Equation \(\rightarrow\) Theta Function
The diffusion equation and the delta function (distribution) can be defined far and wide. On the circle we have the diffusion equation
\[\begin{equation} \partial_t f(x,t) = \partial_x^2 f(x,t) \end{equation}\]
Which is solved by a Theta function.
\[\begin{equation} \Theta(x - \mu, it) = \sum_n e^{-n^2 t} e^{i n (x - \mu)} \end{equation}\]
is a probability distribution on the circle that solves the diffusion equation
\[\begin{equation} \partial_t \Theta(x - \mu,it) = \sum_n -n^2 e^{-n^2 t} e^{i n (x - \mu)} = \partial_x^2 \Theta(x - \mu, it) \end{equation}\]
and whose limit as \(t \rightarrow 0\) is \(\delta(x - \mu)\) (delta function defined on the circle).
The Theta function has a wide array of interesting / wild mathematical properties. Within statistics this distribution is often called the Wrapped Normal Distribution part of a wider family of distribution obtained by ‘wrapping’ a distribution on the real line around circle.
Of the properties surveyed above, the diffusion equation gives the cleanest generalization — it requires only a Laplacian, which is intrinsic to any Riemannian manifold. The theta function is the unique candidate that emerges from this criterion, and it does not depend on any embedding of the circle in \(\mathbb{R}^2\).
I am biased towards this as it presents a unique candidate for the ‘Normal’ distribution on the circle — and more generally in any situation where we can define the diffusion equation (Riemannian manifold for instance).
So with this we can consider the probability distribution
\[\begin{equation} P(x | \mu, t) = \Theta(x - \mu, it) \end{equation}\]
Statistical Inference with Theta Distribution.
Given a probability distribution from this family - denoting \(\theta(x - \mu, t) = \Theta(x - \mu , it)\)
- maxima of the density specified by \(\mu\).
- width of the density specified by t.
it is natural to ask how we estimate parameters given a discrete set of samples. The problem of estimating the maxima of the posterior distribution alone is quite interesting. Suppose we have a set of \(N\) samples \(x_n\) drawn from a distribution with density \(\theta(x - \mu, t)\) with known \(t\) and unkown \(\mu\). We want to obtain a best estimate of \(\mu\). For the Normal distribution the max likelihood estimate of \(\mu\) coincides with the mean. This provides a nice concrete example of how the notion of a ‘mean value’ extends beyond \(\mathbb{R}\).
Before delving too far into the details lets see where intuition leads. Assuming the ML estimate behaves something like the mean value we expect \(N \mu = \sum_n x_n\). For the reals this has a unique solution - but on the circle we have
\[\begin{equation} \mu = \frac{1}{N} x_n + \frac{2\pi r}{N} \;\;\text{for}\;\; r \in [0, N-1] \end{equation}\]
There are \(N-1\) possible ‘mean values’. (This is quite generic especially in compact groups.) We thus expect that the Likelihood will depend on the full set of these mean values - with some additional mechanism that selects between them.
Already at 2 samples the problem is non-trivial (without prior knowledge of theta functions). After applying the right magic identity for products of theta functions, much rearranging and mangling of summations, or an elegant argument I haven’t found yet… we find that the product of 2 theta densities with same width \(t\) is
\[\begin{align} \theta(t, x_1 - \mu) \theta(t, x_2 - \mu) = \frac{1}{2} \sum_{r=0}^{1} & \theta\left( t/2, \frac{x_1-x_2}{2} + r\pi \right) \\ & \times \theta\left( t/2, \frac{ x_1 + x_2 }{ 2 } + r\pi - \mu \right) \end{align}\]
So the product of the density reduces to a linear combination of products of 2 \(\theta\) functions where only one of depends on \(\mu\). As expected each term is peaked at one of the possible mean values.
As we bring in more samples at each stage we are simplifying products of the form
\[\begin{equation} \theta(t, x_N - \mu) \theta(t/(N-1), x' - \mu) \end{equation}\]
Using the product formula below we have
\[\begin{equation} \frac{1}{N} \sum_{r=0}^{N-1} \theta\left(\frac{t}{N(N-1)}, \frac{x' - x_N + 2\pi r}{N}\right) \theta\left(t/N, \frac{(N-1)x' + x_N + 2\pi r}{N} - \mu\right) \end{equation}\]
We get a linear combination of \(\theta\) densities each peaked at a different possible mean value for our samples.
The parameter estimation above relies on identities involving products of theta functions. Most importantly if \(t\) and \(t'\) are commensurate or that \(t'/t = a/b \in \mathbb{Q}\) then
\[\begin{align*} \theta(t,x) \theta(t',x') =& \frac{1}{a+b} \sum_{r=0}^{a+b-1} \\ & \theta\left(\frac{t}{b(a+b)}, \frac{x - x' + 2\pi r}{a+b}\right) \\ & \cdot \theta\left(\frac{t+t'}{tt'}, \frac{a x + b x' + 2\pi a r}{a+b}\right) \end{align*}\]
For proving this we start with the Fourier representation, which converts the product into a double sum where the two theta functions are decoupled.
\[\begin{equation} \theta(t,x) \theta(t', x') = \sum_{n,m} e^{-n^2 t - m^2 t' + inx + imx'} \end{equation}\]
The goal is a change of variables that separates the weighted mean \(\bar{x} = (ax + bx')/(a+b)\) from the difference \(d = (x-x')/(a+b)\) in the phase. With hindsight, define \(q = n+m\) and \(p = bn - am\). With this choice the exponent becomes
\[\begin{align} -(n^2 + m^2 a/b) t + inx + imx' &= - q^2 \frac{a}{a+b} - \frac{p^2}{b(a+b)} \\ &\quad + i q\frac{a x + b x'}{a+b} + i p \frac{x - x'}{a+b} \end{align}\]
We can put this all together in a suggestive form
\[\begin{equation} \theta(t,x) \theta(t',x') = \sum_{p,q} e^{-p^2 s - q^2 s'} e^{i p d + i q \bar{x}} \end{equation}\]
for
\[\begin{align} s &= \frac{t}{b(a+b)} \\ s' &= \frac{at}{a+b} \\ \bar{x} &= \frac{a x + b x'}{a+b} \\ d &= \frac{x-x'}{a+b} \end{align}\]
Which is very nearly again the product of 2 theta functions where one of the theta functions depends very clearly on a weighted average value of \(x\) and \(x'\). Why nearly? If we call the mapping \((n,m) \rightarrow (n+m, bn - am)\), \(\phi\) then the sum will range over \(\phi(\mathbb{Z}^2)\). The mapping \(\phi\) is injective but NOT surjective. We can most clearly identify this by looking at the inverse mapping
\[\begin{equation} \phi^{-1}(p,q) = \left(\frac{aq+p}{a+b}, \frac{bq-p}{a+b}\right) \end{equation}\]
so the image of a generic pair of integers \(p,q\) under the inverse mapping will be rational. A pair \((p,q)\) is in the image of \(\phi\) iff \(aq+p = 0 \;\mathrm{mod} (a+b)\) and \(bq-p =0 \; \mathrm{mod} (a+b)\).
To get a sum over all integers — rather than just the image \(\phi(\mathbb{Z}^2)\) — we introduce discrete delta functions that enforce the membership condition, then expand them as Fourier series over \(\mathbb{Z}/(a+b)\mathbb{Z}\)
\[\begin{equation} \theta(t,x) \theta(t',x') = \sum_{p,q \in \mathbb{Z}} \delta(aq+p) \delta(bq-p) e^{-p^2 s - q^2 s'} e^{i p d + i q \bar{x}} \end{equation}\]
Now the sums range over all of \(\mathbb{Z}^2\), at the cost of two delta functions. Expanding each as a sum of exponentials over \(\mathbb{Z}/(a+b)\mathbb{Z}\)
\[\begin{align} \theta(t,x) \theta(t',x') = \frac{1}{(a+b)^2} \sum_{p,q \in \mathbb{Z}} \sum_{r,r'} &\; e^{i 2\pi r (aq+p)/(a+b)}\, e^{i 2\pi r' (bq-p)/(a+b)} \\ &\cdot e^{-p^2 s - q^2 s'}\, e^{i p d + i q \bar{x}} \end{align}\]
The double sum over \(r, r'\) depends only on the difference \(r - r'\), so one sum is free and gives a factor of \((a+b)\). Using \(a \equiv -b \;\mathrm{mod}\; (a+b)\) to combine the phase factors
\[\begin{equation} \theta(t,x) \theta(t',x') = \sum_{p,q \in \mathbb{Z}} 1/(a+b) \sum_r e^{i 2\pi r (aq+p)/(a+b)} e^{-p^2 s - q^2 s'} e^{i p d + i q \bar{x}} \end{equation}\]
And everything conveniently reduces to a linear combination of products of theta functions.
\[\begin{equation} \theta(t,x) \theta(t',x') = \frac{1}{a+b} \sum_r \theta(s, d + 2\pi r/(a+b)) \theta(s', \bar{x} + 2\pi a r/(a+b)) \end{equation}\]
Conclusion
The diffusion equation gives a clean, intrinsic definition of “normal” on the circle — the theta function — without requiring an embedding in \(\mathbb{R}^2\).
The statistical inference problem reveals something more interesting. The posterior for \(\mu\) given \(N\) samples is not a single theta distribution but a linear combination of \(N\) theta distributions, each peaked at a different candidate mean value. This is not a failure of the method — it is the correct answer. The circle genuinely has \(N\) distinct solutions to \(N\mu = \sum x_n\), and the likelihood reflects all of them. The non-uniqueness of the mean flagged in the motivation is not a nuisance to be resolved but a structural feature that the posterior faithfully represents.
This also answers the moment-based approach: on the circle, the ML estimate of \(\mu\) does not reduce to a simple sample mean, because no unique sample mean exists. The posterior is the honest replacement.
A natural extension is to other Riemannian manifolds, where the Laplacian is still intrinsically defined and the diffusion equation still picks out a canonical family of distributions. The inference problem will generically exhibit the same multi-modal posterior structure wherever the manifold’s topology obstructs a unique notion of mean.