Statistical Inference on the Circle

10 minute read

This is a mathematical digression into statistical inference on the circle. This is motivated by a strange mix of things :

  1. In learning group representations with machine learning I was exposed to the von-Mises probability distribution defined on the circle. Practically this distribution appears quite effective and efficient to work with. Mathematically though, the definition of this probability distribution depends on the embedding of the circle in $\mathbb{R}^2$ as opposed to depending only on intrinsic properties of $S^1$. From a geometric perspective ideally we would work with mathematical objects on the circle that depend only on intrinsic properties of the circle. Thus I got totally distracted and ended up on this tangent.

  2. Mean values are closely tied to $\mathbb{R}$ and the arithmetic on it. When we step away from the real line to the circle - the natural operation somewhat different. The circle is a group under addition mod 1 (if we give the circle coordinates [0,1]). What is the natural analog of the mean value then? Is this even a meaningful question?

  3. The second point relates to an interest that extends far beyond the circle - in many cases the mean and variance of a distribution do not give an accurate picture of the distribution. This is common in ‘robust statistics’ where one is attempting to estimate properties of a distribution in the presence of outliers or samples from some other polluting distribution. Such cases push me to consider where moments (mean, variance, ..) fail to effectively or efficiently describe a distribution.

What is Natural?

My initial response to the von-Mises distribution was ‘that’s not a natural object on the circle’. What would be ‘natural’ analog of the Normal (Gaussian) distribution on other sample spaces though? The Normal distribution has many unique & desirable properties - which of these properties we want to maintain will define a ‘Normal’ distribution on other sample spaces.

Properties of Normal Distribution

  1. The normal distribution is the unique absolutely continuous distribution on $\mathbb{R}$ whose (cumulants beyond the first two (i.e., other than the mean and variance) are zero.

  2. It is also the unique continuous distribution with the maximum entropy for a specified mean and variance.

  3. The Normal distribution belongs to the family of stable distributions which are the attractors of sums of independent, identically distributed distributions whether or not the mean or variance is finite.

  4. It is invariant under the Fourier transform (or is an eigen-vector).

  5. The Normal distribution is Stable in the sense that the space of Normal distributions is closed under convolution. (This does not uniquely specify it).

  6. It solves the diffusion or heat equation. Given

\begin{equation} f(x | \mu, t) = \frac{1}{\sqrt{2\pi t}} e^{-(x-\mu)^2 / 2 t} \end{equation}

density of the Normal distribution written as a function of $t = \sigma^2$. We have

\begin{equation} \partial_t f(x | \mu, t) = \frac{1}{2} \partial_x^2 f(x | \mu, t) \end{equation}

, so the density solves the diffusion equation - and as $t \rightarrow 0$ the density approaches the delta function $\delta(x - \mu)$.

Moment Based Definitions

  • Finitely many non-zero cumulants.
  • Maximum Entropy distribution with given moments.

Each of these requires a definition of expectation value. Expectation values are defined if our sample space is a vector space (or a Banach space where we can define Bochner integral) or if we choose an embedding of our sample space in a vector space.

The von Mises distribution falls into the latter case. If there is a reasonable choice of functions $f_i : S^1 \rightarrow \mathbb{R}$ given by the problem at hand we can construct other maximum entropy distributions fixing those moments.

Fourier Transform

A notion of Fourier transform is available for many sample space that are also groups. But aside from $\mathbb{R}^n$ the Fourier transform has a different domain than the original sample space, so talking about something being an eigenstate of the Fourier transform is right out.

Stability

Stability under binary operation - depends on choice of binary operation. The circle though has a natural binary operation that makes it into a group.

For $S_1 = [0,1]$ let $x+y$ be the sum modulo 1.

What are distributions such that given two probability spaces $X$ and $Y$ the distribution of the sum of their coordinates $x+y$ is in the same family? Consider 2 distributions with on the circle with density $\rho_X(x), \rho_Y(y)$ and their respective fourier transforms $\rho_X(n), \rho_Y(m)$ for $m \in \mathbb{Z}$. The condition that the distribution of the sum lives in the same family is

\begin{equation} \rho(z) = \sum_{x+y=z} \rho_X(x) \rho_Y(y) = \sum_{n} e^{i 2\pi n x} \rho_X(n) \rho_Y(n) \end{equation}

So any densities whose Fourier transform has the form

\begin{equation} \rho(n) = \exp{\sum \alpha_i f_i(n)} \end{equation}

is stable in the sense that convolution preserves the space.

We further must require that $\rho(x) = \sum_n \exp{i 2\pi n x} \rho_X(n)$ is positive for all $x$. Or equivalently that $\rho_X(n)$ is the fourier transform of a non-negative function. The simplest viable form for the Fourier transform is $\rho(n) = \exp{-\alpha |n|^\beta}$ These are unsurprisingly related to the family of stable distributions on $\mathbb{R}$.

On $\mathbb{R}$ the Normal distribution is the unique member of the family of stable distributions with finite variance. On the circle that distinguishing property vanishes - and there is even a wider range of possible values for $\beta$ than on $\mathbb{R}$.

Central Limit Theorem

Slight challenge here. A binary operation is not the full story for the central limit theorem. Given just a commutative binary operation and a set of samples we can construct the sum $x_1 + x_2 + …$. We require something like the mean $n \bar{x} = x_1 + x_2 + x_3 + …$. For the real numbers this has a unique solution for circle there are $n$ possible solutions - so a ‘mean’ is not uniquely defined. The non-uniqueness here is the same as the non-uniqueness of the n-th root of a complex number. This non-uniqueness of the mean re-appears when attempting to do statistical inference on the circle below.

Diffusion Equation $\rightarrow$ Theta Function

The diffusion equation and the delta function (distribution) can be defined far and wide. On the circle we have the diffusion equation

\begin{equation} \partial_t f(x,t) = \partial_x^2 f(x,t) \end{equation}

Which is solved by a Theta function.

\begin{equation} \Theta(x - \mu, it) = \sum_n e^{-n^2 t} e^{i n (x - \mu)} \end{equation}

is a probability distribution on the circle that solves the diffusion equation

\begin{equation} \partial_t \Theta(x - \mu,it) = \sum_n -n^2 e^{-n^2 t} e^{i n (x - \mu)} = \partial_x^2 \Theta(x - \mu, it) \end{equation}

and whose limit as $t \rightarrow 0$ is $\delta(x - \mu)$ (delta function defined on the circle).

The Theta function has a wide array of interesting / wild mathematical properties. Within statistics this distribution is often called the Wrapped Normal Distribution part of a wider family of distribution obtained by ‘wrapping’ a distribution on the real line around circle.

I am biased towards this as it presents a unique candidate for the ‘Normal’ disribution on the circle - and more generally in any situation where we can define the diffusion equation (Riemannain manifold for instance).

So with this we can consider the probability distribution

\begin{equation} P(x | \mu, t) = \Theta(x - \mu, it) \end{equation}

Statistical Inference with Theta Distribution.

Given a probability distribution from this family - denoting $\theta(x - \mu, t) = \Theta(x - \mu , it)$

  • maxima of the density specified by $\mu$.
  • width of the density specified by t.

it is natural to ask how we estimate parameters given a discrete set of samples. The problem of estimating the maxima of the posterior distribution alone is quite interesting. Suppose we have a set of $N$ samples $x_n$ drawn from a distribution with density $\theta(x - \mu, t)$ with known $t$ and unkown $\mu$. We want to obtain a best estimate of $\mu$. For the Normal distribution the max likelihood estimate of $\mu$ coincides with the mean. This provides a nice concrete example of how the notion of a ‘mean value’ extends beyond $\mathbb{R}$.

Before delving too far into the details lets see where intuition leads. Assuming the ML estimate behaves something like the mean value we expect $N \mu = \sum_n x_n$. For the reals this has a unique solution - but on the circle we have

\begin{equation} \mu = \frac{1}{N} x_n + \frac{2\pi r}{N} \;\;\rm{for}\;\; r \in [0, N-1] \end{equation}

There are $N-1$ possible ‘mean values’. (This is quite generic especially in compact groups.) We thus expect that the Likelihood will depend on the full set of these mean values - with some additional mechanism that selects between them.

Already at 2 samples the problem is non-trivial (without prior knowledge of theta functions). After applying the right magic identity for products of theta functions, much rearranging and mangling of summations, or an elegant argument I haven’t found yet… we find that the product of 2 theta densities with same width $t$ is

\begin{align} \theta(t, x_1 - \mu) \theta(t, x_2 - \mu) = \frac{1}{2} \sum_{r=0}^{1} & \theta\left( t/2, \frac{x_1-x_2}{2} + r\pi \right) \
& \times \theta\left( t/2, \frac{ x_1 + x_2 }{ 2 } + r\pi - \mu \right) \end{align}

So the product of the density reduces to a linear combination of products of 2 $\theta$ functions where only one of depends on $\mu$. As expected each term is peaked at one of the possible mean values.

As we bring in more samples at each stage we are simplifying products of the form

\begin{equation} \theta(t, x_N - \mu) \theta(t/(N-1), x’ - \mu) \end{equation}

Using the product formula below we have

\begin{equation} \frac{1}{N} \sum_{r=0}^{N-1} \theta\left(\frac{t}{N(N-1)}, \frac{x’ - x_N + 2\pi r}{N}\right) \theta\left(t/N, \frac{(N-1)x’ + x_N + 2\pi r}{N} - \mu\right) \end{equation}

We get a linear combination of $\theta$ densities each peaked at a different possible mean value for our samples.

The parameter estimation above relies on identities involving products of theta functions. Most importantly if $t$ and $t’$ are commensurate or that $t’/t = a/b \in \mathbb{Q}$ then

\begin{align} \theta(t,x) \theta(t’,x’) =& \frac{1}{a+b} \sum_{r=0}^{a+b-1} \newline & \theta\left(\frac{t}{b(a+b)}, \frac{x - x’ + 2\pi r}{a+b}\right) \newline & \cdot \theta\left(\frac{t+t’}{tt’}, \frac{a x + b x’ + 2\pi a r}{a+b}\right) \end{align}

For proving this we start with the fourier representation.

\begin{equation} \theta(t,x) \theta(t’, x’) = \sum_{n,m} e^{-n^2 t - m^2 t’ + inx + imx’} \end{equation}

With the benefit of hind-sight we can do a change of variables for the sum by - defining $q = n+m$ and $p = bn - am$. With this choice the exponent becomes

\begin{equation} -(n^2 + m^2 a/b) t + inx + imx’ = - q^2 \frac{a}{a+b} - \frac{p^2}{b(a+b)} + i q\frac{a x + b x’}{a+b} + i p \frac{x - x’}{a+b} \end{equation}

We can put this all together in a suggestive form

\begin{equation} \theta(t,x) \theta(t’,x’) = \sum_{p,q} e^{-p^2 s - q^2 s’} e^{i p d + i q \bar{x}} \end{equation}

for

\begin{align} s &= \frac{t}{b(a+b)} \newline s’ &= \frac{at}{a+b} \newline \bar{x} &= \frac{a x + b x’}{a+b} \newline d &= \frac{x-x’}{a+b} \end{align}

Which is very nearly again the product of 2 theta functions where one of the theta functions depends very clearly on a weighted average value of $x$ and $x’$. Why nearly? If we call the mapping $(n,m) \rightarrow (n+m, bn - am)$, $\phi$ then the sum will range over $\phi(\mathbb{Z}^2)$. The mapping $\phi$ is injective but NOT surjective. We can most clearly identify this by looking at the inverse mapping

\begin{equation} \phi^{-1}(p,q) = \left(\frac{aq+p}{a+b}, \frac{bq-p}{a+b}\right) \end{equation}

so the image of a generic pair of integers $p,q$ under the inverse mapping will be rational. A pair $(p,q)$ is in the image of $\phi$ iff $aq+p = 0 \;\rm{mod} (a+b)$ and $bq-p =0 \; \rm{mod} (a+b)$.

To get a sum over integers we can introduce delta functions that restrict us to $\phi(\mathbb{Z}^2)$

\begin{equation} \theta(t,x) \theta(t’,x’) = \sum_{p,q \in \mathbb{Z}} \delta(aq+p) \delta(bq-p) e^{-p^2 s - q^2 s’} e^{i p d + i q \bar{x}} \end{equation}

Now the sums have the right domain, but we have picked up some delta functions. We can expand those as sums of exponentials though ..

\begin{equation} \theta(t,x) \theta(t’,x’) = \sum_{p,q \in \mathbb{Z}} 1/(a+b)^2 \sum_{r,r’} e^{i 2\pi r (aq+p)/(a+b)} e^{i 2\pi r’ (bq-p)/(a+b)} e^{-p^2 s - q^2 s’} e^{i p d + i q \bar{x}} \end{equation}

Using $a=-b \;\rm{mod} (a+b)$ and that the expression depends only on the difference $r-r’$ we can simplify this to

\begin{equation} \theta(t,x) \theta(t’,x’) = \sum_{p,q \in \mathbb{Z}} 1/(a+b) \sum_r e^{i 2\pi r (aq+p)/(a+b)} e^{-p^2 s - q^2 s’} e^{i p d + i q \bar{x}} \end{equation}

And everything conveniently reduces to a linear combination of products of theta functions.

\begin{equation} \theta(t,x) \theta(t’,x’) = \frac{1}{a+b} \sum_r \theta(s, d + 2\pi r/(a+b)) \theta(s’, \bar{x} + 2\pi a r/(a+b)) \end{equation}

Conclusion

  • The definition of “normal” distribution coming from the diffusion equation generalizes well.

  • The posterior distribution of the “normal” distribuiton on the circle is a linear combination of distributions peaked at each distinct “mean” value.

This is still a simple example - a natural extension is to consider the sphere.

Updated: