Notation in Statistics

Statistics
Notation
The distribution is the primary object in probability theory. Expectation values and moments are tools for summarizing it — tools that can fail. A case for thinking in measures.
Author

Adam Henderson

Published

March 24, 2017

Notation is powerful

“By relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more advanced problems, and, in effect, increases the mental power of the race.” (Alfred North Whitehead, “An Introduction to Mathematics”)

“Good notation can make the difference between a readable paper and an unreadable one.” Terence Tao — Use Good Notation

See also: hypotext - notation

The central complaint here: the language of Random Variables and Expectation Values encourages treating moments as the primary object of probability theory. They are not. The probability measure — the distribution — is the primary object. Moments are tools for summarizing or estimating properties of that distribution. This distinction matters because the tools can fail while the distribution remains perfectly well-defined. When they fail the measure-first view stays coherent; the moment-first view breaks down.

Probability is about Measures

Modern probability theory derives from Kolmogorov, who laid its foundation using measure theory. Measure theory is commonly taught as a prerequisite to integration — but in probability theory, measure theory is the star and integration is a useful tool. This ordering gets inverted in practice.

Sample Spaces

In the language of measure theory a sample space is a measurable space — a set \(X\) equipped with a sigma algebra \(\Sigma\).

A sigma algebra is a collection of subsets of \(X\) such that:

  1. \(\Sigma\) contains \(\emptyset\)
  2. \(\Sigma\) is closed under complements: if \(A \in \Sigma\) then \(X - A \in \Sigma\).
  3. \(\Sigma\) is closed under countable unions: given \(A_n \in \Sigma\), then \(\cup_n A_n \in \Sigma\).

This pair \((X,\Sigma)\) gives us the space of possible outcomes and the sets for which probability can be defined. Two canonical examples:

  • Finite Discrete\(X = \{1,\ldots,n\}\) with \(\Sigma = \mathcal{P}(X)\). The sample space of a die, a deck of cards, etc.
  • Borel Algebra — the smallest sigma algebra containing all open sets in a topology. For \(\mathbb{R}^n\) with the standard topology, this is the canonical continuous sample space.

Probability Space

A Probability Space is a measurable space equipped with a measure \(P\) satisfying

  1. Positivity \(P : \Sigma \rightarrow \mathbb{R}^+\)
  2. Normalized \(P(X) = 1\)
  3. \(P(\emptyset) = 0\)
  4. Countable additivity : for disjoint \(A_n \in \Sigma\), \(P(\cup_n A_n) = \sum_n P(A_n)\)

Given \((X, \Sigma)\) we can consider \(\mathcal{P}(X, \Sigma)\) — the space of all normalized measures. These are the possible probability distributions on our sample space. For the finite discrete case this is simply the space of non-negative functions \(f : X \rightarrow \mathbb{R}\) summing to 1.

At this stage we have a complete language for discussing probabilities — and this is where integration typically creeps in.

Random Variables

A Random Variable is a measurable function from a probability space to the reals. Such a function can be integrated; its integral is the expectation value, and integrals of functions of it give moments.

Measurable functions are natural morphisms between measurable spaces. For measurable spaces \(X\), \(Y\) a measurable function \(f: X \rightarrow Y\) satisfies: \(f^{-1}(B)\) is measurable for all measurable \(B \in Y\).

The preimage behaves well under all set operations:

  1. \(f^{-1}(Y)=X\)
  2. \(f^{-1}(\emptyset) = \emptyset\)
  3. \(f^{-1}(\cap_\alpha V_\alpha) = \cap_\alpha f^{-1} (V_\alpha)\)
  4. \(f^{-1}(\cup_\alpha V_\alpha) = \cup_\alpha f^{-1} (V_\alpha)\)
  5. \(f^{-1}(Y - V) = X - f^{-1}(V)\)

So a measurable \(f\) induces a morphism \(f^{-1} : \Sigma_Y \rightarrow \Sigma_X\) of sigma algebras.

We Really Care about Measures

Random variables are a means to define probability measures — and probability measures are the primary object of interest.

Given a measurable function \(f : X \rightarrow Y\) and a probability measure \(P\) on \(X\), we get a probability measure on \(Y\) for free — the pushforward:

\[f_*(P)(V) = P(f^{-1}(V))\]

One can verify: \(f_*(P)(\emptyset) = P(\emptyset) = 0\), \(f_*(P)(Y) = P(X) = 1\), and countable additivity holds. This is consistent with the operational meaning: draw samples from \(P\), evaluate \(f\) on each, and the resulting distribution is \(f_*(P)\).

A real random variable defines a probability distribution on \(\mathbb{R}\) via this pushforward — and it is the pushforward distribution that we care about. Expectation values and moments are tools for estimating properties of this distribution. The distribution is the object; the moments are summaries of it.

Artificially Restricted Scope

The standard definition restricts random variables to real-valued measurable functions — a restriction motivated by integration. But we frequently care about measurable functions to other spaces:

  • Random sentences from a generative text model are not real-valued.
  • Categorically we want all measurable spaces and all measurable functions between them. Real random variables form a Comma Category — the category over \(\mathbb{R}\) with the Borel algebra.
  • Compositions of random variables arise constantly; restricting to \(\mathbb{R}\) makes this awkward.

At minimum a random variable should be any measurable map between a probability space \((X, P)\) and a measurable space \(Y\).

Conflicting Notation

Working with a single sample space, arithmetic on real random variables is pointwise:

  • \((f + g)(x) = f(x) + g(x)\)
  • \((fg)(x) = f(x)g(x)\)

But statements like “let \(h = f + g\) where \(f\) and \(g\) are independent Normal random variables” introduce a second, different notion of addition. What is really happening:

  1. Start with measure spaces \(X\), \(Y\) and random variables \(f\), \(g\).
  2. Construct the product space \(X \times Y\).
  3. Extend: \(f(x,y) = f(x)\) and \(g(x,y) = g(y)\).
  4. Now add: \((f + g)(x,y) = f(x) + g(y)\) — pointwise on the product.

The \(+\) bundles together an operation on the sample space (taking the product) with an operation on the random variables. This is not a problem with random variables per se, but a consequence of suppressing the underlying sample spaces from the notation.

Real Random Variables are Coordinates

Random variables primarily act as coordinate functions — mapping from a sample space to real values we can compute with. This is a close analog to coordinate charts on a manifold: \(f : U \rightarrow \mathbb{R}\) for \(U \subset M\).

On a manifold, the existence of coordinate functions does not mean that integrating those coordinates is geometrically meaningful. The coordinate values depend on the chart; the geometry does not. Similarly, the fact that a random variable can be integrated does not mean the integral — the expectation — is the right object to focus on. The pushforward measure is the intrinsic object; the expectation is one chart-dependent summary of it.

Where Moments Fail

Two examples where the distribution is perfectly well-defined but moment-based reasoning breaks down.

Long-tailed distributions. The Cauchy distribution has density \(\frac{1}{\pi(1 + x^2)}\) — a completely well-defined, symmetric distribution centered at 0. Its mean does not exist: the integral \(\int x \cdot \frac{1}{\pi(1+x^2)} dx\) diverges. A framework that treats the mean as the primary object has nothing to say about a distribution that is otherwise completely tractable. The distribution is fine; the moment is the thing that breaks.

More generally, for heavy-tailed distributions (Pareto, stable distributions with \(\alpha < 2\)) finite moments may not exist at any order. The distribution is still the right object.

Distributions on manifolds. On the circle \(S^1\), the distribution is well-defined — the wrapped normal / theta function gives a clean, intrinsically-defined probability measure. But the mean is not unique: for \(N\) samples \(x_n\), the equation \(N\mu = \sum x_n\) has \(N\) solutions on \(S^1\). A moment-first view produces ambiguity; a measure-first view does not. The full posterior over \(\mu\) is a mixture peaked at each candidate mean — the distribution contains the information, the moment does not. (See Statistical Inference on the Circle for the full story.)

Conclusion

Probability measures are the right primary object. Random variables are coordinate functions that induce pushforward measures; expectation values and moments are summaries of those measures. The summaries are useful but not fundamental — they can fail to exist, fail to be unique, or fail to capture the structure of the distribution. Working directly with measures avoids these failures and generalizes cleanly to any measurable space, not just \(\mathbb{R}^n\).