Symmetries in Machine Learning I
Exact symmetries in machine learning problems are rare — but when they exist, ignoring them is costly. A model that doesn’t know translation is a symmetry has to learn each translated version of a pattern separately. One that does can work directly on the quotient space of equivalence classes.
The quotient is the right mathematical object. The problem is that it often has no obvious coordinates. This post defines invariance and covariance precisely, works through an idealized example that makes the coordinate problem concrete, and sets up the question that Part II answers via representation theory.
Inspired by Invariant Information Clustering, which leverages transformation invariance for unsupervised learning.
Exact Symmetries in ML
Simple Supervised machine learning tasks can be classified by
- Domain \(X\) - or the function inputs
- Range \(Y\) - or the function outputs
A particular instance of a task consists of
A set of observations of input, output pairs. \(\mathcal{D} \subset X \times Y\)
Cost function defined given a candidate function \(f: X \rightarrow Y\) and a probability distribution on \(X \times Y\) (typically defined by observed data). \(C : (X \rightarrow Y) \times \mathcal{P}(X \times Y) \rightarrow \mathbb{R}\)
We say ‘simple’ here as there are plenty of cases where observations are from \(X \times Y\) while the function to learn has a different range \(Y'\). The most common example being Logistic regression where the observations are from a discrete set \(L\) and the function output is a probability distribution over \(L\).
Symmetries are typically observed for tasks where the domain \(X\) has transformations, \(\phi_{\alpha} : X \rightarrow X\), such that either
The outputs are invariant under transformation of the inputs by \(\phi_{\alpha}\), so the outputs \(y\) associated to \(x\) and \(\phi_{\alpha}(x)\) would be the same for all \(\alpha\).
The outputs are covariant under \(\phi_{\alpha}\) - which qualitatively means that the outputs associated to \(x\) and \(\phi_{\alpha}(x)\) are related in a known way.
Denoting our symmetry actions \(\phi_{\alpha} : X \rightarrow X\) by \(\phi_{\alpha} \cdot x\) we have the requirements that the optimal \(f\) given the cost function should satisfy.
- Invariance : \(f(\phi_{\alpha} \cdot x) = f(x)\) for all \(\alpha, x\)
- Covariance : There exists a function \(G\) taking transformation of \(X\) and producing a transformation of \(Y\), \(G : (X \rightarrow X) \rightarrow (Y \rightarrow Y)\) such that it
- Defines transformation of outputs : \(f(\phi_{\alpha} \cdot x) = G(\phi_{\alpha}) \cdot f(x)\) for all \(\alpha, x\)
- Is a Functor : \(f(\phi_{\alpha} \cdot (\phi_{\beta} \cdot x)) = G(\phi_{\alpha} \cdot \phi_{\beta}) \cdot f(x) = G(\phi_{\alpha}) \cdot G(\phi_{\beta}) \cdot f(x)\) for all \(\alpha, \beta, x\)
Such exact symmetries are rare in practice but idealized versions of standard problems do admit exact symmetries.
Idealized Images : If we remove questions of “what happens at the boundary”, allowing images composed of pixels that extend infinitely in all directions (functions \(\mathbb{Z}^2\) to \(\mathbb{R}^3\)). Then we expect: - Classification of image content to be translation invariant. - Detection of bounding boxes to be translation covariant with the bounding box being translated along with the image. - Classification to be dilation invariant where we allow transformations that take each pixel and “blow it up” to a \(K \times K\) square of pixels of identical color.
Once we introduce image boundaries we are reduced to having “near” or “approximate” symmetries that will be discussed another time.
Simple Idealized Example
A highly idealized example that is similar to image models takes as domain \(X\) functions from The cyclic group \(\mathbb{Z}_N\) (addition mod N) to \(\mathbb{Z}_2\).
- Domain \(X\) : \(\mathbb{Z}_N \rightarrow \mathbb{Z}_2\)
- Interpretation : Finite strings of bits.
- Range \(Y\) : Any
- Group of Transformations : Self-Group action of \(\mathbb{Z}_N\)
- For \(n \in \mathbb{Z}_N\) define \((\phi_n \cdot x)(m) = x(m - n)\)
- Interpretation : Translation with periodic boundary conditions.
Suppose we have domain expert knowledge that the task is exactly invariant with respect to the transformations - how do we incorporate that knowledge? Can we directly work with functions \(f: (\mathbb{Z}_N \rightarrow \mathbb{Z}_2) \rightarrow \mathbb{R}\) that were invariant under the group of transformations?
\[ f(\phi_n \cdot x) = f(x) \text{ for all } n \in \mathbb{Z}_N \]
What do these invariant functions look like?
Exploring Invariant Functions
If \(f(x) = f(\phi_n(x))\) for all \(n\) then clearly it is sufficient to specify \(f\) for any single element of the following subset of \(X\),
\[ [x] = \{\phi_n(x) | n \in \mathbb{Z}_N\} \]
to know the value of \(f\) on the entire subset. This set consists of all possible translations / transformations of the element \(x\).
For \(N=3\) and writing the input \(x\) as a string of bits - \([101] = \\\{ 101, 011, 110 \\\}\)
Each such subset is called an equivalence class under \(\sim_\phi\) and we denote the set of these equivalence classes as \(X/\sim_{\phi}\). One avenue to understanding the invariant functions is to directly define the functions on \(X / \sim_{\phi}\) (typically called the quotient of \(X\) with respect to \(\sim_{\phi}\)).
For our simple example it is amusing to note that these equivalence classes are called “Necklaces” necklaces - mathworld and have been actively studied within combinatorics - where the primary object of study is the number of distinct equivalence classes. This also hints that the study of these objects is non-trivial!
For different \(N\) we can write down representatives of the equivalence classes.
- \(N=1\):
0,1 - \(N=2\):
00,10,11 - \(N=3\):
000,100,110,111 - \(N=4\):
0000,1000,1100,1010,1110,1111 - \(N=5\):
00000,10000,11000,10100,11100,10110,11110,11111 - \(N=6\):
000000,100000,110000,101000,100100,111000,110100,110010,111111,011111,001111,111010,011011,101010
With access to these it is possible to define a general invariant function by assigning a value \(f([x])\) to each equivalence class \([x]\) and then generally we have \(f(x) = f([y])\) if \(x \in [y]\). While this allows us to construct invariant functions they are not pleasant or efficient to work with.
With this understanding one can imagine taking each observation \((x, y)\) and mapping to \(([x], y)\), so using \(\mathcal{D}' = \\\{ ([x], y) \vert (x,y) \in \mathcal{D} \\\}\) as training data.
- Efficiently determining which equivalence class the input \(x\) belongs to may be burdensome.
- Where elements of \(X\) could be simply expressed as elements of \(\mathbb{Z}^N_2\) providing simple coordinates (features) and so a means to define a parametrized family of functions \(f_{\theta}\) to optimize over - the quotient \(X / \sim_{\phi}\) does not have such obvious coordinates.
The quotient \(X/\sim_\phi\) is the right object to work with — but it has no obvious coordinates. Elements of \(\mathbb{Z}_2^N\) have simple coordinates (bits); the necklace classes do not. Determining which equivalence class an input belongs to is itself non-trivial, and defining a parametrized family of functions to optimize over requires coordinates.
This is the central problem: the quotient is mathematically clean but practically intractable without a coordinate system.
Responses to the Coordinate Problem
Five strategies, each exploiting the symmetry differently:
1. Work directly with the quotient. Map each input to its equivalence class and train on \(X/{\sim_\phi}\) directly. Mathematically clean — an invariant function on \(X\) is exactly a function on \(X/{\sim_\phi}\). In practice this requires (a) efficiently determining which class a given input belongs to, and (b) coordinates on the quotient to parametrize a family of functions. For necklaces, neither is obvious.
2. Data augmentation. Keep raw inputs; use knowledge of the group to expand the training set. For each example \((x, y)\), add all transformed copies \((\phi_\alpha \cdot x, y)\) for every \(\alpha \in G\). The model works in the original coordinates but sees enough rotations of each pattern to learn the invariance from data. The cost: sample complexity scales with \(|G|\) — for \(\mathbb{Z}_N\), a factor of \(N\) more examples to match what invariant coordinates give for free.
3. Invariant featurization. Preprocess inputs into features that are constant across equivalence classes before the model sees them. The model implicitly operates on \(X/{\sim_\phi}\) — two inputs in the same class are identical in feature space. Symmetry is handled entirely upstream; the downstream model is unconstrained.
4. Equivariant model components. Build the symmetry into the architecture. Equivariant layers transform predictably under the group action; an invariant output layer (e.g. global pooling) collapses to a value constant across each orbit. The model operates on raw inputs but is constrained by construction to respect the symmetry.
Strategies 3 and 4 are closely related. An invariant function \(f : X \to Y\) factors as \(f = g \circ \pi\) where \(\pi : X \to X/{\sim}\) is the quotient map. Invariant featurization is \(\pi\); the downstream model is \(g\). An equivariant architecture with an invariant output implements the same factorization differently. The precise conditions under which they are equivalent is the subject of Part II.
5. Gauge fixing. Choose a canonical representative from each equivalence class — a gauge slice that intersects each orbit exactly once. In physics, a gauge condition is an algebraic constraint on the field that selects this representative globally. The ML analogue faces a difficulty: the data arrives spread across orbits, so alignment must happen per sample. Algorithmically this means defining a deterministic canonical form (e.g. lexicographic minimum rotation) — fast, but prone to discontinuities and Gribov ambiguities where multiple rotations tie. The learned variant, Spatial Transformer Networks, trains an alignment network to predict the rotation per sample; the main model then sees the aligned input. Unlike strategies 3 and 4, gauge fixing keeps raw signals and an unconstrained model, but introduces a learned normalization stage whose correctness is hard to verify.
Part II takes strategies 2–4 up concretely — quantifying the sample complexity cost of ignoring symmetry and showing that the representation theory of \(\mathbb{Z}_N\) (Fourier modes) provides the natural coordinates on the quotient that make strategy 3 tractable.