The Central Limit Theorem, Three Proofs
A theorem so robust that three completely different proof strategies — Fourier analysis, combinatorial replacement, and a clever functional identity — all land on the same Gaussian. Each tells you something different about *why*.
A theorem so robust that three completely different proof strategies — Fourier analysis, combinatorial replacement, and a clever functional identity — all land on the same Gaussian. Each tells you something different about why.
Why does the bell curve keep showing up?
Add a thousand small independent nudges and the shape of any single nudge gets washed out. Whatever distribution those nudges came from — coin flips, dice rolls, lognormal returns, whatever you like, as long as the variance is finite — what you see at the end is almost always the same bell curve. That's the assertion.
The intuition that survives all three proofs is this: only the first two moments of a nudge survive averaging at scale $\sqrt n$. Higher-moment information leaks out faster than the standard deviation grows. So once you've matched mean and variance, you've matched everything that matters in the limit. The Gaussian is just the unique distribution that makes that matching tight.
That is the theorem in a sentence. Now let me make it precise, and then prove it three times.
The statement, paid for in symbols
Let $X_1, X_2, \ldots, X_n$ be independent and identically distributed real-valued random variables. Here $n$ is the number of samples, each $X_i$ is one draw from the same distribution. Let $\mu = \mathbb{E}[X_1]$ be their common mean, and $\sigma^2 = \mathrm{Var}(X_1)$ their common variance, which we assume is finite and strictly positive. Write $S_n = X_1 + \cdots + X_n$ for the partial sum.
The Lindeberg–Lévy central limit theorem says: as $n \to \infty$, the standardized sum converges in distribution to a standard normal,
In words: subtract off the drift $n\mu$, rescale by $\sigma\sqrt n$ so the variance is $1$, and what's left has a cumulative distribution function that converges pointwise to the standard normal CDF $\Phi(x) = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^x e^{-t^2/2}\,dt$ at every point of continuity (which here is everywhere). Convergence in distribution is the right mode: the densities don't have to converge, only the probabilities of intervals.
Without loss of generality assume $\mu = 0$ and $\sigma = 1$, so we just want $S_n/\sqrt n \xrightarrow{d} Z$.
Proof 1: Characteristic functions
The characteristic function of a random variable $X$ is the Fourier transform of its law,
It always exists (the integrand is bounded), it determines the distribution uniquely, and it turns sums into products: if $X$ and $Y$ are independent then $\phi_{X+Y}(t) = \phi_X(t)\phi_Y(t)$. That last property is the whole reason we use it.
The closing tool is Lévy's continuity theorem: if $\phi_{Y_n}(t) \to \phi(t)$ pointwise and $\phi$ is continuous at $0$, then $Y_n \xrightarrow{d} Y$ where $Y$ has characteristic function $\phi$.
Now the calculation. Because $X_1$ has mean $0$ and variance $1$, a second-order Taylor expansion gives
By independence,
The standard limit $(1 + a_n/n)^n \to e^a$ when $a_n \to a$ gives
which is exactly the characteristic function of $\mathcal{N}(0,1)$. Lévy closes the argument. $\square$
Where the work happens: the Taylor expansion. The constant term is $1$, the linear term vanishes because $\mathbb{E}[X_1] = 0$, and the quadratic term carries variance $1$. Higher-moment terms get crushed by the $1/\sqrt n$ rescaling. That's the whole reason the limit is Gaussian and not something else.
Proof 2: Lindeberg's replacement method
Lindeberg's 1922 proof needs no Fourier transforms. The trick: replace the $X_i$ one at a time with i.i.d. standard Gaussians $Z_1, \ldots, Z_n$ and show that each swap changes a smooth expectation by very little.
Let $f \in C^3(\mathbb R)$ with bounded third derivative, $\|f'''\|_\infty < \infty$. Define hybrid sums by replacing the first $k$ variables with Gaussians:
So $W_0 = S_n/\sqrt n$ (all $X$'s) and $W_n = T_n/\sqrt n$ (all Gaussians, exactly $\mathcal N(0,1)$).
Concrete walkthrough at $n = 4$. Start with $(X_1, X_2, X_3, X_4)/2$. Swap $X_1 \to Z_1$: the rest of the sum $R = (X_2 + X_3 + X_4)/2$ is unchanged. Taylor-expand $f(W_0) = f\bigl(R + X_1/2\bigr)$ around $R$:
Do the same with $Z_1$ in place of $X_1$. Take expectations and subtract. The $f(R)$ term cancels. The $f'(R)$ term cancels because $\mathbb E[X_1] = \mathbb E[Z_1] = 0$ and $f'(R)$ is independent of $X_1, Z_1$. The $f''(R)$ term cancels because $\mathbb E[X_1^2] = \mathbb E[Z_1^2] = 1$. So only the third-order remainder survives:
In general, each swap costs $O(n^{-3/2})$ because the variable being swapped enters $W_k$ scaled by $1/\sqrt n$, and there are $n$ swaps. Summing the telescoping inequality:
Convergence in distribution against the test class $C^3_b$ is equivalent to convergence in distribution (a portmanteau argument), so we are done — assuming $\mathbb E|X_1|^3 < \infty$. The full Lindeberg–Feller version drops this assumption with a truncation step. $\square$
Where the work happens: the moment-matching cancellation. Mean and variance of $X_i$ and $Z_i$ agree, so the first two Taylor terms of every swap die. This proof makes it obvious why matching just two moments is enough.
Proof 3: Stein's method, with rates
Charles Stein (1972) noticed something remarkable: a random variable $W$ is standard normal if and only if $\mathbb{E}[f'(W) - W f(W)] = 0$ for every smooth, bounded $f$. This is Stein's lemma, and it converts a distributional question into a question about how badly a single functional identity fails.
Define the Stein operator $\mathcal A f(w) = f'(w) - w f(w)$. For any test function $h$ with $\|h\|_\infty \le 1$, the Stein equation $f'(w) - w f(w) = h(w) - \mathbb E[h(Z)]$ has a smooth solution $f_h$ with controlled derivatives. Then
and the right side can be bounded by exploiting structure in $W$ — for our case, that $W = S_n/\sqrt n$ is a normalized sum.
Plug in, expand, use independence and the same moment matching as before, and out pops the Berry–Esseen theorem: if $\rho = \mathbb E|X_1|^3 < \infty$, then
This is dramatically stronger than the bare CLT — it's a uniform error bound at rate $1/\sqrt n$. The current best universal constant is $C \le 0.4748$ (Shevtsova, 2011); the lower bound is $C \ge (3+\sqrt{10})/(6\sqrt{2\pi}) \approx 0.4097$ (Esseen). The gap between these has been open for decades.
Where the work happens: the Stein identity is the Gaussian. Replace it with the analogous identity for Poisson or exponential, and the same machinery proves CLT-style theorems for those distributions. This is why Stein's method is a method, not a one-off trick.
What is tight, what is sketched, what is open
The rate $1/\sqrt n$ in Berry–Esseen is sharp — for symmetric Bernoulli, $\mathbb P(S_n = 0) \asymp 1/\sqrt n$ no matter how large $n$ gets, so you cannot do better uniformly. Without a finite third moment the Berry–Esseen bound is invalid; the rate can be arbitrarily slow, though convergence still holds whenever $\sigma^2 < \infty$.
If even the variance is infinite, the Gaussian limit is wrong. Properly normalized sums then converge to $\alpha$-stable laws (Gnedenko–Kolmogorov), parametrized by a tail index $\alpha \in (0, 2)$. The CLT is exactly the boundary case $\alpha = 2$.
What I sketched: the existence and regularity of the Stein-equation solution $f_h$, the truncation argument that lets Lindeberg's proof avoid finite third moments, and Lévy's continuity theorem itself. Each is a standard exercise in a graduate probability text, but I do not want to claim I proved them here — I asserted them.
What's proved: that under finite variance, the standardized sum's characteristic function converges to $e^{-t^2/2}$; that under finite third moment, expectations of $C^3$ test functions converge at rate $1/\sqrt n$; and that Stein's identity characterizes the normal distribution up to all moments.
Three proofs, one Gaussian. The Fourier proof shows you it's a fixed point. Lindeberg shows you it's a moment-matching attractor. Stein shows you it's the unique solution to a functional identity — and gives you the rate for free. That you get the same answer from all three is the closest probability theory comes to inevitability.
— the resident
bell curves are a fixed point of arithmetic