Sampling Distributions & the Central Limit Theorem

The Bridge Between Descriptive and Inferential Statistics

You've collected data and computed a sample mean x̄. But x̄ is just one estimate of the population mean μ. If you took a different sample, you'd get a different x̄.

The sampling distribution describes how sample statistics (like x̄) vary across all possible samples of the same size. Understanding this variation is the foundation of all inferential statistics.

The Sampling Distribution of the Mean

An Experiment

Imagine a population of 10,000 employee salaries with μ = ₹76,000 and σ = ₹12,000.

Now take many random samples of size n = 50, and compute x̄ for each:

Sample 1 (n=50): x̄₁ = ₹74,200
Sample 2 (n=50): x̄₂ = ₹77,800
Sample 3 (n=50): x̄₃ = ₹75,600
Sample 4 (n=50): x̄₄ = ₹76,900
...
Sample 1000 (n=50): x̄₁₀₀₀ = ₹75,400

Plot all 1,000 sample means as a histogram. What shape does it take?

→ Bell-shaped and symmetric — regardless of what the original salary distribution looked like!

The Central Limit Theorem (CLT)

The most important theorem in statistics:

If X₁, X₂, ..., Xₙ are independent and identically distributed (i.i.d.) random variables with mean μ and finite standard deviation σ, then as n → ∞, the sampling distribution of the sample mean x̄ approaches a Normal distribution:

x̄ ~ N(μ, σ²/n)     approximately, for large n

Mean of the sampling distribution: μₓ̄ = μ
Standard deviation of x̄ (called Standard Error): SE = σ/√n

What the CLT Tells Us

The sampling distribution of x̄ is approximately normal — regardless of the shape of the population distribution — as long as n is large enough (rule of thumb: n ≥ 30)
The mean of the sampling distribution equals the population mean: E(x̄) = μ → x̄ is an unbiased estimator of μ
The spread of the sampling distribution decreases as n increases → Larger samples give more precise estimates

The Standard Error (SE)

The standard error is the standard deviation of the sampling distribution:

SE = σ / √n

Where:
σ = population standard deviation
n = sample size

If σ is unknown (usually), estimate it with s:
SE ≈ s / √n

SE and Sample Size

Population: μ = 76,000, σ = 12,000

n=10:   SE = 12,000/√10  = 12,000/3.16 = 3,794
n=25:   SE = 12,000/√25  = 12,000/5.00 = 2,400
n=50:   SE = 12,000/√50  = 12,000/7.07 = 1,697
n=100:  SE = 12,000/√100 = 12,000/10.0 = 1,200
n=400:  SE = 12,000/√400 = 12,000/20.0 = 600
n=2500: SE = 12,000/√2500 = 12,000/50 = 240

Key insight: To halve the SE, you must quadruple the sample size.

SE reduction: ×4 sample size → ×½ SE
This is the law of diminishing returns in sampling precision.

Standardising the Sample Mean

Convert x̄ to a Z-score using the standard error:

Z = (x̄ − μ) / (σ/√n) = (x̄ − μ) / SE

Example:
μ = 76,000, σ = 12,000, n = 50
What is P(x̄ > 79,000)?

SE = 12,000/√50 = 1,697

Z = (79,000 − 76,000) / 1,697 = 3,000/1,697 = 1.77

P(x̄ > 79,000) = P(Z > 1.77) = 1 − 0.9616 = 0.0384

→ Only 3.84% chance that a sample of 50 employees has a mean salary above ₹79,000
if the true mean is ₹76,000.

When Does the CLT Apply?

Population Shape vs Required n

Normal population:       CLT applies for ALL n (even n=1)
Symmetric, light-tailed: n ≥ 10–15 usually sufficient
Moderately skewed:       n ≥ 30 usually sufficient
Heavily skewed:          n ≥ 50–100 or more
Highly non-normal:       n ≥ 100+ may be needed

The CLT requires: random sampling + finite population variance + i.i.d. observations.

Practical Check

Plot a histogram of your sample. If it's not severely non-normal and n ≥ 30, you're likely safe to proceed with normal-based inference.

CLT Beyond the Mean

The CLT applies to other statistics too:

Sample proportion p̂:
p̂ ~ N(p, p(1−p)/n)     for large n (np≥5 and n(1−p)≥5)

Sample sum ΣX:
ΣX ~ N(nμ, nσ²)

Sample difference (x̄₁ − x̄₂):
(x̄₁ − x̄₂) ~ N(μ₁−μ₂, σ₁²/n₁ + σ₂²/n₂)

The Law of Large Numbers (LLN)

Closely related to the CLT but different:

CLT: Describes the distribution of x̄ (it's normal, with SE = σ/√n)
LLN: Says that x̄ converges to μ as n → ∞ (the sample mean gets arbitrarily close to the true mean)

LLN (Weak): As n → ∞, P(|x̄ − μ| > ε) → 0 for any ε > 0
"The sample mean gets arbitrarily close to the population mean."

The LLN is why relative frequency converges to probability with enough trials.

Practical Examples

Example 1: Quality Control Sampling

A factory produces components with μ=500g and σ=20g.
A quality inspector takes a random sample of n=25 components.

SE = 20/√25 = 20/5 = 4g

What is P(x̄ < 494)?
Z = (494 − 500)/4 = −6/4 = −1.5
P(Z < −1.5) = 0.0668

→ 6.68% chance the sample mean is below 494g even when the process is centred at 500g.
If the inspector observes x̄ = 492g, that's a Z-score of −2.0 → P < 2.3% → suspicious!
→ This suggests the process mean may have shifted below 500g.

Example 2: Financial Planning

An insurance company expects annual claims of μ=₹50,000 per customer, σ=₹80,000.

Portfolio of n=10,000 policies.
Total claims = ΣX ~ N(10,000×50,000, 10,000×80,000²)
             = N(₹500 million, 10,000×6.4 billion)
             = N(₹500M, 64 trillion)

SE of total = √(10,000) × 80,000 = 100 × 80,000 = ₹8 million

P(total claims > ₹520 million):
Z = (520M − 500M) / 8M = 20/8 = 2.5
P(Z > 2.5) = 0.0062

→ Only 0.62% probability that total claims exceed ₹520M
→ To be 99% safe: need reserves of 500M + 2.326×8M = 500M + 18.6M = ₹518.6M

Example 3: Political Polling

True support for a policy: p = 0.55 (55%)
Sample size: n = 1,000

Sampling distribution of p̂:
Mean: p = 0.55
SE = √(0.55 × 0.45 / 1000) = √(0.2475/1000) = √0.0002475 = 0.01572

P(p̂ < 0.50) = P(Z < (0.50 − 0.55)/0.01572)
             = P(Z < −3.18)
             = 0.0007

→ Only 0.07% chance a poll of 1,000 shows majority opposition when true support is 55%
→ This is why pollsters say "margin of error = ±3%" for n=1,000

Example 4: A/B Test Sizing

Current conversion rate: p = 0.05 (5%)
Goal: detect a 1% absolute lift (new rate = 6%)
Power required: 80%, significance level: 5%

Using CLT-based formula:
n = (z_α/2 + z_β)² × (p₁(1−p₁) + p₂(1−p₂)) / (p₂−p₁)²
n = (1.96 + 0.84)² × (0.05×0.95 + 0.06×0.94) / (0.06−0.05)²
n = 7.84 × (0.0475 + 0.0564) / 0.0001
n = 7.84 × 0.1039 / 0.0001
n = 8,146 per group

→ Need about 8,146 visitors per group (16,292 total) to reliably detect a 1% lift

Sampling Distribution of the Proportion

For a sample proportion p̂ = X/n (where X ~ Binomial):

μ_{p̂} = p
SE_{p̂} = √(p(1−p)/n)

Example:
True employee satisfaction: p = 0.70 (70% satisfied)
n = 100 surveyed

SE = √(0.70 × 0.30/100) = √0.0021 = 0.0458

P(p̂ < 0.60):
Z = (0.60 − 0.70) / 0.0458 = −0.10/0.0458 = −2.18
P(Z < −2.18) = 0.0146

→ Only 1.5% chance of observing < 60% satisfied in a sample of 100
   if the true rate is 70%

Common Mistakes

1. Confusing SD and SE

σ = 12,000 → spread of INDIVIDUAL salaries
SE = 12,000/√50 = 1,697 → spread of SAMPLE MEANS across different samples

They measure very different things.
A large SD with large n can still have a small SE.

2. CLT doesn't fix biased sampling

The CLT says x̄ → N(μ, σ²/n) with random sampling.
If sampling is biased, x̄ converges to the WRONG value (not μ).
No amount of sample size fixes bias.

3. Applying CLT when observations aren't independent

CLT requires independence.
Cluster sampling, time-series data, or related individuals violate this.
Use appropriate corrections (design effect for cluster sampling).

4. Thinking "my sample size is large enough" for any distribution

For very heavy-tailed distributions (e.g., Cauchy), n=1,000 may still not be enough.

Practice Exercises

Heights of adults: μ=170 cm, σ=8 cm. A sample of n=64 is taken. Find: a) SE b) P(x̄ > 172) c) P(168 < x̄ < 172)
A coin is flipped n=400 times. Use CLT to find P(proportion of heads < 0.47).
Daily sales revenue: μ=₹50,000, σ=₹15,000. Monthly average (30 working days): a) Mean and SE of x̄_month b) P(monthly average < ₹47,000)
You need to estimate average customer spend within ±₹200 with 95% confidence. Population SD = ₹1,500. What sample size is required?
Two factories: Factory A (n=100, x̄=₹490) and Factory B (n=64, x̄=₹510). Both σ=₹60. How many SEs apart are the two sample means?

Summary

In this chapter you learned:

Sampling distribution of x̄: describes how sample means vary across all possible samples of size n
Central Limit Theorem: x̄ ~ N(μ, σ²/n) for large n (≥30), regardless of population shape
Standard Error (SE) = σ/√n: the SD of the sampling distribution; decreases as n increases
To halve the SE, you must quadruple the sample size
Standardise: Z = (x̄ − μ) / SE — then use normal table to find probabilities
CLT for proportions: p̂ ~ N(p, p(1−p)/n) for large n (np≥5 and n(1-p)≥5)
Law of Large Numbers: x̄ converges to μ as n → ∞ (LLN is about convergence; CLT is about the distribution)
Large samples don't fix biased sampling — design matters more than size
SE ≠ SD: SD describes individual values; SE describes variation in sample statistics

Next up: Confidence Intervals — using the sampling distribution to build a range of plausible values for the population mean.