ANOVA — Analysis of Variance | Statistics Tutorial | Meritshot

Why Not Just Run Multiple t-Tests?

If you want to compare means across 3 groups (A, B, C), you could run three separate t-tests: A vs B, A vs C, B vs C.

Problem: With α=0.05 per test and 3 tests, the probability of at least one false positive is:

P(at least one Type I error) = 1 − (1−0.05)³ = 1 − 0.857 = 0.143 = 14.3%

The more groups, the worse this gets. With 5 groups → 10 tests → 40% false positive rate.

ANOVA compares all groups simultaneously in a single test, maintaining the overall error rate at α.

The ANOVA Idea

ANOVA decomposes the total variability in the data into two parts:

Total Variability = Variability BETWEEN groups + Variability WITHIN groups

                   ↑                               ↑
           Due to group differences         Due to random variation
           (what we care about)             within each group

If between-group variation >> within-group variation → groups are truly different
F = Between-group variance / Within-group variance

When H₀ is true (all group means equal), F ≈ 1. When H₁ is true (groups differ), F > 1 (how much > 1 determines significance).

Hypotheses

H₀: μ₁ = μ₂ = μ₃ = ... = μₖ    (all k group means are equal)
H₁: At least one mean is different from the others

Note: H₁ says "at least one is different" — not which one or how many.

The F-Statistic

F = MSB / MSW = Between-group Mean Square / Within-group Mean Square

Where:
MSB = SSB / (k−1)     Between-group Mean Square
MSW = SSW / (N−k)     Within-group Mean Square

SSB = Σnᵢ(x̄ᵢ − x̄_grand)²    Between-group Sum of Squares
SSW = ΣΣ(xᵢⱼ − x̄ᵢ)²          Within-group Sum of Squares
SST = SSB + SSW                Total Sum of Squares

k = number of groups
N = total number of observations
nᵢ = size of group i
x̄ᵢ = mean of group i
x̄_grand = grand mean = (Σxᵢⱼ)/N

The F statistic follows an F-distribution with (k−1) and (N−k) degrees of freedom.

Worked Example

Scenario: Compare average performance scores across 3 training methods.

Method A: 72, 68, 75, 70, 74        n₁=5, x̄₁=71.8
Method B: 85, 80, 88, 82, 90        n₂=5, x̄₂=85.0
Method C: 78, 75, 82, 79, 76        n₃=5, x̄₃=78.0

k = 3, N = 15

Grand mean:
x̄_grand = (Σall observations) / 15
         = (72+68+75+70+74 + 85+80+88+82+90 + 78+75+82+79+76) / 15
         = (359 + 425 + 390) / 15
         = 1174 / 15
         = 78.27

Step 1: SSB (Between groups)
SSB = n₁(x̄₁−x̄_grand)² + n₂(x̄₂−x̄_grand)² + n₃(x̄₃−x̄_grand)²
    = 5(71.8−78.27)² + 5(85.0−78.27)² + 5(78.0−78.27)²
    = 5(−6.47)² + 5(6.73)² + 5(−0.27)²
    = 5(41.86) + 5(45.29) + 5(0.073)
    = 209.3 + 226.45 + 0.365
    = 436.11

Step 2: SSW (Within groups)
Method A deviations from x̄₁=71.8:
(72−71.8)² + (68−71.8)² + (75−71.8)² + (70−71.8)² + (74−71.8)²
= 0.04 + 14.44 + 10.24 + 3.24 + 4.84 = 32.8

Method B deviations from x̄₂=85.0:
(85−85)² + (80−85)² + (88−85)² + (82−85)² + (90−85)²
= 0 + 25 + 9 + 9 + 25 = 68.0

Method C deviations from x̄₃=78.0:
(78−78)² + (75−78)² + (82−78)² + (79−78)² + (76−78)²
= 0 + 9 + 16 + 1 + 4 = 30.0

SSW = 32.8 + 68.0 + 30.0 = 130.8

Step 3: ANOVA Table
Source          SS       df      MS          F
Between       436.11    k−1=2   218.06      20.01
Within        130.8     N−k=12   10.9
Total         566.91    N−1=14

F = MSB / MSW = 218.06 / 10.9 = 20.01

Step 4: Critical value
F(2, 12) at α=0.05: F_crit = 3.885
Our F = 20.01 >> 3.885 → REJECT H₀

p-value < 0.001

Conclusion: There is significant evidence of a difference in average performance
scores between training methods (F(2,12) = 20.01, p < 0.001).

ANOVA Table Structure

Source of      Sum of     Degrees of   Mean Square    F-ratio    p-value
Variation      Squares    Freedom      (MS = SS/df)
─────────────────────────────────────────────────────────────────────────
Between        SSB        k−1          MSB = SSB/(k−1)    F = MSB/MSW    p
(Treatment)
Within         SSW        N−k          MSW = SSW/(N−k)
(Error)
─────────────────────────────────────────────────────────────────────────
Total          SST        N−1

Post-Hoc Tests

ANOVA tells you THAT at least one mean differs, but not WHICH pairs differ.

Post-hoc tests compare all pairs of means while controlling the family-wise error rate:

Tukey's HSD (Honest Significant Difference)

The most commonly used post-hoc test for balanced designs (equal group sizes):

HSD = q* × √(MSW/n)

Where q* is from the studentised range distribution (depends on α, k, and N−k)

For our example (k=3, df_W=12, α=0.05):
q* ≈ 3.77 (from table)
HSD = 3.77 × √(10.9/5) = 3.77 × 1.477 = 5.57

Pairwise differences:
|x̄₁ − x̄₂| = |71.8 − 85.0| = 13.2 > 5.57 → A vs B: SIGNIFICANT
|x̄₁ − x̄₃| = |71.8 − 78.0| = 6.2 > 5.57 → A vs C: SIGNIFICANT
|x̄₂ − x̄₃| = |85.0 − 78.0| = 7.0 > 5.57 → B vs C: SIGNIFICANT

All three pairs are significantly different!
Method B has the highest mean; Method A has the lowest.

Test	When to Use
Tukey's HSD	Balanced groups, all pairwise comparisons, equal n
Bonferroni	Unequal n; conservative; fewer pre-planned comparisons
Scheffé	Complex contrasts (not just pairwise); most conservative
Games-Howell	Unequal variances (like Welch's for ANOVA)
Dunnett's	Compare each group to one control group only

Assumptions of One-Way ANOVA

1. INDEPENDENCE: Observations within and across groups are independent
   Check: Study design (random sampling/assignment)

2. NORMALITY: Each group's data is approximately normal
   Check: Histogram or QQ plot per group; Shapiro-Wilk test
   Robust to this with large samples (n≥30 per group) via CLT

3. HOMOSCEDASTICITY (Equal Variances): All groups have the same σ²
   Check: Levene's test; visual inspection (SD ratio ≤ 2)
   If violated: use Welch's ANOVA (unequal variances version)

Checking Homoscedasticity

Levene's test: H₀ = equal variances

Rule of thumb: if largest SD / smallest SD > 2, consider Welch's ANOVA

Our example:
Method A: s₁ = √(32.8/4) = √8.2 = 2.86
Method B: s₂ = √(68/4) = √17 = 4.12
Method C: s₃ = √(30/4) = √7.5 = 2.74

Max/Min SD ratio = 4.12/2.74 = 1.50 < 2 → OK to use standard ANOVA

Effect Size: Eta-Squared (η²)

Measures the proportion of total variability explained by group differences:

η² = SSB / SST = 436.11 / 566.91 = 0.769

Interpretation:
η² = 0.01: small effect
η² = 0.06: medium effect
η² = 0.14: large effect

η² = 0.769 → 76.9% of the variability in scores is explained by the training method → very large effect!

Partial η² is used in factorial ANOVA; ω² is a less biased alternative to η².

Two-Way ANOVA (Brief Overview)

One-way ANOVA has one factor. Two-way ANOVA has two factors simultaneously:

Example: Does performance depend on:
- Factor 1: Training method (A, B, C)
- Factor 2: Experience level (Junior, Senior)
- AND does their INTERACTION matter? (Does Method A work better for Juniors?)

Two-way ANOVA produces THREE F-tests:
1. Main effect of Training method
2. Main effect of Experience level
3. Interaction effect (Training × Experience)

If the interaction is significant → the effect of one factor depends on the level of the other.
Must interpret interaction plot carefully.

Non-Parametric Alternative: Kruskal-Wallis

When ANOVA assumptions are severely violated (non-normal data, small samples, ordinal data):

Kruskal-Wallis H test:
- Non-parametric version of one-way ANOVA
- Uses ranks instead of raw values
- H₀: all populations have the same distribution (not just equal means)
- Decision: reject if H > χ²_critical (df = k−1)

Post-hoc for Kruskal-Wallis: Dunn's test

Practical Examples

Example 1: Salary Comparison Across Departments

HR asks: Is there a significant salary difference across Finance, Tech, Marketing, and HR?

Finance (n=12): x̄₁=82k, s₁=9k
Technology (n=15): x̄₂=95k, s₂=14k
Marketing (n=10): x̄₃=71k, s₃=8k
HR (n=8): x̄₄=65k, s₄=7k

Step 1: Run one-way ANOVA → F(3, 41) = 12.4, p < 0.001 → reject H₀

Step 2: Tukey's HSD post-hoc:
Tech vs HR: significant (30k difference)
Tech vs Marketing: significant (24k difference)
Finance vs HR: significant
All pairs involving Technology are significant
Finance vs Marketing: borderline

Conclusion: Technology employees earn significantly more than all other departments.
η² = 0.48 → 48% of salary variance explained by department → large effect.

Example 2: Website Load Time by Server Region

Three server regions, 20 measurements each (milliseconds):
Region A (n=20): x̄=245ms
Region B (n=20): x̄=310ms
Region C (n=20): x̄=280ms

F(2, 57) = 8.92, p = 0.0004 → reject H₀

Post-hoc (Tukey):
Region A vs B: significant (65ms difference)
Region A vs C: not significant (35ms, p=0.12)
Region B vs C: significant (30ms, p=0.04)

→ Region A is significantly faster than B; B is slower than C; A and C are comparable
→ Action: Route more traffic to Region A

Example 3: Customer Satisfaction by Complaint Resolution

Customer satisfaction scores (1–10) after complaints resolved in <1 day, 1–3 days, >3 days.

Since satisfaction scale is ordinal, use Kruskal-Wallis:
H = 15.3, df = 2, p = 0.0005 → reject H₀

Dunn's post-hoc:
<1 day vs >3 days: significant (higher satisfaction for faster resolution)
1–3 days vs >3 days: significant
<1 day vs 1–3 days: not significant

→ Resolution within 3 days doesn't significantly differ; beyond 3 days matters.

Common Mistakes

1. Not checking assumptions

If variances are very unequal (Levene's p<0.05): use Welch's ANOVA
If data is severely non-normal with small n: use Kruskal-Wallis

2. Running post-hoc tests without a significant ANOVA

If overall F is not significant → stop. Don't run post-hoc tests.
"Fishing" for significant pairs inflates error rates.

3. Confusing ANOVA with ANOVA of variance

ANOVA tests MEANS — not variances! The name is misleading.
It works by partitioning variance, but the null hypothesis is about MEANS.

4. Interpreting η² as the full story

η² = 0.05 is "small" but in a practical context might still matter.
Always combine with domain knowledge and the actual magnitude of differences.

Practice Exercises

Three diets: Diet A (n=8): x̄=68kg, s=3. Diet B (n=8): x̄=72kg, s=4. Diet C (n=8): x̄=65kg, s=3.5. Compute SSB, SSW, the F-statistic, and test at α=0.05.
Explain why running 5 separate t-tests to compare 5 groups is problematic. What is the inflated false positive rate (α=0.05 per test)?
ANOVA result: F(3,36)=5.2, p=0.004. Which post-hoc test would you use if groups have equal sizes? If group sizes are very unequal?
An ANOVA produces η²=0.23. What percentage of variability is explained by the group factor? Is this a small, medium, or large effect?
ANOVA assumptions are violated: groups are non-normal with small n (n=6 per group). What alternative test would you use?

Summary

In this chapter you learned:

ANOVA reason: running multiple t-tests inflates Type I error; ANOVA tests all groups simultaneously at level α
F-statistic = MSB/MSW = (Between-group variance) / (Within-group variance); F > 1 when groups differ
ANOVA table: SSB (df=k−1), SSW (df=N−k), SST (df=N−1); F = MSB/MSW
H₀: all group means equal; H₁: at least one mean different
Post-hoc tests (after significant ANOVA): Tukey's HSD (equal n), Bonferroni, Games-Howell (unequal variances)
Assumptions: independence, normality per group, equal variances (homoscedasticity)
Levene's test checks equal variances; Welch's ANOVA if violated; Kruskal-Wallis if non-normal
Effect size η² = SSB/SST → proportion of variance explained by groups; 0.01=small, 0.06=medium, 0.14=large
Two-way ANOVA: two factors + their interaction; interaction means effect of one factor depends on the other
ANOVA tests MEANS, not variances — despite the name

Next up: Correlation — quantifying the linear relationship between two variables.

ANOVA — Analysis of Variance