Why Not Just Run Multiple t-Tests?
If you want to compare means across 3 groups (A, B, C), you could run three separate t-tests: A vs B, A vs C, B vs C.
Problem: With α=0.05 per test and 3 tests, the probability of at least one false positive is:
P(at least one Type I error) = 1 − (1−0.05)³ = 1 − 0.857 = 0.143 = 14.3%
The more groups, the worse this gets. With 5 groups → 10 tests → 40% false positive rate.
ANOVA compares all groups simultaneously in a single test, maintaining the overall error rate at α.
The ANOVA Idea
ANOVA decomposes the total variability in the data into two parts:
Total Variability = Variability BETWEEN groups + Variability WITHIN groups
↑ ↑
Due to group differences Due to random variation
(what we care about) within each group
If between-group variation >> within-group variation → groups are truly different
F = Between-group variance / Within-group variance
When H₀ is true (all group means equal), F ≈ 1. When H₁ is true (groups differ), F > 1 (how much > 1 determines significance).
Hypotheses
H₀: μ₁ = μ₂ = μ₃ = ... = μₖ (all k group means are equal)
H₁: At least one mean is different from the others
Note: H₁ says "at least one is different" — not which one or how many.
The F-Statistic
F = MSB / MSW = Between-group Mean Square / Within-group Mean Square
Where:
MSB = SSB / (k−1) Between-group Mean Square
MSW = SSW / (N−k) Within-group Mean Square
SSB = Σnᵢ(x̄ᵢ − x̄_grand)² Between-group Sum of Squares
SSW = ΣΣ(xᵢⱼ − x̄ᵢ)² Within-group Sum of Squares
SST = SSB + SSW Total Sum of Squares
k = number of groups
N = total number of observations
nᵢ = size of group i
x̄ᵢ = mean of group i
x̄_grand = grand mean = (Σxᵢⱼ)/N
The F statistic follows an F-distribution with (k−1) and (N−k) degrees of freedom.
Worked Example
Scenario: Compare average performance scores across 3 training methods.
Method A: 72, 68, 75, 70, 74 n₁=5, x̄₁=71.8
Method B: 85, 80, 88, 82, 90 n₂=5, x̄₂=85.0
Method C: 78, 75, 82, 79, 76 n₃=5, x̄₃=78.0
k = 3, N = 15
Grand mean:
x̄_grand = (Σall observations) / 15
= (72+68+75+70+74 + 85+80+88+82+90 + 78+75+82+79+76) / 15
= (359 + 425 + 390) / 15
= 1174 / 15
= 78.27
Step 1: SSB (Between groups)
SSB = n₁(x̄₁−x̄_grand)² + n₂(x̄₂−x̄_grand)² + n₃(x̄₃−x̄_grand)²
= 5(71.8−78.27)² + 5(85.0−78.27)² + 5(78.0−78.27)²
= 5(−6.47)² + 5(6.73)² + 5(−0.27)²
= 5(41.86) + 5(45.29) + 5(0.073)
= 209.3 + 226.45 + 0.365
= 436.11
Step 2: SSW (Within groups)
Method A deviations from x̄₁=71.8:
(72−71.8)² + (68−71.8)² + (75−71.8)² + (70−71.8)² + (74−71.8)²
= 0.04 + 14.44 + 10.24 + 3.24 + 4.84 = 32.8
Method B deviations from x̄₂=85.0:
(85−85)² + (80−85)² + (88−85)² + (82−85)² + (90−85)²
= 0 + 25 + 9 + 9 + 25 = 68.0
Method C deviations from x̄₃=78.0:
(78−78)² + (75−78)² + (82−78)² + (79−78)² + (76−78)²
= 0 + 9 + 16 + 1 + 4 = 30.0
SSW = 32.8 + 68.0 + 30.0 = 130.8
Step 3: ANOVA Table
Source SS df MS F
Between 436.11 k−1=2 218.06 20.01
Within 130.8 N−k=12 10.9
Total 566.91 N−1=14
F = MSB / MSW = 218.06 / 10.9 = 20.01
Step 4: Critical value
F(2, 12) at α=0.05: F_crit = 3.885
Our F = 20.01 >> 3.885 → REJECT H₀
p-value < 0.001
Conclusion: There is significant evidence of a difference in average performance
scores between training methods (F(2,12) = 20.01, p < 0.001).
ANOVA Table Structure
Source of Sum of Degrees of Mean Square F-ratio p-value
Variation Squares Freedom (MS = SS/df)
─────────────────────────────────────────────────────────────────────────
Between SSB k−1 MSB = SSB/(k−1) F = MSB/MSW p
(Treatment)
Within SSW N−k MSW = SSW/(N−k)
(Error)
─────────────────────────────────────────────────────────────────────────
Total SST N−1
Post-Hoc Tests
ANOVA tells you THAT at least one mean differs, but not WHICH pairs differ.
Post-hoc tests compare all pairs of means while controlling the family-wise error rate:
Tukey's HSD (Honest Significant Difference)
The most commonly used post-hoc test for balanced designs (equal group sizes):
HSD = q* × √(MSW/n)
Where q* is from the studentised range distribution (depends on α, k, and N−k)
For our example (k=3, df_W=12, α=0.05):
q* ≈ 3.77 (from table)
HSD = 3.77 × √(10.9/5) = 3.77 × 1.477 = 5.57
Pairwise differences:
|x̄₁ − x̄₂| = |71.8 − 85.0| = 13.2 > 5.57 → A vs B: SIGNIFICANT
|x̄₁ − x̄₃| = |71.8 − 78.0| = 6.2 > 5.57 → A vs C: SIGNIFICANT
|x̄₂ − x̄₃| = |85.0 − 78.0| = 7.0 > 5.57 → B vs C: SIGNIFICANT
All three pairs are significantly different!
Method B has the highest mean; Method A has the lowest.
Other Post-Hoc Tests
| Test | When to Use |
|---|---|
| Tukey's HSD | Balanced groups, all pairwise comparisons, equal n |
| Bonferroni | Unequal n; conservative; fewer pre-planned comparisons |
| Scheffé | Complex contrasts (not just pairwise); most conservative |
| Games-Howell | Unequal variances (like Welch's for ANOVA) |
| Dunnett's | Compare each group to one control group only |
Assumptions of One-Way ANOVA
1. INDEPENDENCE: Observations within and across groups are independent
Check: Study design (random sampling/assignment)
2. NORMALITY: Each group's data is approximately normal
Check: Histogram or QQ plot per group; Shapiro-Wilk test
Robust to this with large samples (n≥30 per group) via CLT
3. HOMOSCEDASTICITY (Equal Variances): All groups have the same σ²
Check: Levene's test; visual inspection (SD ratio ≤ 2)
If violated: use Welch's ANOVA (unequal variances version)
Checking Homoscedasticity
Levene's test: H₀ = equal variances
Rule of thumb: if largest SD / smallest SD > 2, consider Welch's ANOVA
Our example:
Method A: s₁ = √(32.8/4) = √8.2 = 2.86
Method B: s₂ = √(68/4) = √17 = 4.12
Method C: s₃ = √(30/4) = √7.5 = 2.74
Max/Min SD ratio = 4.12/2.74 = 1.50 < 2 → OK to use standard ANOVA
Effect Size: Eta-Squared (η²)
Measures the proportion of total variability explained by group differences:
η² = SSB / SST = 436.11 / 566.91 = 0.769
Interpretation:
η² = 0.01: small effect
η² = 0.06: medium effect
η² = 0.14: large effect
η² = 0.769 → 76.9% of the variability in scores is explained by the training method → very large effect!
Partial η² is used in factorial ANOVA; ω² is a less biased alternative to η².
Two-Way ANOVA (Brief Overview)
One-way ANOVA has one factor. Two-way ANOVA has two factors simultaneously:
Example: Does performance depend on:
- Factor 1: Training method (A, B, C)
- Factor 2: Experience level (Junior, Senior)
- AND does their INTERACTION matter? (Does Method A work better for Juniors?)
Two-way ANOVA produces THREE F-tests:
1. Main effect of Training method
2. Main effect of Experience level
3. Interaction effect (Training × Experience)
If the interaction is significant → the effect of one factor depends on the level of the other.
Must interpret interaction plot carefully.
Non-Parametric Alternative: Kruskal-Wallis
When ANOVA assumptions are severely violated (non-normal data, small samples, ordinal data):
Kruskal-Wallis H test:
- Non-parametric version of one-way ANOVA
- Uses ranks instead of raw values
- H₀: all populations have the same distribution (not just equal means)
- Decision: reject if H > χ²_critical (df = k−1)
Post-hoc for Kruskal-Wallis: Dunn's test
Practical Examples
Example 1: Salary Comparison Across Departments
HR asks: Is there a significant salary difference across Finance, Tech, Marketing, and HR?
Finance (n=12): x̄₁=82k, s₁=9k
Technology (n=15): x̄₂=95k, s₂=14k
Marketing (n=10): x̄₃=71k, s₃=8k
HR (n=8): x̄₄=65k, s₄=7k
Step 1: Run one-way ANOVA → F(3, 41) = 12.4, p < 0.001 → reject H₀
Step 2: Tukey's HSD post-hoc:
Tech vs HR: significant (30k difference)
Tech vs Marketing: significant (24k difference)
Finance vs HR: significant
All pairs involving Technology are significant
Finance vs Marketing: borderline
Conclusion: Technology employees earn significantly more than all other departments.
η² = 0.48 → 48% of salary variance explained by department → large effect.
Example 2: Website Load Time by Server Region
Three server regions, 20 measurements each (milliseconds):
Region A (n=20): x̄=245ms
Region B (n=20): x̄=310ms
Region C (n=20): x̄=280ms
F(2, 57) = 8.92, p = 0.0004 → reject H₀
Post-hoc (Tukey):
Region A vs B: significant (65ms difference)
Region A vs C: not significant (35ms, p=0.12)
Region B vs C: significant (30ms, p=0.04)
→ Region A is significantly faster than B; B is slower than C; A and C are comparable
→ Action: Route more traffic to Region A
Example 3: Customer Satisfaction by Complaint Resolution
Customer satisfaction scores (1–10) after complaints resolved in <1 day, 1–3 days, >3 days.
Since satisfaction scale is ordinal, use Kruskal-Wallis:
H = 15.3, df = 2, p = 0.0005 → reject H₀
Dunn's post-hoc:
<1 day vs >3 days: significant (higher satisfaction for faster resolution)
1–3 days vs >3 days: significant
<1 day vs 1–3 days: not significant
→ Resolution within 3 days doesn't significantly differ; beyond 3 days matters.
Common Mistakes
1. Not checking assumptions
If variances are very unequal (Levene's p<0.05): use Welch's ANOVA
If data is severely non-normal with small n: use Kruskal-Wallis
2. Running post-hoc tests without a significant ANOVA
If overall F is not significant → stop. Don't run post-hoc tests.
"Fishing" for significant pairs inflates error rates.
3. Confusing ANOVA with ANOVA of variance
ANOVA tests MEANS — not variances! The name is misleading.
It works by partitioning variance, but the null hypothesis is about MEANS.
4. Interpreting η² as the full story
η² = 0.05 is "small" but in a practical context might still matter.
Always combine with domain knowledge and the actual magnitude of differences.
Practice Exercises
-
Three diets: Diet A (n=8): x̄=68kg, s=3. Diet B (n=8): x̄=72kg, s=4. Diet C (n=8): x̄=65kg, s=3.5. Compute SSB, SSW, the F-statistic, and test at α=0.05.
-
Explain why running 5 separate t-tests to compare 5 groups is problematic. What is the inflated false positive rate (α=0.05 per test)?
-
ANOVA result: F(3,36)=5.2, p=0.004. Which post-hoc test would you use if groups have equal sizes? If group sizes are very unequal?
-
An ANOVA produces η²=0.23. What percentage of variability is explained by the group factor? Is this a small, medium, or large effect?
-
ANOVA assumptions are violated: groups are non-normal with small n (n=6 per group). What alternative test would you use?
Summary
In this chapter you learned:
- ANOVA reason: running multiple t-tests inflates Type I error; ANOVA tests all groups simultaneously at level α
- F-statistic = MSB/MSW = (Between-group variance) / (Within-group variance); F > 1 when groups differ
- ANOVA table: SSB (df=k−1), SSW (df=N−k), SST (df=N−1); F = MSB/MSW
- H₀: all group means equal; H₁: at least one mean different
- Post-hoc tests (after significant ANOVA): Tukey's HSD (equal n), Bonferroni, Games-Howell (unequal variances)
- Assumptions: independence, normality per group, equal variances (homoscedasticity)
- Levene's test checks equal variances; Welch's ANOVA if violated; Kruskal-Wallis if non-normal
- Effect size η² = SSB/SST → proportion of variance explained by groups; 0.01=small, 0.06=medium, 0.14=large
- Two-way ANOVA: two factors + their interaction; interaction means effect of one factor depends on the other
- ANOVA tests MEANS, not variances — despite the name
Next up: Correlation — quantifying the linear relationship between two variables.