Chapter 17 of 18

Correlation — Pearson, Spearman & Causation

Quantify the linear relationship between two variables — Pearson r, Spearman ρ, interpreting r², and why correlation does not imply causation.

Meritshot11 min read
StatisticsCorrelationPearsonSpearmanScatter PlotCausation
All Statistics Chapters

What Is Correlation?

Correlation measures the strength and direction of the relationship between two variables. It answers: Do the variables tend to move together, and if so, how strongly?

Examples:
→ Do study hours and exam scores move together? (positive)
→ Do absences and grades move together? (negative)
→ Do shoe size and IQ move together? (none)

Correlation is always between −1 and +1.

Types of Correlation

r > 0: Positive correlation — as X increases, Y tends to increase
r < 0: Negative correlation — as X increases, Y tends to decrease
r = 0: No linear relationship

Strength:
|r| ≥ 0.9: Very strong
0.7 ≤ |r| < 0.9: Strong
0.5 ≤ |r| < 0.7: Moderate
0.3 ≤ |r| < 0.5: Weak
|r| < 0.3: Very weak or negligible

1. Pearson Correlation Coefficient (r)

Pearson r measures the linear relationship between two continuous variables.

Formula

r = Σ[(xᵢ − x̄)(yᵢ − ȳ)] / [√Σ(xᵢ − x̄)² × √Σ(yᵢ − ȳ)²]

Equivalently:
r = Σ(xᵢyᵢ) − n×x̄×ȳ / √[Σxᵢ² − n×x̄²] × √[Σyᵢ² − n×ȳ²]

Properties:
- Dimensionless (no units)
- Symmetric: r(X,Y) = r(Y,X)
- −1 ≤ r ≤ +1
- r = +1 or −1: perfect linear relationship
- r = 0: no linear relationship (may still have non-linear relationship)

Worked Example

Dataset: 8 employees — hours of study per week vs performance score

Employee  Study hrs (x)  Score (y)
1              2            55
2              4            60
3              6            70
4              8            75
5             10            80
6             12            88
7             14            92
8             16            95

n = 8
x̄ = (2+4+6+8+10+12+14+16)/8 = 72/8 = 9
ȳ = (55+60+70+75+80+88+92+95)/8 = 615/8 = 76.875

Computing Σ(xᵢ − x̄)(yᵢ − ȳ):
Employee 1: (2−9)(55−76.875) = (−7)(−21.875) = 153.125
Employee 2: (4−9)(60−76.875) = (−5)(−16.875) = 84.375
Employee 3: (6−9)(70−76.875) = (−3)(−6.875) = 20.625
Employee 4: (8−9)(75−76.875) = (−1)(−1.875) = 1.875
Employee 5: (10−9)(80−76.875) = (+1)(+3.125) = 3.125
Employee 6: (12−9)(88−76.875) = (+3)(+11.125) = 33.375
Employee 7: (14−9)(92−76.875) = (+5)(+15.125) = 75.625
Employee 8: (16−9)(95−76.875) = (+7)(+18.125) = 126.875

Σ(xᵢ − x̄)(yᵢ − ȳ) = 153.125 + 84.375 + 20.625 + 1.875 + 3.125 + 33.375 + 75.625 + 126.875 = 499.0

Σ(xᵢ − x̄)²:
(−7)² + (−5)² + (−3)² + (−1)² + (1)² + (3)² + (5)² + (7)²
= 49 + 25 + 9 + 1 + 1 + 9 + 25 + 49 = 168

Σ(yᵢ − ȳ)²:
(−21.875)² + (−16.875)² + (−6.875)² + (−1.875)² + (3.125)² + (11.125)² + (15.125)² + (18.125)²
= 478.516 + 284.766 + 47.266 + 3.516 + 9.766 + 123.766 + 228.766 + 328.516
= 1,504.875

r = 499.0 / √(168 × 1504.875)
  = 499.0 / √252,819
  = 499.0 / 502.81
  = 0.992

Very strong positive correlation: r = 0.99
As study hours increase, performance scores increase almost perfectly linearly.

Significance Testing for r

A non-zero r in the sample doesn't necessarily mean the population correlation ρ ≠ 0. Test significance:

H₀: ρ = 0 (no linear relationship in the population)
H₁: ρ ≠ 0

Test statistic:
t = r × √(n−2) / √(1−r²)    with df = n−2

For our example:
t = 0.992 × √(8−2) / √(1−0.992²)
  = 0.992 × √6 / √(1−0.984)
  = 0.992 × 2.449 / √0.016
  = 2.429 / 0.1265
  = 19.2

df = 6
p < 0.0001 → highly significant

The correlation between study hours and scores is highly significant.

Coefficient of Determination (r²)

r² is the proportion of variance in Y that is explained by the linear relationship with X:

r² = 0.992² = 0.984 = 98.4%

Interpretation: 98.4% of the variability in performance scores can be
explained by (is associated with) variability in study hours.

The remaining 1.6% is due to other factors not in the model.

r vs r²

r = 0.7 → r² = 0.49 (49% of Y variance explained)
r = 0.5 → r² = 0.25 (25% of Y variance explained)
r = 0.3 → r² = 0.09 (9% of Y variance explained)

A "moderate" r of 0.5 only explains 25% of variance — the relationship
is less practically meaningful than the raw r suggests.
Always report r² alongside r.

2. Spearman Rank Correlation (ρ)

Spearman ρ (rho) measures the monotonic relationship between two variables. It works on ranks, not raw values.

Use when:

  • Data is ordinal (ratings, rankings)
  • Data is non-normal or has outliers
  • The relationship is monotonic but not necessarily linear
  • Small sample sizes

Formula

ρ = 1 − (6 × Σdᵢ²) / (n(n²−1))

Where dᵢ = rank(xᵢ) − rank(yᵢ) for each observation

Worked Example

Scenario: 8 candidates ranked by two interviewers (A and B)

Candidate  Interviewer A rank  Interviewer B rank  d = A−B   d²
1                 1                    2              −1        1
2                 2                    1              +1        1
3                 3                    4              −1        1
4                 4                    3              +1        1
5                 5                    7              −2        4
6                 6                    5              +1        1
7                 7                    6              +1        1
8                 8                    8               0        0

Σd² = 1 + 1 + 1 + 1 + 4 + 1 + 1 + 0 = 10

ρ = 1 − (6 × 10) / (8 × (64−1))
  = 1 − 60 / (8 × 63)
  = 1 − 60/504
  = 1 − 0.119
  = 0.881

Strong agreement between the two interviewers (ρ = 0.881).

Pearson vs Spearman

FeaturePearson rSpearman ρ
MeasuresLinear relationshipMonotonic relationship
Data typeInterval/RatioOrdinal or ranked interval/ratio
Outlier sensitivityHighLow (ranks reduce impact)
Distribution assumptionBivariate normal (for testing)None
When to preferNormal data, no outliersNon-normal, ordinal, or outliers present
ScaleRaw valuesRanks

Rule of thumb: Use Pearson when data is continuous, roughly normal, and outlier-free. Use Spearman otherwise.

The Scatter Plot: Always Plot First

Before computing any correlation, always look at the scatter plot:

Pattern Types and Their r:
                        
r ≈ +1: points cluster   r ≈ −1: points cluster   r ≈ 0: no pattern
around upward line       around downward line     (scattered cloud)

Perfect curve (quadratic): r ≈ 0
BUT Spearman ρ might be ±1 (monotonic, not linear)

ALWAYS PLOT first — r = 0 doesn't mean "no relationship"

The Anscombe Quartet

Four datasets with the SAME Pearson r ≈ 0.82:

Dataset I: Linear relationship — appropriate to use r
Dataset II: Curved relationship — r misleads!
Dataset III: Almost perfect line with ONE outlier pulling r down
Dataset IV: Vertical cloud with ONE outlier creating the correlation

Moral: Same r, wildly different patterns. ALWAYS plot the scatter first.

Correlation Is Not Causation

This is the most critical concept in statistics.

A significant, strong correlation between X and Y does NOT mean:
1. X causes Y
2. Y causes X
3. They have any direct relationship at all

Explanations for correlation without causation:

1. COMMON CAUSE (confounding variable):
   Ice cream sales correlate with drowning rates.
   Cause: Hot weather → more ice cream AND more swimming
   Fix: Control for temperature

2. REVERSE CAUSATION:
   Hospital visits correlate with illness.
   Does the hospital cause illness? Or does illness cause hospital visits?

3. COINCIDENTAL CORRELATION (spurious):
   Nicholas Cage film releases correlate with pool drownings (r=0.67)
   U.S. per capita cheese consumption correlates with deaths by tangled
   in bedsheets (r=0.95)
   These are statistically significant but meaningless coincidences.

4. SELECTION BIAS:
   In a dataset of job applicants, skills and friendliness are negatively
   correlated — but only because unskilled AND unfriendly people don't
   get interviewed.

Establishing Causality

Correlation is a necessary but not sufficient condition for causation. To establish causality:

1. RANDOMISED CONTROLLED EXPERIMENT:
   Randomly assign subjects to treatment/control → eliminates confounding
   Gold standard for causality

2. TEMPORAL ORDER:
   Cause must precede effect (if X causes Y, X must change before Y)

3. MECHANISM:
   There should be a plausible biological/physical/economic mechanism
   explaining how X affects Y

4. DOSE-RESPONSE:
   More X should lead to more Y (or less Y if negative)

5. ELIMINATION OF ALTERNATIVES:
   Rule out confounders, reverse causation, selection bias

Without an experiment, causal inference requires strong assumptions
and careful design (IV, DiD, RD, propensity score matching in economics).

Practical Examples

Example 1: Marketing Spend and Revenue

Monthly data for 12 months:
Marketing spend (₹000): 10, 15, 12, 20, 18, 25, 22, 30, 28, 35, 32, 40
Revenue (₹000):         85, 110, 95, 130, 125, 155, 145, 175, 170, 210, 200, 240

r = 0.993 (very strong positive)
r² = 0.986

Scatter plot: linear pattern with tight clustering

Interpretation: Very strong linear relationship between marketing spend
and revenue. A 1% increase in marketing spend is associated with
approximately a 0.99% increase in revenue (on average).

Caution: Cannot conclude marketing CAUSES revenue without controlling
for confounders (seasonality, competitors, economic conditions).

Example 2: Customer Satisfaction and Retention

12 product categories — satisfaction score (1–10) and 1-year retention rate (%):
r = 0.72, r² = 0.518

Interpretation: Moderate-strong positive relationship. Satisfaction explains
52% of the variance in retention rate. The relationship is real and
practically significant, though 48% of retention variance is unexplained.

Example 3: Outlier Effect on Pearson

Original 7 data points: r = 0.15 (very weak)
Add one extreme outlier (x=100, y=200):
New r = 0.88 (very strong!)

The SINGLE outlier completely changed the interpretation.
Solution: Compute Spearman ρ (unaffected by outlier):
Spearman ρ = 0.18 (consistent with original pattern)

This shows why Spearman is more robust when outliers are present.

Partial Correlation

Correlation between X and Y after removing the influence of a third variable Z:

r(XY.Z) = (r_XY − r_XZ × r_YZ) / √[(1 − r_XZ²)(1 − r_YZ²)]

Example:
r(education, salary) = 0.65
r(education, experience) = 0.55
r(experience, salary) = 0.70

Partial correlation of education and salary controlling for experience:
r(ed, sal | exp) = (0.65 − 0.55×0.70) / √[(1−0.3025)(1−0.49)]
                 = (0.65 − 0.385) / √(0.6975 × 0.51)
                 = 0.265 / √0.3557
                 = 0.265 / 0.5964
                 = 0.44

The relationship is weaker once we control for experience — part of the
education-salary correlation was driven by more-educated people also
having more relevant experience.

Common Mistakes

1. Computing r without looking at the scatter plot

Non-linear relationships, outliers, and clusters can all produce
misleading r values. Always visualise first.

2. Interpreting r as the slope

r = 0.8 does NOT mean "for every 1 unit increase in X, Y increases by 0.8"
That's the regression coefficient, not r.
r only tells you the strength of the linear relationship.

3. Treating absence of correlation as absence of relationship

r ≈ 0 means no LINEAR relationship.
X and Y could have a perfect quadratic, circular, or sinusoidal relationship
with r = 0. Use scatter plots and Spearman ρ to detect non-linear patterns.

4. Extrapolating from correlation to causation

Every statistics course teaches this, yet it's violated constantly in news reports.
"Study shows coffee drinkers live longer" → confounders abound.
Always ask: What else could explain this pattern?

Practice Exercises

  1. Compute Pearson r for: X: 1, 3, 5, 7, 9 and Y: 4, 8, 12, 16, 20. Interpret the result and compute r².

  2. Two teachers rank 6 students: Teacher A: 1, 2, 3, 4, 5, 6 Teacher B: 2, 1, 4, 3, 6, 5 Compute Spearman ρ.

  3. Sales team size and revenue both increase over 10 years. r=0.95. A manager concludes "hiring more salespeople causes revenue to grow." Identify all possible alternative explanations.

  4. Dataset with r=0.3. Is this statistically significant at α=0.05 with n=100? What about with n=10? Compute both t-statistics.

  5. You find r=0.85 between variable A and variable B. A colleague says r²=0.85, so 85% of variance in B is explained by A. What's wrong?

Summary

In this chapter you learned:

  • Pearson r: measures linear relationship between two continuous variables; r = Σ[(x−x̄)(y−ȳ)] / (n×sₓ×sᵧ)
  • Interpretation: −1=perfect negative, 0=none, +1=perfect positive; |r|≥0.7 strong, 0.3–0.7 moderate
  • r²: proportion of Y variance explained by X; always report alongside r
  • Significance test: t = r√(n−2)/√(1−r²) with df=n−2; tests H₀: ρ=0
  • Spearman ρ: ranks-based correlation; robust to outliers and non-normality; use for ordinal data
  • Pearson vs Spearman: Pearson = linear, sensitive to outliers; Spearman = monotonic, robust
  • Always visualise: scatter plot first — Anscombe's Quartet shows identical r for completely different patterns
  • r=0 ≠ no relationship: could be a non-linear or non-monotonic pattern
  • Correlation ≠ causation: common cause, reverse causation, spurious correlations all produce r≠0
  • Partial correlation: removes the influence of a third variable to isolate the direct relationship

Next up: Linear Regression — the most powerful tool for modelling and predicting relationships.