Correlation — Pearson, Spearman & Causation | Statistics Tutorial | Meritshot

What Is Correlation?

Correlation measures the strength and direction of the relationship between two variables. It answers: Do the variables tend to move together, and if so, how strongly?

Examples:
→ Do study hours and exam scores move together? (positive)
→ Do absences and grades move together? (negative)
→ Do shoe size and IQ move together? (none)

Correlation is always between −1 and +1.

Types of Correlation

r > 0: Positive correlation — as X increases, Y tends to increase
r < 0: Negative correlation — as X increases, Y tends to decrease
r = 0: No linear relationship

Strength:
|r| ≥ 0.9: Very strong
0.7 ≤ |r| < 0.9: Strong
0.5 ≤ |r| < 0.7: Moderate
0.3 ≤ |r| < 0.5: Weak
|r| < 0.3: Very weak or negligible

1. Pearson Correlation Coefficient (r)

Pearson r measures the linear relationship between two continuous variables.

Formula

r = Σ[(xᵢ − x̄)(yᵢ − ȳ)] / [√Σ(xᵢ − x̄)² × √Σ(yᵢ − ȳ)²]

Equivalently:
r = Σ(xᵢyᵢ) − n×x̄×ȳ / √[Σxᵢ² − n×x̄²] × √[Σyᵢ² − n×ȳ²]

Properties:
- Dimensionless (no units)
- Symmetric: r(X,Y) = r(Y,X)
- −1 ≤ r ≤ +1
- r = +1 or −1: perfect linear relationship
- r = 0: no linear relationship (may still have non-linear relationship)

Worked Example

Dataset: 8 employees — hours of study per week vs performance score

Employee  Study hrs (x)  Score (y)
1              2            55
2              4            60
3              6            70
4              8            75
5             10            80
6             12            88
7             14            92
8             16            95

n = 8
x̄ = (2+4+6+8+10+12+14+16)/8 = 72/8 = 9
ȳ = (55+60+70+75+80+88+92+95)/8 = 615/8 = 76.875

Computing Σ(xᵢ − x̄)(yᵢ − ȳ):
Employee 1: (2−9)(55−76.875) = (−7)(−21.875) = 153.125
Employee 2: (4−9)(60−76.875) = (−5)(−16.875) = 84.375
Employee 3: (6−9)(70−76.875) = (−3)(−6.875) = 20.625
Employee 4: (8−9)(75−76.875) = (−1)(−1.875) = 1.875
Employee 5: (10−9)(80−76.875) = (+1)(+3.125) = 3.125
Employee 6: (12−9)(88−76.875) = (+3)(+11.125) = 33.375
Employee 7: (14−9)(92−76.875) = (+5)(+15.125) = 75.625
Employee 8: (16−9)(95−76.875) = (+7)(+18.125) = 126.875

Σ(xᵢ − x̄)(yᵢ − ȳ) = 153.125 + 84.375 + 20.625 + 1.875 + 3.125 + 33.375 + 75.625 + 126.875 = 499.0

Σ(xᵢ − x̄)²:
(−7)² + (−5)² + (−3)² + (−1)² + (1)² + (3)² + (5)² + (7)²
= 49 + 25 + 9 + 1 + 1 + 9 + 25 + 49 = 168

Σ(yᵢ − ȳ)²:
(−21.875)² + (−16.875)² + (−6.875)² + (−1.875)² + (3.125)² + (11.125)² + (15.125)² + (18.125)²
= 478.516 + 284.766 + 47.266 + 3.516 + 9.766 + 123.766 + 228.766 + 328.516
= 1,504.875

r = 499.0 / √(168 × 1504.875)
  = 499.0 / √252,819
  = 499.0 / 502.81
  = 0.992

Very strong positive correlation: r = 0.99
As study hours increase, performance scores increase almost perfectly linearly.

Significance Testing for r

A non-zero r in the sample doesn't necessarily mean the population correlation ρ ≠ 0. Test significance:

H₀: ρ = 0 (no linear relationship in the population)
H₁: ρ ≠ 0

Test statistic:
t = r × √(n−2) / √(1−r²)    with df = n−2

For our example:
t = 0.992 × √(8−2) / √(1−0.992²)
  = 0.992 × √6 / √(1−0.984)
  = 0.992 × 2.449 / √0.016
  = 2.429 / 0.1265
  = 19.2

df = 6
p < 0.0001 → highly significant

The correlation between study hours and scores is highly significant.

Coefficient of Determination (r²)

r² is the proportion of variance in Y that is explained by the linear relationship with X:

r² = 0.992² = 0.984 = 98.4%

Interpretation: 98.4% of the variability in performance scores can be
explained by (is associated with) variability in study hours.

The remaining 1.6% is due to other factors not in the model.

r vs r²

r = 0.7 → r² = 0.49 (49% of Y variance explained)
r = 0.5 → r² = 0.25 (25% of Y variance explained)
r = 0.3 → r² = 0.09 (9% of Y variance explained)

A "moderate" r of 0.5 only explains 25% of variance — the relationship
is less practically meaningful than the raw r suggests.
Always report r² alongside r.

2. Spearman Rank Correlation (ρ)

Spearman ρ (rho) measures the monotonic relationship between two variables. It works on ranks, not raw values.

Use when:

Data is ordinal (ratings, rankings)
Data is non-normal or has outliers
The relationship is monotonic but not necessarily linear
Small sample sizes

Formula

ρ = 1 − (6 × Σdᵢ²) / (n(n²−1))

Where dᵢ = rank(xᵢ) − rank(yᵢ) for each observation

Worked Example

Scenario: 8 candidates ranked by two interviewers (A and B)

Candidate  Interviewer A rank  Interviewer B rank  d = A−B   d²
1                 1                    2              −1        1
2                 2                    1              +1        1
3                 3                    4              −1        1
4                 4                    3              +1        1
5                 5                    7              −2        4
6                 6                    5              +1        1
7                 7                    6              +1        1
8                 8                    8               0        0

Σd² = 1 + 1 + 1 + 1 + 4 + 1 + 1 + 0 = 10

ρ = 1 − (6 × 10) / (8 × (64−1))
  = 1 − 60 / (8 × 63)
  = 1 − 60/504
  = 1 − 0.119
  = 0.881

Strong agreement between the two interviewers (ρ = 0.881).

Pearson vs Spearman

Feature	Pearson r	Spearman ρ
Measures	Linear relationship	Monotonic relationship
Data type	Interval/Ratio	Ordinal or ranked interval/ratio
Outlier sensitivity	High	Low (ranks reduce impact)
Distribution assumption	Bivariate normal (for testing)	None
When to prefer	Normal data, no outliers	Non-normal, ordinal, or outliers present
Scale	Raw values	Ranks

Rule of thumb: Use Pearson when data is continuous, roughly normal, and outlier-free. Use Spearman otherwise.

The Scatter Plot: Always Plot First

Before computing any correlation, always look at the scatter plot:

Pattern Types and Their r:
                        
r ≈ +1: points cluster   r ≈ −1: points cluster   r ≈ 0: no pattern
around upward line       around downward line     (scattered cloud)

Perfect curve (quadratic): r ≈ 0
BUT Spearman ρ might be ±1 (monotonic, not linear)

ALWAYS PLOT first — r = 0 doesn't mean "no relationship"

The Anscombe Quartet

Four datasets with the SAME Pearson r ≈ 0.82:

Dataset I: Linear relationship — appropriate to use r
Dataset II: Curved relationship — r misleads!
Dataset III: Almost perfect line with ONE outlier pulling r down
Dataset IV: Vertical cloud with ONE outlier creating the correlation

Moral: Same r, wildly different patterns. ALWAYS plot the scatter first.

Correlation Is Not Causation

This is the most critical concept in statistics.

A significant, strong correlation between X and Y does NOT mean:
1. X causes Y
2. Y causes X
3. They have any direct relationship at all

Explanations for correlation without causation:

1. COMMON CAUSE (confounding variable):
   Ice cream sales correlate with drowning rates.
   Cause: Hot weather → more ice cream AND more swimming
   Fix: Control for temperature

2. REVERSE CAUSATION:
   Hospital visits correlate with illness.
   Does the hospital cause illness? Or does illness cause hospital visits?

3. COINCIDENTAL CORRELATION (spurious):
   Nicholas Cage film releases correlate with pool drownings (r=0.67)
   U.S. per capita cheese consumption correlates with deaths by tangled
   in bedsheets (r=0.95)
   These are statistically significant but meaningless coincidences.

4. SELECTION BIAS:
   In a dataset of job applicants, skills and friendliness are negatively
   correlated — but only because unskilled AND unfriendly people don't
   get interviewed.

Establishing Causality

Correlation is a necessary but not sufficient condition for causation. To establish causality:

1. RANDOMISED CONTROLLED EXPERIMENT:
   Randomly assign subjects to treatment/control → eliminates confounding
   Gold standard for causality

2. TEMPORAL ORDER:
   Cause must precede effect (if X causes Y, X must change before Y)

3. MECHANISM:
   There should be a plausible biological/physical/economic mechanism
   explaining how X affects Y

4. DOSE-RESPONSE:
   More X should lead to more Y (or less Y if negative)

5. ELIMINATION OF ALTERNATIVES:
   Rule out confounders, reverse causation, selection bias

Without an experiment, causal inference requires strong assumptions
and careful design (IV, DiD, RD, propensity score matching in economics).

Practical Examples

Example 1: Marketing Spend and Revenue

Monthly data for 12 months:
Marketing spend (₹000): 10, 15, 12, 20, 18, 25, 22, 30, 28, 35, 32, 40
Revenue (₹000):         85, 110, 95, 130, 125, 155, 145, 175, 170, 210, 200, 240

r = 0.993 (very strong positive)
r² = 0.986

Scatter plot: linear pattern with tight clustering

Interpretation: Very strong linear relationship between marketing spend
and revenue. A 1% increase in marketing spend is associated with
approximately a 0.99% increase in revenue (on average).

Caution: Cannot conclude marketing CAUSES revenue without controlling
for confounders (seasonality, competitors, economic conditions).

Example 2: Customer Satisfaction and Retention

12 product categories — satisfaction score (1–10) and 1-year retention rate (%):
r = 0.72, r² = 0.518

Interpretation: Moderate-strong positive relationship. Satisfaction explains
52% of the variance in retention rate. The relationship is real and
practically significant, though 48% of retention variance is unexplained.

Example 3: Outlier Effect on Pearson

Original 7 data points: r = 0.15 (very weak)
Add one extreme outlier (x=100, y=200):
New r = 0.88 (very strong!)

The SINGLE outlier completely changed the interpretation.
Solution: Compute Spearman ρ (unaffected by outlier):
Spearman ρ = 0.18 (consistent with original pattern)

This shows why Spearman is more robust when outliers are present.

Partial Correlation

Correlation between X and Y after removing the influence of a third variable Z:

r(XY.Z) = (r_XY − r_XZ × r_YZ) / √[(1 − r_XZ²)(1 − r_YZ²)]

Example:
r(education, salary) = 0.65
r(education, experience) = 0.55
r(experience, salary) = 0.70

Partial correlation of education and salary controlling for experience:
r(ed, sal | exp) = (0.65 − 0.55×0.70) / √[(1−0.3025)(1−0.49)]
                 = (0.65 − 0.385) / √(0.6975 × 0.51)
                 = 0.265 / √0.3557
                 = 0.265 / 0.5964
                 = 0.44

The relationship is weaker once we control for experience — part of the
education-salary correlation was driven by more-educated people also
having more relevant experience.

Common Mistakes

1. Computing r without looking at the scatter plot

Non-linear relationships, outliers, and clusters can all produce
misleading r values. Always visualise first.

2. Interpreting r as the slope

r = 0.8 does NOT mean "for every 1 unit increase in X, Y increases by 0.8"
That's the regression coefficient, not r.
r only tells you the strength of the linear relationship.

3. Treating absence of correlation as absence of relationship

r ≈ 0 means no LINEAR relationship.
X and Y could have a perfect quadratic, circular, or sinusoidal relationship
with r = 0. Use scatter plots and Spearman ρ to detect non-linear patterns.

4. Extrapolating from correlation to causation

Every statistics course teaches this, yet it's violated constantly in news reports.
"Study shows coffee drinkers live longer" → confounders abound.
Always ask: What else could explain this pattern?

Practice Exercises

Compute Pearson r for: X: 1, 3, 5, 7, 9 and Y: 4, 8, 12, 16, 20. Interpret the result and compute r².
Two teachers rank 6 students: Teacher A: 1, 2, 3, 4, 5, 6 Teacher B: 2, 1, 4, 3, 6, 5 Compute Spearman ρ.
Sales team size and revenue both increase over 10 years. r=0.95. A manager concludes "hiring more salespeople causes revenue to grow." Identify all possible alternative explanations.
Dataset with r=0.3. Is this statistically significant at α=0.05 with n=100? What about with n=10? Compute both t-statistics.
You find r=0.85 between variable A and variable B. A colleague says r²=0.85, so 85% of variance in B is explained by A. What's wrong?

Summary

In this chapter you learned:

Pearson r: measures linear relationship between two continuous variables; r = Σ[(x−x̄)(y−ȳ)] / (n×sₓ×sᵧ)
Interpretation: −1=perfect negative, 0=none, +1=perfect positive; |r|≥0.7 strong, 0.3–0.7 moderate
r²: proportion of Y variance explained by X; always report alongside r
Significance test: t = r√(n−2)/√(1−r²) with df=n−2; tests H₀: ρ=0
Spearman ρ: ranks-based correlation; robust to outliers and non-normality; use for ordinal data
Pearson vs Spearman: Pearson = linear, sensitive to outliers; Spearman = monotonic, robust
Always visualise: scatter plot first — Anscombe's Quartet shows identical r for completely different patterns
r=0 ≠ no relationship: could be a non-linear or non-monotonic pattern
Correlation ≠ causation: common cause, reverse causation, spurious correlations all produce r≠0
Partial correlation: removes the influence of a third variable to isolate the direct relationship

Next up: Linear Regression — the most powerful tool for modelling and predicting relationships.

Correlation — Pearson, Spearman & Causation

What Is Correlation?

Types of Correlation

1. Pearson Correlation Coefficient (r)

Formula

Worked Example

Significance Testing for r

Coefficient of Determination (r²)

r vs r²

2. Spearman Rank Correlation (ρ)

Formula

Worked Example

Pearson vs Spearman

The Scatter Plot: Always Plot First

The Anscombe Quartet

Correlation Is Not Causation

Establishing Causality

Practical Examples

Example 1: Marketing Spend and Revenue

Example 2: Customer Satisfaction and Retention

Example 3: Outlier Effect on Pearson

Partial Correlation

Common Mistakes

1. Computing r without looking at the scatter plot

2. Interpreting r as the slope

3. Treating absence of correlation as absence of relationship

4. Extrapolating from correlation to causation

Practice Exercises

Summary

Explore Meritshot

Resources

Company

FAQs