Why Spread Matters
Two datasets can have the same mean but completely different characters:
Team A scores: 70, 72, 74, 76, 78 Mean = 74, very consistent
Team B scores: 40, 55, 74, 90, 111 Mean = 74, wildly variable
Measures of spread describe how scattered the data is around the centre. Without them, the mean tells an incomplete — sometimes misleading — story.
Sample Dataset
Annual salaries for 8 analysts (₹ thousands):
62, 68, 72, 75, 78, 82, 85, 91
n = 8
x̄ = (62+68+72+75+78+82+85+91) / 8 = 613 / 8 = 76.625
1. Range
The simplest measure of spread.
Range = Maximum − Minimum
Range = 91 − 62 = 29 (₹29,000)
Limitation: Uses only two values. One extreme outlier completely changes it.
Same data but add one outlier (₹200k):
Range = 200 − 62 = 138 ← doubles the range even though 7 of 8 values are unchanged
2. Variance
Measures the average squared deviation from the mean.
Why Square the Deviations?
Raw deviations sum to zero:
Σ(xᵢ − x̄) = (62−76.625) + (68−76.625) + ... + (91−76.625) = 0 (always)
Squaring makes all deviations positive so they don't cancel:
Sample Variance Formula
s² = Σ(xᵢ − x̄)² / (n − 1)
We divide by (n−1), not n, because we're estimating the population variance from a sample. Dividing by (n−1) makes s² an unbiased estimator of σ². This is called Bessel's correction.
Calculation
x̄ = 76.625
Deviations and squared deviations:
x (x − x̄) (x − x̄)²
62 −14.625 213.891
68 −8.625 74.391
72 −4.625 21.391
75 −1.625 2.641
78 1.375 1.891
82 5.375 28.891
85 8.375 70.141
91 14.375 206.641
─────────
Sum of (x−x̄)²: 619.878
s² = 619.878 / (8−1) = 619.878 / 7 = 88.554 (₹² thousands²)
Population Variance
σ² = Σ(xᵢ − μ)² / N (divide by N when you have ALL data)
Limitation of variance: Units are squared (₹²) — hard to interpret. We need the standard deviation.
3. Standard Deviation
The square root of the variance — back in the original units.
Sample SD: s = √s² = √88.554 = 9.41 (₹9,410)
Population SD: σ = √σ²
Interpretation
x̄ = 76.625 (₹76,625)
s = 9.41 (₹9,410)
→ A typical analyst's salary is about ₹9,410 away from the mean
→ Most salaries fall roughly between x̄ ± s = ₹67,215 and ₹86,035
The Empirical Rule (68-95-99.7 Rule)
For bell-shaped (approximately normal) distributions:
68% of data falls within 1 SD of the mean: (x̄ ± 1s)
95% of data falls within 2 SDs of the mean: (x̄ ± 2s)
99.7% of data falls within 3 SDs of the mean: (x̄ ± 3s)
If exam scores: x̄ = 70, s = 10
→ 68% of students scored between 60 and 80
→ 95% scored between 50 and 90
→ 99.7% scored between 40 and 100
→ Only 0.3% score below 40 or above 100
This rule is covered in depth in Chapter 10 (Normal Distribution).
Properties of Standard Deviation
- Always ≥ 0 (zero only if all values are identical)
- Same units as the original data
- Sensitive to outliers (like the mean)
- The most widely used measure of spread for symmetric distributions
4. Interquartile Range (IQR)
The IQR measures the spread of the middle 50% of data — robust to outliers.
Quartiles
Sort the data. Divide into four equal quarters:
Sorted: 62, 68, 72, 75, 78, 82, 85, 91
Q1 (25th percentile): median of the lower half
Lower half: 62, 68, 72, 75
Q1 = (68+72)/2 = 70
Q2 (50th percentile) = Median = (75+78)/2 = 76.5
Q3 (75th percentile): median of the upper half
Upper half: 78, 82, 85, 91
Q3 = (82+85)/2 = 83.5
IQR = Q3 − Q1 = 83.5 − 70 = 13.5 (₹13,500)
Interpretation
IQR = 13.5 → the middle 50% of salaries span a range of ₹13,500
25% earn below ₹70k, 75% earn below ₹83.5k
Outlier Detection with IQR (Tukey's Fences)
Lower fence = Q1 − 1.5 × IQR = 70 − 1.5 × 13.5 = 70 − 20.25 = 49.75
Upper fence = Q3 + 1.5 × IQR = 83.5 + 1.5 × 13.5 = 83.5 + 20.25 = 103.75
Any value below 49.75 or above 103.75 is flagged as a potential outlier.
Our data: all values between 62 and 91 → no outliers by this rule.
If we added 200: 200 > 103.75 → outlier flagged.
When to Use IQR vs Standard Deviation
| IQR | Standard Deviation | |
|---|---|---|
| Resistant to outliers | ✓ | ✗ |
| Uses all data | ✗ (middle 50%) | ✓ |
| Used with | Median | Mean |
| Best for | Skewed data, outlier detection | Symmetric data, normal distribution |
| Used in | Box plots | Most statistical tests |
5. Coefficient of Variation (CV)
Compares spread relative to the mean — useful when comparing distributions with different units or scales.
CV = (s / x̄) × 100%
Salary dataset: CV = (9.41 / 76.625) × 100% = 12.3%
Interpretation: The standard deviation is 12.3% of the mean.
→ Moderate variability relative to the mean
Comparing Two Datasets
Dataset A: Salaries in Finance: x̄ = 82, s = 9 → CV = 11%
Dataset B: Project durations: x̄ = 12 days, s = 4 days → CV = 33%
Project durations are relatively more variable than salaries,
even though the absolute SD is smaller.
CV is dimensionless — allows comparison across different units or scale.
6. The Box Plot (Box-and-Whisker Plot)
Visualises the five-number summary: Min, Q1, Median, Q3, Max.
Box Plot for salary data:
Min=62 Max=91
| Q1=70 Med=76.5 Q3=83.5 |
├────┤━━━━━┿━━━━━━━━━┿━━━━━┤────┤
62 70 76.5 83.5 91
The box spans Q1 to Q3 (the IQR).
The line inside the box is the median.
Whiskers extend to Min and Max (or to the fences for outlier detection).
Points beyond the fences are plotted individually as outliers (dots/circles).
Side-by-Side Box Plots
Comparing distributions across groups — more informative than just comparing means:
Finance ──[━━━━━━━━━━]──
Tech ──────[━━━━━━━━━━━━━━━]─────
Marketing ──[━━━━━━]──
Quick visual: Tech has higher median and wider spread than Finance or Marketing.
Practical Examples
Example 1: Comparing Two Investment Strategies
Strategy A monthly returns (%): 2.1, 2.3, 2.0, 2.4, 2.2, 2.5, 2.1, 2.3, 2.0, 2.2
Strategy B monthly returns (%): 4.0, −1.0, 5.5, 0.5, 3.8, −2.0, 6.1, 0.3, 4.2, −0.4
x̄_A = 2.21% s_A = 0.16% CV_A = 7.2%
x̄_B = 2.10% s_B = 2.88% CV_B = 137%
Strategy A has similar mean return but dramatically lower variability.
Strategy B has high upside months but also losses — much higher risk.
Risk-adjusted, Strategy A is clearly superior (same return, less risk).
Example 2: Quality Control
Two production lines, target output = 500 units/day:
Line A: 495, 498, 502, 504, 497, 501, 500, 503, 499, 501
x̄_A = 500, s_A = 2.7 units, CV = 0.54%
Line B: 480, 510, 490, 520, 485, 515, 505, 495, 510, 490
x̄_B = 500, s_B = 14.3 units, CV = 2.86%
Both lines produce 500 units on average.
Line B is 5× more variable — worse quality control.
Example 3: Student Performance Analysis
Class of 30 students, scores:
Q1 = 58, Median = 72, Q3 = 84
IQR = 84 − 58 = 26
Lower fence = 58 − 1.5(26) = 19 → any score below 19 = outlier
Upper fence = 84 + 1.5(26) = 123 → any score above 123 = outlier (impossible here)
Three students scored 12, 15, 18 → all three are outliers
→ These students need individual attention
Comparing All Measures of Spread
| Measure | Formula | Resistant? | Best for |
|---|---|---|---|
| Range | Max − Min | No | Quick overview |
| Variance (s²) | Σ(x−x̄)²/(n−1) | No | Further calculations |
| Std Dev (s) | √s² | No | Symmetric data, normal distribution |
| IQR | Q3 − Q1 | Yes | Skewed data, outlier detection |
| CV | (s/x̄)×100% | No | Comparing across different units |
Common Mistakes
1. Confusing sample SD with population SD
Sample: divide by (n−1) → s (estimating population spread from a sample)
Population: divide by N → σ (you have ALL the data)
Using n instead of (n−1) for a sample underestimates the true spread.
2. Interpreting SD without context
s = 15 marks is huge for a 0–20 quiz but tiny for a 0–1000 exam. Use CV for context.
3. Reporting only the mean without spread
"Average salary = ₹76k" is incomplete. "Average = ₹76k, SD = ₹9.4k" tells you much more — and "Median = ₹76.5k, IQR = ₹13.5k" is even better for skewed salary distributions.
4. Using SD for skewed distributions
Salary distribution with outlier (₹200k):
s = 44k (distorted by outlier)
IQR = 13.5k (unaffected)
Report IQR + Median for skewed salary data.
Practice Exercises
-
For the dataset: 3, 7, 7, 9, 11, 14, 15, 18, 21, 25 — calculate range, variance, and standard deviation.
-
Find Q1, Q3, and IQR for: 12, 15, 18, 22, 25, 28, 31, 35, 40. Identify any outliers using Tukey's fences.
-
Two mutual funds:
- Fund X: mean annual return = 12%, SD = 2%
- Fund Y: mean annual return = 18%, SD = 9% Which is more variable relative to its return? Calculate CV for each.
-
A factory tracks daily defects: 2, 4, 3, 5, 4, 3, 4, 2, 25, 4. Calculate the mean and SD with and without the outlier (25). Which measure of spread better describes this dataset?
-
Explain in words why we divide by (n−1) instead of n when calculating sample variance.
Summary
In this chapter you learned:
- Range = Max − Min; simple but outlier-sensitive
- Variance s² = Σ(x−x̄)²/(n−1); divide by (n−1) for sample (Bessel's correction); units are squared
- Standard deviation s = √s²; same units as data; 68% within 1 SD, 95% within 2 SDs (for normal distributions)
- IQR = Q3 − Q1; spread of middle 50%; resistant to outliers
- Tukey's fences: Q1−1.5×IQR and Q3+1.5×IQR — values beyond these are potential outliers
- CV = (s/x̄)×100% — relative spread; compare across different scales or units
- Box plot visualises Min, Q1, Median, Q3, Max — great for comparing groups
- Use SD+Mean for symmetric data; use IQR+Median for skewed data or when outliers are present
Next up: Data Visualisation for Statistics — histograms, box plots, scatter plots, and how to choose the right chart for statistical data.