What Is Linear Regression?
Linear regression models the relationship between a dependent variable (Y) and one or more independent variables (X) by fitting the best-fitting straight line through the data.
- Simple linear regression: one predictor X → one outcome Y
- Multiple linear regression: two or more predictors X₁, X₂, ... → one outcome Y
Goal: Find the line that best describes how Y changes with X, then use it to predict Y for new values of X.
Examples:
→ Predict salary based on years of experience
→ Predict revenue based on marketing spend
→ Predict house price based on area in sq ft
The Simple Linear Regression Model
Y = β₀ + β₁X + ε
Where:
Y = dependent variable (outcome, response)
X = independent variable (predictor, feature)
β₀ = intercept (value of Y when X = 0)
β₁ = slope (change in Y for a one-unit increase in X)
ε = error term (unexplained variation)
In practice, we estimate:
ŷ = b₀ + b₁x
Where ŷ (y-hat) is the predicted value and b₀, b₁ are the estimated coefficients.
Ordinary Least Squares (OLS)
OLS finds the line that minimises the sum of squared residuals (the distances between observed Y values and predicted ŷ values):
Residual: eᵢ = yᵢ − ŷᵢ = yᵢ − (b₀ + b₁xᵢ)
Minimise: SSE = Σeᵢ² = Σ(yᵢ − ŷᵢ)²
The OLS solution:
b₁ = Σ(xᵢ − x̄)(yᵢ − ȳ) / Σ(xᵢ − x̄)²
= r × (sᵧ / sₓ)
b₀ = ȳ − b₁ × x̄
Where r = Pearson correlation, sᵧ = SD of Y, sₓ = SD of X
Worked Example
Dataset: 8 employees — study hours per week (X) and performance score (Y)
Employee X (hours) Y (score)
1 2 55
2 4 60
3 6 70
4 8 75
5 10 80
6 12 88
7 14 92
8 16 95
From correlation chapter:
n = 8
x̄ = 9, ȳ = 76.875
Σ(xᵢ − x̄)(yᵢ − ȳ) = 499.0
Σ(xᵢ − x̄)² = 168
Step 1: Compute slope b₁
b₁ = 499.0 / 168 = 2.970
Interpretation: Each additional hour of study per week is associated with
a 2.97-point increase in performance score (on average).
Step 2: Compute intercept b₀
b₀ = ȳ − b₁ × x̄ = 76.875 − 2.970 × 9 = 76.875 − 26.73 = 50.145
Interpretation: An employee who studies 0 hours per week would be
predicted to score 50.1 (the baseline when X=0).
Regression equation:
ŷ = 50.145 + 2.970x
Step 3: Predictions
For x = 5 hours: ŷ = 50.145 + 2.970(5) = 50.145 + 14.85 = 65.0
For x = 11 hours: ŷ = 50.145 + 2.970(11) = 50.145 + 32.67 = 82.8
For x = 20 hours: ŷ = 50.145 + 2.970(20) = 50.145 + 59.40 = 109.5 ← extrapolation!
Goodness of Fit: R-Squared
R² (coefficient of determination) measures how much of the total variability in Y is explained by the regression model:
Total Sum of Squares:
SST = Σ(yᵢ − ȳ)² = total variability in Y
Sum of Squares Explained:
SSR = Σ(ŷᵢ − ȳ)² = variability explained by the model
Sum of Squared Errors:
SSE = Σ(yᵢ − ŷᵢ)² = unexplained variability (residuals)
Relationship: SST = SSR + SSE
R² = SSR/SST = 1 − SSE/SST
For simple regression: R² = r²
Computing R² for Our Example
Predicted values (ŷᵢ = 50.145 + 2.970xᵢ):
x=2: ŷ=56.1 x=4: ŷ=62.0 x=6: ŷ=67.9 x=8: ŷ=73.9
x=10: ŷ=79.8 x=12: ŷ=85.8 x=14: ŷ=91.7 x=16: ŷ=97.6
Residuals (eᵢ = yᵢ − ŷᵢ):
e₁ = 55 − 56.1 = −1.1
e₂ = 60 − 62.0 = −2.0
e₃ = 70 − 67.9 = +2.1
e₄ = 75 − 73.9 = +1.1
e₅ = 80 − 79.8 = +0.2
e₆ = 88 − 85.8 = +2.2
e₇ = 92 − 91.7 = +0.3
e₈ = 95 − 97.6 = −2.6
SSE = 1.1² + 2.0² + 2.1² + 1.1² + 0.2² + 2.2² + 0.3² + 2.6²
= 1.21 + 4.0 + 4.41 + 1.21 + 0.04 + 4.84 + 0.09 + 6.76
= 22.56
SST = Σ(yᵢ − ȳ)² = 1504.875 (computed in correlation chapter)
R² = 1 − 22.56/1504.875 = 1 − 0.015 = 0.985
98.5% of the variability in performance scores is explained by study hours.
Interpreting R²
R² = 0.985: Excellent fit — study hours explains 98.5% of score variance
R² = 0.70: Good fit — predictor explains 70% of outcome variance
R² = 0.30: Moderate fit — only 30% explained; many other factors matter
R² = 0.05: Poor fit — X barely predicts Y
Context matters: social science often accepts R²=0.20; physics might demand R²>0.99
Hypothesis Testing for Regression Coefficients
Testing the Slope (b₁)
H₀: β₁ = 0 (X has no linear effect on Y)
H₁: β₁ ≠ 0
Test statistic:
t = b₁ / SE(b₁) with df = n − 2
SE(b₁) = √(MSE / Σ(xᵢ − x̄)²) where MSE = SSE/(n−2)
For our example:
MSE = 22.56 / (8−2) = 22.56/6 = 3.76
SE(b₁) = √(3.76/168) = √0.02238 = 0.1496
t = 2.970 / 0.1496 = 19.85, df = 6
p < 0.0001 → slope is highly significant
95% CI for β₁:
b₁ ± t*(0.025, df=6) × SE(b₁)
= 2.970 ± 2.447 × 0.1496
= 2.970 ± 0.366
= (2.604, 3.336)
We are 95% confident that each additional study hour increases scores by 2.6 to 3.3 points.
The F-Test for Overall Significance
H₀: β₁ = 0 (model has no predictive value)
H₁: β₁ ≠ 0
F = MSR / MSE where MSR = SSR/1 for simple regression
For simple regression, F = t² (the t-test and F-test are equivalent).
Residual Analysis
After fitting the model, always examine the residuals to check if assumptions are met.
What Residuals Should Look Like
If all assumptions are satisfied:
1. Residuals randomly scatter around 0 (no pattern)
2. Constant spread (no funnel shape) — homoscedasticity
3. Approximately normally distributed
4. No autocorrelation (for time series data)
Residual Plots
1. Residuals vs Fitted (ŷ) Plot:
Good: random scatter around horizontal zero line
Bad: curved pattern → non-linearity; fan shape → heteroscedasticity
2. Normal QQ Plot of Residuals:
Good: points fall roughly on the 45° diagonal line
Bad: S-curve or extreme points → non-normal residuals
3. Scale-Location Plot:
Good: horizontal band with roughly equal spread
Bad: upward trend → variance increases with fitted values
4. Residuals vs Leverage (Cook's Distance):
High leverage + large residual → influential point
Remove or investigate observations with Cook's D > 4/n
Assumptions of Linear Regression (LINE)
L — LINEARITY: True relationship between X and Y is linear
Check: scatter plot, residuals vs fitted plot
I — INDEPENDENCE: Observations are independent of each other
Check: study design; Durbin-Watson for time series
N — NORMALITY: Residuals are approximately normally distributed
Check: QQ plot of residuals, Shapiro-Wilk test on residuals
E — EQUAL VARIANCE (Homoscedasticity): Variance of residuals is
constant across all values of X
Check: residuals vs fitted plot; Breusch-Pagan test
If assumptions are violated:
- Non-linearity → transform variables (log, square root), add polynomial terms
- Non-normal residuals → transform Y (log Y, √Y), use robust regression
- Heteroscedasticity → weighted least squares, robust standard errors
- Dependence (time series) → use time series methods (AR, MA)
Making Predictions
Interpolation vs Extrapolation
Our model: ŷ = 50.145 + 2.970x, fitted on X = 2 to 16
Interpolation (safe): x = 7 → ŷ = 50.145 + 20.79 = 70.9
We have data on both sides of x=7. Prediction is reliable.
Extrapolation (risky): x = 25 → ŷ = 50.145 + 74.25 = 124.4
No data near x=25. The linear pattern may not hold.
A score above 100 isn't even possible — extrapolation fails here.
Prediction Interval vs Confidence Interval for Mean Response
Confidence Interval for mean response at x = x₀:
ŷ ± t* × SE_mean (narrower — for the AVERAGE Y at x=x₀)
Prediction Interval for a new individual observation:
ŷ ± t* × SE_pred (wider — for ONE new Y at x=x₀)
SE_pred > SE_mean because individual observations vary around the mean line.
For x₀ = 10 (in our example), ŷ = 79.8:
95% CI for mean: (78.1, 81.5)
95% PI for individual: (72.3, 87.3)
The PI is ALWAYS wider than the CI for the mean.
Multiple Linear Regression (Brief Overview)
Model: Y = β₀ + β₁X₁ + β₂X₂ + ... + βₖXₖ + ε
Example: Predict salary based on years of experience AND education level
ŷ = 35 + 2.5(experience) + 8.3(education)
Interpretations:
β₁ = 2.5: Holding education constant, each additional year of experience
is associated with ₹2,500 higher salary
β₂ = 8.3: Holding experience constant, each additional year of education
is associated with ₹8,300 higher salary
Key concepts:
→ Adjusted R²: penalises adding unhelpful predictors (R² always increases with more predictors)
→ Multicollinearity: predictors correlated with each other → inflated SEs, unstable coefficients
Check: VIF (Variance Inflation Factor); VIF > 10 indicates severe multicollinearity
→ Feature selection: include only predictors that improve the model (AIC, BIC, cross-validation)
Practical Examples
Example 1: Marketing Spend → Revenue
Data (12 months):
Marketing spend (₹000) range: 10–40
Revenue (₹000) range: 85–240
OLS result:
ŷ = 32.5 + 5.14 × marketing
b₁ = 5.14 (SE=0.18, t=28.6, p<0.001)
R² = 0.987
Interpretation:
Each ₹1,000 increase in marketing spend is associated with ₹5,140 increase in revenue.
(ROI ratio: 5.14:1 on average)
The intercept (₹32,500) represents estimated revenue even with zero marketing spend.
Prediction for ₹22,000 marketing spend:
ŷ = 32.5 + 5.14(22) = 32.5 + 113.1 = 145.6 → ₹145,600 expected revenue
Example 2: Property Prices
Dataset: 50 properties
X = floor area (sq ft), Y = price (₹ lakhs)
ŷ = −12.5 + 0.45 × area
b₁ = 0.45 (SE=0.03, t=15.0, p<0.001)
R² = 0.822
Interpretation:
Each additional square foot of area is associated with ₹45,000 higher price.
82.2% of price variation is explained by floor area.
The remaining 17.8% is due to location, age, amenities, etc.
Residual analysis revealed: one property with very high leverage (new luxury property).
After investigation, retained in the model — it reflects the real market.
Example 3: Detecting Non-Linearity
Scenario: Engine temperature (X) vs fuel efficiency (Y)
Initial model: ŷ = 40 + 0.08x, R²=0.48 (poor)
Residuals vs fitted plot: clear U-shaped pattern → non-linear!
Better model with quadratic term:
ŷ = 10 + 1.2x − 0.003x²
R² = 0.91 (much better)
Lesson: Always check residuals. A significant slope doesn't mean the
model is adequate — the relationship might be non-linear.
Simple vs Multiple Regression
| Aspect | Simple | Multiple |
|---|---|---|
| Predictors | 1 (X) | 2+ (X₁, X₂, …) |
| Model | ŷ = b₀ + b₁X | ŷ = b₀ + b₁X₁ + b₂X₂ + ... |
| R² formula | = r² | SSR/SST (direct) |
| Adjusted R² | Not needed | Needed (penalises extra predictors) |
| Key concern | Linearity | Multicollinearity + Linearity |
| Coefficient interpretation | Marginal effect | Partial effect (holding others constant) |
Common Mistakes
1. Extrapolating beyond the data range
If the model was fit on X = 5 to 50, predicting at X = 200 is dangerous.
The relationship may be completely different outside the observed range.
Always state: "Valid for X between [min, max] in the training data."
2. Interpreting the intercept when X=0 is not meaningful
Model: salary = 20,000 + 1,500 × experience
b₀ = 20,000 "at zero years of experience" → could be meaningful
But: model fit on 3–20 years; X=0 is extrapolation!
The intercept is a mathematical anchor, not always interpretable.
3. Confusing statistical significance with effect size
With n=10,000, b₁=0.001 might be significant (t=10, p<0.001).
BUT a 0.001-unit change in Y per unit X might be economically meaningless.
Always report both: b₁ and its practical significance.
4. Omitting confounders → omitted variable bias
Regress salary on education: b₁=5,000 (per year of education)
But experience is correlated with education AND salary.
Omitting experience → education's coefficient absorbs some of experience's effect.
Include relevant control variables to reduce bias.
5. R² as the only measure of model quality
Two models on the same data:
Model A: R²=0.82, residuals show clear pattern (wrong shape)
Model B: R²=0.75, residuals look clean (assumptions met)
Model B is BETTER despite lower R².
Always combine R² with residual diagnostics.
Practice Exercises
-
For the data: X: 1, 2, 3, 4, 5 and Y: 3, 5, 4, 8, 7, compute b₁, b₀, and ŷ when x=6. State whether this is interpolation or extrapolation.
-
An OLS regression gives SST=800, SSE=200. Compute SSR and R². Interpret the R² value.
-
Regression of study time (X, hours) on test score (Y): ŷ = 40 + 6X, R²=0.64, n=30. Test H₀: β₁=0 at α=0.05 given SE(b₁)=1.2.
-
A residuals vs fitted plot shows a clear funnel shape (residuals fan out as ŷ increases). Which assumption is violated? What remedies exist?
-
Two models: Model A (X₁ only): R²=0.55, Adjusted R²=0.54. Model B (X₁, X₂, X₃): R²=0.57, Adjusted R²=0.51. Which model would you prefer and why?
Summary
In this chapter you learned:
- Simple linear regression model: ŷ = b₀ + b₁X where b₁ = slope (Δŷ per ΔX) and b₀ = intercept (ŷ when X=0)
- OLS: minimises Σ(yᵢ − ŷᵢ)²; estimates: b₁ = Σ(x−x̄)(y−ȳ)/Σ(x−x̄)², b₀ = ȳ − b₁x̄
- R² = 1 − SSE/SST = proportion of Y variance explained; for simple regression R² = r²
- Significance test for slope: t = b₁/SE(b₁), df = n−2; tests whether X has any linear effect on Y
- Residuals = observed − predicted; residual plots diagnose assumption violations
- LINE assumptions: Linearity, Independence, Normality (of residuals), Equal Variance
- Prediction interval (for one new observation) is always wider than the confidence interval (for the mean response)
- Interpolation (within data range) is safe; extrapolation (outside range) is risky
- Multiple regression: adds more predictors; interpret each coefficient as partial effect (holding others constant)
- Adjusted R² penalises unnecessary predictors; use it to compare models with different numbers of predictors
- Common pitfalls: extrapolation, omitted variable bias, ignoring residual patterns, treating R² as the only model quality measure
This completes the Statistics tutorial series — you now have a complete foundation from descriptive statistics through regression, covering all major methods used in data science, business analytics, finance, and research.