Linear Regression & Model Evaluation

What Is Linear Regression?

Linear regression models the relationship between a dependent variable (Y) and one or more independent variables (X) by fitting the best-fitting straight line through the data.

Simple linear regression: one predictor X → one outcome Y
Multiple linear regression: two or more predictors X₁, X₂, ... → one outcome Y

Goal: Find the line that best describes how Y changes with X, then use it to predict Y for new values of X.

Examples:
→ Predict salary based on years of experience
→ Predict revenue based on marketing spend
→ Predict house price based on area in sq ft

The Simple Linear Regression Model

Y = β₀ + β₁X + ε

Where:
Y    = dependent variable (outcome, response)
X    = independent variable (predictor, feature)
β₀   = intercept (value of Y when X = 0)
β₁   = slope (change in Y for a one-unit increase in X)
ε    = error term (unexplained variation)

In practice, we estimate:
ŷ = b₀ + b₁x

Where ŷ (y-hat) is the predicted value and b₀, b₁ are the estimated coefficients.

Ordinary Least Squares (OLS)

OLS finds the line that minimises the sum of squared residuals (the distances between observed Y values and predicted ŷ values):

Residual: eᵢ = yᵢ − ŷᵢ = yᵢ − (b₀ + b₁xᵢ)

Minimise: SSE = Σeᵢ² = Σ(yᵢ − ŷᵢ)²

The OLS solution:
b₁ = Σ(xᵢ − x̄)(yᵢ − ȳ) / Σ(xᵢ − x̄)²
   = r × (sᵧ / sₓ)

b₀ = ȳ − b₁ × x̄

Where r = Pearson correlation, sᵧ = SD of Y, sₓ = SD of X

Worked Example

Dataset: 8 employees — study hours per week (X) and performance score (Y)

Employee  X (hours)  Y (score)
1              2        55
2              4        60
3              6        70
4              8        75
5             10        80
6             12        88
7             14        92
8             16        95

From correlation chapter:
n = 8
x̄ = 9, ȳ = 76.875
Σ(xᵢ − x̄)(yᵢ − ȳ) = 499.0
Σ(xᵢ − x̄)² = 168

Step 1: Compute slope b₁
b₁ = 499.0 / 168 = 2.970

Interpretation: Each additional hour of study per week is associated with
a 2.97-point increase in performance score (on average).

Step 2: Compute intercept b₀
b₀ = ȳ − b₁ × x̄ = 76.875 − 2.970 × 9 = 76.875 − 26.73 = 50.145

Interpretation: An employee who studies 0 hours per week would be
predicted to score 50.1 (the baseline when X=0).

Regression equation:
ŷ = 50.145 + 2.970x

Step 3: Predictions
For x = 5 hours:  ŷ = 50.145 + 2.970(5) = 50.145 + 14.85 = 65.0
For x = 11 hours: ŷ = 50.145 + 2.970(11) = 50.145 + 32.67 = 82.8
For x = 20 hours: ŷ = 50.145 + 2.970(20) = 50.145 + 59.40 = 109.5 ← extrapolation!

Goodness of Fit: R-Squared

R² (coefficient of determination) measures how much of the total variability in Y is explained by the regression model:

Total Sum of Squares:
SST = Σ(yᵢ − ȳ)² = total variability in Y

Sum of Squares Explained:
SSR = Σ(ŷᵢ − ȳ)² = variability explained by the model

Sum of Squared Errors:
SSE = Σ(yᵢ − ŷᵢ)² = unexplained variability (residuals)

Relationship: SST = SSR + SSE

R² = SSR/SST = 1 − SSE/SST

For simple regression: R² = r²

Computing R² for Our Example

Predicted values (ŷᵢ = 50.145 + 2.970xᵢ):
x=2:  ŷ=56.1     x=4:  ŷ=62.0     x=6:  ŷ=67.9     x=8:  ŷ=73.9
x=10: ŷ=79.8     x=12: ŷ=85.8     x=14: ŷ=91.7     x=16: ŷ=97.6

Residuals (eᵢ = yᵢ − ŷᵢ):
e₁ = 55 − 56.1 = −1.1
e₂ = 60 − 62.0 = −2.0
e₃ = 70 − 67.9 = +2.1
e₄ = 75 − 73.9 = +1.1
e₅ = 80 − 79.8 = +0.2
e₆ = 88 − 85.8 = +2.2
e₇ = 92 − 91.7 = +0.3
e₈ = 95 − 97.6 = −2.6

SSE = 1.1² + 2.0² + 2.1² + 1.1² + 0.2² + 2.2² + 0.3² + 2.6²
    = 1.21 + 4.0 + 4.41 + 1.21 + 0.04 + 4.84 + 0.09 + 6.76
    = 22.56

SST = Σ(yᵢ − ȳ)² = 1504.875  (computed in correlation chapter)

R² = 1 − 22.56/1504.875 = 1 − 0.015 = 0.985

98.5% of the variability in performance scores is explained by study hours.

Interpreting R²

R² = 0.985: Excellent fit — study hours explains 98.5% of score variance
R² = 0.70:  Good fit — predictor explains 70% of outcome variance
R² = 0.30:  Moderate fit — only 30% explained; many other factors matter
R² = 0.05:  Poor fit — X barely predicts Y

Context matters: social science often accepts R²=0.20; physics might demand R²>0.99

Hypothesis Testing for Regression Coefficients

Testing the Slope (b₁)

H₀: β₁ = 0  (X has no linear effect on Y)
H₁: β₁ ≠ 0

Test statistic:
t = b₁ / SE(b₁)    with df = n − 2

SE(b₁) = √(MSE / Σ(xᵢ − x̄)²)   where MSE = SSE/(n−2)

For our example:
MSE = 22.56 / (8−2) = 22.56/6 = 3.76
SE(b₁) = √(3.76/168) = √0.02238 = 0.1496

t = 2.970 / 0.1496 = 19.85, df = 6
p < 0.0001 → slope is highly significant

95% CI for β₁:
b₁ ± t*(0.025, df=6) × SE(b₁)
= 2.970 ± 2.447 × 0.1496
= 2.970 ± 0.366
= (2.604, 3.336)

We are 95% confident that each additional study hour increases scores by 2.6 to 3.3 points.

The F-Test for Overall Significance

H₀: β₁ = 0  (model has no predictive value)
H₁: β₁ ≠ 0

F = MSR / MSE    where MSR = SSR/1 for simple regression

For simple regression, F = t² (the t-test and F-test are equivalent).

Residual Analysis

After fitting the model, always examine the residuals to check if assumptions are met.

What Residuals Should Look Like

If all assumptions are satisfied:
1. Residuals randomly scatter around 0 (no pattern)
2. Constant spread (no funnel shape) — homoscedasticity
3. Approximately normally distributed
4. No autocorrelation (for time series data)

Residual Plots

1. Residuals vs Fitted (ŷ) Plot:
   Good: random scatter around horizontal zero line
   Bad:  curved pattern → non-linearity; fan shape → heteroscedasticity

2. Normal QQ Plot of Residuals:
   Good: points fall roughly on the 45° diagonal line
   Bad:  S-curve or extreme points → non-normal residuals

3. Scale-Location Plot:
   Good: horizontal band with roughly equal spread
   Bad:  upward trend → variance increases with fitted values

4. Residuals vs Leverage (Cook's Distance):
   High leverage + large residual → influential point
   Remove or investigate observations with Cook's D > 4/n

Assumptions of Linear Regression (LINE)

L — LINEARITY: True relationship between X and Y is linear
                Check: scatter plot, residuals vs fitted plot

I — INDEPENDENCE: Observations are independent of each other
                   Check: study design; Durbin-Watson for time series

N — NORMALITY: Residuals are approximately normally distributed
                Check: QQ plot of residuals, Shapiro-Wilk test on residuals

E — EQUAL VARIANCE (Homoscedasticity): Variance of residuals is
                   constant across all values of X
                   Check: residuals vs fitted plot; Breusch-Pagan test

If assumptions are violated:

Non-linearity → transform variables (log, square root), add polynomial terms
Non-normal residuals → transform Y (log Y, √Y), use robust regression
Heteroscedasticity → weighted least squares, robust standard errors
Dependence (time series) → use time series methods (AR, MA)

Making Predictions

Interpolation vs Extrapolation

Our model: ŷ = 50.145 + 2.970x, fitted on X = 2 to 16

Interpolation (safe): x = 7 → ŷ = 50.145 + 20.79 = 70.9
  We have data on both sides of x=7. Prediction is reliable.

Extrapolation (risky): x = 25 → ŷ = 50.145 + 74.25 = 124.4
  No data near x=25. The linear pattern may not hold.
  A score above 100 isn't even possible — extrapolation fails here.

Prediction Interval vs Confidence Interval for Mean Response

Confidence Interval for mean response at x = x₀:
ŷ ± t* × SE_mean    (narrower — for the AVERAGE Y at x=x₀)

Prediction Interval for a new individual observation:
ŷ ± t* × SE_pred    (wider — for ONE new Y at x=x₀)

SE_pred > SE_mean because individual observations vary around the mean line.

For x₀ = 10 (in our example), ŷ = 79.8:
95% CI for mean: (78.1, 81.5)
95% PI for individual: (72.3, 87.3)

The PI is ALWAYS wider than the CI for the mean.

Multiple Linear Regression (Brief Overview)

Model: Y = β₀ + β₁X₁ + β₂X₂ + ... + βₖXₖ + ε

Example: Predict salary based on years of experience AND education level
ŷ = 35 + 2.5(experience) + 8.3(education)

Interpretations:
β₁ = 2.5: Holding education constant, each additional year of experience
           is associated with ₹2,500 higher salary
β₂ = 8.3: Holding experience constant, each additional year of education
           is associated with ₹8,300 higher salary

Key concepts:
→ Adjusted R²: penalises adding unhelpful predictors (R² always increases with more predictors)
→ Multicollinearity: predictors correlated with each other → inflated SEs, unstable coefficients
   Check: VIF (Variance Inflation Factor); VIF > 10 indicates severe multicollinearity
→ Feature selection: include only predictors that improve the model (AIC, BIC, cross-validation)

Practical Examples

Example 1: Marketing Spend → Revenue

Data (12 months):
Marketing spend (₹000) range: 10–40
Revenue (₹000) range: 85–240

OLS result:
ŷ = 32.5 + 5.14 × marketing
b₁ = 5.14 (SE=0.18, t=28.6, p<0.001)
R² = 0.987

Interpretation:
Each ₹1,000 increase in marketing spend is associated with ₹5,140 increase in revenue.
(ROI ratio: 5.14:1 on average)

The intercept (₹32,500) represents estimated revenue even with zero marketing spend.

Prediction for ₹22,000 marketing spend:
ŷ = 32.5 + 5.14(22) = 32.5 + 113.1 = 145.6 → ₹145,600 expected revenue

Example 2: Property Prices

Dataset: 50 properties
X = floor area (sq ft), Y = price (₹ lakhs)

ŷ = −12.5 + 0.45 × area
b₁ = 0.45 (SE=0.03, t=15.0, p<0.001)
R² = 0.822

Interpretation:
Each additional square foot of area is associated with ₹45,000 higher price.
82.2% of price variation is explained by floor area.
The remaining 17.8% is due to location, age, amenities, etc.

Residual analysis revealed: one property with very high leverage (new luxury property).
After investigation, retained in the model — it reflects the real market.

Example 3: Detecting Non-Linearity

Scenario: Engine temperature (X) vs fuel efficiency (Y)

Initial model: ŷ = 40 + 0.08x, R²=0.48 (poor)

Residuals vs fitted plot: clear U-shaped pattern → non-linear!

Better model with quadratic term:
ŷ = 10 + 1.2x − 0.003x²
R² = 0.91 (much better)

Lesson: Always check residuals. A significant slope doesn't mean the
model is adequate — the relationship might be non-linear.

Simple vs Multiple Regression

Aspect	Simple	Multiple
Predictors	1 (X)	2+ (X₁, X₂, …)
Model	ŷ = b₀ + b₁X	ŷ = b₀ + b₁X₁ + b₂X₂ + ...
R² formula	= r²	SSR/SST (direct)
Adjusted R²	Not needed	Needed (penalises extra predictors)
Key concern	Linearity	Multicollinearity + Linearity
Coefficient interpretation	Marginal effect	Partial effect (holding others constant)

Common Mistakes

1. Extrapolating beyond the data range

If the model was fit on X = 5 to 50, predicting at X = 200 is dangerous.
The relationship may be completely different outside the observed range.
Always state: "Valid for X between [min, max] in the training data."

2. Interpreting the intercept when X=0 is not meaningful

Model: salary = 20,000 + 1,500 × experience
b₀ = 20,000 "at zero years of experience" → could be meaningful
But: model fit on 3–20 years; X=0 is extrapolation!
The intercept is a mathematical anchor, not always interpretable.

3. Confusing statistical significance with effect size

With n=10,000, b₁=0.001 might be significant (t=10, p<0.001).
BUT a 0.001-unit change in Y per unit X might be economically meaningless.
Always report both: b₁ and its practical significance.

4. Omitting confounders → omitted variable bias

Regress salary on education: b₁=5,000 (per year of education)
But experience is correlated with education AND salary.
Omitting experience → education's coefficient absorbs some of experience's effect.
Include relevant control variables to reduce bias.

5. R² as the only measure of model quality

Two models on the same data:
Model A: R²=0.82, residuals show clear pattern (wrong shape)
Model B: R²=0.75, residuals look clean (assumptions met)
Model B is BETTER despite lower R².
Always combine R² with residual diagnostics.

Practice Exercises

For the data: X: 1, 2, 3, 4, 5 and Y: 3, 5, 4, 8, 7, compute b₁, b₀, and ŷ when x=6. State whether this is interpolation or extrapolation.
An OLS regression gives SST=800, SSE=200. Compute SSR and R². Interpret the R² value.
Regression of study time (X, hours) on test score (Y): ŷ = 40 + 6X, R²=0.64, n=30. Test H₀: β₁=0 at α=0.05 given SE(b₁)=1.2.
A residuals vs fitted plot shows a clear funnel shape (residuals fan out as ŷ increases). Which assumption is violated? What remedies exist?
Two models: Model A (X₁ only): R²=0.55, Adjusted R²=0.54. Model B (X₁, X₂, X₃): R²=0.57, Adjusted R²=0.51. Which model would you prefer and why?

Summary

In this chapter you learned:

Simple linear regression model: ŷ = b₀ + b₁X where b₁ = slope (Δŷ per ΔX) and b₀ = intercept (ŷ when X=0)
OLS: minimises Σ(yᵢ − ŷᵢ)²; estimates: b₁ = Σ(x−x̄)(y−ȳ)/Σ(x−x̄)², b₀ = ȳ − b₁x̄
R² = 1 − SSE/SST = proportion of Y variance explained; for simple regression R² = r²
Significance test for slope: t = b₁/SE(b₁), df = n−2; tests whether X has any linear effect on Y
Residuals = observed − predicted; residual plots diagnose assumption violations
LINE assumptions: Linearity, Independence, Normality (of residuals), Equal Variance
Prediction interval (for one new observation) is always wider than the confidence interval (for the mean response)
Interpolation (within data range) is safe; extrapolation (outside range) is risky
Multiple regression: adds more predictors; interpret each coefficient as partial effect (holding others constant)
Adjusted R² penalises unnecessary predictors; use it to compare models with different numbers of predictors
Common pitfalls: extrapolation, omitted variable bias, ignoring residual patterns, treating R² as the only model quality measure

This completes the Statistics tutorial series — you now have a complete foundation from descriptive statistics through regression, covering all major methods used in data science, business analytics, finance, and research.