Linear Regression

What Is Linear Regression?

Linear regression is the simplest and most widely used algorithm for regression — the task of predicting a continuous numeric target from one or more input features. Think house prices in ₹ lakhs, a customer's expected lifetime spend, tomorrow's temperature, or the delivery time of a Swiggy order. In each case the answer is a number on a continuous scale, not a category.

The idea is to fit a straight line (or, with several features, a flat plane or hyperplane) through the data so that the line captures the underlying trend. Once fitted, you feed in a new input and read off the predicted number.

Here is the intuition. Imagine Priya, an analyst at a real-estate firm in Pune, plots flat area (sq ft) on the x-axis and price (₹ lakhs) on the y-axis. The dots trend upward — bigger flats cost more. Linear regression draws the single straight line that sits "closest" to all those dots at once. That line is the model. For a new 900 sq ft flat, she moves up from 900 on the x-axis to the line and reads the predicted price.

Regression  → predict a number   (price, temperature, demand)
Classification → predict a category (spam / not-spam, churn / stay)

Linear regression is the natural first model to reach for: it is fast, it trains on almost any hardware, and — unlike many black-box models — its output is fully interpretable. You can look at the fitted coefficients and say exactly how much each feature moves the prediction. For classification problems (yes/no, category labels), you instead use Logistic Regression, covered in the next chapter.

This chapter is the machine-learning view of regression — model fitting, gradient descent, and the scikit-learn workflow. The deeper statistical treatment (hypothesis tests on coefficients, confidence and prediction intervals, formal assumption diagnostics) lives in the Linear Regression & Model Evaluation chapter of the Statistics tutorial.

Simple vs Multiple Linear Regression

There are two flavours, differing only in how many input features you use.

Simple linear regression uses a single feature X to predict the target Y. Example: predict salary from years of experience alone.
Multiple linear regression uses two or more features X1, X2, ..., Xn. Example: predict salary from experience and education level and city tier.

The model equation for the simple case is a straight line:

ŷ = w·x + b

where
  ŷ = predicted target value ("y-hat")
  x = the single input feature
  w = weight / slope (how much ŷ changes per unit of x)
  b = bias / intercept (value of ŷ when x = 0)

The multiple case generalises to a weighted sum of all features plus a bias:

ŷ = w1·x1 + w2·x2 + ... + wn·xn + b

In compact vector form:
ŷ = wᵀx + b      (dot product of the weight vector w and feature vector x, plus bias b)

Statisticians write the same thing with betas — Y = β0 + β1·X1 + ... + βn·Xn + ε, where β0 is the intercept, the βi are coefficients, and ε is the irreducible error. The machine-learning w/b notation and the statistics β notation describe the identical model.

Aspect	Simple Linear Regression	Multiple Linear Regression
Number of features	1	2 or more
Model shape	A line in 2D	A plane / hyperplane in n dimensions
Equation	`ŷ = w·x + b`	`ŷ = wᵀx + b`
Main extra concern	Is the relationship linear?	Multicollinearity between features
Coefficient meaning	Marginal effect of `x`	Partial effect, holding other features fixed

The Cost Function: Mean Squared Error

To fit the line we need a way to score how good any given line is. For each training example the residual (error) is the gap between the true value and the prediction:

residualᵢ = yᵢ − ŷᵢ

We square each residual (so positive and negative errors do not cancel, and large errors are penalised more), then average across all m training examples. That average is the Mean Squared Error (MSE) — the cost function linear regression minimises:

             1   m
MSE(w, b) = --- · Σ (yᵢ − ŷᵢ)²
             m  i=1

           1   m
        = --- · Σ (yᵢ − (wᵀxᵢ + b))²
           m  i=1

Some texts use 1/(2m) instead of 1/m; the extra 1/2 just makes the calculus tidier and does not change where the minimum sits. A related metric, Root Mean Squared Error (RMSE = √MSE), is popular because it is back in the same units as the target (₹, minutes, degrees) and is therefore easy to communicate.

Lower MSE  → the line sits closer to the data → better fit
MSE = 0    → the line passes through every point exactly (rare, often overfitting)

The whole job of "training" is to search for the values of w and b that make MSE as small as possible. There are two standard ways to do that search.

How the Line Is Fit

Method 1: The Normal Equation (closed-form solution)

Because MSE is a smooth, bowl-shaped (convex) function of the weights, calculus gives an exact formula for the minimising weights in one shot — no iteration required. Stacking all features into a matrix X (with a column of 1s for the bias) and all targets into a vector y, the optimal weight vector is:

w = (Xᵀ X)⁻¹ Xᵀ y

This is the normal equation. It is what scikit-learn's LinearRegression uses under the hood (via an efficient SVD-based solver, not a literal matrix inverse).

Pros: exact answer, no learning rate to tune, no iterations.
Cons: computing (Xᵀ X)⁻¹ costs roughly O(n³) in the number of features n, so it becomes slow when you have very many features; it also struggles when features are highly collinear (Xᵀ X is near-singular).

Method 2: Gradient Descent (iterative optimisation)

When you have millions of rows or thousands of features, the closed form is too expensive. Gradient descent instead starts from a guess and repeatedly nudges the weights downhill on the MSE surface until it reaches the bottom.

The intuition: imagine standing on a foggy hillside and wanting the lowest point. You feel the slope under your feet and take a small step in the steepest downhill direction. Repeat until the ground is flat. The gradient is that slope; the learning rate is your step size.

The partial derivatives of MSE give the update rules. Each iteration:

For each feature j:
  wⱼ ← wⱼ − α · (∂MSE/∂wⱼ)
  b  ← b  − α · (∂MSE/∂b)

where the gradients are:
  ∂MSE/∂wⱼ = (−2/m) · Σ (yᵢ − ŷᵢ) · xᵢⱼ
  ∂MSE/∂b  = (−2/m) · Σ (yᵢ − ŷᵢ)

and α (alpha) is the learning rate, typically 0 < α < 1

Choosing α matters. Too small and training crawls; too large and the steps overshoot the minimum and the cost can diverge to infinity. A common range to try is α between about 0.0001 and 0.3.

α too small → many tiny steps, very slow convergence
α just right → smooth, steady decrease in MSE each iteration
α too large → MSE bounces around or explodes (diverges)

Variants you will meet: Batch GD uses all rows per step, Stochastic GD (SGD) uses one random row per step (noisier but fast on huge data), and Mini-batch GD uses small batches — the practical default in modern ML. Because features on wildly different scales distort the gradient, gradient descent needs feature scaling (standardisation) to converge well; see the Feature Engineering & Scaling chapter.

	Normal Equation	Gradient Descent
Style	Closed-form, one shot	Iterative
Learning rate	Not needed	Must tune `α`
Cost in features `n`	About `O(n³)`	About `O(n)` per step
Very many features	Slow	Scales well
Very many rows	Fine (fits in memory)	Excellent (works with mini-batches)
Needs feature scaling	No	Yes
Used by	`LinearRegression`	`SGDRegressor`, deep learning

Interpreting Coefficients and Intercept

The great strength of linear regression is that the fitted numbers mean something concrete.

The intercept b is the predicted target when every feature equals 0. Often this is a mathematical anchor rather than a meaningful scenario (a flat with 0 sq ft does not exist), so interpret it with care.
Each coefficient wⱼ is the change in the predicted target for a one-unit increase in that feature, holding all other features constant. The "holding others constant" part is what makes it a partial effect in multiple regression.

Suppose a salary model (in ₹ thousands) fits to:

  salary = 320 + 45·(years_experience) + 60·(education_level)

Reading the coefficients:
  • Intercept 320  → a fresher (0 years, education_level 0) is anchored at ₹3.2 lakh
  • w = 45         → each extra year of experience adds ₹45,000, holding education fixed
  • w = 60         → each extra education level adds ₹60,000, holding experience fixed

Two cautions. First, the magnitude of a coefficient depends on the feature's units — a coefficient on "area in sq ft" and one on "area in sq m" differ by a factor even though the model is identical. To compare feature importances fairly, standardise the features first, or inspect standardised coefficients. Second, a large coefficient is not the same as a statistically significant one; significance testing (t-tests, p-values on coefficients) is covered in the Statistics tutorial.

R-Squared: How Good Is the Fit?

MSE tells you the error in the target's units, but it is hard to judge in the abstract — is an MSE of 40 good? The coefficient of determination, R² (R-squared), gives a unit-free score of how much of the target's variability the model explains.

                 SS_res      Σ (yᵢ − ŷᵢ)²
R² = 1 − ------  =  1 − ----------------
                 SS_tot      Σ (yᵢ − ȳ)²

where
  SS_res = sum of squared residuals (model's errors)
  SS_tot = total variance of y around its mean ȳ

Interpretation:

R² = 1.0  → the model explains 100% of the variance (perfect fit)
R² = 0.85 → the model explains 85% of the variance in the target
R² = 0.0  → the model is no better than always predicting the mean ȳ
R² < 0    → the model is worse than predicting the mean (a bad fit or wrong test set)

For multiple regression, prefer Adjusted R², which penalises adding features that do not genuinely help. Plain R² can only go up when you add features, so it can flatter a bloated model. What counts as a "good" R² is domain-dependent: physics experiments may demand R² above 0.99, while noisy social or marketing data may treat 0.30 as useful.

Assumptions of Linear Regression

Linear regression's coefficients and error estimates are trustworthy only when a few assumptions roughly hold. A handy mnemonic is LINE:

Linearity — the true relationship between features and target is linear. Curved patterns need transformed features (log, polynomial terms) or a different model.
Independence — observations are independent of each other (a concern with time-series or clustered data).
Normality — the residuals are approximately normally distributed (matters most for inference and prediction intervals).
Equal variance (homoscedasticity) — the spread of residuals is roughly constant across the range of predictions, not fanning out.

You check these mainly by plotting residuals versus fitted values (should be a random cloud around zero) and a Q-Q plot of residuals. The deeper treatment — formal tests, diagnostics, and remedies — is in the Statistics tutorial's regression chapter. For prediction-focused ML work, mild violations are often tolerable, but severe ones bias your coefficients and inflate error.

Full Example with scikit-learn

Let's predict flat prices in ₹ lakhs from area, number of bedrooms, and building age. This uses the modern scikit-learn workflow: train_test_split, fit, predict, and metric functions.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# --- 1. Build a small illustrative dataset (values are made up for teaching) ---
data = pd.DataFrame({
    "area_sqft":  [650, 900, 1100, 1500, 800, 1250, 2000, 1750, 950, 1400],
    "bedrooms":   [1,   2,   2,    3,    1,   2,    4,    3,    2,   3],
    "age_years":  [10,  5,   8,    2,    15,  6,    1,    3,    12,  4],
    "price_lakh": [42,  68,  75,   115,  50,  92,   170,  135,  70,  108],
})

X = data[["area_sqft", "bedrooms", "age_years"]]
y = data["price_lakh"]

# --- 2. Hold out a test set so we measure generalisation, not memorisation ---
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

# --- 3. Fit the model ---
model = LinearRegression()
model.fit(X_train, y_train)

# --- 4. Inspect the learned coefficients and intercept ---
print("Intercept (b):", round(model.intercept_, 2))
for feature, coef in zip(X.columns, model.coef_):
    print(f"  {feature:>10}: {round(coef, 3)}")

# --- 5. Predict on the held-out test set ---
y_pred = model.predict(X_test)

# --- 6. Evaluate ---
print("MAE :", round(mean_absolute_error(y_test, y_pred), 2))
print("RMSE:", round(np.sqrt(mean_squared_error(y_test, y_pred)), 2))
print("R2  :", round(r2_score(y_test, y_pred), 3))

# --- 7. Predict the price of a new flat ---
new_flat = pd.DataFrame({"area_sqft": [1000], "bedrooms": [2], "age_years": [7]})
print("Predicted price (lakh):", round(model.predict(new_flat)[0], 2))

The console output looks like this (numbers are illustrative — your exact values will vary with the split):

Intercept (b): 5.31
   area_sqft: 0.071
    bedrooms: 6.842
   age_years: -0.913
MAE : 4.10
RMSE: 5.02
R2  : 0.981
Predicted price (lakh): 78.4

Reading the coefficients: each extra sq ft adds about ₹7,100 to the price, each extra bedroom about ₹6.84 lakh, and each additional year of age lowers the price by about ₹91,000 — all sensible signs. The R2 of 0.981 says the model explains roughly 98% of the price variance on the test set.

Using a Pipeline with Scaling

When you later switch to gradient-descent-based SGDRegressor or add regularisation, wrap scaling and the model in a Pipeline so the scaler is fit on training data only and applied consistently at prediction time:

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDRegressor

pipe = make_pipeline(
    StandardScaler(),                     # scale features (essential for gradient descent)
    SGDRegressor(max_iter=1000, tol=1e-3, random_state=42)
)
pipe.fit(X_train, y_train)
print("R2 (SGD pipeline):", round(pipe.score(X_test, y_test), 3))

When to Use It — and Its Limitations

Reach for linear regression when:

The target is continuous and the relationship with features is roughly linear.
You need an interpretable model — stakeholders want to know why the number is what it is.
You want a fast, low-variance baseline before trying fancier models.
You have a modest number of features relative to rows.

Strengths	Limitations
Simple, fast, cheap to train	Assumes a linear feature-target relationship
Fully interpretable coefficients	Sensitive to outliers (squared error punishes them)
Needs little data to get going	Cannot capture complex non-linear patterns unaided
A strong, honest baseline	Hurt by multicollinearity among features
Extends cleanly to regularised forms	Underfits genuinely complex data

When linearity breaks down, options include adding polynomial features, transforming variables, using regularised variants (Ridge, Lasso — see the Overfitting & Regularization chapter), or moving to tree-based regressors like Random Forests.

Common Mistakes

Forgetting to scale features before gradient descent. LinearRegression (normal equation) is scale-invariant, but SGDRegressor and regularised models are not. Unscaled features make gradient descent converge slowly or not at all. Standardise inside a pipeline.
Extrapolating far outside the training range. A model fit on flats of 650 to 2000 sq ft should not be trusted to price a 10000 sq ft mansion. The linear pattern may not hold there, and predictions become unreliable.
Judging fit on the training set only. A high training R² can hide overfitting. Always report metrics on a held-out test set (see the Train-Test Split & Cross-Validation chapter).
Comparing raw coefficient magnitudes across differently-scaled features. A coefficient on "area in sq ft" looks tiny next to "bedrooms", yet may matter more. Standardise features before comparing importances.
Ignoring outliers. Because MSE squares errors, one extreme point can drag the whole line toward it. Inspect residuals and consider robust alternatives if outliers are genuine anomalies.
Using linear regression for a categorical target. Predicting a yes/no or class label with linear regression is a modelling error — use Logistic Regression for classification instead.

Practice Exercises

Simple fit by hand. For X = [1, 2, 3, 4, 5] and Y = [2, 4, 5, 4, 6], fit LinearRegression with scikit-learn, then print coef_ and intercept_. Predict Y at X = 6 and state whether this is interpolation or extrapolation.
Interpret coefficients. A model fits sales = 12 + 3.5·(ad_spend) + 1.2·(store_size). Explain in one sentence each what the intercept, 3.5, and 1.2 mean. State the assumption implied by the phrase "holding other features constant".
Gradient descent intuition. You train an SGDRegressor and the loss increases every iteration and eventually becomes nan. Which hyperparameter is almost certainly wrong, and in which direction should you change it? What preprocessing step might also be missing?
Metrics. Given a model with SS_res = 120 and SS_tot = 800, compute R² by hand and interpret it. Would you prefer this over a model with R² = 0.60? What extra metric would you look at?
Pipeline. Build a make_pipeline(StandardScaler(), LinearRegression()) on any dataset with features of very different scales, and confirm the test R² matches a plain LinearRegression (scaling should not change LinearRegression's predictions — explain why).
Residual check. After fitting, plot residuals (y_test − y_pred) against y_pred. Describe what a healthy plot looks like and what a funnel shape would indicate about the LINE assumptions.

Summary

Linear regression predicts a continuous target by fitting a line/hyperplane: ŷ = wᵀx + b.
Simple regression uses one feature; multiple regression uses many, and each coefficient is a partial effect holding other features constant.
The model is fit by minimising the Mean Squared Error, MSE = (1/m)·Σ(yᵢ − ŷᵢ)².
Two fitting methods: the normal equation w = (XᵀX)⁻¹Xᵀy (exact, one shot — what LinearRegression uses) and gradient descent (iterative, scales to huge data, needs a learning rate α and feature scaling).
The gradient-descent update is wⱼ ← wⱼ − α·(∂MSE/∂wⱼ); too-large α diverges, too-small α crawls.
Coefficients and the intercept are directly interpretable; standardise features before comparing their magnitudes.
R² (1 − SS_res/SS_tot) reports the fraction of target variance explained; use Adjusted R² with many features.
The LINE assumptions (Linearity, Independence, Normality of residuals, Equal variance) should roughly hold — the deeper treatment is in the Statistics tutorial.
In scikit-learn: train_test_split → LinearRegression().fit() → .coef_ / .intercept_ → .predict() → r2_score / mean_squared_error; wrap scaling in a Pipeline for gradient-descent variants.
Watch for extrapolation, outliers, unscaled features with SGD, and never use it for categorical targets.

Linear regression is your interpretable, fast baseline for any numeric-prediction problem — master it and you understand the backbone of many more advanced models.

Next up: Logistic Regression — despite the name, it is a classification algorithm; you will see how it reuses the linear equation but squashes the output through a sigmoid to predict probabilities and class labels.