What Is Logistic Regression?
Despite the word "regression" in its name, logistic regression is a classification algorithm. It answers yes/no questions: Will this customer churn? Is this transaction fraud? Will this loan applicant default?
In the previous chapter you saw how Linear Regression predicts a continuous number like a salary or a house price. But if you tried to predict a label (say, churn = 1 vs stay = 0) with a straight line, the model would happily output values like -0.4 or 1.7 — nonsense as probabilities. Logistic regression fixes this by wrapping the linear model in a squashing function that keeps the output between 0 and 1.
Intuitive analogy. Think of a bank loan officer named Priya. She looks at an applicant's income, credit score, and existing loans, and mentally adds up "evidence for approval." A little evidence nudges her confidence up; a lot of evidence pushes it near certainty; strong negative evidence pushes it near zero. She never says "I am 170% confident" — her confidence saturates. Logistic regression models exactly this: it computes a weighted score from the features, then converts that score into a calibrated-looking probability that flattens out near 0 and near 1.
Goal: learn feature weights so that the predicted probability of the positive class is high for positive examples and low for negative ones — then apply a threshold to turn probability into a decision.
Examples:
→ Predict whether a customer will churn (churn = 1, stay = 0)
→ Predict whether a UPI transaction is fraudulent (fraud = 1, legit = 0)
→ Predict whether an email is spam (spam = 1, ham = 0)
→ Predict whether a patient has a disease (positive = 1, negative = 0)
From a Line to a Probability: The Sigmoid Function
Logistic regression starts with the same linear combination you know from linear regression. This raw score is called the logit or log-odds, and we usually denote it z.
Linear score (logit):
z = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ
z can be any real number from -∞ to +∞.
The problem: z is unbounded, but a probability must live in the range 0 <= p <= 1. We pass z through the sigmoid (also called the logistic) function, which squashes any real number into that range.
Sigmoid / logistic function:
σ(z) = 1 / (1 + e^(-z))
Key behaviour:
z → -∞ ⇒ σ(z) → 0
z = 0 ⇒ σ(z) = 0.5
z → +∞ ⇒ σ(z) → 1
The curve is S-shaped: steep near z = 0, flat in the tails.
Putting the two pieces together gives the full model. The predicted probability that the label is 1 (given features x) is:
p = P(y = 1 | x) = σ(z) = 1 / (1 + e^(-(β₀ + β₁x₁ + ... + βₙxₙ)))
The sigmoid is what makes logistic regression a linear classifier: the boundary between classes is linear in the features, even though the mapping from score to probability is a smooth curve.
A Quick Numerical Feel
Suppose a fitted model gives z = β₀ + β₁·(credit_score):
z = -4 + 0.01 × credit_score
For credit_score = 300: z = -4 + 3 = -1.0 ⇒ σ(-1.0) = 0.27
For credit_score = 400: z = -4 + 4 = 0.0 ⇒ σ(0.0) = 0.50
For credit_score = 500: z = -4 + 5 = 1.0 ⇒ σ(1.0) = 0.73
Notice: equal 100-point jumps in the feature do NOT cause equal
jumps in probability — the sigmoid flattens near the extremes.
The Decision Boundary and Threshold
The sigmoid gives a probability. To turn it into a hard class label, we apply a threshold — by default 0.5.
Decision rule (default threshold t = 0.5):
if p >= 0.5 ⇒ predict class 1
if p < 0.5 ⇒ predict class 0
Because σ(z) = 0.5 exactly when z = 0, the default threshold is
equivalent to:
predict class 1 when z = β₀ + β₁x₁ + ... + βₙxₙ >= 0
The set of points where z = 0 is the decision boundary. Because z is linear in the features, this boundary is a straight line (in 2D), a plane (in 3D), or a hyperplane (in higher dimensions).
The threshold is a business decision, not a mathematical constant. For fraud detection you might lower the threshold to 0.2 so you catch more fraud (higher recall) at the cost of more false alarms. For a spam filter you might raise it to 0.8 so you rarely misclassify a real email. You will study how to pick thresholds using the ROC curve and precision-recall trade-offs in the Model Evaluation Metrics chapter.
Cost Function: Log Loss (Cross-Entropy)
Linear regression minimises squared error. Logistic regression does not — squared error on top of the sigmoid produces a non-convex, bumpy surface with many local minima that gradient descent gets stuck in. Instead we use log loss, also called binary cross-entropy, which is convex for logistic regression and rewards confident, correct probabilities.
Cost for a single example (yᵢ is the true label, pᵢ is the predicted probability):
cost = -[ yᵢ · log(pᵢ) + (1 - yᵢ) · log(1 - pᵢ) ]
Intuition:
→ If yᵢ = 1, cost = -log(pᵢ) → 0 when pᵢ → 1, → ∞ when pᵢ → 0
→ If yᵢ = 0, cost = -log(1 - pᵢ) → 0 when pᵢ → 0, → ∞ when pᵢ → 1
Total cost over m examples (this is what training minimises):
J(β) = -(1/m) · Σᵢ [ yᵢ · log(pᵢ) + (1 - yᵢ) · log(1 - pᵢ) ]
The key insight: log loss punishes confident wrong answers brutally. Predicting p = 0.99 for something that is actually class 0 incurs a huge penalty, while predicting p = 0.51 for the same mistake is only mildly penalised. This is why the model learns not just to be right, but to be appropriately uncertain. There is no neat closed-form solution like OLS, so z is fit iteratively using optimisation algorithms (gradient descent, lbfgs, newton-cg, etc.).
Interpreting Coefficients: Log-Odds and Odds Ratios
This is logistic regression's superpower over black-box models: the coefficients are interpretable. The trick is to think in terms of odds rather than probability.
Odds = p / (1 - p) (e.g. p = 0.75 ⇒ odds = 3, i.e. "3 to 1")
The model is linear in the LOG of the odds:
log( p / (1 - p) ) = z = β₀ + β₁x₁ + ... + βₙxₙ
So each coefficient βⱼ is the change in log-odds for a one-unit
increase in xⱼ (holding other features constant).
Log-odds are hard to explain to a stakeholder, so we exponentiate the coefficient to get an odds ratio:
Odds ratio for feature xⱼ: OR = e^(βⱼ)
Interpretation of a one-unit increase in xⱼ (others held constant):
OR > 1 ⇒ odds of class 1 MULTIPLY by OR (feature increases likelihood)
OR = 1 ⇒ no effect
OR < 1 ⇒ odds of class 1 shrink (feature decreases likelihood)
Worked Interpretation
Churn model coefficient: β_tenure = -0.12 (tenure in months)
Odds ratio = e^(-0.12) = 0.887
Interpretation:
"Each additional month a customer stays multiplies their odds of
churning by 0.887 — roughly an 11% reduction in churn odds per month,
holding all other features constant."
Another: β_monthly_charge = 0.03 (charge in ₹100 units)
Odds ratio = e^(0.03) = 1.030
"Each extra ₹100 of monthly charge multiplies churn odds by 1.03
(a 3% increase in churn odds)."
Note the difference from linear regression: the coefficient acts on the odds multiplicatively, not on the probability additively. A 0.03 coefficient does not mean "3% more probability."
Binary vs Multiclass Classification
Plain logistic regression handles two classes. Real problems often have more (e.g. classify a support ticket as Billing, Technical, or Sales). There are two standard extensions.
One-vs-Rest (OvR)
Train one binary classifier per class: "this class vs everything else." For K classes you fit K models, get K probabilities, and predict the class whose model gives the highest score.
3 classes {A, B, C} ⇒ train 3 binary models:
Model_A: A vs (B or C)
Model_B: B vs (A or C)
Model_C: C vs (A or B)
Predict = class with the highest probability.
Softmax / Multinomial Logistic Regression
A single model that outputs a full probability distribution over all K classes at once, using the softmax function (a generalisation of the sigmoid). The probabilities across all classes sum to 1.
Softmax for class k out of K classes:
P(y = k | x) = e^(zₖ) / Σⱼ e^(zⱼ) (sum over j = 1..K)
The K probabilities are guaranteed to sum to 1.0.
In scikit-learn, modern versions choose multinomial automatically for multiclass problems with solvers that support it, or you can force behaviour. Multinomial is usually better calibrated when classes are genuinely mutually exclusive; OvR is a robust, simple fallback.
Logistic Regression in scikit-learn
Here is a realistic end-to-end binary classification workflow. We predict customer churn from a small feature set. Note the pipeline with scaling — logistic regression is sensitive to feature scale when regularised (more on that below).
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (accuracy_score, confusion_matrix,
classification_report, roc_auc_score)
# --- Illustrative telecom churn data (features already engineered) ---
# Columns: tenure_months, monthly_charge, num_support_calls
df = pd.read_csv("telecom_churn.csv")
X = df[["tenure_months", "monthly_charge", "num_support_calls"]]
y = df["churned"] # 1 = churned, 0 = stayed
# Stratify to preserve the churn ratio in both splits
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y)
# Scale + model in one pipeline so scaling is fit ONLY on training data
model = Pipeline([
("scaler", StandardScaler()),
("clf", LogisticRegression(C=1.0, max_iter=1000, random_state=42))
])
model.fit(X_train, y_train)
# Hard predictions (uses the default 0.5 threshold internally)
y_pred = model.predict(X_test)
# Probabilities — column 1 is P(class = 1)
y_proba = model.predict_proba(X_test)[:, 1]
print("Accuracy :", round(accuracy_score(y_test, y_pred), 3))
print("ROC-AUC :", round(roc_auc_score(y_test, y_proba), 3))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
Illustrative output (numbers shown for shape, not a real benchmark):
Accuracy : 0.842
ROC-AUC : 0.889
[[512 38]
[ 71 129]]
precision recall f1-score support
0 0.88 0.93 0.90 550
1 0.77 0.65 0.70 200
accuracy 0.84 750
macro avg 0.82 0.79 0.80 750
weighted avg 0.85 0.84 0.84 750
Using predict_proba and a Custom Threshold
predict_proba is what makes logistic regression valuable — you get a ranked score, not just a label. To catch more churners (higher recall) you can lower the decision threshold yourself.
# Default predict() uses 0.5. Here we use a lower, business-chosen threshold.
threshold = 0.35
y_pred_custom = (y_proba >= threshold).astype(int)
# Lowering the threshold typically increases recall (catch more class 1)
# at the cost of precision (more false positives).
Reading Coefficients as Odds Ratios
# Pull the fitted coefficients out of the pipeline's classifier step.
clf = model.named_steps["clf"]
feature_names = X.columns
coefs = pd.DataFrame({
"feature": feature_names,
"coefficient": clf.coef_[0],
"odds_ratio": np.exp(clf.coef_[0])
}).sort_values("odds_ratio", ascending=False)
print(coefs)
Illustrative output (coefficients are on SCALED features):
feature coefficient odds_ratio
2 num_support_calls 0.640 1.897
1 monthly_charge 0.310 1.363
0 tenure_months -0.880 0.415
Read as: after scaling, more support calls sharply raise churn odds,
while longer tenure strongly lowers them.
Because the pipeline scaled the features, these odds ratios describe the effect of a one-standard-deviation change, not a one-raw-unit change. If you need raw-unit odds ratios for stakeholders, fit the model on unscaled features (accepting the caveats in Common Mistakes).
The Regularization Parameter C
Logistic regression in scikit-learn is regularized by default (L2). Regularization discourages large coefficients, which prevents overfitting and stabilises the model when features are correlated. The strength is controlled by C.
C is the INVERSE of regularization strength:
Small C ⇒ strong regularization ⇒ smaller coefficients, simpler model (may underfit)
Large C ⇒ weak regularization ⇒ larger coefficients, more flexible (may overfit)
C = 1.0 is the scikit-learn default.
You can also switch the penalty type (penalty="l1" for sparse models that zero out useless features, penalty="l2" for the default, or elasticnet). Tune C with cross-validation.
from sklearn.model_selection import GridSearchCV
param_grid = {"clf__C": [0.01, 0.1, 1, 10, 100]}
grid = GridSearchCV(model, param_grid, cv=5, scoring="roc_auc")
grid.fit(X_train, y_train)
print("Best C :", grid.best_params_)
print("Best ROC-AUC:", round(grid.best_score_, 3))
The full treatment of the bias-variance trade-off, why regularization works, and L1 vs L2 lives in the Bias-Variance, Overfitting & Regularization chapter. For now, remember: C is a knob you tune, and it interacts with feature scaling — which is why the pipeline above scales first.
When to Use Logistic Regression: Pros and Cons
| Aspect | Details |
|---|---|
| Problem type | Binary or multiclass classification; probabilistic ranking |
| Output | Calibrated-ish probability via predict_proba, plus a hard label |
| Interpretability | High — coefficients map directly to odds ratios |
| Training speed | Very fast; scales to large datasets |
| Decision boundary | Linear (a hyperplane) in the feature space |
| Assumptions | Roughly linear relationship between features and log-odds |
| Handles non-linearity | Only if you add polynomial or interaction features manually |
| Feature scaling | Recommended (required for fair regularization and faster convergence) |
Pros
- Fast, simple, and a strong baseline — always try it before reaching for complex models.
- Coefficients are interpretable as odds ratios, which regulators and business stakeholders love.
- Outputs probabilities, not just labels, so you can rank and set custom thresholds.
- Rarely overfits when regularized; works well even with many features.
Cons
- The decision boundary is linear; it cannot capture complex non-linear patterns without feature engineering.
- Sensitive to outliers and to strongly correlated features (multicollinearity destabilises coefficients).
- Assumes a roughly linear relationship between features and the log-odds.
- Probabilities are not automatically well-calibrated, especially with imbalanced classes or heavy regularization.
If the boundary is clearly non-linear, algorithms like K-Nearest Neighbors, Decision Trees, or Support Vector Machines (all covered later in this series) may fit better.
Common Mistakes
1. Not scaling features before regularized logistic regression
Because L2 regularization penalises the SIZE of coefficients, features
on large scales (e.g. income in ₹ vs age in years) get unfairly shrunk.
Always scale (StandardScaler) inside a Pipeline so the penalty is fair
and the solver converges quickly. Symptom: a "ConvergenceWarning".
2. Ignoring class imbalance
If only 2% of transactions are fraud, a model predicting "never fraud"
scores 98% accuracy — and is useless. Fixes:
→ use class_weight="balanced" in LogisticRegression
→ resample (SMOTE / undersampling)
→ judge with ROC-AUC, precision, recall — NOT plain accuracy.
3. Treating predict_proba output as perfectly calibrated
A predicted 0.90 does NOT guarantee "90 out of 100 such cases are positive."
Regularization and imbalance distort calibration. If you need trustworthy
probabilities (e.g. for pricing or risk), calibrate with
CalibratedClassifierCV and check a calibration (reliability) curve.
4. Interpreting coefficients on scaled features as raw-unit effects
After StandardScaler, a coefficient describes a ONE-STANDARD-DEVIATION
change in the feature, not a one-rupee or one-year change. Don't tell a
stakeholder "each extra ₹1 does X" when the model saw scaled inputs.
5. Using the default 0.5 threshold blindly
0.5 is rarely optimal for imbalanced or cost-asymmetric problems.
Choose the threshold from business costs (cost of a false negative vs
false positive) using the ROC / precision-recall curve — not by default.
6. Confusing coefficient magnitude with importance across unscaled features
On unscaled data, a small coefficient on a large-range feature can matter
more than a big coefficient on a tiny-range feature. Compare importance
only after scaling, or use standardized coefficients.
Practice Exercises
-
A fitted model has
z = -3 + 0.5·x. Compute the predicted probabilityp = σ(z)forx = 4,x = 6, andx = 8. At what value ofxis the decision boundary (wherep = 0.5)? -
A logistic regression coefficient for a "has_dependents" (0/1) feature is
β = -0.7. Compute the odds ratio and explain in plain English what it means for churn. -
Load any binary dataset (e.g. scikit-learn's
load_breast_cancer). Build aPipelinewithStandardScalerandLogisticRegression, fit it, and report accuracy and ROC-AUC on a stratified test set. -
Using the model from Exercise 3, extract
predict_proba, then compute predictions at thresholds0.3,0.5, and0.7. Describe how precision and recall shift as the threshold changes. -
Explain in one paragraph why logistic regression uses log loss instead of mean squared error as its cost function.
-
Take an imbalanced dataset. Train two models — one with default settings and one with
class_weight="balanced"— and compare their recall on the minority class. What changed and why?
Summary
In this chapter you learned:
- Logistic regression is classification, not regression — it predicts the probability of a class, then applies a threshold.
- The sigmoid
σ(z) = 1 / (1 + e^(-z))squashes the linear logitzinto the range0to1. - The decision boundary is where
z = 0(probability0.5); the threshold is a tunable business choice, not a fixed constant. - Training minimises log loss (cross-entropy), which is convex and heavily punishes confident wrong answers.
- Coefficients are interpretable:
e^(βⱼ)is the odds ratio — how a one-unit feature change multiplies the odds of the positive class. - Multiclass is handled by one-vs-rest (K binary models) or softmax / multinomial (one model, probabilities sum to 1).
- In scikit-learn, use a Pipeline with
StandardScaler, callfit/predict, and usepredict_probafor probabilities and custom thresholds. Cis the inverse regularization strength — smallCregularizes more; tune it with cross-validation (see the regularization chapter).- Watch for the classic pitfalls: unscaled features, imbalanced classes, uncalibrated probabilities, and blind use of the
0.5threshold.
Logistic regression is the workhorse baseline of classification — fast, interpretable, and probabilistic — and a model you will reach for constantly in real data science work.
Next up: K-Nearest Neighbors (KNN) — a delightfully simple, non-parametric classifier that makes predictions by looking at the closest examples in the training data, with no explicit training step at all.