Support Vector Machines (SVM)

What Is a Support Vector Machine?

A Support Vector Machine (SVM) is a classifier that separates two classes by drawing the widest possible street between them. Instead of just finding any line that splits the data, an SVM finds the line that leaves the largest margin — the biggest gap — between the two groups. The data points that sit right on the edge of that street are called the support vectors, and they alone define the boundary.

In the previous chapter you saw how Naive Bayes classifies using probabilities, and earlier how Logistic Regression finds a linear decision boundary by maximising likelihood. SVMs attack the same problem from a purely geometric angle: forget probabilities, just find the boundary that is as far as possible from the nearest points of both classes.

Intuitive analogy. Imagine Priya is a highway planner asked to build a road separating two villages — one on the left, one on the right. She could build the road hugging the left village, or hugging the right one, but a wise planner builds it right down the middle, leaving the maximum shoulder on both sides. If a new house is later built near the road, the wide shoulder means it is far less likely to end up on the wrong side. That widest-possible road is the SVM's maximum-margin hyperplane, and the houses closest to the road on each side — the ones that actually constrain where the road can go — are the support vectors.

Goal: find the hyperplane that maximises the margin between classes, so the model generalises well to unseen points.

Examples:
→ Classify a UPI transaction as fraud vs legitimate
→ Classify an email as spam vs ham (text -> high-dimensional vectors)
→ Classify a tumour as malignant vs benign from medical scan features
→ Recognise handwritten digits (image pixels -> classes 0..9)

The Hyperplane and the Margin

In two dimensions the decision boundary is a line; in three dimensions it is a plane; in higher dimensions it is a hyperplane. Whatever the dimension, an SVM describes it with a weight vector w and a bias b.

Decision boundary (hyperplane):
w · x + b = 0

Decision rule:
predict +1  if  w · x + b > 0
predict -1  if  w · x + b < 0

Here w · x is the dot product of the weight vector and the feature vector.

The margin is the distance from the hyperplane to the closest data point of either class. SVM chooses w and b so that this margin is as large as possible. Geometrically, we can scale w and b so the closest points satisfy w · x + b = +1 (positive side) and w · x + b = -1 (negative side). The width of the street between those two lines is:

Margin width = 2 / ||w||

where ||w|| is the length (Euclidean norm) of the weight vector.

Maximising 2 / ||w|| is the same as minimising ||w||. So the SVM optimisation, in its simplest form, is:

Minimise:   (1/2) ||w||^2
subject to: yᵢ (w · xᵢ + b) >= 1   for every training point i

where yᵢ is +1 or -1 (the true class label).

The constraint says "every point must be on the correct side of its margin line, by at least a full unit." Points that sit exactly on the margin (where the constraint is tight, yᵢ (w · xᵢ + b) = 1) are the support vectors. Move any other point around and the boundary does not change; move a support vector and the whole street shifts. This is why SVMs are memory-efficient at prediction time — only the support vectors matter.

Hard Margin vs Soft Margin: The C Parameter

The model above is a hard-margin SVM: it insists that every point be correctly separated with room to spare. That works only if the data is perfectly linearly separable and noise-free — which real data almost never is. One mislabelled point or one overlapping outlier makes a hard margin impossible.

The fix is the soft margin, which allows some points to violate the margin (or even land on the wrong side) in exchange for a wider, more robust street. Each violation is measured by a slack variable ξᵢ (xi), and we pay a penalty for it. This introduces the most important SVM hyperparameter, C.

Soft-margin objective:
Minimise:   (1/2) ||w||^2 + C * Σ ξᵢ
subject to: yᵢ (w · xᵢ + b) >= 1 - ξᵢ,   ξᵢ >= 0

C controls the trade-off between a wide margin and few violations.

Think of C as how much you punish mistakes:

Large C (e.g. C = 1000): violations are very expensive. The model tries hard to classify every training point correctly, producing a narrow margin that hugs the data. This lowers bias but risks overfitting to noise.
Small C (e.g. C = 0.01): violations are cheap. The model tolerates some misclassified points to keep a wide, smooth margin. This is stronger regularization — lower variance, but risks underfitting.

C is therefore a regularization knob, the inverse of the regularization strength you met with regularized linear models. You tune it — usually on a log scale like [0.01, 0.1, 1, 10, 100] — with cross-validation.

Rule of thumb:
C too large  -> overfits, wiggly boundary, many training points memorised
C too small  -> underfits, boundary ignores real structure
Sweet spot   -> found via cross-validation (see the Train-Test Split chapter)

The Kernel Trick: Handling Non-Linear Boundaries

So far the boundary is a straight hyperplane. But what if the classes are arranged in a circle-inside-a-circle pattern, where no straight line can separate them? The elegant SVM answer is the kernel trick.

The idea: map the data into a higher-dimensional space where it becomes linearly separable, then find a hyperplane there. For example, a 2D circle of points can be separated by a plane once you add a third feature x₁² + x₂² — points near the centre get small values, points far out get large values, and a flat plane slices between them. Back in the original 2D space, that plane projects down to a circle.

The magic is that SVMs never need to actually compute those high-dimensional coordinates. The optimisation depends only on dot products between points, and a kernel function K(xᵢ, xⱼ) computes the dot product in the higher-dimensional space directly from the original features — cheaply, without ever building the new coordinates.

Kernel replaces the dot product:
xᵢ · xⱼ   ->   K(xᵢ, xⱼ)

The model implicitly works in a richer feature space,
but only ever evaluates the kernel on original inputs.

The Common Kernels

Linear:      K(xᵢ, xⱼ) = xᵢ · xⱼ
             A plain hyperplane. Best when data is (nearly) linearly separable
             or very high-dimensional (e.g. text).

Polynomial:  K(xᵢ, xⱼ) = (γ * xᵢ · xⱼ + r)^d
             Curved boundaries of polynomial degree d.
             Params: degree d, gamma γ, coef0 r.

RBF/Gaussian: K(xᵢ, xⱼ) = exp(-γ * ||xᵢ - xⱼ||^2)
             Smooth, flexible, "bumpy" boundaries. The default and
             most popular kernel. Only one main param: gamma γ.

Sigmoid:     K(xᵢ, xⱼ) = tanh(γ * xᵢ · xⱼ + r)
             Rarely the best choice; behaves like a shallow neural net.

The gamma Parameter (RBF)

For the RBF (Radial Basis Function) kernel, gamma controls how far the influence of a single training point reaches.

Large gamma: each point's influence is very local — the boundary wraps tightly around individual points, creating islands. High variance, prone to overfitting.
Small gamma: each point's influence is broad — the boundary is smooth and simple. High bias, prone to underfitting.

gamma large  -> tight, wiggly, island-like boundary (overfit)
gamma small  -> smooth, almost linear boundary (underfit)

Intuition: gamma is like the reach of a point's "gravity".
C and gamma are tuned together via grid search on a log scale.

C and gamma interact, so you almost always tune them jointly with GridSearchCV, sweeping both over log-spaced grids.

Why Scaling Is Critical

This is the single biggest gotcha with SVMs. The RBF and polynomial kernels depend on distances between points (||xᵢ - xⱼ||), and the margin depends on ||w||. If one feature is measured in rupees (range 0 to 5,00,000) and another in years (range 0 to 40), the rupee feature will completely dominate every distance calculation. The years feature becomes invisible, and the SVM effectively ignores it.

The fix: standardise all features to comparable scales before fitting — typically zero mean and unit variance with StandardScaler (covered in the Feature Engineering & Scaling chapter).

Standardisation:
z = (x - mean) / std

Every feature ends up centred at 0 with std 1,
so no single feature dominates the distance/kernel computation.

Always put the scaler and the SVM into a Pipeline so scaling parameters are learned only on the training fold — this prevents data leakage during cross-validation.

SVM in Scikit-Learn

Scikit-learn provides SVC (Support Vector Classifier) for classification and SVR (Support Vector Regressor) for regression. Here is a realistic end-to-end classification workflow.

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix

# Load a real dataset: predict malignant vs benign tumours
data = load_breast_cancer()
X, y = data.data, data.target   # 569 samples, 30 features

# 1. Split first (never let test data touch the scaler)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 2. Pipeline: scaling is MANDATORY for SVM
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("svc", SVC(kernel="rbf"))   # RBF is the sensible default
])

# 3. Tune C and gamma together on a log-spaced grid
param_grid = {
    "svc__C": [0.1, 1, 10, 100],
    "svc__gamma": [0.001, 0.01, 0.1, 1],
}
grid = GridSearchCV(pipe, param_grid, cv=5, scoring="f1")
grid.fit(X_train, y_train)

print("Best params:", grid.best_params_)
print("Best CV F1  :", round(grid.best_score_, 3))

# 4. Evaluate the tuned model on the held-out test set
best = grid.best_estimator_
y_pred = best.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Illustrative output (your exact numbers will vary):

Best params: {'svc__C': 10, 'svc__gamma': 0.01}
Best CV F1  : 0.978

[[41  2]
 [ 1 70]]
              precision    recall  f1-score   support
           0       0.98      0.95      0.96        43
           1       0.97      0.99      0.98        71
    accuracy                           0.97       114

A few things worth noting:

Nested parameter names use double underscores: svc__C targets the C of the step named "svc" inside the pipeline.
probability=True on SVC enables predict_proba, but it retrains with internal cross-validation and is slow — only turn it on if you truly need probabilities. Otherwise use decision_function, which returns the signed distance to the hyperplane.
Inspect best.named_steps["svc"].n_support_ to see how many support vectors were used per class.

SVR: SVM for Regression

The same margin idea extends to regression as Support Vector Regression (SVR). Instead of separating classes, SVR fits a "tube" of width epsilon around the target values, and only points outside the tube (the support vectors) incur a penalty. It shares the kernel, C, and gamma parameters.

from sklearn.svm import SVR

svr = Pipeline([
    ("scaler", StandardScaler()),
    ("svr", SVR(kernel="rbf", C=10, gamma=0.1, epsilon=0.1))
])
svr.fit(X_train_reg, y_train_reg)   # predicts a continuous value

Use SVR for smooth non-linear regression on small-to-medium tabular data — for example predicting a used car's resale price in ₹ from mileage, age, and engine size.

Strengths and Weaknesses

SVMs shine in high-dimensional spaces — even when you have more features than samples — because the margin depends only on the support vectors, not the raw dimensionality. This makes them a classic choice for text classification (thousands of word features) and bioinformatics. But they scale poorly to very large datasets: training cost grows roughly between quadratic and cubic in the number of samples, so hundreds of thousands of rows become painfully slow.

Strengths	Weaknesses
Effective in high-dimensional spaces (features `>=` samples)	Training is slow on very large datasets (`>` ~100k rows)
Robust to overfitting when `C` and `gamma` are tuned well	Requires careful feature scaling before training
Kernel trick captures complex non-linear boundaries	Sensitive to the choice of kernel and hyperparameters
Uses only support vectors, so prediction is memory-light	Poor interpretability with non-linear kernels (no simple coefficients)
Works well when classes have a clear margin of separation	No native probability output (must be enabled, and it is slow)
Strong out-of-the-box accuracy on small/medium clean data	Struggles with heavy class overlap and lots of noise

When to reach for an SVM: small-to-medium datasets, many features, a need for a strong non-linear boundary, and you can afford to tune hyperparameters. When to avoid it: millions of rows (prefer linear models, tree ensembles, or LinearSVC / stochastic gradient methods), or when you need transparent, explainable coefficients (prefer Logistic Regression or Decision Trees).

Common Mistakes

1. Forgetting to scale the features

Without StandardScaler, a feature measured in rupees (0 - 500000) swamps
a feature measured in years (0 - 40). The RBF kernel then sees only the
big-scale feature and the model is effectively blind to the rest.
FIX: always wrap SVC/SVR in a Pipeline with StandardScaler.

2. Leaking test data into the scaler

Fitting StandardScaler on the FULL dataset before splitting lets test
statistics leak into training -> optimistic, dishonest scores.
FIX: split first, then fit the scaler inside a Pipeline / cross-validation.

3. Using the default C and gamma and never tuning them

The defaults are rarely optimal. Untuned gamma that is too large gives a
boundary that memorises noise (overfit); too small gives a near-linear
underfit. FIX: GridSearchCV over C and gamma on a log scale, jointly.

4. Picking the wrong kernel for the data shape

Using a linear kernel on clearly circular/non-linear data underfits badly.
Using RBF on huge, already-linearly-separable, high-dim text data wastes
time and can overfit. FIX: start with RBF for tabular data, linear for
sparse high-dimensional text; compare via cross-validation.

5. Throwing a very large dataset at SVC

SVC training is roughly O(n^2) to O(n^3) in the number of samples. On
500k+ rows it can run for hours. FIX: subsample, use LinearSVC / SGDClassifier
for linear cases, or switch to a tree ensemble (see Random Forests).

6. Enabling probability=True by default

probability=True triggers internal cross-validation to calibrate outputs,
which is slow and can even disagree with predict(). FIX: leave it off unless
you genuinely need probabilities; otherwise use decision_function.

Practice Exercises

Explain in your own words why an SVM prefers the widest separating street rather than any separating line. What role do the support vectors play, and what happens to the boundary if you delete a non-support-vector point?
You set C = 1000 and get 100% training accuracy but poor test accuracy. Is the model overfitting or underfitting? Which direction should you move C, and why?
For an RBF SVM, describe what happens to the decision boundary as gamma increases from very small to very large. Relate each extreme to bias and variance.
Load the load_wine dataset from scikit-learn. Build a Pipeline with StandardScaler and SVC(kernel="rbf"), then use GridSearchCV to tune C over [0.1, 1, 10, 100] and gamma over [0.001, 0.01, 0.1, 1]. Report the best parameters and the test-set accuracy.
Take any dataset and train the same SVM twice — once with StandardScaler and once without. Compare the accuracies and explain the difference in terms of how the RBF kernel uses distances.
When would you choose a linear kernel over an RBF kernel? Give two concrete data situations (one favouring each) and justify your choice.

Summary

In this chapter you learned:

An SVM classifies by finding the maximum-margin hyperplane — the widest street separating two classes; the closest points are the support vectors and they alone define the boundary.
The boundary is w · x + b = 0, the margin width is 2 / ||w||, and training minimises ||w|| subject to every point being on the correct side of its margin.
Hard margin demands perfect separation; the practical soft margin allows violations penalised by C. Large C = narrow margin, low bias, risk of overfitting; small C = wide margin, stronger regularization, risk of underfitting.
The kernel trick implicitly maps data into a higher-dimensional space using only dot products, enabling non-linear boundaries. Common kernels: linear, polynomial, and RBF/Gaussian (the default).
For the RBF kernel, gamma sets how far each point's influence reaches: large gamma overfits (local, wiggly), small gamma underfits (smooth). Tune C and gamma together with GridSearchCV.
Scaling is mandatory — always wrap SVC/SVR in a Pipeline with StandardScaler to keep distance-based kernels honest and avoid leakage.
Use SVC for classification and SVR for regression; SVMs excel in high-dimensional spaces but scale poorly to very large datasets and offer weak interpretability with non-linear kernels.
Avoid the classic pitfalls: no scaling, test-data leakage, untuned C/gamma, the wrong kernel, huge datasets, and needlessly enabling probability=True.

SVMs give you a powerful, geometry-driven classifier that turns "draw the best boundary" into a precise optimisation problem — and the kernel trick lets that boundary bend around almost any shape.

Next up: K-Means Clustering — we leave labelled data behind and enter unsupervised learning, discovering natural groupings in data by iteratively assigning points to the nearest of k cluster centres.