K-Nearest Neighbors (KNN)

What Is K-Nearest Neighbors?

Imagine Priya moves into a new neighbourhood in Pune and wants to guess the monthly rent of a flat she is eyeing. The most natural thing she does is look at the most similar nearby flats — the same size, the same locality, the same floor — and average their rents. If five comparable flats rent for around ₹24,000, she reasonably guesses ₹24,000 for hers too.

That everyday instinct is K-Nearest Neighbors (KNN). To predict for a new data point, KNN finds the k most similar points in the training data — its nearest neighbours — and lets them vote.

Classification: the new point is assigned the majority class among its k neighbours.
Regression: the new point's value is the average (or weighted average) of its k neighbours' values.

KNN rests on a single, powerful assumption: points that are close together in feature space tend to have similar outcomes. "Birds of a feather flock together." If that assumption holds for your data, KNN works remarkably well despite being one of the simplest algorithms in machine learning.

The KNN idea in one line:
To label a new point, ask its k closest neighbours what they are,
then go with the majority (classification) or the average (regression).

Lazy Learning: KNN Does Almost Nothing at Training Time

Most algorithms you have met — Linear Regression, Logistic Regression — are eager learners: they spend effort during training to fit parameters (the coefficients), then throw the training data away and predict cheaply from those parameters.

KNN is the opposite. It is a lazy learner (also called instance-based or memory-based). Its "training" step is almost trivial: it simply stores the entire training dataset. No coefficients are estimated, no line or boundary is fitted in advance. All the real work is deferred to prediction time, when it must compute distances from the new point to every stored point.

Eager learner (e.g. Logistic Regression):
  Training  -> SLOW  (fit parameters)
  Predict   -> FAST  (plug into formula)

Lazy learner (KNN):
  Training  -> FAST  (just memorise the data)
  Predict   -> SLOW  (compute distances to all stored points)

This trade-off has real consequences. KNN is delightful for quick experiments and small datasets, but it can become painfully slow at inference on large data — a point we return to in the pitfalls. It is also non-parametric: it makes no assumption about the shape of the decision boundary, so it can carve out highly irregular, curved boundaries that a straight-line model never could.

Measuring "Closeness": Distance Metrics

The whole method hinges on the word closest. To find the nearest neighbours, KNN needs a distance metric — a formula that quantifies how far apart two points are. The default and most common is Euclidean distance: the ordinary straight-line distance you would measure with a ruler.

Euclidean Distance (L2)

For two points p and q with n features each:

Euclidean distance:
d(p, q) = sqrt( (p1 - q1)^2 + (p2 - q2)^2 + ... + (pn - qn)^2 )

        = sqrt( sum over i of (pi - qi)^2 )

Manhattan Distance (L1)

Manhattan distance (also "taxicab" or "city-block" distance) sums the absolute differences. Picture a taxi driving along a grid of city blocks: it cannot cut diagonally, so it travels the sum of horizontal and vertical moves.

Manhattan distance:
d(p, q) = |p1 - q1| + |p2 - q2| + ... + |pn - qn|

        = sum over i of |pi - qi|

Manhattan distance is often more robust when features are on grid-like scales or when you want to reduce the outsized pull of a single large-difference dimension.

Minkowski Distance (the General Form)

Minkowski distance is the parent formula that unifies both. It has a parameter p (confusingly also called p, unrelated to the point label):

Minkowski distance of order p:
d(a, b) = ( sum over i of |ai - bi|^p ) ^ (1 / p)

Special cases:
  p = 1  -> Manhattan distance (L1)
  p = 2  -> Euclidean distance (L2)

In scikit-learn, KNeighborsClassifier uses Minkowski with p=2 (Euclidean) by default. Set p=1 for Manhattan.

Worked example (2 features): flat A vs flat B
  A = (area = 800 sqft, rent = 20)   (rent in thousands of rupees)
  B = (area = 850 sqft, rent = 22)

Euclidean:  sqrt( (800-850)^2 + (20-22)^2 ) = sqrt( 2500 + 4 ) = sqrt(2504) = 50.04
Manhattan:  |800-850| + |20-22| = 50 + 2 = 52

Notice something alarming in that example: the area difference (50) completely drowns out the rent difference (2), even though rent is arguably just as important. That is not a quirk — it is the reason feature scaling is non-negotiable for KNN, as we will see next.

Choosing a Distance Metric

Metric	Formula (fenced above)	When to prefer it
Euclidean (L2)	Minkowski with `p = 2`	Continuous features, roughly spherical clusters; the sensible default
Manhattan (L1)	Minkowski with `p = 1`	High-dimensional data; when you want less sensitivity to large single-feature gaps
Minkowski (general)	order `p`	Tune `p` via cross-validation to interpolate between L1 and L2
Hamming	count of differing positions	Purely categorical or binary features
Cosine	angle between vectors	Text or sparse data where direction matters more than magnitude

Why Feature Scaling Is Essential for KNN

This is the single most important practical rule for KNN, so it gets its own section. Because every distance formula adds up differences across features, any feature measured on a large numeric scale dominates the distance — and therefore dominates who counts as a "neighbour" — regardless of how important it actually is.

Consider predicting whether a customer will churn using two features:

Feature 1: monthly_spend  in rupees, range   500  to  50000   (spread ~ 49500)
Feature 2: tenure_years    in years, range     0  to     10   (spread ~ 10)

Distance between two customers:
d = sqrt( (spend_diff)^2 + (tenure_diff)^2 )

A spend difference of 20000 contributes 20000^2 = 400,000,000
A tenure difference of 5    contributes     5^2 =           25

tenure is utterly ignored. The model is effectively 1-dimensional.

The fix is feature scaling (covered in depth in the Feature Engineering & Scaling chapter): rescale every feature to a comparable range so each contributes fairly.

Standardisation (StandardScaler): centre each feature to mean 0 and scale to standard deviation 1.
Min-Max scaling (MinMaxScaler): squeeze each feature into the range [0, 1].

Standardisation:  z = (x - mean) / std
Min-Max scaling:  x_scaled = (x - min) / (max - min)

Because scaling must be learned from the training data only (fitting a scaler on the whole dataset causes data leakage — see the Train-Test Split & Cross-Validation chapter), the correct pattern is to bundle the scaler and the classifier into a Pipeline. We do exactly that in the code section below. If you take away one habit from this chapter, let it be: always scale before KNN, and always inside a pipeline.

Choosing k: The Central Hyperparameter

The k in KNN — how many neighbours vote — is the knob that controls the bias-variance trade-off (covered fully in the Bias-Variance, Overfitting & Regularization chapter).

Small k (e.g. k = 1): the prediction follows the single nearest point. The decision boundary is jagged and wraps tightly around individual training points. This is low bias but high variance — the model chases noise and overfits. A single mislabelled point creates its own island of wrong predictions.
Large k (e.g. k equal to half the dataset): many far-away points vote, blurring the boundary into something very smooth. This is high bias but low variance — the model oversmooths and ignores genuine local structure. In the extreme, k = n just predicts the overall majority class for every point.

Effect of k on the decision boundary:

k = 1     ->  extremely wiggly boundary, memorises noise      (OVERFIT)
k = 5..15 ->  smooth but faithful boundary                    (usually good)
k = n     ->  flat: always predicts the global majority       (UNDERFIT)

Practical Rules for Picking k

Use cross-validation. Do not guess. Try a range of k values and pick the one with the best mean cross-validation score. This is the only reliable method.
Use an odd k for binary classification. An odd number of voters prevents tie votes (e.g. k = 4 could split 2-2; k = 5 cannot deadlock between two classes).
A rough starting point is k near sqrt(n), but treat it only as a starting neighbourhood for the cross-validation search, never as a final answer.
Weight neighbours by distance. Setting weights="distance" lets closer neighbours count more than distant ones, which often softens the choice of k and improves accuracy.

KNN in Scikit-Learn

Let's build a complete, leak-free classifier. We use the classic Iris dataset, scale inside a pipeline, and tune k with cross-validation via GridSearchCV.

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

# 1. Load data
X, y = load_iris(return_X_y=True)
print(X.shape)   # (150, 4)  -> 150 flowers, 4 measurements

# 2. Split first (so the scaler never sees the test rows)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 3. Scaler + KNN bundled in a pipeline (prevents scaling leakage)
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("knn", KNeighborsClassifier())
])

# 4. Tune k and the weighting scheme with 5-fold cross-validation
param_grid = {
    "knn__n_neighbors": [1, 3, 5, 7, 9, 11, 15, 21],
    "knn__weights": ["uniform", "distance"],
    "knn__p": [1, 2]          # 1 = Manhattan, 2 = Euclidean
}

grid = GridSearchCV(pipe, param_grid, cv=5, scoring="accuracy")
grid.fit(X_train, y_train)

print("Best params:", grid.best_params_)
print("Best CV accuracy:", round(grid.best_score_, 3))

# 5. Evaluate the tuned model on the untouched test set
y_pred = grid.predict(X_test)
print("Test accuracy:", round(accuracy_score(y_test, y_pred), 3))

Illustrative output:
(150, 4)
Best params: {'knn__n_neighbors': 5, 'knn__p': 2, 'knn__weights': 'uniform'}
Best CV accuracy: 0.958
Test accuracy: 0.967

Inspecting the Neighbours Directly

KNN is refreshingly transparent — you can ask it which training points it consulted for a given prediction.

knn = KNeighborsClassifier(n_neighbors=5).fit(X_train, y_train)

# For the first test flower, find its 5 nearest training neighbours
distances, indices = knn.kneighbors(X_test[:1], n_neighbors=5)
print("Neighbour indices:", indices[0])       # rows in X_train
print("Neighbour distances:", np.round(distances[0], 3))
print("Neighbour classes:", y_train[indices[0]])   # the votes being cast

Illustrative output:
Neighbour indices: [ 61  92  34  70  18]
Neighbour distances: [0.141 0.173 0.201 0.223 0.245]
Neighbour classes: [1 1 1 1 1]

All five neighbours are class 1, so the majority vote is an unambiguous 1.

KNN for Regression

The regressor is identical in spirit but averages the neighbours' target values instead of voting.

from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score

# Predict flat rent from features like area, bedrooms, age, distance-to-metro
# (X_rent, y_rent assumed already loaded and scaled inside a pipeline)
reg = KNeighborsRegressor(n_neighbors=5, weights="distance")

# weights="distance": nearer flats influence the predicted rent more strongly
scores = cross_val_score(reg, X_train, y_train, cv=5,
                         scoring="neg_root_mean_squared_error")
print("CV RMSE:", round(-scores.mean(), 3))

For regression, weights="distance" is especially natural: a flat 50 metres away should shape your rent estimate more than one 5 kilometres away.

The Curse of Dimensionality

KNN's Achilles' heel is high-dimensional data. As the number of features grows, a counter-intuitive geometric fact takes hold: in high dimensions, all points become almost equally far from one another. The very notion of a "nearest" neighbour dissolves — the nearest and the farthest points end up at nearly the same distance, so the neighbourhood is no longer meaningfully "local".

Why distances flatten out in high dimensions:

- With 2 features, points fill a familiar plane; near vs far is obvious.
- With 500 features, the space is so vast that training data is spread
  incredibly thin. Every point is isolated in a huge empty volume.
- The ratio (farthest distance - nearest distance) / nearest distance
  shrinks toward 0 as dimensions rise. "Nearest" stops meaning "similar".

The practical symptoms: KNN's accuracy degrades, and it also slows down (more features means costlier distance sums). Defences:

Reduce dimensionality first with feature selection or PCA (see the Dimensionality Reduction & PCA chapter).
Engineer fewer, more informative features (see the Feature Engineering & Scaling chapter).
Prefer other algorithms — tree-based models or SVMs — when the feature count is genuinely large and cannot be reduced.

A useful rule of thumb: KNN thrives with a modest number of features and plenty of samples; it struggles when features outnumber the informative signal.

Pros and Cons of KNN

Aspect	Pros	Cons
Simplicity	Intuitive, no training phase, easy to explain	Naive implementation is `O(n)` per prediction
Assumptions	Non-parametric; makes no assumption about boundary shape	Assumes nearby points share labels — fails if that is untrue
Flexibility	Handles multi-class naturally; does both classification and regression	Sensitive to `k`, distance metric, and feature scaling
Data needs	Learns complex, irregular boundaries from data alone	Needs many samples; suffers the curse of dimensionality
Interpretability	Transparent — you can see exactly which neighbours voted	No compact model or coefficients to inspect globally
Robustness	Weighted voting softens noise	Very sensitive to irrelevant features and outliers
Cost	Near-zero training cost	Slow inference and high memory (stores all data) on big datasets

Common Mistakes

1. Forgetting to scale the features

By far the most common KNN error. Without scaling, a large-range feature (rupees, population, area in sq ft) hijacks the distance metric and small-range features are ignored. Always standardise or min-max scale, and do it inside a pipeline so the scaler is fit on training folds only.

2. Choosing k by hand instead of by cross-validation

Picking k = 5 "because it feels right" leaves accuracy on the table. Search a range of k values with GridSearchCV and let the cross-validation score decide. Remember to use an odd k for binary classification to avoid tie votes.

3. Using an enormous k

Setting k too large drags in far-away, irrelevant points and oversmooths the boundary until the model just predicts the majority class everywhere. If your best cross-validated k is huge relative to n, KNN is probably the wrong tool for this data.

4. Applying KNN to very high-dimensional data

With hundreds or thousands of features, the curse of dimensionality makes "nearest" meaningless. Reduce dimensions with PCA or feature selection first, or switch to an algorithm that handles high dimensions gracefully.

5. Expecting fast predictions on large datasets

Because KNN computes distances to every stored point at prediction time, inference is slow and memory-hungry on millions of rows. For production at scale, use approximate nearest-neighbour libraries, KD-Tree/Ball-Tree structures (set algorithm="kd_tree" or "ball_tree"), or a different model altogether.

6. Ignoring irrelevant features and outliers

KNN treats every feature as equally meaningful in the distance sum, so noisy or irrelevant columns corrupt the neighbourhood. Drop unhelpful features, and remember that a single mislabelled point can flip predictions when k is small — weighted voting (weights="distance") mitigates this.

Practice Exercises

By hand, compute both the Euclidean and Manhattan distance between the points (3, 4, 0) and (0, 0, 5). State which metric gives the larger value here and why.
Load the wine dataset from scikit-learn. Build a pipeline of StandardScaler plus KNeighborsClassifier and use GridSearchCV to find the best k from [1, 3, 5, ..., 25]. Report the best k and its cross-validation accuracy.
Train two KNN classifiers on the same data: one with StandardScaler and one without. Compare their test accuracies and explain the gap in terms of distance domination.
For a fixed dataset, plot cross-validation accuracy against k for k in [1, 3, 5, ..., 31]. Identify the region of overfitting (small k) and oversmoothing (large k), and pick the k you would deploy.
Using kneighbors, retrieve the 5 nearest neighbours of any one test point and print their classes. Verify by hand that the predicted label matches the majority vote of those neighbours.
Explain, in two or three sentences, why KNN accuracy would likely fall if you added 200 columns of pure random noise to a dataset, referencing the curse of dimensionality.

Summary

In this chapter you learned:

KNN predicts by consulting the k closest training points: majority vote for classification, average for regression. It assumes nearby points share outcomes.
It is a lazy, instance-based, non-parametric learner: training just memorises the data, so all the cost falls on prediction time.
Predictions depend on a distance metric — Euclidean (p=2), Manhattan (p=1), and the general Minkowski form that unifies them.
Feature scaling is essential: because distance sums features, an unscaled large-range feature dominates. Always scale, inside a pipeline to avoid leakage.
Choosing k trades bias against variance: small k overfits (noisy boundary), large k oversmooths (flat boundary). Tune it with cross-validation and use an odd k for binary problems.
KNeighborsClassifier and KNeighborsRegressor implement it in scikit-learn; weights="distance" lets closer neighbours count more.
The curse of dimensionality cripples KNN in high dimensions — reduce features with PCA or feature selection first.
Watch for: forgetting to scale, hand-picking k, huge k, high-dimensional data, slow inference on big data, and irrelevant features or outliers.

KNN's greatest gift is intuition: it is the algorithm that most closely mirrors how people actually reason by analogy, which makes it a superb baseline and a clear window into what "learning from similarity" really means.

Next up: Decision Trees — swap distance-based voting for a flowchart of yes/no questions that splits the data into pure regions, giving you a model you can literally read top to bottom.