Chapter 15 of 20

Hierarchical Clustering & DBSCAN

Two clustering methods beyond k-means — agglomerative hierarchical clustering with linkage and dendrograms, and density-based DBSCAN that finds arbitrary shapes and labels outliers, with scikit-learn examples and a comparison table.

Meritshot16 min read
Machine LearningHierarchical ClusteringDBSCANDendrogramDensity ClusteringUnsupervised Learning
All Machine Learning Chapters

Why Look Beyond K-Means?

In the previous chapter you learned K-Means Clustering: pick k, drop k centroids, assign each point to its nearest centroid, repeat. It is fast and famous — but it makes two big assumptions that often fail in real data.

  • It needs you to know k in advance.
  • It assumes clusters are round, roughly equal-sized blobs, because every point simply joins the nearest centre.

Real data is messier. A rider-density map of Bengaluru is not made of tidy circles; customer segments do not come in equal sizes; and fraud or sensor faults show up as outliers that k-means is forced to shove into some cluster anyway. This chapter covers two methods that relax those assumptions.

  • Hierarchical clustering builds a tree of clusters and lets you decide k after looking at the structure — no upfront guess needed.
  • DBSCAN groups points by density, so it finds arbitrary shapes (crescents, rings, snakes) and explicitly labels sparse points as noise / outliers instead of forcing them into a cluster.

Intuitive analogy. Imagine organising a wedding guest list. Hierarchical thinking is genealogical: start with individuals, merge siblings into families, families into extended families, and so on up to the whole clan — then "cut" the family tree at whatever level of grouping you want. DBSCAN thinking is like spotting friend-circles at the reception: wherever people are packed tightly together, that is a group; a couple standing alone in a corner belongs to no group at all — they are noise, and that is a perfectly valid answer.

Both are unsupervised — there are no labels, only the structure the data reveals.

Hierarchical (Agglomerative) Clustering

The most common form is agglomerative (bottom-up) hierarchical clustering. It works like this:

1. Start: every point is its own cluster (n points → n clusters).
2. Find the two CLOSEST clusters and merge them into one.
3. Repeat step 2 — each round reduces the cluster count by one.
4. Stop when everything is in a single cluster (1 big cluster).

Because it records the entire sequence of merges, you get a full hierarchy. You never had to commit to a number of clusters while running the algorithm — you choose k afterwards by deciding where to "cut" the tree.

There is also a top-down variant called divisive clustering (start with one cluster, keep splitting), but it is rarely used in practice, so we focus on the agglomerative version.

Linkage: How "Distance Between Clusters" Is Defined

Step 2 says "find the two closest clusters" — but what does distance between two groups of points mean? That choice is called the linkage criterion, and it dramatically changes the shapes you get. Let A and B be two clusters.

Single linkage    : distance = MIN distance between any point in A and any point in B
Complete linkage  : distance = MAX distance between any point in A and any point in B
Average linkage   : distance = AVERAGE of all pairwise A–B distances
Ward's linkage    : merge the pair that increases total within-cluster variance the LEAST

Each behaves differently:

LinkageHow it measures cluster distanceTends to produceWatch out for
SingleNearest pair (min)Long, straggly "chains"; can follow curved shapesChaining — two blobs joined by a thin bridge merge into one
CompleteFarthest pair (max)Compact, roughly equal-diameter clustersSensitive to outliers (one far point inflates the max)
AverageMean of all pairsA balance between single and completeLess interpretable than the extremes
WardMinimises variance increaseCompact, similar-sized clusters (very k-means-like)Requires Euclidean distance; assumes spherical-ish clusters

Ward's method is the most popular default for general use because it produces clean, balanced clusters. Single linkage is the one to reach for when clusters are genuinely elongated or non-convex — but beware the chaining effect.

The Dendrogram: Reading and Cutting the Tree

Every agglomerative run produces a dendrogram — a tree diagram where the height of each merge equals the distance at which two clusters joined. Low merges = very similar points joined early; high merges = dissimilar groups joined late.

Height (merge distance)
  |
  |                 ┌──────────────┐          <- tall bar: two DISSIMILAR
  |                 │              │              groups merge here
  |          ┌──────┘         ┌────┘
  |       ┌──┘             ┌──┘
  |    ┌──┴─┐          ┌───┴──┐
  |  ┌─┴┐ ┌─┴┐       ┌─┴─┐  ┌─┴─┐
     P1 P2 P3 P4     P5 P6  P7 P8    <- individual points at the bottom

To pick k, draw a horizontal cut across the dendrogram. The number of vertical lines the cut crosses is the number of clusters.

Rule of thumb for choosing where to cut:
→ Cut across the TALLEST vertical gap that no horizontal merge crosses.
→ A tall gap means: "merging further would join very dissimilar groups,"
  so stopping there gives well-separated clusters.

This is the hierarchical answer to k-means' elbow method: instead of running many models, you inspect one tree and read k straight off it.

Hierarchical Clustering in scikit-learn

AgglomerativeClustering does the clustering; SciPy is used separately to draw the dendrogram.

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import AgglomerativeClustering

# Customers of an Indian D2C brand: annual spend (₹000) and visits/month
data = pd.DataFrame({
    "annual_spend": [12, 15, 11, 88, 92, 85, 45, 48, 43, 90],
    "visits":       [ 2,  3,  2, 14, 15, 13,  7,  8,  6, 16],
})

# ALWAYS scale first — distance-based methods are sensitive to feature units
X = StandardScaler().fit_transform(data)

# n_clusters=3 chosen after inspecting the dendrogram (see next snippet)
model = AgglomerativeClustering(n_clusters=3, linkage="ward")
labels = model.fit_predict(X)

data["cluster"] = labels
print(data)
   annual_spend  visits  cluster
0            12       2        1
1            15       3        1
2            11       2        1
3            88      14        0
4            92      15        0
5            85      13        0
6            45       7        2
7            48       8        2
8            43       6        2
9            90      16        0

Three clear segments emerge: low-spend/low-visit (1), mid-tier (2), and high-value loyalists (0). Note there is no .predict() for new points — agglomerative clustering has no cluster "centres" to compare against, so to label a new customer you must re-fit or fall back to a nearest-neighbour rule.

To choose k visually, draw the dendrogram with SciPy:

import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage

Z = linkage(X, method="ward")   # the full merge history
plt.figure(figsize=(8, 4))
dendrogram(Z)
plt.axhline(y=6, color="red", linestyle="--")  # horizontal "cut" line
plt.xlabel("Sample index")
plt.ylabel("Merge distance")
plt.title("Ward Dendrogram — cut crosses 3 lines → k = 3")
plt.show()

The linkage matrix Z from SciPy is the raw record of every merge; AgglomerativeClustering gives you the final labels. In practice you plot the dendrogram, read off k, then set n_clusters=k.

DBSCAN: Density-Based Clustering

DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. Instead of centres or trees, it uses a simple, powerful idea: a cluster is a dense region of points separated from other dense regions by sparse space. Points in sparse space are labelled noise.

It has just two parameters:

eps         : the radius of a point's neighbourhood (how close counts as "near")
min_samples : the minimum number of points required within eps
              (including the point itself) to call a region "dense"

Every point falls into one of three roles:

Core point     : has AT LEAST min_samples points within distance eps.
                 (It sits in the dense interior of a cluster.)
Border point    : within eps of a core point, but does NOT itself have
                 min_samples neighbours. (It sits on a cluster's edge.)
Noise point    : neither core nor border — too isolated to belong anywhere.
                 DBSCAN labels these as -1.

The algorithm then grows clusters outward from core points: start at an unvisited core point, sweep in every core point reachable through overlapping eps-neighbourhoods, attach the border points on the fringe, and stop. Whatever is left over is noise.

Why DBSCAN Shines

  • No k required. The number of clusters emerges from the density structure.
  • Arbitrary shapes. Because clusters grow by connectivity, DBSCAN happily traces crescents, rings, and spirals that k-means would slice in half.
  • Built-in outlier detection. Sparse points become label -1 — you never have to force a lone anomaly into a cluster. This makes DBSCAN a genuinely useful fraud / fault-detection tool.

DBSCAN in scikit-learn

import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons

# Two interleaving crescents — the classic case k-means fails on
X_raw, _ = make_moons(n_samples=300, noise=0.06, random_state=42)
X = StandardScaler().fit_transform(X_raw)

db = DBSCAN(eps=0.3, min_samples=5)
labels = db.fit_predict(X)

n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)
print("Clusters found:", n_clusters)
print("Noise points:", n_noise)
Clusters found: 2
Noise points: 4

DBSCAN recovers both crescents correctly and flags a handful of stray points as noise — something k-means with k=2 cannot do, because it would cut each moon straight down the middle. Notice we never told DBSCAN there were two clusters; it discovered that from the data.

Choosing eps and min_samples

These two knobs make or break DBSCAN, so tune them deliberately.

min_samples : rule of thumb → at least (number_of_dimensions + 1).
              A common default is 4 or 5 for 2-D data. Larger values
              make the algorithm more conservative (more noise).

eps         : use a "k-distance plot". For every point, compute the
              distance to its k-th nearest neighbour (k = min_samples),
              sort those distances ascending, and plot them. The sharp
              "elbow / knee" in the curve is a good value for eps.
from sklearn.neighbors import NearestNeighbors
import numpy as np

k = 5  # match your intended min_samples
nn = NearestNeighbors(n_neighbors=k).fit(X)
distances, _ = nn.kneighbors(X)
kth = np.sort(distances[:, -1])   # distance to the k-th neighbour, sorted

# Plot `kth`; the y-value at the sharp upward bend ≈ good eps
# (a flat stretch that suddenly rises marks the density boundary)

Because DBSCAN uses a single global eps, it struggles when clusters have very different densities — one eps cannot be tight enough for a dense cluster and loose enough for a sparse one at the same time. The variant HDBSCAN (a separate library) addresses this by allowing variable density, and is worth knowing about once the basics click.

When Does Each Method Shine?

None of these three is "best" — they encode different assumptions. Match the method to your data.

ScenarioBest fitWhy
You already know k; clusters are round and similar-sized; data is largeK-MeansFast, scales to millions of rows, simple to explain
You do not know k and want to explore structure at multiple levelsHierarchicalThe dendrogram reveals nested structure and lets you pick k after the fact
Clusters are elongated, curved, or wildly non-sphericalDBSCANGrows clusters by density/connectivity, not by distance to a centre
Data contains outliers you want isolated, not absorbedDBSCANLabels sparse points as noise (-1) instead of forcing membership
Clusters have very different densitiesK-Means or HDBSCANSingle-eps DBSCAN handles one density well; k-means or HDBSCAN cope better
Small dataset, interpretability matters (e.g. taxonomy)HierarchicalThe tree is a rich, human-readable artefact

A quick head-to-head on the properties people care about most:

PropertyK-MeansHierarchical (Agglomerative)DBSCAN
Need to pre-specify k?YesNo (choose by cutting the tree)No
Cluster shape it findsSpherical / convexDepends on linkage (Ward is convex)Arbitrary
Handles noise / outliers?No (forces every point in)No (assigns every point)Yes (labels as -1)
Main parameterskn_clusters, linkageeps, min_samples
Can label brand-new points?Yes (.predict)No (re-fit needed)No (approximate re-fit)
Typical time complexityO(n · k · iters)O(n^2) to O(n^3)roughly O(n log n) with an index
Scales to very large n?ExcellentPoor (memory/time heavy)Moderate to good

The scalability row is the practical dealbreaker: standard agglomerative clustering stores an n x n distance matrix, so a few hundred thousand rows can exhaust memory. K-means and DBSCAN (with a spatial index) scale far better.

A Side-by-Side Worked Comparison

Here is the same non-spherical dataset run through all three, so you can see the difference in one place.

from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN

X_raw, _ = make_moons(n_samples=400, noise=0.07, random_state=0)
X = StandardScaler().fit_transform(X_raw)

km   = KMeans(n_clusters=2, n_init=10, random_state=0).fit_predict(X)
agg  = AgglomerativeClustering(n_clusters=2, linkage="single").fit_predict(X)
db   = DBSCAN(eps=0.25, min_samples=5).fit_predict(X)

for name, lbl in [("KMeans", km), ("Agglo-single", agg), ("DBSCAN", db)]:
    n_clusters = len(set(lbl)) - (1 if -1 in lbl else 0)
    print(f"{name:14s} clusters={n_clusters}  noise={(lbl == -1).sum()}")
KMeans         clusters=2  noise=0     <- cuts each moon in half (wrong grouping)
Agglo-single   clusters=2  noise=0     <- single linkage traces the crescents
DBSCAN         clusters=2  noise=6     <- traces crescents AND isolates 6 stray points

On crescent-shaped data, k-means splits the moons the wrong way (it draws a straight boundary through the middle), while single-linkage hierarchical and DBSCAN both follow the true curved shapes. DBSCAN additionally reports the noisy fringe points as outliers. This is the whole story of this chapter in one experiment.

Common Mistakes

1. Forgetting to scale features

All three methods are distance-based. If annual_spend runs 0–100,000 and visits runs 0–30, spend dominates every distance and visits are effectively ignored. Always standardise (StandardScaler) or normalise before clustering, exactly as covered in the Feature Engineering & Scaling chapter.

2. Treating eps as a magic constant

eps is not transferable across datasets — it depends on scale, dimensionality, and density. Copying an eps from a tutorial (including this one) onto your data will usually give one giant cluster or all-noise. Tune it with a k-distance plot every time.

3. Expecting DBSCAN to handle very uneven densities

A single global eps cannot serve a dense cluster and a sparse cluster simultaneously. If your clusters have very different densities, DBSCAN will either merge the sparse one into noise or split the dense one. Reach for HDBSCAN or k-means instead.

4. Ignoring the -1 noise label in DBSCAN

The -1 label is not a cluster — it is "unassigned." A frequent bug is looping over set(labels) and counting -1 as a real cluster, or plotting it with the others as if it were meaningful. Always subtract it out: len(set(labels)) - (1 if -1 in labels else 0).

5. Cutting a dendrogram at an arbitrary height

Do not just eyeball "somewhere in the middle." Cut across the tallest vertical gap that no merge crosses — that is where clusters are most separated. Cutting through a busy region gives unstable, meaningless groupings.

6. Using single linkage without expecting chaining

Single linkage is powerful for elongated shapes, but the same min-distance rule causes chaining: two distinct blobs joined by even a thin trail of points merge into one cluster. If your clusters look suspiciously merged, try Ward or complete linkage.

Practice Exercises

  1. Given five 1-D points at positions 2, 5, 9, 10, 25, perform agglomerative clustering by hand using single linkage. Write out the merge order and the distance at each merge. At what cut height do you get exactly 2 clusters?

  2. You run AgglomerativeClustering with linkage="ward" and get three roughly equal, compact clusters. You suspect the true clusters are actually two long curved bands. Which linkage would you switch to, and why?

  3. On a dataset, DBSCAN with eps=0.3, min_samples=5 returns almost all points as noise (-1). List two changes to the parameters that could reduce the amount of noise, and explain the trade-off of each.

  4. Explain in one or two sentences why DBSCAN can find two interleaving crescent shapes but k-means with k=2 cannot.

  5. You have 2 million rows and need clusters quickly for a nightly job. Which of the three methods would you rule out on scalability grounds, and which would you try first?

  6. Sketch (in words) a k-distance plot and describe how you would read a good value of eps off it. What does k correspond to in that plot?

Summary

In this chapter you learned:

  • Hierarchical (agglomerative) clustering starts with every point as its own cluster and repeatedly merges the two closest clusters bottom-up, recording the full hierarchy — no need to pre-specify k.
  • The linkage criterion defines cluster-to-cluster distance: single (min, chains and curves), complete (max, compact), average (balanced), and Ward (minimises variance increase, the popular default).
  • A dendrogram visualises the merges; you choose k by drawing a horizontal cut across the tallest uncrossed vertical gap.
  • DBSCAN clusters by density using two parameters — eps (neighbourhood radius) and min_samples (points needed to be dense) — classifying points as core, border, or noise (-1).
  • DBSCAN needs no k, finds arbitrary shapes, and isolates outliers as noise — but a single global eps struggles with clusters of very different densities.
  • Choose by data: k-means for large, round, known-k data; hierarchical to explore unknown structure and pick k afterwards; DBSCAN for non-spherical shapes and outlier-heavy data.
  • Always scale features first, tune eps with a k-distance plot, and never treat the -1 noise label as a real cluster.

Clustering groups rows; next we turn to the columns. When you have dozens or hundreds of features, distances become unreliable and visualisation impossible — so we compress the feature space while keeping the signal.

Next up: Dimensionality Reduction & PCA — how principal component analysis rotates your data onto a handful of high-variance axes to fight the curse of dimensionality, speed up models, and let you actually plot high-dimensional data.