Chapter 14 of 20

K-Means Clustering

Learn how K-Means groups unlabeled data into k clusters — the assign-and-update algorithm, the inertia objective, choosing k with the elbow method and silhouette score, k-means++ initialization, why scaling matters, and a full scikit-learn customer-segmentation workflow.

Meritshot17 min read
Machine LearningK-MeansClusteringUnsupervised LearningElbow MethodSilhouette
All Machine Learning Chapters

What Is K-Means Clustering?

Every algorithm you have met so far in this series — Linear Regression, Logistic Regression, Decision Trees, SVM — is supervised: the training data comes with a correct answer (a label y) and the model learns to reproduce it. K-Means is different. It is unsupervised. There are no labels. You hand the algorithm a pile of unlabeled data points and ask a single question: "Which of these points naturally belong together?"

K-Means clustering partitions your data into k groups (called clusters) so that points inside a cluster are close to one another and far from points in other clusters. Each cluster is summarised by its centroid — the mean position of all points assigned to it. The k in the name is exactly that: the number of clusters, and you must choose it before you start.

Intuitive analogy. Imagine Priya, a store manager, spreading 500 loyalty-card customers out as dots on a large notice board, positioned by how much they spend and how often they visit. She wants to organise them into, say, 4 marketing segments. She sticks 4 pins into the board at rough starting spots. Then she repeats two moves until nothing changes: (1) connect every customer to their nearest pin, and (2) slide each pin to the average location of the customers now attached to it. After a handful of rounds the pins stop moving — and each pin, with its cluster of customers, becomes a segment: "high-value regulars," "occasional big spenders," "frequent bargain hunters," and "dormant customers." That two-step dance is the entire K-Means algorithm.

Goal: find centroid positions and point assignments that make each cluster as tight (internally compact) as possible.

Where K-Means is used:
→ Customer segmentation for targeted marketing (spend vs frequency)
→ Grouping products or SKUs by sales pattern
→ Image colour compression (reduce millions of colours to k)
→ Grouping delivery locations into k zones for route planning
→ Document / news-article grouping by topic

The K-Means Algorithm, Step by Step

K-Means alternates between two steps — assignment and update — until the clusters stabilise. This is a special case of a general strategy called Expectation-Maximization (EM).

Input: data points X, number of clusters k

STEP 0 — INITIALIZE
    Pick k starting centroids (positions in feature space).
    (Naive: pick k random points. Better: k-means++, explained later.)

REPEAT:
    STEP 1 — ASSIGN (the "E" step)
        For every point, compute its distance to all k centroids.
        Assign the point to the NEAREST centroid.
        (This carves the space into k regions.)

    STEP 2 — UPDATE (the "M" step)
        For each cluster, move its centroid to the MEAN
        position of all points currently assigned to it.

UNTIL:
    assignments stop changing (or centroids move less than a tolerance,
    or a maximum number of iterations is reached).

OUTPUT: k centroids + a cluster label for every point.

The reason this works: Step 1 can only lower (or hold) the total distance because each point moves to its closest centroid, and Step 2 can only lower (or hold) it because the mean is the point that minimises the sum of squared distances to a group. Since the objective keeps dropping and cannot go below zero, the algorithm is guaranteed to converge. The catch — and we will return to it — is that it converges to a local optimum that depends on where the centroids started.

A Tiny Walk-Through

1-D toy data (spend in ₹000): [1, 2, 3, 8, 9, 10],  k = 2

INITIALIZE centroids at c1 = 2, c2 = 9 (say).

ASSIGN:
  1,2,3 are closer to c1=2   → cluster A
  8,9,10 are closer to c2=9  → cluster B

UPDATE:
  c1 = mean(1,2,3) = 2.0
  c2 = mean(8,9,10) = 9.0

ASSIGN again: assignments unchanged → CONVERGED.

Final clusters: {1,2,3} around 2.0  and  {8,9,10} around 9.0

The Objective: Inertia (Within-Cluster Sum of Squares)

K-Means is not wandering aimlessly — it is minimising a precise quantity called inertia, also known as the within-cluster sum of squares (WCSS). Inertia is the total squared distance from every point to the centroid of its assigned cluster.

Inertia (WCSS):

J = Σ (over clusters i = 1..k)  Σ (over points x in cluster Cᵢ)  || x − μᵢ ||²

Where:
  Cᵢ = the set of points assigned to cluster i
  μᵢ = the centroid (mean) of cluster i
  || x − μᵢ ||² = squared Euclidean distance from point x to its centroid

Lower J = tighter, more compact clusters.

Two important consequences follow directly from that formula:

  • K-Means uses squared Euclidean distance. That is why the mean is the right centroid (the mean minimises squared distance) and why the geometry assumes round, straight-line notions of closeness.
  • Inertia always decreases as k increases. With k = n (one cluster per point) inertia hits 0. So you can never pick k by simply minimising inertia — you would always pick the largest k. This is the whole reason the elbow method and silhouette score exist.

In scikit-learn, the fitted attribute model.inertia_ gives you J for the final clustering.

k-means++ Initialization

The single biggest weakness of the classic algorithm is sensitivity to the starting centroids. Bad random starts can leave the algorithm stuck in a poor local optimum — for example, two centroids landing inside the same natural blob while a real, separate blob gets split.

k-means++ is a smarter initialization that spreads the starting centroids out before the main loop even begins.

k-means++ initialization:

1. Choose the FIRST centroid at random from the data points.
2. For every remaining point, compute D(x) = distance to the
   nearest centroid already chosen.
3. Choose the NEXT centroid randomly, with probability proportional
   to D(x)²  → points far from existing centroids are far more likely.
4. Repeat steps 2–3 until k centroids are chosen.
5. Run standard K-Means from these well-spread starting points.

Because centroids start far apart, k-means++ converges faster and to better solutions on average. It is the default in scikit-learn (init="k-means++"), and you should almost never turn it off. On top of that, scikit-learn runs the whole procedure several times from different seeds (n_init) and keeps the run with the lowest inertia — a second layer of protection against unlucky starts.

Why Scaling Is Essential

K-Means measures distance, and distance is dominated by whichever feature has the largest numeric range. If one feature is "annual spend in ₹" (0 to 500000) and another is "number of visits" (0 to 50), the spend feature utterly swamps the distance calculation — visits contribute almost nothing, and the clusters become "high spend vs low spend" while ignoring behaviour entirely.

Two customers, features = [annual_spend (₹), visits_per_year]:
  A = [200000, 5]
  B = [201000, 40]

Raw squared distance ≈ (1000)² + (35)²  = 1,000,000 + 1,225
  → the ₹1,000 spend gap dwarfs the huge behavioural gap of 35 visits.

After standard scaling, both features sit on the same scale and
"visits" gets a fair say in the clustering.

Always scale your features before K-Means (typically StandardScaler, or MinMaxScaler when features are bounded). This is not optional polish — it is the difference between meaningful and meaningless clusters. Feature scaling is covered in depth in the Feature Engineering & Scaling chapter; here it is a hard requirement. The one exception is when all features are genuinely on the same natural scale (e.g. pixel intensities 0 to 255).

Choosing k: The Elbow Method

Since you must pick k yourself and inertia always falls as k rises, you need a principled way to choose. The elbow method plots inertia against k and looks for the "elbow" — the point where adding another cluster stops buying you much reduction in inertia.

Fit K-Means for k = 1, 2, 3, ..., 10 and record inertia each time.
Plot inertia (y-axis) vs k (x-axis).

Typical shape:

inertia │*
        │ *
        │  *
        │   *___  ← the "elbow" (here around k = 3 or 4)
        │      *____
        │           *_____*_____*
        └─────────────────────────── k
          1  2  3  4  5  6  7  8

Steep drop before the elbow = real structure being captured.
Flat tail after the elbow = you are just splitting hairs.
Pick k at the elbow: extra clusters beyond it add little.

The elbow method is quick and intuitive, but honest practitioners admit the elbow is often ambiguous — sometimes the curve bends gently and there is no obvious corner. When that happens, lean on the silhouette score instead (or as a tie-breaker).

Choosing k: The Silhouette Score

The silhouette score measures how well each point sits inside its own cluster compared to the nearest neighbouring cluster. Unlike inertia, it rewards clusters that are both tight and well separated, so it does not blindly improve with larger k.

For a single point i:
  a(i) = mean distance from i to all OTHER points in its OWN cluster
         (how tight — smaller is better)
  b(i) = mean distance from i to all points in the NEAREST OTHER cluster
         (how separated — larger is better)

  silhouette(i) = ( b(i) − a(i) ) / max( a(i), b(i) )

Range: −1 to +1
  near +1  → point is well inside its cluster, far from others (great)
  near  0  → point sits on the border between two clusters
  near −1  → point is probably in the WRONG cluster

Overall silhouette score = the average of silhouette(i) over all points.
Choose the k that MAXIMISES this average.

A rough reading guide: an average silhouette above roughly 0.5 suggests reasonable, well-defined structure; values near 0.2 or below suggest the clusters are weak, overlapping, or that k is wrong. Use it together with the elbow method — when both point to the same k, you can be confident.

K-Means in scikit-learn: Customer Segmentation

Here is a realistic end-to-end workflow. Priya's retail chain wants to segment loyalty customers by annual spend and visit frequency so the marketing team can target each group. Note the pipeline with StandardScaler — as stressed above, scaling is mandatory.

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.pipeline import Pipeline
from sklearn.metrics import silhouette_score

# --- Illustrative loyalty-customer data ---
# Columns: annual_spend (₹), visits_per_year, avg_basket_value (₹)
df = pd.read_csv("loyalty_customers.csv")

features = ["annual_spend", "visits_per_year", "avg_basket_value"]
X = df[features]

# Scale first — distances must be fair across features.
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit K-Means with k = 4 (chosen via the elbow / silhouette steps below).
kmeans = KMeans(
    n_clusters=4,
    init="k-means++",   # smart initialization (the default)
    n_init=10,          # run 10 times, keep the lowest-inertia result
    max_iter=300,
    random_state=42,
)
labels = kmeans.fit_predict(X_scaled)

df["segment"] = labels
print("Inertia (WCSS):", round(kmeans.inertia_, 1))
print("Silhouette   :", round(silhouette_score(X_scaled, labels), 3))
Illustrative output (numbers shown for shape, not a real benchmark):

Inertia (WCSS): 812.4
Silhouette   : 0.541

Choosing k with the Elbow and Silhouette in Code

inertias, silhouettes = [], []
K = range(2, 9)

for k in K:
    km = KMeans(n_clusters=k, init="k-means++", n_init=10, random_state=42)
    lbl = km.fit_predict(X_scaled)
    inertias.append(km.inertia_)
    silhouettes.append(silhouette_score(X_scaled, lbl))

for k, inr, sil in zip(K, inertias, silhouettes):
    print(f"k={k}  inertia={inr:8.1f}  silhouette={sil:.3f}")
Illustrative output:

k=2  inertia= 1980.6  silhouette=0.462
k=3  inertia= 1204.3  silhouette=0.508
k=4  inertia=  812.4  silhouette=0.541   ← elbow + highest silhouette
k=5  inertia=  690.1  silhouette=0.497
k=6  inertia=  602.8  silhouette=0.471
k=7  inertia=  548.9  silhouette=0.443
k=8  inertia=  511.2  silhouette=0.419

Inertia drops steeply up to k=4, then flattens (the elbow), and the
silhouette peaks at k=4 too → choose k = 4.

Profiling the Segments (Turning Clusters into Business Meaning)

A cluster label of 0 or 2 means nothing to the marketing team. The real work is interpreting the centroids — converting them back to the original units and giving each segment a name.

# Centroids live in SCALED space — invert the scaling to read real ₹ / counts.
centroids = scaler.inverse_transform(kmeans.cluster_centers_)
profile = pd.DataFrame(centroids, columns=features)
profile["n_customers"] = pd.Series(labels).value_counts().sort_index().values
print(profile.round(0))
Illustrative output (centroids in original units):

   annual_spend  visits_per_year  avg_basket_value  n_customers
0       48000.0             42.0            1150.0          210
1      310000.0             18.0           17200.0           64
2       12000.0              4.0            3000.0          295
3      165000.0             36.0            4600.0          131

Interpretation for the marketing team:
  Segment 0 → "Frequent value shoppers"  (many visits, small baskets)
  Segment 1 → "Premium occasional buyers" (few visits, huge baskets)
  Segment 2 → "Dormant / low-value"        (rare visits, low spend)
  Segment 3 → "Loyal high-value regulars"  (frequent AND high spend)

That final table is the deliverable — not the labels, but the story the centroids tell. Assigning a new customer to a segment later is a single call: scale their features, then kmeans.predict(...).

Limitations of K-Means

K-Means is fast and intuitive, but its assumptions are strong. Knowing where it breaks tells you when to reach for the algorithms in the next chapter.

  • It assumes spherical, roughly equal-size clusters. Because it uses squared Euclidean distance from a centroid, K-Means loves round blobs of similar size. It fails on elongated, crescent, or nested-ring shapes — it will slice them into geometric wedges that ignore the true structure.
  • You must choose k in advance. The algorithm cannot discover the "right" number of clusters on its own; the elbow method and silhouette are heuristics, not guarantees.
  • It is sensitive to initialization. Different random starts can give different clusterings. k-means++ and n_init > 1 mitigate this but do not fully eliminate it.
  • It is sensitive to outliers. Because centroids are means, a single extreme point can drag a centroid far off, distorting a whole cluster. Consider removing outliers first, or use KMedoids / MiniBatchKMeans variants.
  • It is sensitive to feature scaling. Skip scaling and the largest-range feature silently dominates — the single most common cause of nonsense clusters.
  • Every point gets assigned. K-Means has no notion of "noise" or "this point belongs to nothing" — a limitation that DBSCAN, in the next chapter, was designed to solve.

Pros and Cons

AspectDetails
Learning typeUnsupervised — no labels required
What you provideThe number of clusters k and (ideally) scaled features
Distance metricSquared Euclidean (centroids are means)
Cluster shape assumedSpherical, roughly equal size and density
Speed / scaleVery fast; near-linear in points, scales to large data (MiniBatchKMeans for huge sets)
DeterminismNot deterministic across seeds unless random_state is fixed
Handles noise/outliersPoorly — every point is forced into a cluster
InterpretabilityHigh — centroids are readable segment profiles
Needs feature scalingYes — essentially mandatory

Pros

  • Simple to understand and explain to non-technical stakeholders.
  • Fast and scalable; handles large datasets comfortably.
  • Centroids give directly interpretable cluster "profiles."
  • Assigning a new point to a cluster is instant (predict).

Cons

  • You must guess k, and the best k is often ambiguous.
  • Assumes round, equal-size clusters — fails on odd shapes and varying densities.
  • Sensitive to initialization, outliers, and unscaled features.
  • Forces every point into a cluster, with no outlier / noise category.

Common Mistakes

1. Forgetting to scale the features

K-Means is a distance algorithm. If one feature ranges 0–500000 and
another 0–50, the big-range feature dictates every distance and the
small one is ignored. Symptom: clusters split only along the largest
feature. Fix: StandardScaler (or MinMaxScaler) BEFORE fitting — always.

2. Picking k by minimising inertia

Inertia ALWAYS decreases as k grows and hits 0 at k = n. If you choose
k to minimise inertia you will always pick the largest k, which is
meaningless. Use the elbow method and silhouette score instead.

3. Reading centroids in scaled space

After StandardScaler, cluster_centers_ are in scaled units, not ₹ or
counts. Telling stakeholders "this segment spends 1.4" is nonsense.
Fix: scaler.inverse_transform(kmeans.cluster_centers_) before profiling.

4. Running with a single initialization

n_init=1 leaves you at the mercy of one random start and a bad local
optimum. Keep n_init at 10+ (the default is already 10 in modern
scikit-learn) and set random_state for reproducibility.

5. Forcing K-Means onto non-spherical or noisy data

Crescents, rings, elongated blobs, or data with clear outliers will
produce misleading round clusters. Plot the data first (or a 2-D PCA
projection). If the shapes are not round, use DBSCAN or hierarchical
clustering (next chapter) instead.

6. Treating cluster labels as meaningful numbers

The label 0, 1, 2, 3 are arbitrary IDs — they can swap between runs and
carry no order (cluster 3 is not "more" than cluster 1). Never feed raw
cluster IDs into a model as a numeric feature; treat them as categories.

Practice Exercises

  1. For the 1-D data [2, 4, 6, 20, 22, 24] with k = 2 and initial centroids c1 = 4, c2 = 22, run one full assignment-and-update cycle by hand. What are the new centroids, and has it converged?

  2. Explain in your own words why inertia can never be used directly to choose k. What is the value of inertia when k equals the number of data points?

  3. Generate synthetic 2-D blobs with scikit-learn's make_blobs (say 4 centers). Fit KMeans for k from 2 to 8, then plot inertia vs k and silhouette vs k. Do both methods agree on the true k?

  4. Take any customer or product dataset, scale it, fit KMeans with your chosen k, and then use scaler.inverse_transform(kmeans.cluster_centers_) to write a one-line business name for each cluster.

  5. Fit K-Means on the same data twice — once with init="random", n_init=1 and once with init="k-means++", n_init=10. Compare the two inertia_ values and explain the difference.

  6. Create a dataset of two crescent-moon shapes using make_moons. Fit K-Means with k = 2 and describe why the resulting clusters do not match the true moons. Which algorithm from the next chapter would handle this correctly?

Summary

In this chapter you learned:

  • K-Means is unsupervised — it groups unlabeled data into k clusters, each summarised by a centroid (the cluster mean).
  • The algorithm alternates two steps until stable: assign each point to its nearest centroid, then update each centroid to the mean of its points.
  • It minimises inertia (within-cluster sum of squares): J = Σ Σ || x − μᵢ ||², using squared Euclidean distance.
  • Inertia always falls as k grows, so choose k with the elbow method (inertia vs k) and the silhouette score ((b − a) / max(a, b), higher is better), ideally where both agree.
  • k-means++ spreads the initial centroids out for faster, better convergence, and n_init runs multiple seeds — both are scikit-learn defaults.
  • Feature scaling is essentially mandatory, because K-Means is a distance algorithm dominated by large-range features.
  • In scikit-learn: scale, then KMeans(n_clusters=k).fit_predict(X); read clusters via inertia_, cluster_centers_ (invert the scaling), and predict for new points.
  • Limitations: assumes spherical, equal-size clusters; needs you to pick k; sensitive to init, outliers, and scaling; forces every point into a cluster with no noise category.

K-Means is the natural first tool for any clustering task — fast, interpretable, and a strong baseline — as long as you respect its round-cluster assumptions.

Next up: Hierarchical Clustering & DBSCAN — clustering methods that do not force you to pick k upfront (hierarchical dendrograms) and that gracefully handle odd shapes and noisy outliers (density-based DBSCAN), covering exactly the cases where K-Means struggles.