Dimensionality Reduction & PCA

What Is Dimensionality Reduction?

Dimensionality reduction is the process of taking data with many features (columns) and re-expressing it using far fewer features, while keeping as much of the useful information as possible.

Think of it like a good summary. A 300-page book has enormous "dimensionality" — every word matters in principle — but a well-written one-page synopsis captures the plot, the characters, and the ending. You lose some detail, but you keep what matters. Principal Component Analysis (PCA) is the most widely used technique for producing that kind of compressed summary of numerical data.

A dataset with d features is a cloud of points living in d-dimensional space. Dimensionality reduction finds a smaller space (say 2 or 10 dimensions) that the cloud can be projected into with minimal distortion.

Examples:
→ 784 pixel columns of a handwritten digit image → compressed to 40 features
→ 60 financial ratios per company → 5 components fed to a clustering model
→ 100 survey questions → 2 components you can plot on a scatter chart

Why Reduce Dimensions?

There are four practical reasons, and most real projects hit at least one of them.

The Curse of Dimensionality

As the number of features grows, the volume of the space explodes and your data points become sparse — every point ends up roughly equidistant from every other point. Distance-based methods (see the K-Nearest Neighbors (KNN) and K-Means Clustering chapters) rely on "near" and "far" being meaningful, and that breaks down in very high dimensions.

Rough intuition:
- In 2D, 100 points can reasonably fill a unit square.
- In 10D, you would need roughly 100^5 points for the same density.
- Distances between all pairs of points converge → "nearest" loses meaning.

Noise, Compute, and Redundancy

Noise: Many features are noisy or nearly constant. Dropping the low-variance directions often improves a downstream model by removing junk.
Compute and memory: Fewer features mean faster training, smaller models, and lower storage — important when you have millions of rows.
Redundancy / correlation: Highly correlated features (height_cm and height_inches, or revenue and profit) carry overlapping information. PCA collapses correlated features into shared directions.

Visualisation

Humans can only see 2D and 3D. If you have 50 features and want to look at the structure of your data — clusters, outliers, class separation — you must project it down to 2 or 3 dimensions first. This is one of the most common everyday uses of PCA.

The Intuition Behind PCA

PCA answers one question: in which directions does the data vary the most?

Imagine a scatter of points that form a long, tilted, cigar-shaped cloud. The data clearly stretches most along the length of the cigar, and only a little across its width. PCA finds that long axis first — this is Principal Component 1 (PC1), the direction of maximum variance. It then finds PC2, the direction of the next-most variance that is perpendicular (orthogonal) to PC1, and so on.

Key properties of principal components:
1. PC1 captures the most variance possible in a single direction.
2. PC2 is orthogonal to PC1 and captures the most of the REMAINING variance.
3. Each later component is orthogonal to all earlier ones.
4. There are up to d components for d original features, but the first
   few usually capture the bulk of the total variance.

Projection is the second half of the idea. Once you have the principal components (the new axes), you re-express each data point by its coordinates along those axes. Keep only the top k components and you have reduced d features to k features. Because the top components hold most of the variance, the projected points still preserve most of the original shape.

Analogy: photographing a 3D object.
A photo is a 2D projection of a 3D scene. A good photographer picks the
angle (the projection direction) that shows the object's shape best —
you can still recognise a chair from a single well-chosen photo.
PCA picks that "best angle" automatically, by maximising variance.

The Math, at a High Level

PCA is computed from the covariance matrix of the (scaled) features. Its eigenvectors are the principal components — the new axis directions — and each eigenvector's eigenvalue measures how much variance lies along that direction. Larger eigenvalue = more important component.

At a high level:
- Covariance matrix C summarises how features vary together.
- Solve for eigenvectors v and eigenvalues lambda:  C v = lambda v
- Eigenvectors  = principal component directions (orthogonal)
- Eigenvalues   = variance captured along each direction
- Sort by eigenvalue descending → PC1, PC2, PC3, ...

(The linear-algebra details of eigenvectors and eigenvalues are covered
in the Statistics tutorial. For ML you mostly need the intuition above.)

You will rarely compute this by hand — scikit-learn does it for you — but knowing that components are eigenvectors of the covariance matrix explains why they are orthogonal and why explained variance is just the (normalised) eigenvalue.

You MUST Scale Before PCA

This is the single most important practical rule of PCA. PCA is driven by variance, and variance depends on units.

Suppose one feature is annual salary in rupees (values around 500000) and another is years of experience (values around 5). The salary column's raw variance is astronomically larger simply because the numbers are bigger — not because it is more informative. PCA would put PC1 almost entirely along salary and ignore experience entirely.

Fix: standardise every feature to mean 0 and standard deviation 1 first.

z = (x − mean) / std_dev

After standardisation, every feature contributes on an equal footing and
PCA measures genuine shared variation, not accidental unit differences.

Use StandardScaler, and always fit the scaler on the training data only, then apply it to test data — the same discipline taught in the Feature Engineering & Scaling chapter. Doing this inside a Pipeline prevents leakage automatically.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Scale first, then PCA — always in this order.
pca_pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("pca", PCA(n_components=2, random_state=42)),
])

Explained Variance and Choosing k

Every component reports how much of the total variance it captures. In scikit-learn this is the explained_variance_ratio_ attribute — an array that sums to at most 1.0 (that is, 100% of the variance).

There are two standard ways to choose the number of components k.

1. Cumulative Variance Threshold

Add up the explained-variance ratios until you cross a target such as 95%. Keep just enough components to explain that much variance.

Example explained_variance_ratio_ for 8 features:
PC1: 0.52   cumulative: 0.52
PC2: 0.23   cumulative: 0.75
PC3: 0.11   cumulative: 0.86
PC4: 0.06   cumulative: 0.92
PC5: 0.04   cumulative: 0.96  ← crosses 0.95 here
PC6: 0.02   cumulative: 0.98
PC7: 0.01   cumulative: 0.99
PC8: 0.01   cumulative: 1.00

To retain 95% of the variance, keep k = 5 components (8 → 5).

2. The Scree Plot (Elbow)

Plot each component's explained variance in descending order. The curve usually drops steeply and then flattens — the elbow marks the point of diminishing returns. Components after the elbow add little and are mostly noise.

Scree plot shape (explained variance vs component number):

var │ *
    │  *
    │   *
    │     *  ← elbow: keep components up to here
    │       * * * *   (flat tail = noise)
    └──────────────── component number

A convenient shortcut in scikit-learn: pass a float to n_components. PCA(n_components=0.95) automatically keeps the smallest number of components that explains at least 95% of the variance.

PCA in scikit-learn

Here is the core workflow: scale, fit PCA, inspect explained variance, and transform.

import numpy as np
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load a real dataset: 178 wines described by 13 chemical features.
data = load_wine()
X, y = data.data, data.target
print(X.shape)          # (178, 13)  → 13 features

# Step 1: MUST scale first
X_scaled = StandardScaler().fit_transform(X)

# Step 2: fit PCA (keep all components initially to study variance)
pca_full = PCA(random_state=42)
pca_full.fit(X_scaled)

# Step 3: inspect explained variance
ratios = pca_full.explained_variance_ratio_
print(np.round(ratios[:5], 3))
# → [0.362 0.192 0.111 0.071 0.066]

cumulative = np.cumsum(ratios)
print(np.round(cumulative[:7], 3))
# → [0.362 0.554 0.665 0.736 0.802 0.851 0.893]

# How many components to reach 95%?
k = np.argmax(cumulative >= 0.95) + 1
print("Components for 95% variance:", k)   # → 10 (13 features → 10)

Once you know k, refit PCA with that many components and transform the data. fit_transform learns the components and projects in one call.

# Reduce to the chosen number of components
pca = PCA(n_components=10, random_state=42)
X_reduced = pca.fit_transform(X_scaled)
print(X_reduced.shape)                 # (178, 10)
print(pca.explained_variance_ratio_.sum())  # ≈ 0.96

# On NEW/test data, only call transform (never fit again):
# X_test_reduced = pca.transform(scaler.transform(X_test))

The reduced array X_reduced can now be fed to any downstream model — a classifier, a clusterer, a regressor — exactly like ordinary features.

Using PCA to Visualise Data in 2D

The most common use of PCA is squeezing many features down to two so you can plot them and see the structure.

import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline

# Scale + PCA to exactly 2 components for plotting
viz = Pipeline([
    ("scaler", StandardScaler()),
    ("pca", PCA(n_components=2, random_state=42)),
])
X_2d = viz.fit_transform(X)            # shape (178, 2)

# How much did we keep with just 2 dimensions?
ev = viz.named_steps["pca"].explained_variance_ratio_
print(np.round(ev, 3), "sum:", round(ev.sum(), 3))
# → [0.362 0.192] sum: 0.554   (2 components capture ~55% of variance)

plt.figure(figsize=(7, 5))
plt.scatter(X_2d[:, 0], X_2d[:, 1], c=y, cmap="viridis", s=40)
plt.xlabel("PC1 (36.2% variance)")
plt.ylabel("PC2 (19.2% variance)")
plt.title("Wine dataset projected onto 2 principal components")
plt.colorbar(label="wine class")
plt.show()

Even though 2 components hold only about 55% of the variance, the three wine classes usually separate into visibly distinct clusters — enough to confirm the classes are learnable before you train a model.

Beyond PCA: t-SNE and UMAP

PCA is linear — it can only find straight-line directions. When structure is curved or tangled (data lying on a spiral or a folded sheet), linear projections struggle. For visualisation of such non-linear structure, two techniques are popular.

t-SNE (t-distributed Stochastic Neighbor Embedding): excellent at revealing local clusters in 2D. Great for eyeballing groups, but distances between clusters and cluster sizes are not meaningful, and it is slow on large data.
UMAP (Uniform Manifold Approximation and Projection): similar spirit to t-SNE, usually faster, and preserves more of the global structure.

Rule of thumb:
- Need reduced features to FEED a model, or reversible/interpretable
  compression, or speed → use PCA.
- Need a pretty 2D picture that separates non-linear clusters
  → use t-SNE or UMAP (for viewing only, not as model inputs).

t-SNE and UMAP are almost always used purely to look at data, not to produce features for training — their output is not stable or reversible the way PCA's is.

Pros, Cons, and When to Use PCA

Aspect	Detail
Strength: decorrelation	Produces uncorrelated (orthogonal) components; removes redundancy
Strength: speed	Fast, deterministic, scales to large datasets
Strength: denoising	Dropping low-variance components often removes noise
Strength: visualisation	Standard tool for projecting to 2D or 3D
Weakness: linear only	Cannot capture curved / non-linear structure
Weakness: interpretability	Components are weighted mixes of original features, not real-world quantities
Weakness: variance is not importance	High-variance directions are not always the ones useful for prediction
Requirement	Features must be scaled first; works on numeric data only
Use when	Many correlated numeric features, need speed, denoising, or a 2D plot
Avoid when	You need feature-level interpretability, or structure is highly non-linear

The Interpretability Caveat

This deserves emphasis. A principal component is a linear combination of all original features — for example PC1 = 0.31*alcohol − 0.24*malic_acid + 0.14*ash + .... It is not "the alcohol feature." You can inspect these weights (called loadings, available in pca.components_) to guess what a component roughly represents, but you can no longer say "feature X increased by 5." If a stakeholder needs to know exactly which original variable drove a decision — common in credit scoring or healthcare — PCA may cost you more than it saves.

# Loadings: how each original feature contributes to each component
import pandas as pd

loadings = pd.DataFrame(
    pca.components_[:2].T,           # first two components
    columns=["PC1", "PC2"],
    index=data.feature_names,
)
print(loadings.round(2))
# Large absolute values → that feature strongly shapes the component.

Common Mistakes

1. Not Scaling Before PCA

The number-one error. Without StandardScaler, features with large numeric ranges (salary, revenue) dominate every component and small-range features (ratios, counts) are ignored. Always scale first, ideally inside a Pipeline.

2. Fitting PCA (or the Scaler) on the Full Dataset

Calling fit on train + test together leaks information from the test set into your components. Fit on training data only, then transform the test data. A Pipeline inside cross-validation handles this correctly — see Train-Test Split & Cross-Validation.

3. Over-Reducing and Throwing Away Signal

Cutting to 2 components just because it plots nicely can discard variance your model needs. Use a variance threshold (like 95%) for modelling, and reserve the aggressive 2D reduction for visualisation only.

4. Treating Components as Original Features

Writing "PC1 is the alcohol level" is wrong — PC1 is a mixture of all features. Do not report a principal component as if it were a single interpretable variable.

5. Assuming High Variance Means High Predictive Power

PCA maximises variance, not correlation with the target. Occasionally the variance a downstream classifier needs sits in a low-variance component that PCA is tempted to drop. If accuracy falls after PCA, keep more components or reconsider.

6. Running PCA on Non-Numeric or Un-Encoded Data

PCA needs numeric input. One-hot encoded categorical columns and PCA interact awkwardly (binary variance is not comparable to continuous variance). Handle categoricals thoughtfully, or apply PCA only to the continuous block of features.

Practice Exercises

A dataset has 40 features. After scaling and running full PCA you get a cumulative explained-variance array whose values first cross 0.90 at index 11 and 0.95 at index 17 (0-based). How many components do you keep for a 90% threshold? For 95%? Explain the trade-off.
Load the load_breast_cancer dataset from scikit-learn. Build a Pipeline of StandardScaler then PCA(n_components=2), transform the data, and make a 2D scatter plot coloured by the diagnosis label. Report the summed explained variance of the two components.
Take any dataset and fit PCA twice — once on raw features and once on scaled features. Print explained_variance_ratio_ for both and explain, in one sentence, why the raw-feature version concentrates almost all variance in PC1.
For a fitted PCA, print pca.components_[0] (the PC1 loadings) alongside the feature names. Identify the three features with the largest absolute loadings and describe, in plain language, what PC1 seems to represent.
Compare PCA(n_components=0.95) with manually choosing k from a cumulative-variance calculation on the same scaled data. Confirm they select the same number of components and explain what the float argument does.
Conceptual: A colleague reduces 50 features to 2 with PCA, trains a classifier on the 2 components, and reports lower accuracy than on the full 50 features. Give two distinct reasons this could happen and how you would diagnose each.

Summary

In this chapter you learned:

Dimensionality reduction re-expresses many features with fewer, motivated by the curse of dimensionality, noise, compute cost, and the need to visualise data in 2D or 3D.
PCA finds orthogonal principal components — directions of maximum variance — and projects the data onto the top k of them; components are the eigenvectors of the covariance matrix and their eigenvalues measure captured variance.
You MUST scale features (StandardScaler) before PCA, because PCA is variance-driven and raw units would dominate; do it inside a Pipeline and fit on training data only.
explained_variance_ratio_ tells you how much variance each component holds; choose k with a cumulative threshold (e.g. 95%) or the scree-plot elbow — or let PCA(n_components=0.95) decide.
In scikit-learn, fit_transform learns and projects in one step; on new data use only transform. PCA is a standard tool for 2D visualisation.
t-SNE and UMAP handle non-linear structure but are for viewing data, not for producing model features.
PCA is fast, decorrelating, and denoising, but linear and hard to interpret — components are mixtures of all features (loadings in pca.components_), so never treat a component as a single original variable.
Common mistakes: not scaling, leaking by fitting on all data, over-reducing, mislabelling components, and assuming variance equals predictive power.

Dimensionality reduction gives you cleaner, faster, more visualisable data — but a smaller feature set still needs to be judged by how well the final model performs.

Next up: Model Evaluation Metrics — how to measure whether a model is actually any good, from accuracy, precision, recall, and F1 to ROC-AUC, confusion matrices, and the right metric for each problem.