Data Preprocessing & Cleaning

What Is Data Preprocessing?

Data preprocessing is the set of steps that transform messy, real-world data into a clean, numeric, consistent table that a machine learning model can actually learn from. It sits right after data collection in the workflow you saw in The Machine Learning Workflow chapter, and before any model ever sees the data.

Think of it like cooking. You do not throw whole, unwashed vegetables into a pot and call it dinner. You wash them, peel them, chop them to a consistent size, throw out the rotten ones, and measure the salt. Raw data is the unwashed vegetables; preprocessing is the prep work. Skip it and even the fanciest algorithm produces garbage.

Why is raw data never model-ready?

It has holes. Real datasets are full of blanks — a customer skipped a form field, a sensor dropped a reading.
It has repeats. The same record gets logged twice by a buggy pipeline.
It has extreme values. A data-entry typo records a salary of ₹50,00,000 per month instead of per year.
It is not numeric. Most models only understand numbers, but your data has columns like City = "Mumbai" or Plan = "Premium".
It is inconsistent. The same city appears as "Delhi", " delhi ", and "DELHI".

The golden rule of this chapter: every cleaning decision (which value to fill a blank with, which categories exist, where to clip outliers) must be learned from the training data only, then applied to the test/validation data. Learning it from the whole dataset leaks future information into your model. We come back to this repeatedly.

A Realistic Messy Dataset

Let us build a small, dirty customer dataset for an Indian fintech company and clean it end to end.

import numpy as np
import pandas as pd

data = {
    "name":     ["Priya", "Rahul", "Anjali", "Rahul", "Vikram", "Neha", "Karan"],
    "city":     ["Mumbai", "delhi ", "Bengaluru", "delhi ", "MUMBAI", None, "Pune"],
    "age":      [28, 34, np.nan, 34, 41, 29, 250],       # 250 is a bad value
    "salary":   [55000, 72000, 61000, 72000, np.nan, 48000, 90000],  # monthly ₹
    "plan":     ["Basic", "Premium", "Standard", "Premium", "Premium", "Basic", None],
    "churned":  [0, 1, 0, 1, 0, 0, 1],                   # target label
}
df = pd.DataFrame(data)
print(df.shape)
print(df.isna().sum())

(7, 6)
name       0
city       1
age        1
salary     1
plan       1
churned    0
dtype: int64

Notice the problems already: a duplicate row (Rahul), inconsistent city casing/whitespace, a missing city and plan, missing age and salary, and an impossible age = 250. We will fix all of these.

Handling Missing Values

Missing values are the most common defect. You have two broad choices: drop them or impute (fill in) them.

Drop vs Impute

DROP  — remove rows or columns that contain missing values.
        Safe when: very few rows affected, OR a whole column is mostly empty.
        Danger: you throw away real information and can bias the sample.

IMPUTE — fill the gap with a reasonable substitute.
         Safe when: missingness is modest and you can justify the fill value.
         Danger: a bad fill value introduces noise or fake signal.

# Drop rows with ANY missing value
df.dropna()

# Drop a column that is more than 40% missing
threshold = 0.40
df.dropna(axis=1, thresh=int((1 - threshold) * len(df)))

Mean, Median, or Mode?

When you impute a numeric column, the usual fill values are the mean or the median. For a categorical column, you use the mode (most frequent value).

Column type	Fill with	Use when
Numeric, roughly symmetric	Mean	No strong outliers or skew
Numeric, skewed or outlier-prone	Median	Income, prices, counts (robust to extremes)
Categorical	Mode (most frequent)	Any category column
Any (advanced)	Model-based (KNN, regression)	Missingness relates to other features

Rule of thumb: prefer the median for money-like columns (salary, price, transaction amount) because it is not dragged around by a few huge values.

For salary = [55000, 72000, 61000, 72000, NaN, 48000, 90000]:
  mean   of the 6 known values = 66,333
  median of the 6 known values = 66,500

SimpleImputer (scikit-learn)

SimpleImputer learns the fill value with .fit() and applies it with .transform(). This split is exactly what lets us fit on train only.

from sklearn.impute import SimpleImputer

# Numeric imputer: fill with the median learned from training data
num_imputer = SimpleImputer(strategy="median")
df["salary"] = num_imputer.fit_transform(df[["salary"]])

# Categorical imputer: fill with the most frequent category
cat_imputer = SimpleImputer(strategy="most_frequent")
df["plan"] = cat_imputer.fit_transform(df[["plan"]]).ravel()

print(df[["salary", "plan"]].isna().sum())

salary    0
plan      0
dtype: int64

Available strategies: "mean", "median", "most_frequent", and "constant" (with fill_value=..., e.g. filling text gaps with "Unknown").

KNNImputer (a smarter option)

SimpleImputer fills every gap in a column with the same value. KNNImputer instead looks at the k most similar rows (nearest neighbours across the other features) and fills the gap with their average. It captures relationships between columns — for example, that younger customers tend to earn less — at the cost of more computation. Reach for it when a plain median clearly loses signal.

from sklearn.impute import KNNImputer

knn_imp = KNNImputer(n_neighbors=3)
numeric_cols = ["age", "salary"]
df[numeric_cols] = knn_imp.fit_transform(df[numeric_cols])

Removing Duplicates

Duplicate rows over-weight some records and can leak identical rows across the train/test split, quietly inflating your accuracy.

print("Before:", df.shape)          # Before: (7, 6)
print("Duplicate rows:", df.duplicated().sum())

df = df.drop_duplicates().reset_index(drop=True)
print("After:", df.shape)           # After: (6, 6)

Be deliberate about what "duplicate" means. Two customers can legitimately share an age and city. Often you should de-duplicate on a business key such as a customer ID:

df = df.drop_duplicates(subset=["name"], keep="first")

Fixing Inconsistent / Dirty Data

Before encoding, standardise the content of text columns. Otherwise "delhi ", "Delhi", and "DELHI" become three different categories.

# Strip whitespace and standardise casing
df["city"] = df["city"].str.strip().str.title()   # " delhi " -> "Delhi"

# Fix data types (numbers stored as text are a classic bug)
df["age"] = pd.to_numeric(df["age"], errors="coerce").astype("Int64")

# Map messy free-text categories to a clean set
plan_map = {"basic": "Basic", "std": "Standard", "premium": "Premium"}
df["plan"] = df["plan"].str.strip().str.lower().map(plan_map).fillna(df["plan"])

print(df["city"].unique())

['Mumbai' 'Delhi' 'Bengaluru' 'Pune']

Common dirty-data fixes to keep on your checklist:

Whitespace: .str.strip() removes leading/trailing spaces.
Casing: .str.lower() / .str.title() unifies case.
Types: pd.to_numeric(..., errors="coerce") turns unparseable text into NaN so you can then impute it.
Units: make sure a column is all the same unit (all monthly salary, not a mix of monthly and annual).

Outlier Detection & Treatment

An outlier is a value that sits far from the rest of the distribution. Some are errors (age = 250); some are real and important (a genuine high-value customer). Your job is to detect them, then decide.

The IQR Method

The Interquartile Range (IQR) method flags values that fall outside a fence built from the quartiles. It is robust and makes no normality assumption.

Q1 = 25th percentile,  Q3 = 75th percentile
IQR = Q3 - Q1

Lower fence = Q1 - 1.5 * IQR
Upper fence = Q3 + 1.5 * IQR

A value is an outlier if it is below the lower fence OR above the upper fence.

Q1 = df["age"].quantile(0.25)
Q3 = df["age"].quantile(0.75)
IQR = Q3 - Q1
lower, upper = Q1 - 1.5 * IQR, Q3 + 1.5 * IQR

outliers = df[(df["age"] < lower) | (df["age"] > upper)]
print("Fences:", round(lower, 1), "to", round(upper, 1))
print(outliers[["name", "age"]])

Fences: 19.5 to 49.5
    name   age
6  Karan   250

The Z-Score Method

The z-score measures how many standard deviations a value is from the mean. A common cutoff flags values whose absolute z-score is greater than 3. This method assumes a roughly bell-shaped distribution, so it is weaker for heavily skewed data.

z = (x - mean) / std

Flag x as an outlier if abs(z) is greater than 3.

from scipy import stats

z = np.abs(stats.zscore(df["salary"].dropna()))
print((z > 3).sum(), "salary outliers by z-score")

Treating Outliers — and When to Keep Them

Once flagged, you have several options:

Technique	What it does	When to use
Remove the row	Drops the record entirely	Clear data-entry errors, very few of them
Cap / Winsorize	Clips values to the fence (e.g. set to `upper`)	Real but extreme values you want to keep, softened
Transform	Apply `log` to compress the tail	Right-skewed money/count data
Keep as-is	Do nothing	The outlier is genuine and meaningful

# The age = 250 is impossible -> treat as an error and remove
df = df[(df["age"] >= lower) & (df["age"] <= upper)].reset_index(drop=True)

# Alternative: cap salary at the upper fence instead of removing
sal_upper = df["salary"].quantile(0.75) + 1.5 * (
    df["salary"].quantile(0.75) - df["salary"].quantile(0.25))
df["salary"] = df["salary"].clip(upper=sal_upper)

When to keep outliers: if you are building a fraud-detection or anomaly-detection model, the outliers are the signal — removing them destroys the very thing you want to catch. Never delete an outlier just because it is inconvenient; confirm it is an error first.

Encoding Categorical Variables

Models do arithmetic, so text categories must become numbers. The right encoding depends on whether the categories have a meaningful order.

Label / Ordinal Encoding

Ordinal encoding maps each category to an integer. This is correct only when the categories have a natural order, because the model will treat the numbers as having magnitude (2 > 1).

from sklearn.preprocessing import OrdinalEncoder

# plan has a real order: Basic < Standard < Premium
plan_order = [["Basic", "Standard", "Premium"]]
ord_enc = OrdinalEncoder(categories=plan_order)
df["plan_encoded"] = ord_enc.fit_transform(df[["plan"]]).astype(int)

print(df[["plan", "plan_encoded"]].drop_duplicates())

       plan  plan_encoded
0     Basic             0
1   Premium             2
2  Standard             1

LabelEncoder does the same integer mapping but is meant for the target column (y), not features — it does not support the fit/transform split cleanly on multiple feature columns.

One-Hot Encoding

For nominal categories with no order (like city), integer codes would fake an ordering (Pune = 3 > Mumbai = 1?). One-hot encoding instead creates one binary column per category.

city = "Mumbai"    ->  city_Mumbai=1, city_Delhi=0, city_Pune=0, ...
city = "Delhi"     ->  city_Mumbai=0, city_Delhi=1, city_Pune=0, ...

The pandas quick way:

dummies = pd.get_dummies(df["city"], prefix="city", drop_first=True)
df = pd.concat([df, dummies], axis=1)

The scikit-learn way (preferred inside pipelines, because it remembers the training categories):

from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
city_encoded = ohe.fit_transform(df[["city"]])
print(ohe.get_feature_names_out(["city"]))

['city_Bengaluru' 'city_Delhi' 'city_Mumbai' 'city_Pune']

Two important settings:

drop_first=True (pandas) or drop="first" (sklearn) removes one redundant column to avoid the dummy variable trap (perfect collinearity), which matters for linear models.
handle_unknown="ignore" tells the encoder how to behave when a category appears in the test set that was never seen in training — it outputs all zeros instead of crashing.

The High-Cardinality Problem

One-hot encoding a column with thousands of unique values (pincode, product_id, merchant_name) explodes your feature count into thousands of sparse columns. This slows training, wastes memory, and hurts many models.

Strategies for high-cardinality columns:

Group rare categories into a single "Other" bucket (e.g. keep the top 20 cities, lump the rest).
Frequency / count encoding: replace each category with how often it appears.
Target encoding: replace each category with the mean of the target for that category — powerful, but a major leakage risk, so it must be fit on training folds only.

ColumnTransformer: Different Steps for Different Columns

Real datasets need different preprocessing per column — impute-and-scale the numbers, impute-and-one-hot the categories. ColumnTransformer applies separate pipelines to separate column groups and stitches the results back together, all inside a single fit/transform object.

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

numeric_features = ["age", "salary"]
categorical_features = ["city", "plan"]

# One pipeline per column type
numeric_pipe = Pipeline(steps=[
    ("impute", SimpleImputer(strategy="median")),
    ("scale", StandardScaler()),
])

categorical_pipe = Pipeline(steps=[
    ("impute", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore")),
])

preprocessor = ColumnTransformer(transformers=[
    ("num", numeric_pipe, numeric_features),
    ("cat", categorical_pipe, categorical_features),
])

Now wire it to the train/test split so that all fitting happens on the training set only — this is the leakage-safe pattern you should default to.

from sklearn.model_selection import train_test_split

X = df[numeric_features + categorical_features]
y = df["churned"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42)

# fit_transform on TRAIN, transform-only on TEST
X_train_prepped = preprocessor.fit_transform(X_train)
X_test_prepped = preprocessor.transform(X_test)   # <-- no fit here!

print("Train shape:", X_train_prepped.shape)

StandardScaler (used above) is covered in depth in the next chapter, Feature Engineering & Scaling; here it simply shows how numeric and categorical steps live side by side.

Preventing Data Leakage — The Non-Negotiable Rule

Data leakage happens when information from outside the training set sneaks into the model. The most common preprocessing leak is computing a statistic (a median, a scaler's mean, the set of categories) over the entire dataset and only then splitting.

WRONG (leaks the test set's information):
  1. Impute missing values using the median of the FULL dataset
  2. Scale using the mean/std of the FULL dataset
  3. train_test_split(...)

RIGHT (leakage-safe):
  1. train_test_split(...)  <-- split FIRST
  2. preprocessor.fit_transform(X_train)  <-- learn medians/means/categories on TRAIN
  3. preprocessor.transform(X_test)       <-- APPLY the same learned values to TEST

The safest habit is to put every preprocessing step inside a Pipeline / ColumnTransformer and only ever call .fit() on training data. Cross-validation (see Train-Test Split & Cross-Validation) then re-fits the preprocessing on each fold automatically, which is exactly what you want.

Common Mistakes

1. Fitting preprocessing on the whole dataset before splitting

Computing an imputer's median or a scaler's mean over ALL rows and then
splitting leaks test information into training. Always split first, then
fit_transform on train and transform-only on test.

2. One-hot encoding a high-cardinality column blindly

One-hot encoding a pincode column with 5,000 unique values creates 5,000
columns. Group rare categories into "Other", or use frequency/target
encoding (target encoding only inside training folds).

3. Using mean imputation on skewed money columns

Salaries and transaction amounts are right-skewed. The mean is dragged up
by a few large values. Use the median for money-like columns.

4. Ordinal-encoding an unordered category

Encoding city as Mumbai=0, Delhi=1, Pune=2 tells the model Pune is
"greater than" Mumbai. There is no such order -> one-hot encode instead.
Reserve ordinal encoding for genuinely ordered categories (Basic < Premium).

5. Deleting outliers that are real signal

In fraud or anomaly detection the outliers ARE the target. Confirm a value
is a data-entry error before removing it; otherwise cap, transform, or keep.

6. Forgetting handle_unknown for one-hot encoding

If the test set contains a category never seen in training, a naive encoder
crashes. Set handle_unknown="ignore" so unseen categories map to all zeros.

Practice Exercises

Given the column age = [25, 30, NaN, 45, NaN, 33], compute both the mean and median of the known values. Which would you use to impute, and why does it barely matter here but matter a lot for salary = [40000, 45000, NaN, 50000, 900000]?
Write pandas + scikit-learn code that de-duplicates a DataFrame on a customer_id column, keeping the most recent record (assume a signup_date column exists).
For spend = [200, 210, 195, 205, 5000], compute Q1, Q3, IQR, and the upper IQR fence. Is 5000 an outlier? Give one reason you might keep it and one reason you might cap it.
A state column has 28 unique values and a pincode column has 4,000 unique values. Which would you one-hot encode as-is, and how would you handle the other? Justify.
Build a ColumnTransformer for a dataset with numeric columns ["income", "tenure"] and categorical columns ["city", "membership"] (where membership is ordered: Bronze < Silver < Gold). Impute numerics with the median, one-hot the unordered category, and ordinal-encode the ordered one.
Explain in two sentences why calling scaler.fit_transform(X_test) instead of scaler.transform(X_test) is a data-leakage bug, and what the correct call is.

Summary

In this chapter you learned how to turn raw, messy data into a model-ready dataset:

Raw data is never model-ready — it has missing values, duplicates, outliers, text categories, and inconsistencies that must be fixed first.
Missing values: drop when trivial or a column is mostly empty; otherwise impute. Use the median for skewed/money columns, mean for symmetric numerics, mode for categories. SimpleImputer handles the fit/transform split; KNNImputer fills using similar rows.
Duplicates over-weight records and leak across splits — remove them, ideally on a business key.
Outliers: detect with IQR (Q1 - 1.5*IQR to Q3 + 1.5*IQR) or z-score (abs(z) > 3); then remove errors, cap/winsorize, transform, or keep genuine signal.
Dirty data: strip whitespace, unify casing, fix types with pd.to_numeric(..., errors="coerce"), standardise units.
Encoding: ordinal for ordered categories, one-hot (OneHotEncoder / pd.get_dummies) for unordered ones; watch out for high cardinality and set handle_unknown="ignore".
ColumnTransformer applies different pipelines to different column groups in one object.
Prevent leakage: split first, then fit_transform on train and transform only on test — fit every preprocessing step on the training data only.

Clean data beats a clever model almost every time; the effort you invest here pays off in every chapter that follows.

Next up: Feature Engineering & Scaling — creating better features, and putting your numeric columns on a common scale with standardisation, normalisation, and transformations.