What Is Data Preprocessing?
Data preprocessing is the set of steps that transform messy, real-world data into a clean, numeric, consistent table that a machine learning model can actually learn from. It sits right after data collection in the workflow you saw in The Machine Learning Workflow chapter, and before any model ever sees the data.
Think of it like cooking. You do not throw whole, unwashed vegetables into a pot and call it dinner. You wash them, peel them, chop them to a consistent size, throw out the rotten ones, and measure the salt. Raw data is the unwashed vegetables; preprocessing is the prep work. Skip it and even the fanciest algorithm produces garbage.
Why is raw data never model-ready?
- It has holes. Real datasets are full of blanks — a customer skipped a form field, a sensor dropped a reading.
- It has repeats. The same record gets logged twice by a buggy pipeline.
- It has extreme values. A data-entry typo records a salary of
₹50,00,000per month instead of per year. - It is not numeric. Most models only understand numbers, but your data has columns like
City = "Mumbai"orPlan = "Premium". - It is inconsistent. The same city appears as
"Delhi"," delhi ", and"DELHI".
The golden rule of this chapter: every cleaning decision (which value to fill a blank with, which categories exist, where to clip outliers) must be learned from the training data only, then applied to the test/validation data. Learning it from the whole dataset leaks future information into your model. We come back to this repeatedly.
A Realistic Messy Dataset
Let us build a small, dirty customer dataset for an Indian fintech company and clean it end to end.
import numpy as np
import pandas as pd
data = {
"name": ["Priya", "Rahul", "Anjali", "Rahul", "Vikram", "Neha", "Karan"],
"city": ["Mumbai", "delhi ", "Bengaluru", "delhi ", "MUMBAI", None, "Pune"],
"age": [28, 34, np.nan, 34, 41, 29, 250], # 250 is a bad value
"salary": [55000, 72000, 61000, 72000, np.nan, 48000, 90000], # monthly ₹
"plan": ["Basic", "Premium", "Standard", "Premium", "Premium", "Basic", None],
"churned": [0, 1, 0, 1, 0, 0, 1], # target label
}
df = pd.DataFrame(data)
print(df.shape)
print(df.isna().sum())
(7, 6)
name 0
city 1
age 1
salary 1
plan 1
churned 0
dtype: int64
Notice the problems already: a duplicate row (Rahul), inconsistent city casing/whitespace, a missing city and plan, missing age and salary, and an impossible age = 250. We will fix all of these.
Handling Missing Values
Missing values are the most common defect. You have two broad choices: drop them or impute (fill in) them.
Drop vs Impute
DROP — remove rows or columns that contain missing values.
Safe when: very few rows affected, OR a whole column is mostly empty.
Danger: you throw away real information and can bias the sample.
IMPUTE — fill the gap with a reasonable substitute.
Safe when: missingness is modest and you can justify the fill value.
Danger: a bad fill value introduces noise or fake signal.
# Drop rows with ANY missing value
df.dropna()
# Drop a column that is more than 40% missing
threshold = 0.40
df.dropna(axis=1, thresh=int((1 - threshold) * len(df)))
Mean, Median, or Mode?
When you impute a numeric column, the usual fill values are the mean or the median. For a categorical column, you use the mode (most frequent value).
| Column type | Fill with | Use when |
|---|---|---|
| Numeric, roughly symmetric | Mean | No strong outliers or skew |
| Numeric, skewed or outlier-prone | Median | Income, prices, counts (robust to extremes) |
| Categorical | Mode (most frequent) | Any category column |
| Any (advanced) | Model-based (KNN, regression) | Missingness relates to other features |
Rule of thumb: prefer the median for money-like columns (salary, price, transaction amount) because it is not dragged around by a few huge values.
For salary = [55000, 72000, 61000, 72000, NaN, 48000, 90000]:
mean of the 6 known values = 66,333
median of the 6 known values = 66,500
SimpleImputer (scikit-learn)
SimpleImputer learns the fill value with .fit() and applies it with .transform(). This split is exactly what lets us fit on train only.
from sklearn.impute import SimpleImputer
# Numeric imputer: fill with the median learned from training data
num_imputer = SimpleImputer(strategy="median")
df["salary"] = num_imputer.fit_transform(df[["salary"]])
# Categorical imputer: fill with the most frequent category
cat_imputer = SimpleImputer(strategy="most_frequent")
df["plan"] = cat_imputer.fit_transform(df[["plan"]]).ravel()
print(df[["salary", "plan"]].isna().sum())
salary 0
plan 0
dtype: int64
Available strategies: "mean", "median", "most_frequent", and "constant" (with fill_value=..., e.g. filling text gaps with "Unknown").
KNNImputer (a smarter option)
SimpleImputer fills every gap in a column with the same value. KNNImputer instead looks at the k most similar rows (nearest neighbours across the other features) and fills the gap with their average. It captures relationships between columns — for example, that younger customers tend to earn less — at the cost of more computation. Reach for it when a plain median clearly loses signal.
from sklearn.impute import KNNImputer
knn_imp = KNNImputer(n_neighbors=3)
numeric_cols = ["age", "salary"]
df[numeric_cols] = knn_imp.fit_transform(df[numeric_cols])
Removing Duplicates
Duplicate rows over-weight some records and can leak identical rows across the train/test split, quietly inflating your accuracy.
print("Before:", df.shape) # Before: (7, 6)
print("Duplicate rows:", df.duplicated().sum())
df = df.drop_duplicates().reset_index(drop=True)
print("After:", df.shape) # After: (6, 6)
Be deliberate about what "duplicate" means. Two customers can legitimately share an age and city. Often you should de-duplicate on a business key such as a customer ID:
df = df.drop_duplicates(subset=["name"], keep="first")
Fixing Inconsistent / Dirty Data
Before encoding, standardise the content of text columns. Otherwise "delhi ", "Delhi", and "DELHI" become three different categories.
# Strip whitespace and standardise casing
df["city"] = df["city"].str.strip().str.title() # " delhi " -> "Delhi"
# Fix data types (numbers stored as text are a classic bug)
df["age"] = pd.to_numeric(df["age"], errors="coerce").astype("Int64")
# Map messy free-text categories to a clean set
plan_map = {"basic": "Basic", "std": "Standard", "premium": "Premium"}
df["plan"] = df["plan"].str.strip().str.lower().map(plan_map).fillna(df["plan"])
print(df["city"].unique())
['Mumbai' 'Delhi' 'Bengaluru' 'Pune']
Common dirty-data fixes to keep on your checklist:
- Whitespace:
.str.strip()removes leading/trailing spaces. - Casing:
.str.lower()/.str.title()unifies case. - Types:
pd.to_numeric(..., errors="coerce")turns unparseable text intoNaNso you can then impute it. - Units: make sure a column is all the same unit (all monthly salary, not a mix of monthly and annual).
Outlier Detection & Treatment
An outlier is a value that sits far from the rest of the distribution. Some are errors (age = 250); some are real and important (a genuine high-value customer). Your job is to detect them, then decide.
The IQR Method
The Interquartile Range (IQR) method flags values that fall outside a fence built from the quartiles. It is robust and makes no normality assumption.
Q1 = 25th percentile, Q3 = 75th percentile
IQR = Q3 - Q1
Lower fence = Q1 - 1.5 * IQR
Upper fence = Q3 + 1.5 * IQR
A value is an outlier if it is below the lower fence OR above the upper fence.
Q1 = df["age"].quantile(0.25)
Q3 = df["age"].quantile(0.75)
IQR = Q3 - Q1
lower, upper = Q1 - 1.5 * IQR, Q3 + 1.5 * IQR
outliers = df[(df["age"] < lower) | (df["age"] > upper)]
print("Fences:", round(lower, 1), "to", round(upper, 1))
print(outliers[["name", "age"]])
Fences: 19.5 to 49.5
name age
6 Karan 250
The Z-Score Method
The z-score measures how many standard deviations a value is from the mean. A common cutoff flags values whose absolute z-score is greater than 3. This method assumes a roughly bell-shaped distribution, so it is weaker for heavily skewed data.
z = (x - mean) / std
Flag x as an outlier if abs(z) is greater than 3.
from scipy import stats
z = np.abs(stats.zscore(df["salary"].dropna()))
print((z > 3).sum(), "salary outliers by z-score")
Treating Outliers — and When to Keep Them
Once flagged, you have several options:
| Technique | What it does | When to use |
|---|---|---|
| Remove the row | Drops the record entirely | Clear data-entry errors, very few of them |
| Cap / Winsorize | Clips values to the fence (e.g. set to upper) | Real but extreme values you want to keep, softened |
| Transform | Apply log to compress the tail | Right-skewed money/count data |
| Keep as-is | Do nothing | The outlier is genuine and meaningful |
# The age = 250 is impossible -> treat as an error and remove
df = df[(df["age"] >= lower) & (df["age"] <= upper)].reset_index(drop=True)
# Alternative: cap salary at the upper fence instead of removing
sal_upper = df["salary"].quantile(0.75) + 1.5 * (
df["salary"].quantile(0.75) - df["salary"].quantile(0.25))
df["salary"] = df["salary"].clip(upper=sal_upper)
When to keep outliers: if you are building a fraud-detection or anomaly-detection model, the outliers are the signal — removing them destroys the very thing you want to catch. Never delete an outlier just because it is inconvenient; confirm it is an error first.
Encoding Categorical Variables
Models do arithmetic, so text categories must become numbers. The right encoding depends on whether the categories have a meaningful order.
Label / Ordinal Encoding
Ordinal encoding maps each category to an integer. This is correct only when the categories have a natural order, because the model will treat the numbers as having magnitude (2 > 1).
from sklearn.preprocessing import OrdinalEncoder
# plan has a real order: Basic < Standard < Premium
plan_order = [["Basic", "Standard", "Premium"]]
ord_enc = OrdinalEncoder(categories=plan_order)
df["plan_encoded"] = ord_enc.fit_transform(df[["plan"]]).astype(int)
print(df[["plan", "plan_encoded"]].drop_duplicates())
plan plan_encoded
0 Basic 0
1 Premium 2
2 Standard 1
LabelEncoder does the same integer mapping but is meant for the target column (y), not features — it does not support the fit/transform split cleanly on multiple feature columns.
One-Hot Encoding
For nominal categories with no order (like city), integer codes would fake an ordering (Pune = 3 > Mumbai = 1?). One-hot encoding instead creates one binary column per category.
city = "Mumbai" -> city_Mumbai=1, city_Delhi=0, city_Pune=0, ...
city = "Delhi" -> city_Mumbai=0, city_Delhi=1, city_Pune=0, ...
The pandas quick way:
dummies = pd.get_dummies(df["city"], prefix="city", drop_first=True)
df = pd.concat([df, dummies], axis=1)
The scikit-learn way (preferred inside pipelines, because it remembers the training categories):
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
city_encoded = ohe.fit_transform(df[["city"]])
print(ohe.get_feature_names_out(["city"]))
['city_Bengaluru' 'city_Delhi' 'city_Mumbai' 'city_Pune']
Two important settings:
drop_first=True(pandas) ordrop="first"(sklearn) removes one redundant column to avoid the dummy variable trap (perfect collinearity), which matters for linear models.handle_unknown="ignore"tells the encoder how to behave when a category appears in the test set that was never seen in training — it outputs all zeros instead of crashing.
The High-Cardinality Problem
One-hot encoding a column with thousands of unique values (pincode, product_id, merchant_name) explodes your feature count into thousands of sparse columns. This slows training, wastes memory, and hurts many models.
Strategies for high-cardinality columns:
- Group rare categories into a single
"Other"bucket (e.g. keep the top 20 cities, lump the rest). - Frequency / count encoding: replace each category with how often it appears.
- Target encoding: replace each category with the mean of the target for that category — powerful, but a major leakage risk, so it must be fit on training folds only.
ColumnTransformer: Different Steps for Different Columns
Real datasets need different preprocessing per column — impute-and-scale the numbers, impute-and-one-hot the categories. ColumnTransformer applies separate pipelines to separate column groups and stitches the results back together, all inside a single fit/transform object.
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
numeric_features = ["age", "salary"]
categorical_features = ["city", "plan"]
# One pipeline per column type
numeric_pipe = Pipeline(steps=[
("impute", SimpleImputer(strategy="median")),
("scale", StandardScaler()),
])
categorical_pipe = Pipeline(steps=[
("impute", SimpleImputer(strategy="most_frequent")),
("onehot", OneHotEncoder(handle_unknown="ignore")),
])
preprocessor = ColumnTransformer(transformers=[
("num", numeric_pipe, numeric_features),
("cat", categorical_pipe, categorical_features),
])
Now wire it to the train/test split so that all fitting happens on the training set only — this is the leakage-safe pattern you should default to.
from sklearn.model_selection import train_test_split
X = df[numeric_features + categorical_features]
y = df["churned"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42)
# fit_transform on TRAIN, transform-only on TEST
X_train_prepped = preprocessor.fit_transform(X_train)
X_test_prepped = preprocessor.transform(X_test) # <-- no fit here!
print("Train shape:", X_train_prepped.shape)
StandardScaler (used above) is covered in depth in the next chapter, Feature Engineering & Scaling; here it simply shows how numeric and categorical steps live side by side.
Preventing Data Leakage — The Non-Negotiable Rule
Data leakage happens when information from outside the training set sneaks into the model. The most common preprocessing leak is computing a statistic (a median, a scaler's mean, the set of categories) over the entire dataset and only then splitting.
WRONG (leaks the test set's information):
1. Impute missing values using the median of the FULL dataset
2. Scale using the mean/std of the FULL dataset
3. train_test_split(...)
RIGHT (leakage-safe):
1. train_test_split(...) <-- split FIRST
2. preprocessor.fit_transform(X_train) <-- learn medians/means/categories on TRAIN
3. preprocessor.transform(X_test) <-- APPLY the same learned values to TEST
The safest habit is to put every preprocessing step inside a Pipeline / ColumnTransformer and only ever call .fit() on training data. Cross-validation (see Train-Test Split & Cross-Validation) then re-fits the preprocessing on each fold automatically, which is exactly what you want.
Common Mistakes
1. Fitting preprocessing on the whole dataset before splitting
Computing an imputer's median or a scaler's mean over ALL rows and then
splitting leaks test information into training. Always split first, then
fit_transform on train and transform-only on test.
2. One-hot encoding a high-cardinality column blindly
One-hot encoding a pincode column with 5,000 unique values creates 5,000
columns. Group rare categories into "Other", or use frequency/target
encoding (target encoding only inside training folds).
3. Using mean imputation on skewed money columns
Salaries and transaction amounts are right-skewed. The mean is dragged up
by a few large values. Use the median for money-like columns.
4. Ordinal-encoding an unordered category
Encoding city as Mumbai=0, Delhi=1, Pune=2 tells the model Pune is
"greater than" Mumbai. There is no such order -> one-hot encode instead.
Reserve ordinal encoding for genuinely ordered categories (Basic < Premium).
5. Deleting outliers that are real signal
In fraud or anomaly detection the outliers ARE the target. Confirm a value
is a data-entry error before removing it; otherwise cap, transform, or keep.
6. Forgetting handle_unknown for one-hot encoding
If the test set contains a category never seen in training, a naive encoder
crashes. Set handle_unknown="ignore" so unseen categories map to all zeros.
Practice Exercises
-
Given the column
age = [25, 30, NaN, 45, NaN, 33], compute both the mean and median of the known values. Which would you use to impute, and why does it barely matter here but matter a lot forsalary = [40000, 45000, NaN, 50000, 900000]? -
Write pandas + scikit-learn code that de-duplicates a DataFrame on a
customer_idcolumn, keeping the most recent record (assume asignup_datecolumn exists). -
For
spend = [200, 210, 195, 205, 5000], compute Q1, Q3, IQR, and the upper IQR fence. Is5000an outlier? Give one reason you might keep it and one reason you might cap it. -
A
statecolumn has 28 unique values and apincodecolumn has 4,000 unique values. Which would you one-hot encode as-is, and how would you handle the other? Justify. -
Build a
ColumnTransformerfor a dataset with numeric columns["income", "tenure"]and categorical columns["city", "membership"](wheremembershipis ordered:Bronze < Silver < Gold). Impute numerics with the median, one-hot the unordered category, and ordinal-encode the ordered one. -
Explain in two sentences why calling
scaler.fit_transform(X_test)instead ofscaler.transform(X_test)is a data-leakage bug, and what the correct call is.
Summary
In this chapter you learned how to turn raw, messy data into a model-ready dataset:
- Raw data is never model-ready — it has missing values, duplicates, outliers, text categories, and inconsistencies that must be fixed first.
- Missing values: drop when trivial or a column is mostly empty; otherwise impute. Use the median for skewed/money columns, mean for symmetric numerics, mode for categories.
SimpleImputerhandles the fit/transform split;KNNImputerfills using similar rows. - Duplicates over-weight records and leak across splits — remove them, ideally on a business key.
- Outliers: detect with IQR (
Q1 - 1.5*IQRtoQ3 + 1.5*IQR) or z-score (abs(z) > 3); then remove errors, cap/winsorize, transform, or keep genuine signal. - Dirty data: strip whitespace, unify casing, fix types with
pd.to_numeric(..., errors="coerce"), standardise units. - Encoding: ordinal for ordered categories, one-hot (
OneHotEncoder/pd.get_dummies) for unordered ones; watch out for high cardinality and sethandle_unknown="ignore". ColumnTransformerapplies different pipelines to different column groups in one object.- Prevent leakage: split first, then
fit_transformon train andtransformonly on test — fit every preprocessing step on the training data only.
Clean data beats a clever model almost every time; the effort you invest here pays off in every chapter that follows.
Next up: Feature Engineering & Scaling — creating better features, and putting your numeric columns on a common scale with standardisation, normalisation, and transformations.