Introduction to Machine Learning

What Is Machine Learning?

Machine learning (ML) is a way of building software that learns patterns from data instead of following instructions a programmer wrote by hand. You give the computer examples of a problem, and it figures out the rules on its own.

Here is the one-sentence definition worth memorising:

Machine learning is the study of algorithms that improve their performance at some task by learning from data rather than from explicit programming.

An intuitive analogy: imagine teaching a child to recognise a mango. You do not hand them a rulebook that says "if the fruit is oval, yellow-orange, 8 to 15 cm long, and smells sweet, then it is a mango." You simply show them many mangoes (and many non-mangoes), and after enough examples they generalise the concept. ML works the same way. Instead of a rulebook, the algorithm sees labelled examples and learns the pattern that separates a mango from an apple.

The task could be predicting a house price, flagging a fraudulent transaction, sorting emails into spam and not-spam, or recommending the next movie. In each case, we do not write the logic by hand. We collect examples and let the algorithm learn.

ML vs Traditional Rule-Based Programming

The cleanest way to understand ML is to contrast it with the way software was written for decades.

In traditional (rule-based) programming, a human studies the problem, writes the rules as code, and the program applies those rules to inputs to produce outputs:

Traditional programming:
  Rules (written by a human)  +  Data  →  Answers

Machine learning:
  Data  +  Answers (examples)  →  Rules (learned by the algorithm)

Notice that ML flips the arrows. In classical programming you supply the rules and get answers. In ML you supply the data and the answers (during training), and the machine hands you back the rules — encoded inside a model.

Consider a concrete example: detecting spam email.

Rule-based approach: A developer writes conditions such as "if the subject contains the word LOTTERY, mark as spam", "if there are more than 5 links, mark as spam". This quickly becomes unmanageable. Spammers change tactics, edge cases pile up, and you end up maintaining thousands of brittle rules.
ML approach: You collect 50,000 emails already labelled spam or not-spam, hand them to a learning algorithm, and it discovers the statistical patterns that distinguish the two — even patterns a human would never think to write down.

Aspect	Rule-Based Programming	Machine Learning
Logic source	Hand-coded by a developer	Learned from data
Handles new patterns	Poorly — needs new rules	Adapts if retrained on new data
Best when	Rules are few, stable, and known	Rules are complex, fuzzy, or unknown
Example	Tax calculation, `if age >= 18` checks	Fraud detection, image recognition
Maintenance	Edit code for every new case	Retrain on fresh labelled data
Explainability	Fully transparent	Ranges from transparent to opaque

The key insight: use rules when the logic is simple and known, and use ML when the logic is too complex or too fuzzy to write by hand.

Why Machine Learning Matters Now

ML is not a new idea — the core algorithms date back to the 1950s to 1990s. What changed in the last two decades are the three ingredients that make ML practical at scale:

Data. Every UPI payment, Swiggy order, and Ola ride generates records. Organisations now sit on enormous labelled datasets, and ML is hungry for data.
Compute. Cheap GPUs and cloud platforms mean a model that once needed a supercomputer can now be trained on a laptop or a rented cloud instance for a few hundred rupees.
Tooling. Libraries like scikit-learn, pandas, and NumPy let you go from raw data to a trained model in a few dozen lines of Python. You no longer implement the mathematics from scratch.

The virtuous cycle:
  More data  →  better models  →  better products
      ↑                                   ↓
  more usage  ←  more users  ←  more value delivered

When these three ingredients came together, ML moved from research labs into the products you use every day.

Where Machine Learning Is Used

ML shows up across almost every industry. A few representative domains:

Finance — fraud detection. Banks and UPI apps score every transaction in milliseconds. A model trained on past fraudulent and legitimate transactions flags a ₹80,000 payment from an unusual location at 3 a.m. as suspicious.
E-commerce — recommendations. When Flipkart or Amazon suggests "customers who bought this also bought…", a recommendation model is predicting what you are most likely to want next.
Healthcare — diagnosis support. Models trained on labelled X-rays or scans help radiologists flag likely tumours, prioritising urgent cases.
Natural Language Processing (NLP). Spam filters, sentiment analysis of product reviews, chatbots, and machine translation all rest on ML models that learn from text.
Computer vision. Face unlock on your phone, automatic number-plate recognition at toll gates, and quality inspection on factory lines all classify images.
Operations & logistics. Demand forecasting for a Zomato dark kitchen or delivery-time prediction for a courier company are regression problems solved with ML.

The common thread: in each case there is abundant historical data and the underlying rule is too complex to hand-code.

Core Machine Learning Terminology

Before writing any code, you need a shared vocabulary. These terms appear in every chapter that follows, so learn them now.

The Data

Dataset — the full collection of examples you learn from. Usually a table where each row is one example and each column is one measured quantity.
Feature (also attribute, predictor, independent variable) — an input column the model uses to make a prediction. The full set of features is conventionally called X (a capital X because it is usually a matrix — many rows, many columns).
Label (also target, outcome, dependent variable) — the answer you want to predict. Conventionally called y (lowercase, because it is usually a single column).
Instance / sample / observation — one row of the dataset: one set of features together with its label.

Example dataset — predicting whether a loan is repaid:

  age   income(₹)   loan_amount(₹)   repaid?   ← columns
  ---------------------------------------------
  28    600000      200000           yes        ← one instance (row)
  45    1200000     800000           yes
  33    450000      500000           no
   |         |            |             |
   └─────── features (X) ──┘        label (y)

The Model and the Loop

Model — the object that has learned the pattern. Concretely, it is a mathematical function with parameters that map features to a prediction: ŷ = f(X).
Training (also fitting, learning) — the process of adjusting the model's parameters so its predictions match the known labels as closely as possible. In scikit-learn this is the .fit() method.
Inference (also prediction, scoring) — using the trained model to produce a prediction for new, unseen data. In scikit-learn this is the .predict() method.
Generalization — the model's ability to perform well on data it has never seen before. This is the whole point. A model that memorises the training data but fails on new data is useless. (We measure this by holding out a test set, covered in the Train-Test Split & Cross-Validation chapter.)

Term	Also called	Symbol / method	Plain meaning
Feature	Predictor, attribute	`X`	The inputs
Label	Target, outcome	`y`	The answer to predict
Model	Estimator, hypothesis	`f`	The learned function
Training	Fitting, learning	`.fit()`	Learn from examples
Inference	Prediction, scoring	`.predict()`	Apply to new data
Generalization	Out-of-sample performance	measured on test set	Works on unseen data

Where ML Sits Within AI and Data Science

These three terms are often used interchangeably, but they nest inside one another:

Artificial Intelligence (AI)
  └── the broad goal: machines that perform tasks needing "intelligence"
      │
      └── Machine Learning (ML)
            └── systems that learn those tasks from data
                │
                └── Deep Learning
                      └── ML using large multi-layer neural networks

Artificial Intelligence is the widest umbrella: any technique that makes a machine act intelligently — including old-fashioned hand-coded rule engines.
Machine Learning is the subset of AI where behaviour is learned from data rather than hand-coded.
Deep Learning is a subset of ML built on large neural networks (the subject of the final chapter, Introduction to Neural Networks & Deep Learning).

And where does data science fit? Data science is the broad practice of extracting insight and value from data — it includes data cleaning, visualisation, statistics, and communication. ML is one of the most powerful tools in the data scientist's kit, but a data scientist also does plenty of work that is not ML at all.

When to Use ML vs Simple Rules

ML is powerful, but it is not always the right choice. Reaching for it when a simple rule would do is a common and expensive mistake. Use this checklist.

Prefer simple rules when:

The logic is small and well understood. "Charge 18% GST" needs no model — it needs one line of code.
You need 100% guaranteed, auditable behaviour (legal, safety, or compliance logic).
You have very little data.
A wrong prediction is unacceptable and there is a known correct formula.

Prefer machine learning when:

The rules are too complex or fuzzy to write by hand (recognising a face, understanding a sentence).
The pattern changes over time and you can retrain on fresh data (fraud tactics evolve).
You have enough historical labelled data to learn from.
Being approximately right at scale is more valuable than being perfectly right on a handful of cases.

Situation	Recommended approach
Compute GST on an invoice	Rule (`amount * 0.18`)
Decide if a user is `age >= 18`	Rule
Predict tomorrow's demand for a product	ML (regression)
Flag a transaction as fraud	ML (classification)
Recommend the next video	ML (recommendation)
Validate that an email contains an `@`	Rule

A good rule of thumb: if you can easily write the rule, write the rule. Save ML for the problems where you cannot.

Your First End-to-End ML Example

Enough theory — let us run the entire loop once, from data to a measured prediction. We will use scikit-learn's built-in Iris dataset (measurements of 150 flowers across 3 species) and train a model to predict the species from four measurements.

This tiny example touches every concept above: features (X), labels (y), a train/test split, fitting a model, inference, and measuring generalization with accuracy.

# Step 0: imports
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Step 1: load the data
iris = load_iris()
X = iris.data      # features: 150 rows x 4 columns (measurements)
y = iris.target    # labels:   150 species codes (0, 1, 2)

print("Feature matrix shape:", X.shape)   # (rows, columns)
print("Label vector shape:  ", y.shape)
print("Species names:       ", list(iris.target_names))

# Step 2: split into training and test sets
# The model learns on the training set and is judged on the unseen test set.
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,      # hold out 20% of the data for testing
    random_state=42     # fixed seed so the split is reproducible
)

# Step 3: create and TRAIN (fit) the model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)     # this is the "learning" step

# Step 4: INFERENCE — predict on data the model has never seen
y_pred = model.predict(X_test)

# Step 5: measure GENERALIZATION with accuracy
acc = accuracy_score(y_test, y_pred)
print(f"Test accuracy: {acc:.2%}")

Running the code prints something like this (your exact accuracy may vary slightly with the sklearn version):

Feature matrix shape: (150, 4)
Label vector shape:   (150,)
Species names:        ['setosa', 'versicolor', 'virginica']
Test accuracy: 100.00%

Let us connect each step back to the terminology:

X and y are the features and labels we defined earlier.
train_test_split holds out a portion of data so we can honestly measure generalization — the model is scored only on rows it never saw during training.
model.fit(...) is training: the algorithm adjusts its internal parameters to match the training labels.
model.predict(...) is inference: applying the learned function to new inputs.
accuracy_score answers the real question — how often is the model right on unseen data?

That is the complete machine learning loop. Every chapter in this series expands one part of it — better data preparation, smarter features, different algorithms, and more honest evaluation — but the shape you just saw never changes.

Predicting a Single New Flower

To make inference concrete, imagine a botanist named Priya measures a new flower. We feed those four numbers to the trained model:

# A new, unseen flower: [sepal_length, sepal_width, petal_length, petal_width]
new_flower = [[5.1, 3.5, 1.4, 0.2]]

prediction = model.predict(new_flower)
species = iris.target_names[prediction[0]]
print("Predicted species:", species)

Predicted species: setosa

Priya did not write any rules about petal sizes. The model learned them from 120 training examples and now generalises to a flower it has never encountered.

Common Pitfalls

Even at this introductory stage, beginners repeatedly stumble on the same issues. Watch for these.

1. Evaluating on the training data

Wrong: measure accuracy on the SAME data the model trained on.
       → The model may have memorised it — accuracy looks great but is a lie.
Right: always measure on a held-out test set (or via cross-validation).

Reporting training accuracy as if it were real performance is the single most common mistake in ML.

2. Reaching for ML when a rule would do

If the task is "flag orders above ₹10,000", that is one if statement, not a model. ML adds data needs, training cost, and unpredictability. Use it only when the rule is genuinely hard to write.

3. Confusing AI, ML, and deep learning

Not every AI system is ML, and not every ML model is a neural network. Using the terms loosely leads to choosing an over-complex tool. Most business problems are solved with simple models like logistic regression or random forests — not deep learning.

4. Ignoring data quality

Garbage in, garbage out.
A model can only be as good as the data it learns from.
Mislabelled examples, missing values, and biased samples all
propagate straight into predictions.

Data cleaning (covered in Data Preprocessing & Cleaning) is where most real ML effort actually goes.

5. Expecting perfection

ML makes probabilistic predictions, not guarantees. A fraud model will occasionally miss a fraud and occasionally flag a legitimate payment. Design your product to tolerate mistakes rather than assuming the model is always right.

6. Forgetting to set a random seed

Without random_state, your train/test split changes every run, so your reported accuracy drifts and your results are not reproducible. Fix the seed while developing.

Practice Exercises

Rules vs ML. For each task, decide whether you would use a hand-coded rule or machine learning, and justify in one line: (a) converting a temperature from Celsius to Fahrenheit, (b) predicting whether a customer will churn next month, (c) checking if a password is at least 8 characters, (d) recognising handwritten pincodes on envelopes.
Terminology. Given a dataset of used cars with columns brand, year, km_driven, and selling_price, and your goal is to predict the price: identify the features (X), the label (y), and state whether one row is an instance or a feature.
Run the loop. Modify the Iris example to use test_size=0.3 instead of 0.2. Re-run it and note how many flowers are now in X_train and X_test. Does the test accuracy change?
Swap the model. Replace LogisticRegression with from sklearn.neighbors import KNeighborsClassifier and KNeighborsClassifier(). Keep everything else the same, fit, predict, and compare accuracy. (This algorithm is covered in the K-Nearest Neighbors chapter.)
Explain generalization. In two or three sentences, explain to a non-technical colleague why we hold out a test set instead of measuring accuracy on the training data.
Domain mapping. Pick any app on your phone and list two features it might use and one label it might predict for one of its ML-powered functions (for example, a food-delivery app predicting delivery time).

Summary

In this chapter you learned:

Machine learning builds software that learns patterns from data instead of relying on hand-coded rules.
ML flips traditional programming: instead of rules + data → answers, ML uses data + answers → rules (packaged as a model).
ML matters now because three ingredients aligned: abundant data, cheap compute, and mature tooling like scikit-learn.
It powers finance fraud detection, e-commerce recommendations, healthcare diagnosis support, NLP, and computer vision — anywhere data is plentiful and the rule is hard to write.
Core terms: dataset, features (X), labels (y), model, training/fitting (.fit()), inference/prediction (.predict()), and generalization (performance on unseen data).
ML is a subset of AI, deep learning is a subset of ML, and ML is one powerful tool within data science.
Use simple rules when the logic is small and known; use ML when the logic is complex, fuzzy, or evolving and you have enough labelled data.
You ran the complete ML loop once: load data, train_test_split, fit, predict, and measure accuracy on a held-out test set.
Common pitfalls: evaluating on training data, over-using ML, confusing AI/ML/DL, ignoring data quality, expecting perfection, and forgetting the random seed.

You now understand what machine learning is, why it works, and what the end-to-end loop looks like.

Next up: Types of Machine Learning — how supervised, unsupervised, and reinforcement learning differ, and how to recognise which type your problem belongs to.