Database Design & Normalization

Why Database Design Matters

Poor database design leads to:

Data anomalies — inserting, updating, or deleting a record causes unintended effects in other records
Data redundancy — the same information stored in many places; update one and miss others
Inconsistency — different rows disagree about the same fact
Query difficulty — poorly structured data is hard to query correctly

Good design prevents all of these through normalization and careful schema planning.

The Design Process

Understand requirements — what data needs to be stored? What questions will be asked?
Identify entities — things with distinct existence (Customer, Order, Product)
Identify attributes — properties of each entity (customer: name, email, phone)
Identify relationships — how entities relate (customer places orders)
Choose primary keys — unique identifier for each entity
Normalize — eliminate redundancy while preserving information
Define constraints — enforce business rules at the database level

Entity-Relationship (ER) Modeling

An ER diagram maps the conceptual model before writing any SQL.

Entities and Attributes

CUSTOMER:  customer_id (PK), name, email, phone, city
ORDER:     order_id (PK), order_date, status, total_amount
PRODUCT:   product_id (PK), name, price, category, stock_qty

Relationship Types

Type	Description	Example
One-to-One (1:1)	One A relates to one B	Employee ↔ Passport
One-to-Many (1:N)	One A relates to many B	Customer → Orders
Many-to-Many (M:N)	Many A relate to many B	Orders ↔ Products

Implementing Relationships in SQL

One-to-Many: FK in the "many" side

CREATE TABLE customers (customer_id INTEGER PRIMARY KEY, name TEXT);
CREATE TABLE orders (
    order_id    INTEGER PRIMARY KEY,
    customer_id INTEGER NOT NULL,  -- FK to customers
    order_date  DATE,
    FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
);
-- One customer has many orders; each order belongs to one customer

Many-to-Many: Junction/bridge table

-- Orders and Products: one order can have many products; one product can be in many orders
CREATE TABLE order_items (
    order_id   INTEGER NOT NULL,
    product_id INTEGER NOT NULL,
    quantity   INTEGER NOT NULL DEFAULT 1,
    unit_price DECIMAL(10,2) NOT NULL,

    PRIMARY KEY (order_id, product_id),
    FOREIGN KEY (order_id)   REFERENCES orders(order_id),
    FOREIGN KEY (product_id) REFERENCES products(product_id)
);

One-to-One: FK with UNIQUE constraint

CREATE TABLE employees (employee_id INTEGER PRIMARY KEY, name TEXT);
CREATE TABLE employee_contracts (
    contract_id INTEGER PRIMARY KEY,
    employee_id INTEGER UNIQUE NOT NULL,  -- UNIQUE enforces 1:1
    start_date  DATE,
    FOREIGN KEY (employee_id) REFERENCES employees(employee_id)
);

What Is Normalization?

Normalization is the process of organising a database to reduce data redundancy and improve data integrity. It's done through a series of Normal Forms (NF) — each one removes a specific type of anomaly.

The Un-normalised Table

Start with a single "flat" table that stores everything:

order_data (bad design)
order_id | customer_name | customer_email     | customer_city | product1 | qty1 | product2 | qty2 | order_date
---------|---------------|--------------------|----|----------|------|----------|------|---
1001     | Priya Sharma  | priya@example.com  | Delhi  | Laptop | 1  | Mouse | 2  | 2026-01-10
1002     | Raj Patel     | raj@example.com    | Mumbai | Phone  | 1  | NULL  | NULL| 2026-01-15
1003     | Priya Sharma  | priya@example.com  | Delhi  | Tablet | 1  | NULL  | NULL| 2026-02-01

Problems:

Customer info repeated across orders (update email in one place, miss another)
Fixed number of product columns (what if order has 10 items?)
Empty NULLs for unused product slots

First Normal Form (1NF)

1NF Rules:

Each column contains atomic (indivisible) values — no lists, no arrays, no repeated columns
Each row must be unique (requires a primary key)
No repeating groups (no product1, product2, product3 columns)

Violation in our table:

product1, qty1, product2, qty2 are repeating groups
A customer might have multiple phone numbers in one cell: "9999,8888"

Fix — Remove repeating groups:

-- 1NF: separate rows for each product in the order
orders (1NF):
order_id | customer_name | customer_email   | customer_city | product_name | qty | order_date
---------|---------------|------------------|--------------|--------------|-----|----------
1001     | Priya Sharma  | priya@example.com| Delhi        | Laptop       | 1   | 2026-01-10
1001     | Priya Sharma  | priya@example.com| Delhi        | Mouse        | 2   | 2026-01-10
1002     | Raj Patel     | raj@example.com  | Mumbai       | Phone        | 1   | 2026-01-15
1003     | Priya Sharma  | priya@example.com| Delhi        | Tablet       | 1   | 2026-02-01

Now 1NF — but we still have redundancy (customer info repeated for each product in order 1001).

Second Normal Form (2NF)

2NF requires:

Already in 1NF
Every non-key column must depend on the entire primary key — not just part of it

If the primary key is (order_id, product_name), then:

qty depends on both (the quantity of a product in a specific order) ✓
customer_name depends only on order_id (not on product_name) ✗ — partial dependency

Fix — Split into separate tables:

-- Customers
CREATE TABLE customers (
    customer_id    INTEGER PRIMARY KEY,
    name           TEXT NOT NULL,
    email          TEXT UNIQUE,
    city           TEXT
);

-- Orders (each order belongs to one customer)
CREATE TABLE orders (
    order_id      INTEGER PRIMARY KEY,
    customer_id   INTEGER NOT NULL,
    order_date    DATE    NOT NULL,
    FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
);

-- Products
CREATE TABLE products (
    product_id    INTEGER PRIMARY KEY,
    name          TEXT    NOT NULL,
    price         DECIMAL(10,2)
);

-- Order items (the bridge table)
CREATE TABLE order_items (
    order_id    INTEGER NOT NULL,
    product_id  INTEGER NOT NULL,
    qty         INTEGER NOT NULL DEFAULT 1,
    unit_price  DECIMAL(10,2) NOT NULL,
    PRIMARY KEY (order_id, product_id),
    FOREIGN KEY (order_id)   REFERENCES orders(order_id),
    FOREIGN KEY (product_id) REFERENCES products(product_id)
);

Now in 2NF: each column depends on the full primary key of its table.

Third Normal Form (3NF)

3NF requires:

Already in 2NF
No transitive dependencies — non-key column should not depend on another non-key column

Example Violation

-- employees with transitive dependency
employees:
employee_id | name  | department_id | department_name | department_budget
------------|-------|--------------|-----------------|------------------
1           | Priya | 10           | Finance         | 500000
2           | Raj   | 20           | Technology      | 750000
3           | Meera | 10           | Finance         | 500000

department_name depends on department_id (not on employee_id)
department_budget depends on department_id
These are transitive dependencies: employee_id → department_id → department_name

Fix — Move department data to its own table:

CREATE TABLE departments (
    department_id   INTEGER PRIMARY KEY,
    department_name TEXT    NOT NULL,
    budget          DECIMAL(12,2)
);

CREATE TABLE employees (
    employee_id   INTEGER PRIMARY KEY,
    name          TEXT    NOT NULL,
    department_id INTEGER NOT NULL,
    FOREIGN KEY (department_id) REFERENCES departments(department_id)
);

Now in 3NF: every non-key column depends only on the primary key.

Boyce-Codd Normal Form (BCNF)

BCNF is a slightly stronger version of 3NF: every determinant must be a candidate key. For most practical database design, 3NF is sufficient.

Summary of Normal Forms

Normal Form	Removes
1NF	Repeating groups, non-atomic values
2NF	Partial dependencies (on part of composite PK)
3NF	Transitive dependencies (non-key → non-key)
BCNF	All anomalies from functional dependencies

Denormalization — When to Break the Rules

Normalization is best for write-heavy systems (fewer anomalies, easier updates). But for read-heavy systems (analytics, reporting), joining many tables is expensive.

Denormalization intentionally adds redundancy to improve read performance:

-- Denormalised orders table (no need to join for common reports)
CREATE TABLE orders_denorm (
    order_id       INTEGER PRIMARY KEY,
    order_date     DATE,
    customer_id    INTEGER,
    customer_name  TEXT,    -- redundant but fast to read
    customer_city  TEXT,    -- redundant
    total_items    INTEGER, -- pre-computed
    total_amount   DECIMAL(10,2) -- pre-computed
);

You trade storage and write complexity for read speed. This is common in:

Data warehouses (star schema, snowflake schema)
Analytics databases (BigQuery, Redshift)
Pre-aggregated reporting tables

Real-World Design Patterns

Audit Columns

Every production table should track when and by whom rows were created or changed:

CREATE TABLE employees (
    id          INTEGER   PRIMARY KEY AUTOINCREMENT,
    name        TEXT      NOT NULL,
    department  TEXT      NOT NULL,
    salary      DECIMAL   NOT NULL,
    created_at  TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at  TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    created_by  INTEGER,  -- user who created the record
    updated_by  INTEGER   -- user who last modified it
);

Soft Deletes

Don't delete rows — mark them as deleted to preserve history:

ALTER TABLE employees ADD COLUMN deleted_at TIMESTAMP;
ALTER TABLE employees ADD COLUMN is_deleted INTEGER DEFAULT 0;

-- "Delete"
UPDATE employees SET is_deleted = 1, deleted_at = CURRENT_TIMESTAMP WHERE id = 5;

-- Normal queries filter out deleted rows
SELECT * FROM employees WHERE is_deleted = 0;

-- Or use a view
CREATE VIEW active_employees AS
SELECT * FROM employees WHERE is_deleted = 0;

Lookup/Reference Tables

Instead of storing string values like 'pending', 'confirmed' everywhere, use a reference table:

CREATE TABLE order_statuses (
    id   INTEGER PRIMARY KEY,
    name TEXT    NOT NULL UNIQUE
);
INSERT INTO order_statuses VALUES (1,'pending'),(2,'confirmed'),(3,'shipped'),(4,'delivered'),(5,'cancelled');

CREATE TABLE orders (
    id        INTEGER PRIMARY KEY,
    status_id INTEGER NOT NULL,
    FOREIGN KEY (status_id) REFERENCES order_statuses(id)
);

Changing a status name requires updating one row, not thousands.

Versioning / History Tables

Keep a full history of changes:

CREATE TABLE product_prices (
    id          INTEGER PRIMARY KEY AUTOINCREMENT,
    product_id  INTEGER NOT NULL,
    price       DECIMAL(10,2) NOT NULL,
    valid_from  DATE NOT NULL,
    valid_to    DATE,  -- NULL means currently active
    FOREIGN KEY (product_id) REFERENCES products(id)
);

-- Current price
SELECT price FROM product_prices
WHERE product_id = 42 AND valid_to IS NULL;

-- Historical price on a specific date
SELECT price FROM product_prices
WHERE product_id = 42
  AND valid_from <= '2025-01-01'
  AND (valid_to IS NULL OR valid_to > '2025-01-01');

Complete Example: E-Commerce Schema

CREATE TABLE categories (
    id   INTEGER PRIMARY KEY AUTOINCREMENT,
    name TEXT NOT NULL UNIQUE
);

CREATE TABLE products (
    id          INTEGER       PRIMARY KEY AUTOINCREMENT,
    name        VARCHAR(200)  NOT NULL,
    description TEXT,
    price       DECIMAL(10,2) NOT NULL CHECK (price >= 0),
    stock       INTEGER       NOT NULL DEFAULT 0 CHECK (stock >= 0),
    category_id INTEGER,
    created_at  TIMESTAMP     DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (category_id) REFERENCES categories(id)
);

CREATE TABLE customers (
    id         INTEGER       PRIMARY KEY AUTOINCREMENT,
    name       VARCHAR(100)  NOT NULL,
    email      VARCHAR(150)  NOT NULL UNIQUE,
    city       VARCHAR(100),
    created_at TIMESTAMP     DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE orders (
    id          INTEGER       PRIMARY KEY AUTOINCREMENT,
    customer_id INTEGER       NOT NULL,
    status      VARCHAR(20)   NOT NULL DEFAULT 'pending'
        CHECK (status IN ('pending','confirmed','shipped','delivered','cancelled')),
    order_date  DATE          NOT NULL DEFAULT CURRENT_DATE,
    total       DECIMAL(12,2),
    FOREIGN KEY (customer_id) REFERENCES customers(id) ON DELETE RESTRICT
);

CREATE TABLE order_items (
    order_id   INTEGER       NOT NULL,
    product_id INTEGER       NOT NULL,
    qty        INTEGER       NOT NULL CHECK (qty > 0),
    unit_price DECIMAL(10,2) NOT NULL CHECK (unit_price >= 0),
    PRIMARY KEY (order_id, product_id),
    FOREIGN KEY (order_id)   REFERENCES orders(id)   ON DELETE CASCADE,
    FOREIGN KEY (product_id) REFERENCES products(id) ON DELETE RESTRICT
);

-- Useful indexes
CREATE INDEX idx_orders_customer   ON orders (customer_id);
CREATE INDEX idx_orders_date       ON orders (order_date);
CREATE INDEX idx_products_category ON products (category_id);
CREATE INDEX idx_order_items_product ON order_items (product_id);

Common Mistakes

1. Storing multiple values in one column

-- Bad
CREATE TABLE employees (
    tags TEXT  -- "finance, accounting, excel" — hard to query
);

-- Good
CREATE TABLE employee_tags (
    employee_id INTEGER,
    tag         TEXT,
    PRIMARY KEY (employee_id, tag)
);

2. Using VARCHAR for status fields

-- Risky — typos like 'Pendin' or 'PENDING' get through
status VARCHAR(20)

-- Better — use a CHECK constraint or FK to a status table
status VARCHAR(20) CHECK (status IN ('pending','confirmed','shipped'))

3. Premature normalization

Normalizing all the way to 5NF for a small lookup table creates unnecessary complexity. Use judgment — 3NF is the standard target for OLTP databases.

4. No surrogate key on junction tables

-- Fine for simple cases
PRIMARY KEY (order_id, product_id)

-- But sometimes you need to reference an order_item specifically:
-- Adding a surrogate key helps
CREATE TABLE order_items (
    id         INTEGER PRIMARY KEY AUTOINCREMENT,  -- surrogate for easy reference
    order_id   INTEGER NOT NULL,
    product_id INTEGER NOT NULL,
    UNIQUE (order_id, product_id)                  -- still enforce uniqueness
);

Practice Exercises

A table student_courses has columns: student_name, student_email, course_name, instructor_name, instructor_email, grade. Identify all violations of 1NF, 2NF, and 3NF, then design a normalised schema.
Design a schema for a library system: books can have multiple authors; members can borrow multiple books; each loan has a due date and return date.
Add audit columns (created_at, updated_at, created_by) to the orders table from the e-commerce schema.
Implement a soft-delete pattern for the customers table and create a view that shows only non-deleted customers.
Why might a data warehouse intentionally denormalize and store customer name directly in the orders table?

Summary

In this chapter you learned:

Good database design prevents anomalies, redundancy, and inconsistency
ER modeling: entities (things), attributes (properties), relationships (1:1, 1:N, M:N)
1:N → foreign key in the "many" side; M:N → junction/bridge table
1NF — atomic values, no repeating groups, unique rows
2NF — no partial dependencies on a composite PK (full key dependency)
3NF — no transitive dependencies (non-key → non-key removed)
BCNF — every determinant is a candidate key
Denormalization — trading update complexity for read performance (analytics/reporting)
Audit columns (created_at, updated_at) — always include in production tables
Soft deletes — preserve history instead of physically deleting rows
Reference/lookup tables — avoid magic strings scattered across many tables
History/versioning tables — track changes over time with valid_from / valid_to

You've now completed the full SQL tutorial! You have the knowledge to design solid schemas, write complex queries, optimise performance, and ensure data integrity in any relational database.