Normalization

Normalization decision flow — 1NF through BCNF

Normalization eliminates data redundancy and update anomalies by decomposing tables into well-structured forms. Each normal form imposes stricter constraints.

Why Normalize?

Update Anomalies

Without normalization, redundant data causes problems:

StudentCourses(student_id, student_name, course_id, course_name, instructor)

| sid | name  | cid  | course   | instructor |
|-----|-------|------|----------|------------|
| 1   | Alice | CS50 | Intro CS | Dr. Smith  |
| 1   | Alice | CS51 | Data Str | Dr. Jones  |
| 2   | Bob   | CS50 | Intro CS | Dr. Smith  |

Insertion anomaly: Can't add a course without a student enrolled.

Deletion anomaly: Deleting Bob's enrollment loses the fact that CS50 is taught by Dr. Smith (if Bob is the last student).

Update anomaly: If Dr. Smith changes, must update every row mentioning them. Miss one → inconsistency.

Solution: Decompose into Students, Courses, and Enrollments tables.

First Normal Form (1NF)

Rule: Every attribute contains only atomic (indivisible) values. No repeating groups, no sets, no nested tables.

Violation:

| student | courses         |
|---------|-----------------|
| Alice   | CS50, CS51      |  ← multivalued!
| Bob     | CS50            |

Fix: One row per student-course pair, or a separate Enrollments table.

Note: Modern databases support JSON, arrays, etc. — 1NF is about the relational model, not about what databases technically allow.

Second Normal Form (2NF)

Rule: 1NF + every non-key attribute is fully dependent on the entire primary key (no partial dependencies).

Only relevant for tables with composite primary keys.

Violation (PK = {student_id, course_id}):

Enrollments(student_id, course_id, student_name, grade)

student_name depends only on student_id, not on the full key → partial dependency.

Fix: Move student_name to a Students table.

Students(student_id, student_name)
Enrollments(student_id, course_id, grade)

Third Normal Form (3NF)

Rule: 2NF + no transitive dependencies (non-key attribute depending on another non-key attribute).

Formally: For every FD X → A, either X is a superkey OR A is part of a candidate key (prime attribute).

Violation:

Students(student_id, name, department_id, department_name)

student_id → department_id → department_name is transitive. department_name depends on department_id, not directly on student_id.

Fix:

Students(student_id, name, department_id)
Departments(department_id, department_name)

3NF Test

For every non-trivial FD X → Y in the relation:

X is a superkey, OR
Y is a prime attribute (part of some candidate key)

If both conditions fail for any FD → not in 3NF.

Boyce-Codd Normal Form (BCNF)

Rule: For every non-trivial FD X → Y, X must be a superkey.

Stricter than 3NF — removes the "Y is prime" exception.

3NF but not BCNF example:

CourseInstructor(student, course, instructor)
FDs: {student, course} → instructor
     instructor → course  (each instructor teaches one course)

Candidate keys: {student, course}, {student, instructor}
instructor → course: instructor is not a superkey → violates BCNF
But course is a prime attribute → satisfies 3NF

Fix: Decompose into:

Teaches(instructor, course)
Takes(student, instructor)

Trade-off: BCNF decomposition may not preserve all FDs (no dependency preservation guarantee). 3NF always preserves FDs.

Fourth Normal Form (4NF)

Rule: BCNF + no multi-valued dependencies (except those implied by superkeys).

Multi-valued dependency (MVD): X →→ Y means: for each X value, the set of Y values is independent of the other attributes.

Violation:

PersonSkillsLangs(person, skill, language)

| person | skill   | language |
|--------|---------|----------|
| Alice  | Python  | English  |
| Alice  | Python  | French   |
| Alice  | Rust    | English  |
| Alice  | Rust    | French   |

person →→ skill and person →→ language (skills and languages are independent). Must store every combination → redundancy.

Fix:

PersonSkills(person, skill)
PersonLanguages(person, language)

Fifth Normal Form (5NF / PJNF)

Rule: 4NF + no join dependencies that aren't implied by candidate keys.

A join dependency means the table can be reconstructed by joining three or more of its projections.

Rarely encountered in practice. Most schemas are well-served by 3NF or BCNF.

Denormalization Tradeoffs

Aspect	Normalized	Denormalized
Redundancy	None	Controlled redundancy
Update anomalies	None	Must manage manually
Storage	Less	More
Read performance	May need joins	Faster reads (pre-joined)
Write performance	Faster writes	Slower (update multiple copies)
Query complexity	More joins	Simpler queries

Normalize by default. Denormalize for measured performance needs.

When to Denormalize

Read-heavy OLAP workloads: Pre-compute aggregates, materialized views.
Caching columns: Store computed values to avoid expensive joins.
Embedding related data: Store frequently-accessed related data in the same row (in document databases, this is natural).
Reporting tables: Flatten for simple reporting queries.

Lossless Join Decomposition

A decomposition is lossless if the original table can be reconstructed exactly by joining the decomposed tables.

Lossless condition (for binary decomposition R into R₁, R₂):

(R₁ ∩ R₂) → R₁  OR  (R₁ ∩ R₂) → R₂

The common attributes must be a superkey of at least one of the decomposed tables.

BCNF decomposition is always lossless. But it may not preserve all FDs.

Find a canonical cover of FDs.
For each FD X → A: create table (X, A).
If no table contains a candidate key: add a table with a candidate key.
Remove redundant tables (subset of another table's attributes).

Result: 3NF, lossless, dependency-preserving.

BCNF Decomposition Algorithm

Find a non-trivial FD X → Y where X is not a superkey.
Decompose R into R₁(XY) and R₂(R - Y + X).
Repeat for each decomposed relation that violates BCNF.

Result: BCNF, lossless. May not preserve all FDs.

Practical Guidelines

Start at 3NF: Sufficient for most OLTP applications.
Use BCNF if possible: When dependency preservation isn't sacrificed.
Don't over-normalize: 4NF/5NF are rarely needed. Understand when they apply.
Denormalize intentionally: Measure first. Document the denormalization and its rationale.
Use constraints: Even in denormalized schemas, enforce consistency with triggers, application code, or batch validation.

Applications in CS

Schema design: Normalization provides a systematic approach to eliminating redundancy.
Data migration: Normalizing legacy schemas is a common refactoring task.
ETL: Extract-Transform-Load pipelines often denormalize during the "Load" phase (star schema for analytics).
ORM design: ORMs work best with properly normalized schemas.
Performance tuning: Understanding normal forms helps decide when denormalization is justified.

Normalization

Why Normalize?

Update Anomalies

First Normal Form (1NF)

Second Normal Form (2NF)

Third Normal Form (3NF)

3NF Test

Boyce-Codd Normal Form (BCNF)

Fourth Normal Form (4NF)

Fifth Normal Form (5NF / PJNF)

Denormalization Tradeoffs

When to Denormalize

Lossless Join Decomposition

Dependency Preservation

Normalization Algorithm

3NF Synthesis Algorithm

BCNF Decomposition Algorithm

Practical Guidelines

Applications in CS