5 min read
On this page

Normalization

Normalization decision flow — 1NF through BCNF

Normalization eliminates data redundancy and update anomalies by decomposing tables into well-structured forms. Each normal form imposes stricter constraints.

Why Normalize?

Update Anomalies

Without normalization, redundant data causes problems:

StudentCourses(student_id, student_name, course_id, course_name, instructor)

| sid | name  | cid  | course   | instructor |
|-----|-------|------|----------|------------|
| 1   | Alice | CS50 | Intro CS | Dr. Smith  |
| 1   | Alice | CS51 | Data Str | Dr. Jones  |
| 2   | Bob   | CS50 | Intro CS | Dr. Smith  |

Insertion anomaly: Can't add a course without a student enrolled.

Deletion anomaly: Deleting Bob's enrollment loses the fact that CS50 is taught by Dr. Smith (if Bob is the last student).

Update anomaly: If Dr. Smith changes, must update every row mentioning them. Miss one → inconsistency.

Solution: Decompose into Students, Courses, and Enrollments tables.

First Normal Form (1NF)

Rule: Every attribute contains only atomic (indivisible) values. No repeating groups, no sets, no nested tables.

Violation:

| student | courses         |
|---------|-----------------|
| Alice   | CS50, CS51      |  ← multivalued!
| Bob     | CS50            |

Fix: One row per student-course pair, or a separate Enrollments table.

Note: Modern databases support JSON, arrays, etc. — 1NF is about the relational model, not about what databases technically allow.

Second Normal Form (2NF)

Rule: 1NF + every non-key attribute is fully dependent on the entire primary key (no partial dependencies).

Only relevant for tables with composite primary keys.

Violation (PK = {student_id, course_id}):

Enrollments(student_id, course_id, student_name, grade)

student_name depends only on student_id, not on the full key → partial dependency.

Fix: Move student_name to a Students table.

Students(student_id, student_name)
Enrollments(student_id, course_id, grade)

Third Normal Form (3NF)

Rule: 2NF + no transitive dependencies (non-key attribute depending on another non-key attribute).

Formally: For every FD X → A, either X is a superkey OR A is part of a candidate key (prime attribute).

Violation:

Students(student_id, name, department_id, department_name)

student_id → department_id → department_name is transitive. department_name depends on department_id, not directly on student_id.

Fix:

Students(student_id, name, department_id)
Departments(department_id, department_name)

3NF Test

For every non-trivial FD X → Y in the relation:

  • X is a superkey, OR
  • Y is a prime attribute (part of some candidate key)

If both conditions fail for any FD → not in 3NF.

Boyce-Codd Normal Form (BCNF)

Rule: For every non-trivial FD X → Y, X must be a superkey.

Stricter than 3NF — removes the "Y is prime" exception.

3NF but not BCNF example:

CourseInstructor(student, course, instructor)
FDs: {student, course} → instructor
     instructor → course  (each instructor teaches one course)

Candidate keys: {student, course}, {student, instructor}
instructor → course: instructor is not a superkey → violates BCNF
But course is a prime attribute → satisfies 3NF

Fix: Decompose into:

Teaches(instructor, course)
Takes(student, instructor)

Trade-off: BCNF decomposition may not preserve all FDs (no dependency preservation guarantee). 3NF always preserves FDs.

Fourth Normal Form (4NF)

Rule: BCNF + no multi-valued dependencies (except those implied by superkeys).

Multi-valued dependency (MVD): X →→ Y means: for each X value, the set of Y values is independent of the other attributes.

Violation:

PersonSkillsLangs(person, skill, language)

| person | skill   | language |
|--------|---------|----------|
| Alice  | Python  | English  |
| Alice  | Python  | French   |
| Alice  | Rust    | English  |
| Alice  | Rust    | French   |

person →→ skill and person →→ language (skills and languages are independent). Must store every combination → redundancy.

Fix:

PersonSkills(person, skill)
PersonLanguages(person, language)

Fifth Normal Form (5NF / PJNF)

Rule: 4NF + no join dependencies that aren't implied by candidate keys.

A join dependency means the table can be reconstructed by joining three or more of its projections.

Rarely encountered in practice. Most schemas are well-served by 3NF or BCNF.

Denormalization Tradeoffs

| Aspect | Normalized | Denormalized | |---|---|---| | Redundancy | None | Controlled redundancy | | Update anomalies | None | Must manage manually | | Storage | Less | More | | Read performance | May need joins | Faster reads (pre-joined) | | Write performance | Faster writes | Slower (update multiple copies) | | Query complexity | More joins | Simpler queries |

Normalize by default. Denormalize for measured performance needs.

When to Denormalize

  • Read-heavy OLAP workloads: Pre-compute aggregates, materialized views.
  • Caching columns: Store computed values to avoid expensive joins.
  • Embedding related data: Store frequently-accessed related data in the same row (in document databases, this is natural).
  • Reporting tables: Flatten for simple reporting queries.

Lossless Join Decomposition

A decomposition is lossless if the original table can be reconstructed exactly by joining the decomposed tables.

Lossless condition (for binary decomposition R into R₁, R₂):

(R₁ ∩ R₂) → R₁  OR  (R₁ ∩ R₂) → R₂

The common attributes must be a superkey of at least one of the decomposed tables.

BCNF decomposition is always lossless. But it may not preserve all FDs.

Dependency Preservation

A decomposition preserves dependencies if all original FDs can be checked within individual decomposed tables (without joins).

3NF decomposition can always be both lossless and dependency-preserving. BCNF decomposition may sacrifice dependency preservation.

Normalization Algorithm

3NF Synthesis Algorithm

  1. Find a canonical cover of FDs.
  2. For each FD X → A: create table (X, A).
  3. If no table contains a candidate key: add a table with a candidate key.
  4. Remove redundant tables (subset of another table's attributes).

Result: 3NF, lossless, dependency-preserving.

BCNF Decomposition Algorithm

  1. Find a non-trivial FD X → Y where X is not a superkey.
  2. Decompose R into R₁(XY) and R₂(R - Y + X).
  3. Repeat for each decomposed relation that violates BCNF.

Result: BCNF, lossless. May not preserve all FDs.

Practical Guidelines

  1. Start at 3NF: Sufficient for most OLTP applications.
  2. Use BCNF if possible: When dependency preservation isn't sacrificed.
  3. Don't over-normalize: 4NF/5NF are rarely needed. Understand when they apply.
  4. Denormalize intentionally: Measure first. Document the denormalization and its rationale.
  5. Use constraints: Even in denormalized schemas, enforce consistency with triggers, application code, or batch validation.

Applications in CS

  • Schema design: Normalization provides a systematic approach to eliminating redundancy.
  • Data migration: Normalizing legacy schemas is a common refactoring task.
  • ETL: Extract-Transform-Load pipelines often denormalize during the "Load" phase (star schema for analytics).
  • ORM design: ORMs work best with properly normalized schemas.
  • Performance tuning: Understanding normal forms helps decide when denormalization is justified.