CS145 Lecture Notes (6) -- Redundancy, Redundancy, and Normalization

For all but the very simplest database application, there are many, many different relational database "designs" (schemas) that can be used to store the relevant data.

Q: Is one schema better than another?
A: Often, yes

Relational database design is one of the more theoretical topics of CS145, but it has fairly practical applications.

We will focus on one very important aspect of relational design: Redundancy, and the problems it causes. We will look at two different kinds of redundancy. For each, we will:

Characterize the redundancy formally
Define a "Normal Form" for relations that prevents it
Learn how to detect it and eliminate it

Request for patience: This is a topic for which it is hard to see the importance at first. When designing schemas at the scale we have been dealing with, people tend to naturally avoid the redundancy problems we will discuss. Also, the formalization of the redundancy can get a little tedious. It will turn out to be useful though, so have patience.

Example schema:

   Student(ID, name, address, HScode, HSname, HScity, cumGPA, priority)
   Apply(ID, campus, date, major)

Anomalies

This relation exhibits three types of "anomalies":

Redundancy
Update Anomaly
Deletion Anomaly

Functional Dependencies

Suppose Student.priority is determined by Student.cumGPA.

(show example)

Formally:

   For every pair of tuples t and u in Student:
   if t[cumGPA] = u[cumGPA] then t[priority] = u[priority]

This is a "functional dependency" (FD): a constraint specified with the schema of a relation.

Based on knowledge of the real world
All data must adhere to it

Notation for declaring an FD for a relation R:

   A₁, A₂, ..., A_m → B₁, B₂, ..., B_n   (commas may be omitted)

states that for every pair of tuples t and u in R: if

t[A₁,...,A_m]
= u[A₁,...,A_m]

then t[B₁,..,B_n] = u[B₁,..,B_n].

We will abbreviate A₁, A₂, ..., A_m as AA (or "A-bar") and B₁, B₂, ..., B_n as BB (or "B-bar").

(simple abstract example)

Question: What are some functional dependencies for Student besides cumGPA → priority?

Question: What are some functional dependencies for Apply?

Functional Dependencies and Keys

AA is a key for R if:

AA → BB, where BB is all attributes of R
no subset of AA satisfies (1), i.e., AA is minimal

(show abstract example)

Subtlety: What if relation R can contain duplicate tuples?

Note that key is required to be minimal; introduce "superkey"

Trivial FD

AA → BB where BB is a subset of AA

Nontrivial FD

AA → BB where BB is not a subset of AA (could have some overlap)

Completely nontrivial FD

AA → BB with no overlap between AA and BB

=> Most of the time we're interested in completely nontrivial FD's.

Rules for Functional Dependencies

Splitting Rule

If AA → B₁, B₂, ..., B_n then AA → B₁, AA → B₂, ..., AA → B_n

Question: Can we also split the left-hand side?

Combining Rule

If AA → B₁, AA → B₂, ..., AA → B_n then AA → B₁, B₂, ..., B_n

Trivial Dependency Rules

If AA → BB then AA → (BB - AA)
If AA → BB then AA → (BB U AA)

Transitive Rule

If AA → BB and BB → CC then AA → CC

Closure of Attributes

Given a relation R, a set of FD's for R, and a set of attributes {A₁, A₂, ..., A_m} of R:

   Find all attributes B in R such that A1, A2, ..., Am → B

This set of attributes is the "closure" and is denoted {A1, A2, ..., Am}+

Algorithm for computing closure:

   start with {A₁, A₂, ..., A_m}
   repeat until no change:
      if current set of attributes includes LHS of a dependency,
      add RHS attributes to the set

(Effectively applies combining and transitive rules until there's no more change.)

Example: closure of {ID, HScode} in Student given FD's:

   ID → name, address, cumGPA
   cumGPA → priority
   HScode → HSname, HScity

Question: How can we exploit closure to test whether a set of attributes is a key?

Related question: How can we find all keys given a set of FD's?

Using FD's to Produce a Good Relational Schema

Start with set of relations
Define FD's (and therefore keys) for them based on real world
Transform relations to "normal form" ("normalize" the relations)

Relation Decomposition

Goal: "decompose" relations into smaller, better ones as above. Do it automatically with a formal framework based on FD's (and later MVD's).

Definition: R₁(A₁, ..., A_m) / R₂(B₁, ..., B_n) is a decomposition of R(C₁, ..., C_k) if {A₁, ..., A_m} U {B₁, ..., B_n} = {C₁, ..., C_k}

(diagram)

Idea of decomposition:

R₁ = π_{A₁, ..., A_m} (R), eliminating duplicates
R₂ = π_{B₁, ..., B_n} (R), eliminating duplicates
R₁ NATURAL-JOIN R₂ = R

Note: R is usually not actually created, it's just a step in the design process.

Example decomposition:

(check criteria for decomposition)

Boyce-Codd Normal Form (BCNF)

Defines which decompositions are good ones

Given: relation R and set of FD's for R

Definition: R is in BCNF with respect to its FD's if for every nontrivial FD AA → BB, AA contains a key.

Question: Why does violating this requirement produce a "bad" relation?

Example:

   Student(ID, name, address, HScode, HSname, HScity, cumGPA, priority)

   FD's: ID → name, address, cumGPA
         cumGPA → priority
         HScode → HSname, HScity

   Key:  ID, HScode

Question: Is the relation in BCNF?

Each violation produces anomalies.

Example:

   Apply(ID, campus, date, major)

Can apply to campus multiple times for different majors, but can only apply to a campus once per day

   FD's: ID, campus, date → major
   Key:  ID, campus, date

Question: Is the relation in BCNF?

Algorithm for decomposing a relation into BCNF relations:

   Input: relation schema R and set of FD's for R

   (1) compute keys for R based on FD's
   (2) repeat until no more BCNF violations:
         (2a) pick any R' with AA → BB that violates BCNF
         (2b) decompose R' into R1(AA,BB) and R2(AA,CC)
              where CC is all attributes in R' except (AA U BB)
         (2c) compute FD's for R1 and R2
         (2d) compute keys for R1 and R2 based on FD's

(diagram)

Question: How can we compute keys in steps (1) and (2d)?

Question: How do we compute FD's in step (2c)?

(run algorithm on Student example)

=> Final decomposed relations may be different depending on which violating FD is chosen in each iteration (step 2(a)), but all decompositions will be in BCNF.

Multivalued Dependency (MVD)

Definition:

   AA ->> BB is an MVD for relation R if:
   For all tuples t,u in R:
      If t[AA] = u[AA] then there exists a v in R such that:
         (1) v[AA] = t[AA]
         (2) v[BB] = t[BB]
         (3) v[CC] = u[CC] where CC is all attributes in R except (AA U BB)

(show with picture; show implied fourth tuple)

MVD's are also called "tuple-generating dependencies."

Example:

Apply(ID, campus, sport)
Assume all students apply to each campus with all sports.

Question: What are FD's?

Question: What is key?

Question: Is it in BCNF?

Question: What are MVD's?

(show example data to verify)

Intuition: MVD's uncover situations where independent facts related to a certain object are being squished together in one relation.

Trivial MVD

AA ->> BB where BB is a subset of AA or (AA U BB) contains all attributes of R

Nontrivial MVD

AA ->> BB where BB is not a subset of AA and (AA U BB) does not contain all attributes of R

Rules for Multivalued Dependencies

FD-is-an-MVD Rule

If AA → BB then AA ->> BB

Prove by showing (1), (2), (3) in MVD definition.

Transitive Rule

If AA ->> BB and BB ->>CC then AA ->>CC

Complementation Rule

If AA ->>BB then AA ->>CC where CC is all attributes in R except (AA U BB)

Question: Are there any rules for FD's that do not apply for MVD's?

Fourth Normal Form (4NF)

Given: relation R and set of MVD's for R

Definition: R is in 4NF with respect to its MVD's if for every nontrivial MVD AA ->>BB, AA contains a key.

Note: Since every FD is also an MVD, 4NF implies BCNF

Question: What happens in the MVD definition if AA contains a key?

Algorithm for decomposing a relation into 4NF relations (same idea as BCNF):

   Input: relation schema R and set of FD's and MVD's for R

   (1) compute keys for R based on FD's
   (2) repeat until no more 4NF violations:
         (2a) pick any R' with AA ->>BB that violates 4NF
         (2b) decompose R' into R1(AA,BB) and R2(AA,CC)
              where CC is all attributes in R' except (AA U BB)
         (2c) compute FD's and MVD's for R1 and R2
         (2d) compute keys for R1 and R2 based on FD's

Question: How do we compute MVD's in step (2c)?

(run algorithm on Apply example)