For all but the very simplest database application, there are many, many different relational database "designs" (schemas) that can be used to store the relevant data.

- Q: Is one schema better than another?
- A: Often, yes

Relational database design is one of the more theoretical topics of CS145, but it has fairly practical applications.

We will focus on one very important aspect of relational design: **Redundancy**, and
the problems it causes. We will look at two different kinds of redundancy. For each,
we will:

- Characterize the redundancy formally
- Define a "Normal Form" for relations that prevents it
- Learn how to detect it and eliminate it

Request for patience: *This is a topic for which it is hard to see the importance
at first. When designing schemas at the scale we have been dealing with, people
tend to naturally avoid the redundancy problems we will discuss. Also,
the formalization of the redundancy can get a little tedious. It will turn out to
be useful though, so have patience.*

Example schema:

Student(ID, name, address, HScode, HSname, HScity, cumGPA, priority) Apply(ID, campus, date, major)

- Redundancy
- Update Anomaly
- Deletion Anomaly

`Student.priority`

is determined by
`Student.cumGPA`

.
*(show example)*

Formally:

For every pair of tuples t and u in Student: if t[cumGPA] = u[cumGPA] then t[priority] = u[priority]This is a "functional dependency" (FD): a constraint specified with the schema of a relation.

- Based on knowledge of the real world
- All data must adhere to it

Astates that for every pair of tuples t and u in R: if_{1}, A_{2}, ..., A_{m}→ B_{1}, B_{2}, ..., B_{n}(commas may be omitted)

`t[A`_{1},...,A_{m}]
= u[A_{1},...,A_{m}]

then `t[B`_{1},..,B_{n}] = u[B_{1},..,B_{n}]

.
We will abbreviate `A`

as _{1}, A_{2}, ..., A_{m}`AA`

(or
"A-bar") and `B`

as _{1}, B_{2}, ..., B_{n}`BB`

(or "B-bar").

*(simple abstract example)*

`Student`

besides `cumGPA → priority`

`Apply?`

- AA → BB, where BB is all attributes of R
- no subset of AA satisfies (1), i.e., AA is minimal

Note that key is required to be minimal; introduce "superkey"

*Question: Can we also split the left-hand side?*

If AA → BB then AA → (BB U AA)

Find all attributes B in R such that A1, A2, ..., Am → BThis set of attributes is the "closure" and is denoted {A1, A2, ..., Am}+

**Algorithm** for computing closure:

start with {A(Effectively applies combining and transitive rules until there's no more change.)_{1}, A_{2}, ..., A_{m}} repeat until no change: if current set of attributes includes LHS of a dependency, add RHS attributes to the set

*Example:* closure of `{ID, HScode}`

in `Student`

given FD's:

ID → name, address, cumGPA cumGPA → priority HScode → HSname, HScity

- Start with set of relations
- Define FD's (and therefore keys) for them based on real world
- Transform relations to "normal form" ("normalize" the relations)

**Definition:** R_{1}(A_{1}, ..., A_{m}) / R_{2}(B_{1}, ..., B_{n}) is a decomposition of
R(C_{1}, ..., C_{k}) if {A_{1}, ..., A_{m}} U {B_{1}, ..., B_{n}} = {C_{1}, ..., C_{k}}

*(diagram)*

Idea of decomposition:

- R
_{1}= π_{A1, ..., Am}(R), eliminating duplicates - R
_{2}= π_{B1, ..., Bn}(R), eliminating duplicates - R
_{1}NATURAL-JOIN R_{2}= R

*Example decomposition:*

**Given:** relation R and set of FD's for R

**Definition:** R is in BCNF with respect to its FD's if for every
nontrivial FD AA → BB, AA contains a key.

*Question: Why does violating this requirement produce a "bad" relation?*

Student(ID, name, address, HScode, HSname, HScity, cumGPA, priority) FD's: ID → name, address, cumGPA cumGPA → priority HScode → HSname, HScity Key: ID, HScode

*Question: Is the relation in BCNF?*

Each violation produces anomalies.

Apply(ID, campus, date, major)Can apply to campus multiple times for different majors, but can only apply to a campus once per day

FD's: ID, campus, date → major Key: ID, campus, date

*Question: Is the relation in BCNF?*

Input: relation schema R and set of FD's for R (1) compute keys for R based on FD's (2) repeat until no more BCNF violations: (2a) pick any R' with AA → BB that violates BCNF (2b) decompose R' into R1(AA,BB) and R2(AA,CC) where CC is all attributes in R' except (AA U BB) (2c) compute FD's for R1 and R2 (2d) compute keys for R1 and R2 based on FD's

*Question: How can we compute keys in steps (1) and (2d)?*

*Question: How do we compute FD's in step (2c)?*

`Student`

example)=> Final decomposed relations may be different depending on which violating FD is chosen in each iteration (step 2(a)), but all decompositions will be in BCNF.

**Definition:**

AA ->> BB is an MVD for relation R if: For all tuples t,u in R: If t[AA] = u[AA] then there exists a v in R such that: (1) v[AA] = t[AA] (2) v[BB] = t[BB] (3) v[CC] = u[CC] where CC is all attributes in R except (AA U BB)

MVD's are also called "tuple-generating dependencies."

*Example:*

` Apply(ID, campus, sport) `

Assume all students apply to each campus with all sports.

*Question: What are FD's?*

*Question: What is key?*

*Question: Is it in BCNF?*

*Question: What are MVD's?*

*(show example data to verify)*

Prove by showing (1), (2), (3) in MVD definition.

*Question: Are there any rules for FD's that do not apply for MVD's?*

**Definition:** R is in 4NF with respect to its MVD's if for
every nontrivial MVD AA ->>BB, AA contains a key.

**Note:** Since every FD is also an MVD, 4NF implies BCNF

*Question: What happens in the MVD definition if AA contains a key?*

Input: relation schema R and set of FD's and MVD's for R (1) compute keys for R based on FD's (2) repeat until no more 4NF violations: (2a) pick any R' with AA ->>BB that violates 4NF (2b) decompose R' into R1(AA,BB) and R2(AA,CC) where CC is all attributes in R' except (AA U BB) (2c) compute FD's and MVD's for R1 and R2 (2d) compute keys for R1 and R2 based on FD's

*Question: How do we compute MVD's in step (2c)?
*

*(run algorithm on Apply example)*