CS145 - Spring 2002
Introduction to Databases
Assignment #3   --   Due Wednesday May 1
|
- The procedure for turning in this assignment, the late policy, and the
grading (and non-grading) of exercises and challenge problems, is
exactly the same as for Assignment #1.
- Suggestion: You may want to do the written problems first,
since the relational design material covered in the written work is
directly applicable to the project work.
- Consider a relation R(A,B,C). Suppose R
contains the following four tuples:
A B C
-------
1 2 3
1 2 4
5 2 3
5 2 6
(a) List all completely nontrivial functional dependencies
that hold on this instance of R.
(b) List all nontrivial multivalued dependencies that hold
on this instance of R.
- Do Exercise 3.6.4 in the textbook (page 117).
- Consider a database for a university that may include information
about students, courses, professors, etc.
(a) Within this general domain, specify the schema for an
example relation and a set of functional dependencies (FD's) over the
relation such that the relation is not in Boyce-Codd Normal Form
(BCNF). Your relation need not be extensive (i.e., it need not
include many attributes and it may represent only a very small part of
the complete university information), but the schema and the FD's
should be realistic in their modeling of the real world. You should
make no assumptions besides those captured by the schema and the FD's.
(b) Decompose your relation from part (a) so that the new
schema is in BCNF.
(c) Now specify the schema for an example relation and a set
of FD's and multivalued dependencies (MVD's) over the relation such
that the relation is in BCNF but is not in Fourth Normal Form (4NF).
Again, your relation need not be extensive but the schema and the
dependencies should be realistic in their modeling of the real world,
and you should make no assumptions besides those captured by the
schema and the dependencies. The relation you give for this part of
the problem can capture similar or different information from that
captured in part (a).
(d) Decompose your relation from part (c) so that the new
schema is in 4NF.
- Do part (c) of Exercise 3.5.4 in the textbook
(page 101).
- Do part (b) of Exercise 3.7.8 in the textbook (page 127).
- Consider the following rule for functional dependencies:
IF A -> BB, CC -> D, and CC is a subset of BB
THEN A -> D
where A and D are single attributes and BB and CC are sets of
attributes. For simplicity you may assume that A and D are not in BB
or CC. Prove the correctness of this rule. Your proof should not be
based on closure or on other rules for functional dependencies, but
rather it should be based on the formal definition of functional
dependencies:
AA -> BB =
for all tuples t and u, if t[AA]=u[AA] then t[BB]=u[BB]
where AA and BB are sets of attributes.
Your proof might have roughly the following form: Suppose A ->
BB holds, CC -> D holds, and CC is a subset of BB. Then for all tuples
t and u, ... [fill in] ... To prove that A -> D holds, we need to prove
that for all tuples t and u, ... [fill in]. Therefore A -> D holds.
- Another way to prove the rule in Exercise 6 is to use
closure of attributes. Given the facts:
A -> BB, CC -> D, CC is a subset of BB
compute the closure {A}+ of the attribute set {A}.
Does {A}+ contain attribute D? If so, you have
proven the rule, i.e., you have shown A -> D. (As an example
of this type of proof, see the justification of the Transitive Rule
on page 96 of the textbook.)
- A third way to prove the rule in Exercise 6 is to use other
rules for functional dependencies that already have been proven.
By applying the splitting, combining, and
transitive rules it is possible to infer A -> D from
the three facts:
A -> BB, CC -> D, CC is a subset of BB
Show the steps in the inference.
- (This one is relatively straightforward for a challenge
problem, but more time-consuming than a typical exercise.)
Consider
the following rule for multivalued dependencies:
IF AA ->> BB and AA ->> CC
THEN AA ->> (BB minus CC)
where AA, BB, and CC are sets of attributes, and minus
performs set difference. For simplicity you may assume that AA does
not intersect BB or CC. As in Exercise 6, your proof should be based
on the formal definition of (in this case) multivalued dependencies,
and not on other rules for multivalued dependencies.
Your proof might have roughly the following form: Suppose AA ->>
BB and AA ->> CC hold. Then for all tuples t and u there exists a
tuple v such that ... [fill in] ... Let DD = BB minus CC. To prove
that AA ->> DD holds, we need to prove that for all tuples t and u
there exists a tuple v such that ... [fill in]. Therefore AA ->> DD
holds.
- Consider a relation R(A1, A2, ..., An) where n
> 1. Give a general algorithm for constructing a relation instance
that satisfies the functional dependencies:
A1 -> A2
A2 -> A3
A3 -> A4
...
A(n-1) -> An
and all functional dependencies that follow from these, but does not
satisfy any other functional dependencies. Even better, give an
algorithm that finds a minimal such instance.
- Consider the Boyce-Codd Normal Form (BCNF) decomposition
algorithm specified in class and discussed in the textbook. We start
with a relation R and a set F of functional
dependencies (FDs) for R. We decompose R into
R1 and R2 based on a dependency in F that
violates BCNF, and then we compute the sets of FDs F1 for
R1 and F2 for R2 so we can continue with
the algorithm. Suppose that when we compute the FDs for R1
and R2, instead of finding all FDs that follow from our
original set F and that apply to R1 and R2
(i.e., that include only attributes from R1 or R2),
we simply use the original set F of FDs specified for
R and take those that apply to R1 and R2.
Can we get a decomposition that is not in BCNF? If not, argue why the
algorithms still works. If so, illustrate the problem: give a
relation schema R and a set of FDs for R, and trace
how the BCNF decomposition algorithm produces a final set of relations
that is not in BCNF.
Now that you're familiar with the basics of Oracle and interacting
with Oracle from a Web interface, it's time to delve into the real
project: AuctionBase. We're providing you
with a significant amount of actual auction data encoded in XML files.
(The data was downloaded in 2001 from eBay courtesy of the University of
Wisconsin Databae Group, then transformed and augmented for use in
CS145.) You will examine the data and design a relational schema for
it. You will then use an XML parser (if you wish) as the basis for a
program that transforms the data from its XML form into Oracle's load
file format, conforming to your relational schema. Finally, you will
create your schema in Oracle, load the data, and try a few queries.
Part A: Examine the XML data files
We are providing an XML-encoded auction data set for you to use in
your project. The data files are located in directory
/usr/class/cs145/ebay_data/ (symbolically linked for your convenience as http://www.stanford.edu/class/cs145/ebay_data/).
- As a small data set for initial experimentation and
debugging we suggest you use just one file:
/usr/class/cs145/ebay_data/items-0.xml (also http://www.stanford.edu/class/cs145/ebay_data/items-0.xml).
It contains 500 auctions, comprising about 900K bytes of ASCII data.
- Your AuctionBase system also must work on the large data
set, which consists of all 40 files:
/usr/class/cs145/ebay_data/items-?.xml, for ? =
0..39. There are a total of about 20,000 auctions, comprising about
38 megabytes of ASCII data.
Each XML data file is valid with respect to the XML DTD specified in
file /usr/class/cs145/ebay_data/items.dtd (also http://www.stanford.edu/class/cs145/ebay_data/items.dtd).
Your first task is to examine the DTD and the XML files to
completely understand the data you will be starting with. You will
translate this data into relations and load it into your AuctionBase
system. Please read the auction data help file in
/usr/class/cs145/ebay_data/items.txt (also http://www.stanford.edu/class/cs145/ebay_data/items.txt).
One of the most important things to understand about the data you're
starting with is that it represents a single point in time. (Very
specifically, it represents the point in time December 20th, 2001, one
second after midnight.) It contains items that have been auctioned
off in the past and items that are "currently" up for auction. Once
your AuctionBase system is online, you will provide the capability to
enter bids on items, to "move forward" in time so that auctions close,
and to add new users and auctions.
As you develop your AuctionBase system you are free to impose any
reasonable auction rules you like -- in particular you are not
required to follow eBay's rules -- as long as they are consistent with
the provided data. (For example, if the provided data contains an
auction with two bids that are 50 cents apart, you cannot impose a bid
increment minimum of one dollar.)
Note that some of the Description elements in the XML data
are quite long. We suggest that you use the Oracle attribute type
VARCHAR2(4000) (which is the maximum length) to store
descriptions, and simply truncate any that are too long to fit. If
you really want to store the full descriptions you can try using the
Oracle datatypes LONG or CLOB, but they have a
number of restrictions and most likely will cause you grief. See Limits from
SQL for
Web Nerds for a discussion of the problems associated with storing
long strings in Oracle.
Part B: Design your relational schema
Now that you understand the data you'll be working with, design a
good relational schema for it.
- List your relations. Please specify all keys that hold on
each relation. You need not specify attribute types at this stage.
- List all nontrivial functional dependencies that hold on each
relation, excluding those that effectively specify keys.
- Are all of your relations in Boyce-Codd Normal Form (BCNF)? If
not, either redesign them and start over, or explain why you feel it
is advantageous to use non-BCNF relations.
- List all nontrivial multivalued dependencies that hold on each
relation, excluding those that are also functional dependencies.
- Are all of your relations in Fourth Normal Form (4NF)? If not,
either redesign them and start over, or explain why you feel it is
advantageous to use non-4NF relations.
Part C: Write a data transformation program
Now write a program that transforms the XML data into Oracle load
files that are consistent with the relational schema you settled on in
Part B. This coding will most likely represent the bulk of your
work for this project part. You may use absolutely any language
you like to code your data tranformation program. However, you will
probably find that your job is much easier if you make use of an XML
parser. We are providing two parser skeletons, one for C++ users and
one for Java users. If you use these skeleton programs, all of the
"parsing" is done for you: the Java version invokes JDK 1.4's built-in
XML parser, and the C++ version uses the Xerces C++ parser (http://xml.apache.org/xerces-c/).
You need to fill in code that processes the internal representation of
the XML tree and produces Oracle files according to the relational
schema you designed in Part B. Detailed information is provided as
comments in the respective skeletons:
- The C++ skeleton program is in files
/usr/class/cs145/hw3/MyParserSkeleton.cc and
/usr/class/cs145/hw3/MyParserSkeleton.h.
- The Java skeleton program is in file
/usr/class/cs145/hw3/MyParserSkeleton.java.
You may also find the notes
from the XML Parsing Help Session useful.
Naturally we suggest that you fully debug your program on the small
data set before unleashing it on the large one.
Note on duplicate elimination: When transforming the XML
data to relational tuples you may discover that you generate certain
tuples multiple times but only want to retain one copy. There are
several ways to eliminate duplicates, including:
- Coding duplicate-elimination as part of your transformation program.
- Using Unix tools (e.g., sort and uniq) directly
on the generated load files to eliminate duplicate lines. We
recommend this method as the easiest one, but the choice is yours.
- Relying on the Oracle bulk loader, which will generate an error
but continue loading when a tuple with a duplicate value in a key
attribute is encountered, without loading the erroneous tuple. (It's
a hack but it works!) If you use this approach you may need
to increase your sqlldr errors setting or your load
may not complete, and you should make sure to look out for other,
unintentional, errors.
Part D: Load your data into Oracle
The final step is to create and populate your AuctionBase database.
Using the sqlplus interface to Oracle, issue CREATE
TABLE commands for all of the relations in your schema. Three
suggestions:
- For efficiency we suggest that you specify a PRIMARY KEY
for each table that has at least one key. (See the document Getting
Started with Oracle, available through the course Web site
Project page, for details of specifying primary keys in table
declarations.)
- For attributes representing dates/times, we suggest that you use
Oracle's built-in DATE type. Information is available in the
document Oracle
Dates and Times (available through the course Web site Project
page), and in a chapter called Dates from SQL for Web
Nerds, also linked from the course Web site Project page.
- For attributes representing money, we suggest you use
NUMBER(8,2) to specify a numeric type with 2 digits after the
decimal point.
Once the tables are created, load your data through
sqlplus in the same way that you loaded tables
Courses and Grades in parts 1 and 2 of the project.
Note that for long string fields like VARCHAR2(4000) you
must specify the length of the field in your control file, as shown
here for the Description attribute:
LOAD DATA
[...]
(ItemID, ..., Description CHAR(4000))
At this point you may want to revisit the Note on maintaining
your databases included in Assignment
#2.
Part E: Test your Oracle database
Take your newly loaded database for a "test drive" by running a few
SQL queries over it through sqlplus. Try some simple queries
over one relation and some more complex queries involving joins. Make
sure the results look correct.
What to submit
Create a submission directory containing the following files:
- Your parser code files (C++, Java, Perl, etc.). Please name the
main program file MyParser.xxx. For example, if you
used our Java skeleton, you should submit MyParser.java, and if
you used our C++ skeleton, you should submit MyParser.cc and
MyParser.h.
- A text file named README, containing the following items,
clearly labeled (a) through (e):
(a) Any special compilation or running instructions for your
program and for loading your data into Oracle (e.g., if you did not
eliminate duplicates inside your transformation program, specify the
procedure used to eliminate duplicates)
(b) The 5 items listed in Part B of the assignment description
(c) The CREATE TABLE commands and text of your sqlldr control files
(d) A sample few lines from each of the Oracle load files
produced by your transformation program, suitably labeled so we
can match them against your relational schema. Do not submit the
entire load files!
(e) A transcript showing your table creation commands in Oracle,
the successful loading of your complete database on the large
data set, and the successful execution of 2-3 sample queries over
the large data set demonstrating that the data set is correct.
If your sample queries generate large results, please include only
the first few tuples of each result in your submission.
Once your submission directory is created, submission
instructions are exactly the same as for Project Parts 1 and 2.
Remember that points may be deducted if you do not follow the
submission procedures exactly as specified, including file naming and
contents, and if you submit more than once, all submissions except your
last will be ignored.