### Assignment #8 Due Wednesday June 6

#### LOGISTICS

Because this assignment is due on the last day of regular classes, no late assignments will be accepted. On-campus students must turn in this assignment by 4:00 PM on June 6. SITN student assignments must either be turned in by 4:00 PM on June 6 or have a courier timestamp no later than June 7. "Chits" for turning work in late without penalty are not applicable for this assignment.

This is a relatively long written assignment covering quite a bit of material. Please plan accordingly.

#### Problem 1

You are to design a database that maintains information for producing a weekly television guide for a given region (such as Northern California). The data should include information about television shows, television networks, cities, channels, show times, etc. For starters, you may make the following assumptions:
• A given channel in a given city is associated with one network.
• A given show is either owned by a network (and shown on a channel associated with that network) or is a local show and may be shown on any channel.
• Not all shows are shown in all cities, and the days and times for a given show may differ from city to city.
• You may ignore cable channels, which generally are not city-dependent.
Please feel free to make additional assumptions about the real world in your design, as long as the assumptions are reasonably realistic and are stated clearly as part of your solution.

(a) Specify your database design using ODL. In addition to class (interface) definitions with attributes and relationships, don't forget to include keys and inverse relationships as appropriate. Note that there are many choices to make in designing an ODL schema for this database - there is no single right answer to this question by any means, although some answers may be better than others.

(b) Using the method for translating an ODL design into relations, produce a set of relations for your database design. In the cases where different mappings to relations are possible, choose whichever one you think is more appropriate. Be sure to specify (underline) keys for all relations.

#### Problem 2

Recall that a key for an ODL class C is a set of attributes {A1,A2,...,An} of C such that no two objects in C will ever have the same value for all attributes {A1, A2,...,An}. Although technically ODL does not allow relationships in keys, it is certainly reasonable for a key to include relationships as well as attributes. In this case, a key for a class C is a set of attributes and/or relationships {A1,...,An,R1,...,Rm} of C such that no two objects in C will have the same value for all attributes and relationships {A1,...,An,R1,...,Rm}.

(a) Give an example of an ODL class (interface) definition and a key for the class such that the key includes at least one relationship, and there is no natural key for the class that includes only attributes. Use any real-world domain you find appropriate.

(b) What concept in the entity-relationship model corresponds to an ODL class where a relationship is needed in the key?

#### Problem 3

Consider the following self-explanatory database design specified in ODL. The database includes information about athletes in general, about basketball players specifically, and for basketball players about their injuries and the commercials they have appeared in.
```  interface Athlete (key (name,jerseyNumber)) {
attribute string name;
attribute int jerseyNumber;
attribute int year; }

attribute string position;
attribute Set <Struct{string bodypart,int severity}> Injuries;
relationship Set<Commercial> starredIn
inverse Commercial::star; }

interface Commercial {
attribute string company;
attribute string product;
```
Convert this design into an object-relational design based on the SQL3 standard as covered in class and the textbook. Show the definitions of any row types or value types you use, along with the actual relations. Be sure to specify (underline) keys for all relations. If any of your relations are not in Boyce-Codd Normal Form (BCNF) or not in Fourth Normal Form (4NF), please say so and explain why.

#### Problem 4

Consider the following self-explanatory object-relational schema using SQL3 row types and references:
```  create row type AddressType as (street string, city string)
create row type CollegeType as (name string, city string)

create table Student of type StudentType
create table College of type CollegeType
create table Attends(student ref(StudentType), college ref(CollegeType),
tuition integer)
create table Roommates(student1 ref(StudentType), student2 ref(StudentType))
```
Write SQL3 queries for each of the following. You may assume that student and college names are unique, that all students have exactly one address, attend one college, and have one roommate, and that all colleges are located in exactly one city.

(a) Find the names of all students who live in Palo Alto.

(b) Find the names of all students who attend Stanford.

(c) Find the names of all students who live in the same city as the college they attend and pay tuition of at least \$10,000.

(d) Find the names of all students who live in the same city and on the same street as their roommate. (Let's assume that addresses in the database are home addresses and not college addresses - otherwise this query would return everyone.)

#### Problem 5

Consider the following relation that records sales information at a retail store:
```  Sales(saleID,date,itemID,color,size,qty,unitPrice)
```

Consider the following two queries over the Sales relation:

```  // Total revenue grouped by date and item:
SELECT date, itemID, SUM(Qty*unitPrice)
FROM Sales
GROUP BY date, itemID

// Total revenue grouped by item and color
SELECT itemID, color, SUM(qty*unitPrice)
FROM Sales
GROUP BY itemID, color
```

Specify a view V over the Sales relation. You should choose V so that if V is stored as a materialized view, then V can be used to substantially speed up the execution of both of the above queries, assuming that the Sales relation is very large. In addition to specifying V, show how each of the two queries above can be rewritten into an equivalent query that uses V instead of Sales.

#### Problem 6

Consider a much simpler relation Sales(itemID), which just lists items sold. The relation has no key, i.e., a value for itemID may appear multiple times in Sales. One fairly straightforward data mining problem is called the frequent itemsets problem: find those items that appear most often in a relation. Suppose we want to find the top ten itemID's in Sales based on how often they appear. Write a SQL query to produce the desired result. If you find it useful to define a sequence of one or more views used by the final query you may do so. Also, if it helps you may assume that frequencies are unique, i.e., that there are no two itemID's that appear the same number of times in Sales.

#### Problem 7

Consider a slightly more complicated version of the Sales relation than in the previous problem: now Sales(saleID,itemID) contains "market basket" data as discussed in class - it records purchase transactions identified by saleID and the items purchased in each transaction. Attributes [saleID,itemID] together form a key for this relation.

We want to mine from this relation all single-attribute association rules with certain support and confidence thresholds (as defined in class). Specifically, write a SQL query that produces all pairs of itemID's [I1,I2] such that I1->I2 is an association rule with support > 0.1 and confidence > 0.6. If you find it useful to define a sequence of one or more views used by the final query you may do so.

#### Problem 8

(a) Suppose your friend asks you to send her information about the courses you're taking this quarter. Being a savvy e-commerce type of person, your friend would like to receive the information as an XML document accompanied by a DTD, so she can parse it automatically and put the information into her proprietary database. Without going overboard, specify an XML encoding of your course information, and include a DTD that describes the structure of the data. Only the basics are necessary: course number, department, instructor, time, location, and expected grade. Your encoding should use at least two levels of nesting and should include at least one attribute of type ID and at least one attribute of type IDREF or IDREFS. If you are an SITN student taking only one course this quarter, include at least two additional courses that you've taken in the past or intend to take in the future.

(b) Now your friend wants to pose a query over your data, specifically she wants to find the instructors of all courses in which you are expecting to receive an "A" grade. Based on the XML structure in part (a), write this query using any (or all) of the XML query languages discussed in class. (If you prefer to use an XML query language not discussed in class, you are welcome to do so as long as you name and provide a citation for the language.)