HW #3

CS245 PROBLEM SET #3
Due by 5:30pm, Tuesday, July 27, 1999

Problem 1: Relational Algebra (20 points)

The outerjoin of relations R and S is a variation of the join (or natural join) of R and S. The outerjoin result is formed by starting with R join S, and adding any dangling tuples from R or S ( a tuple is dangling if it did not join with any tuple from the other relation). The added dangling tuples must be padded with a special NULL symbol for all attributes that they do not possess, but that appear in the join result. For example, consider R and S as follows.

R:     A   B		   S:	   B   C
    -----------		        -----------
       1   2                       1   4
       2   5                       2   6

R outerjoin S is

       A       B       C
    -----------------------
       1       2       6
       NULL    1       4
       2       5       NULL

(a): Write the relational algebra expression for outerjoin using the basic operators (e.g., select, project, join, cross-product, union, difference). You may use a special relation N = {<NULL>} that contains a single one attribute tuple with value NULL.
(b): Prove select_{B < 10}(R outerjoin S) = select_{B < 10}(R) outerjoin select_{B < 10}(S). You may use the transformation laws introduced in class.

Problem 2: Cost Estimation (20 points)

Consider three relations R(A,B), S(B,C), T(C,D) with the following statistics:

N(R) = 200, V(A, R) = 50, V(B, R) = 100
N(S) = 300, V(B, S) = 50, V(C, S) = 50
N(T) = 400, V(C, T) = 40, V(D, T) = 100

N(R) represents the number of tuples of R, and V(A, R) represents the image size of attribute A in relation R. Estimate the sizes of relations that are the results of the following expressions:

(a): select_{B = 20}(S) join T
(b): select_{R.A < S.C}(R join S)

You may make the simplifying assumptions of containment of value sets and preservation of value sets as mentioned in the textbook on page 372.

Problem 3: Query Plan Evaluation (20 points)

Suppose you have 2 relations, R(A, B, C) and S(B, C, D, E). You have a clustered unique (no duplicate keys) B+-tree index on attribute A for relation R. Assume this index is kept entirely in memory (i.e., you do not need to read it from disk).

For relation S, you have two indexes: (i) a non-clustered non-unique B+-tree index for attribute B, and (ii) a clustered non-unique B+-tree index for attribute C. Assume that these two indexes are also kept in memory. Also, assume that all of the tuples of S that agree on attribute C are stored in sequentially adjacent blocks on disk (that is, if more than one block is needed to store all of the tuples with some value of C, then these blocks will be sequentially located on the disk).

Other relevant data:

2000 tuples of R are stored per block on disk.
200 tuples of S are stored per block on disk.
N(R) = 2,000,000
N(S) = 200,000
V(B, S) = 40,000
V(C, S) = 800

You want to execute the following query:

SELECT * 
FROM R, S
WHERE (R.B=S.B) AND (R.C=S.C)

We present you with two query plans:

Plan 1:

for every block BL of R retrieved using the clustered index on A for R do
    for every tuple r of BL do
         use the index on B for S to retrieve all of the tuples s of S such that s.B=r.B;
         for each of these tuples s do
	       if (s.C=r.C) 
	       then output r.A, r.B, r.C, s.B, s.C, s.D, s.E;

Plan 2:

for every block BL of R, retrieved using the clustered index on A for R do
    for every tuple r of BL do
         use the index on C for S to retrieve all of the tuples s of S such that s.C=r.C;
         for each of these tuples s do 
	       if (s.B=r.B) 
	       then output r.A, r.B, r.C, s.B, s.C, s.D, s.E;

(a): Analyze the above two plans carefully in terms of their behavior regarding accesses to disk, and explain which of the plans is therefore better. Be sure to include in your analysis which accesses to disk are sequential accesses and which ones are random accesses.
(b): Can you think of an alternative plan that is better than the above two in terms of disk access? You do not need to submit detail computations for this question.

Problem 4: Join Orderings (20 points)

Assume we want to evaluate the following relational expression:

select_{c1}(R) join select_{c2}(S) join select_{c3}(T)

There are many plans we can construct from this expression, each corresponding to a different join order. Moreover, there are many different algorithms for evaluating a join (e.g., nested loop join, merge sort join, hash join) or a selection (e.g., with or without using an index). That means that each query plan can be evaluated in many different ways, depending on the way each relational operator is executed. We call each such "instantiation" of a query plan a physical query plan.

Assume we can choose between two different ways for doing selections and between three different join algorithms (we can do one selection one way, and another selection another way). Also assume that joins are asymmetric, i.e., R join S is different from S join R. Finally, assume that we are only considering left-deep join orders.

What is the number of different physical query plans available for the above relational expression, given our assumptions? Generalize your answer to the case of an expression involving n selections and n-1 joins.

Note that the exact selection predicates are irrelevant, and so are the schemata of the three relations.