HW #4

CS245A PROBLEM SET #4
Due by 5pm, Tuesday, February 17, 1998.
4 problems, equal credit. 100 pts.

Problem 1: Analysis of hash join

Consider the running example for joins contained in the course notes. What is the minimum number of memory buffers that are needed for the hash join algorithm described in the notes to work? Explain your answer. Notice that, when joining R and S after they are hashed, we do not need to bring into memory both one whole bucket of R and one whole bucket of S to do the join (which is what the notes suggest).
Derive a general expression for the minimum memory requirements of hash join, in terms of the number k of available memory buffers, the sizes (in blocks) of R and S and any other relevant parameters.
Consider the following enhancement of hash join, mentioned in the course notes: Try to keep some buckets of R in main memory, if there is space. Then, when bucketizing S, if a record hashes to a value that corresponds to a bucket of R still in main memory, do the join immediately without bucketizing it. Given the amount of main memory blocks we have in our disposal in our scenario (101 blocks), how many R buckets can we keep in main memory? How many blocks will they take up?
What is the total cost of the hash join if we employ this trick?

You can assume that there is no significant data skew (big assumption, but hey! it's only a homework problem)

Problem 2: Query plan evaluation

Suppose you have 2 relations, R(A, B, C) and S(B, C, D, E). You have a clustered unique (no duplicate keys) B+-tree index on attribute A for relation R. Assume this index is kept entirely in memory (i.e., you do not need to read it from disk).

For relation S, you have two indexes: (i) a non-clustered non-unique B+-tree index for attribute B, and (ii) a clustered non-unique B+-tree index for attribute C. Assume that these two indexes are also kept in memory. Also, assume that all of the tuples of S that agree on attribute C are stored in sequentially adjacent blocks on disk (that is, if more than one block is needed to store all of the tuples with some value of C, then these blocks will be sequentially located on the disk).

Other relevant data:

2000 tuples of R are stored per block on disk.
N(R) = 2,000,000 (number of tuples of R)
200 tuples of S are stored per block on disk.
N(S) = 200,000 (number of tuples of S)
V(B,S)=40,000 (image size of attribute B in S)
V(C,S)=800 (image size of attribute C in S)

You want to execute the following query:

SELECT * 
FROM R, S
WHERE (R.B=S.B) AND (R.C=S.C)

We present you with two query plans:

Plan 1:

For every block BL of R, retrieved using the clustered index on A for R
    For every tuple r of BL
         Use the index on B for S to retrieve all
             of the tuples s of S such that s.B=r.B
             For each of these tuples s, if s.C=r.C, output r.A, r.B, r.C,
                                                            s.B, s.C, s.D, s.E

Plan 2:

For every block BL of R, retrieved using the clustered index on A for R
    For every tuple r of BL
         Use the index on C for S to retrieve all
             of the tuples s of S such that s.C=r.C
             For each of these tuples s, if s.B=r.B, output r.A, r.B, r.C,
                                                            s.B, s.C, s.D, s.E

Analyze the above two plans carefully in terms of their behavior regarding accesses to disk, and explain which of the plans is therefore better. Be sure to include in your analysis which accesses to disk are sequential accesses and which ones are random accesses.
Can you think of an alternative plan that is better than the above two in terms of disk access? You do not need to submit detail computations for this question.

Problem 3: Counting join orderings

Assume we want to evaluate the following relational expression:

As mentioned in the notes, there are many plans we can construct from this expression, each corresponding to a different join order. Moreover, there are many different algorithms for evaluating a join (e.g., nested loop join, merge sort join, hash join) or a selection (e.g., with or without using an index). That means that each query plan can be evaluated in many different ways, depending on the way each relational operator is executed. We call each such "instantiation" of a query plan a physical query plan.

Assume we can choose between two different ways for doing selections and between three different join algorithms (we can do one selection one way, and another selection another way). Also assume that joins are asymmetric, i.e., is different from

Finally, assume that we are only considering left-deep join orders.

What is the number of different physical query plans available for the above relational expression, given our assumptions? Generalize your answer to the case of an expression involving n selections and n-1 joins.

Note that the exact selection predicates are irrelevant, and so are the schemata of the three relations.

Problem 4: Relational Algebra

Let R(A,B) and S(B,C) be binary relations. For each of the following two equalities, prove that it always holds or give a simple counterexample:
a)

Last modified: Mon Feb 2 20:38:34 PST 1998