CS245A PROBLEM SET #4 Due by 5pm, Tuesday, February 17, 1998. 4 problems, equal credit. 100 pts. |

- Consider the running example for joins contained in the course
notes. What is the minimum number of memory buffers that are needed
for the hash join algorithm described in the notes to work? Explain
your answer.
Notice that, when joining R and S after they are hashed, we do not
*need*to bring into memory both one whole bucket of R and one whole bucket of S to do the join (which is what the notes suggest). - Derive a general expression for the minimum memory requirements of
hash join, in terms of the number
*k*of available memory buffers, the sizes (in blocks) of R and S and any other relevant parameters. - Consider the following enhancement of hash join, mentioned in the course notes: Try to keep some buckets of R in main memory, if there is space. Then, when bucketizing S, if a record hashes to a value that corresponds to a bucket of R still in main memory, do the join immediately without bucketizing it. Given the amount of main memory blocks we have in our disposal in our scenario (101 blocks), how many R buckets can we keep in main memory? How many blocks will they take up?
- What is the total cost of the hash join if we employ this trick?

Suppose you have 2 relations, R(A, B, C) and S(B, C, D, E). You have a clustered unique (no duplicate keys) B+-tree index on attribute A for relation R. Assume this index is kept entirely in memory (i.e., you do not need to read it from disk).

For relation S, you have two indexes: (i) a non-clustered non-unique B+-tree index for attribute B, and (ii) a clustered non-unique B+-tree index for attribute C. Assume that these two indexes are also kept in memory. Also, assume that all of the tuples of S that agree on attribute C are stored in sequentially adjacent blocks on disk (that is, if more than one block is needed to store all of the tuples with some value of C, then these blocks will be sequentially located on the disk).

Other relevant data:

- 2000 tuples of R are stored per block on disk.
- N(R) = 2,000,000 (number of tuples of R)
- 200 tuples of S are stored per block on disk.
- N(S) = 200,000 (number of tuples of S)
- V(B,S)=40,000 (image size of attribute B in S)
- V(C,S)=800 (image size of attribute C in S)

SELECT * FROM R, S WHERE (R.B=S.B) AND (R.C=S.C)We present you with two query plans:

* Plan 1:*

For every block BL of R, retrieved using the clustered index on A for R For every tuple r of BL Use the index on B for S to retrieve all of the tuples s of S such that s.B=r.B For each of these tuples s, if s.C=r.C, output r.A, r.B, r.C, s.B, s.C, s.D, s.E

*Plan 2:*

For every block BL of R, retrieved using the clustered index on A for R For every tuple r of BL Use the index on C for S to retrieve all of the tuples s of S such that s.C=r.C For each of these tuples s, if s.B=r.B, output r.A, r.B, r.C, s.B, s.C, s.D, s.E

- Analyze the above two plans carefully in terms of their behavior regarding
accesses to disk, and explain which of the plans is therefore better. Be
sure to include in your analysis which accesses to disk are sequential
accesses and which ones are random accesses.
- Can you think of an alternative plan that is better than the above two in terms of disk access? You do not need to submit detail computations for this question.

As mentioned in the notes, there are many plans we can construct from
this expression, each corresponding to a different join order.
Moreover, there are many different algorithms for evaluating a join
(e.g., nested loop join, merge sort join, hash join) or a selection
(e.g., with or without using an index). That means that each query
plan can be evaluated in many different ways, depending on the way
each relational operator is executed. We call each such
"instantiation" of a query plan a *physical query plan*.

Assume we can choose between two different ways for doing selections and between three different join algorithms (we can do one selection one way, and another selection another way). Also assume that joins are asymmetric, i.e., is different from

Finally, assume that we are only considering *left-deep* join
orders.

What is the number of different physical query plans available for the
above relational expression, given our assumptions? Generalize your
answer to the case of an expression involving *n* selections
and *n-1* joins.

Note that the exact selection predicates are irrelevant, and so are the schemata of the three relations.

a)

b)

Last modified: Mon Feb 2 20:38:34 PST 1998