CS245 Winter 1999 Assignment 5 due in class on Thursday February 11 PROBLEM 1 (a) What is the difference between the SELECT operator in relational algebra and the SELECT keyword in SQL? Do they effectively do the same thing? If not, what is the difference? (b) Assume that R(A,B) and S(B,C) are relations. Convert the generalized relational algebra expression below to a corresponding SQL statement: PROJECT (attribute-list) [ SELECT (condition) [ R NATJOIN S ] ] PROBLEM 2 Show equivalent relational algebra expressions to the ones given below, where the selects and projects are done as early as possible. (a) SELECT(P and Q and M)[ R NATJOIN S ] (b) SELECT(P or Q)[ R NATJOIN S ] (c) PROJECT(E)[ [SELECT(P and Q and M) [R NATJOIN S] ] NATJOIN T ] where R(A,B,C), S(C,D), and T(D,E,F) are relations, P is a predicate over D, Q is a predicate over B, and M is a predicate over C. NOTE: To keep things in ASCII, I am not using special symbols. SELECT(P) is the same as sigma subscript P. R NATJOIN S is the natural join of relations R and S. PROBLEM 3 Suppose you have 2 relations, R(A, B, C) and S(B, C, D, E). You have a clustered unique (no duplicate keys) B+-tree index on attribute A for relation R. Assume this index is kept entirely in memory (i.e., you do not need to read it from disk). For relation S, you have two indices: * a non-clustered non-unique B+-tree index for attribute B * a clustered non-unique B+-tree index for attribute C Furthermore, assume that these two indices are kept in memory (i.e. you do not need to read them from disk). Also, assume that all of the tuples of S that agree on attribute C are stored in sequentially adjacent blocks on disk (that is, if more than one block is needed to store all of the tuples with some value of C, then these blocks will be sequentially located on the disk). Other relevant data: * 500 tuples of R are stored per block on disk. * n_R= 750,000 (number of tuples of R) * 100 tuples of S are stored per block on disk. * n_S= 250,000 (number of tuples of S) * V(B,S)= 50,000 (image size of attribute B in S) * V(C,S)= 50 (image size of attribute C in S) You want to execute the following query: SELECT * FROM R, S WHERE (R.B=S.B) AND (R.C=S.C) We present you with two query plans: Plan 1: For every block B of R, retrieved using the clustered index on A for R For every tuple r of B Use the index on B for S to retrieve all of the tuples s of S such that s.B=r.B For each of these tuples s, if s.C=r.C, output r.A, r.B, r.C, s.B, s.C, s.D, s.E Plan 2: For every block B of R, retrieved using the clustered index on A for R For every tuple r of B Use the index on C for S to retrieve all of the tuples s of S such that s.C=r.C For each of these tuples s, if s.B=r.B, output r.A, r.B, r.C, s.B, s.C, s.D, s.E You should analyze each of these plans carefully in terms of their behavior regarding accesses to disk. Your analysis should consider the behavior both in terms of the number of I/Os, as well as the total access time. Explain which of the plans is therefore better under what circumstances. Be sure to include in your analysis what accesses to disk are sequential accesses and which ones are random accesses. * Now consider what happens if instead V(C,S) = 500. Will one plan perform better than the other? * Instead of using the V-function to do your estimate, assume that the number of tuples returned, n_T, is proportional to the size of the domain of an attribute. Let DOM(B,S) = 1000, and DOM(C,S) = 100. Re-calculate your estimates. END OF HOMEWORK 5