CS245 Winter 2000 Assignment 2 Due in class on Thursday January 20 A relational database system holds three relations, C (customers), A (accounts) and T (transactions), with the following characteristics: Relation C Tuples stored as fixed length, fixed format records, length = 3000 bytes Tuples contain key attribute C.N (customer number), length 10 bytes; other fields and record header make up rest There are 10,000 C tuples There is an index on attribute C.N. Relation A Tuples stored as fixed length, fixed format records, length = 1000 bytes Tuples contain attribute A.N (number of customer who owns account), length 10 bytes; Tuples also contain attribute A.I (account identifier), lenght 10 bytes; other fields and header make up rest of the record There are 20,000 A tuples There is an index on attribute A.N. Relation T Tuples stored as fixed length, fixed format records, length = 500 bytes Tuples contain attribute T.I (the identifier of the account involved), length 10 bytes; other fields (+header) make up rest There are 60,000 T tuples. There is an index on attribute T.I. While the number of accounts associated with each customer vary, for evaluation purposes we may assume that each customer has 2 accounts, and each account has 3 transactions records associated with it. The records are to be placed in a collection of 8 KB disk blocks that have been reserved to exclusively hold C, A, or T records, or combinations of those records. (That is, there are no other types of records in the blocks we are discussing in this problem.) Each block uses 100 bytes for its header; records are not spanned.) Three disk organization strategies are being considered: [sequential] All the C records are placed sequentially (ordered by customer number) in one subset of blocks; Records of type A are separate in another set of blocks; accounts are not ordered by customer number. Finally, transaction records are in a third set of blocks, not ordered by account identifier. [clustered] For each customer record, the 2 accounts for that customer (C.N = A.N) reside in the same block. Similarly, the 6 trasnaction records for those accounts are in the same block. The customer records are not sequenced in any way. [random] The records are placed as follows, without regard to C.N, A.N, AI, T.I values, as follows: Each block contains one random C record, 2 random A records, and 6 random T records. We are also told there are four types of queries that constitute the vast majority of the workload: [probe] Given a customer number, get his/her customer record [ordered scan] For all customers, in increasing customer number order, get each customer record [plain scan] For all customers, in any order, get all customer records. [join] For a given customer number C.N, get his customer record followed by all his transaction records. (That is, get all transaction records with T.I = A.I, for any account with A.N = C.N.) Problem 1: For each of the storage strategies, compute how many total disk blocks are needed for holding relations C, A and T. Briefly explain your answers. Display your final results as explained below. Problem 2: For each query type and for each storage strategy, estimate the number of disk blocks that must be transferred from disk to execute the query. Briefly explain each answer. Display your final results as explained below. Assume you only have one buffer page (8 KB) in memory; thus, each time you need a block it counts as one IO (unless the request is for exactly the same block you already have in the buffer). You also have enough memory to hold a single working copy of a C record, of an A record, and of a T record. So, for example, to do a join, you first read in a block containing a C record, and copy the record to the working C area. Then you read in the block containing an A record into your 8K buffer, and join the two records. Also, ignore any IOs due to index accesses. That is, you may assume the indexes reside in memory. DISPLAYING FINAL ANSWERS: In Problems 1 and 2 you are to compute a total of 15 values. Please display them in a table like this: (You also need to explain the derivation of the numbers, this is just a summary.) | storage || IOs for | IOs for | IOs for | IOs for | | cost(Blks)|| [probe] | [ord scan]| [pln scan]| [join] | ---------------------------------------------------------------------------- [sequential] | || | | | | ---------------------------------------------------------------------------- [clustered] | || | | | | ---------------------------------------------------------------------------- [random] | || | | | | ---------------------------------------------------------------------------- HINT: Do not concern yourself with events that are very unlikely. For example, say you want to get the three account records for a particular customer, and that these records are randomly placed on a large number of disk blocks. There is some chance that the account records you want happen to be in the same block (just by luck). However, for this problem you may assume this probability is negligible and it will take you exactly three IOs to get those three records.