CS245  Winter 1999
 
Assignment 2
due in class on Thursday January 21

A relational database system holds two relations, C (customers) and
A (accounts), with the following characteristics:

Relation C
  Tuples stored as fixed length, fixed format records, length = 640 bytes
  Tuples contain key attribute C.N (customer name), length 20 bytes;
                 other fields and record header make up rest
  There are 6,000 C tuples
  There is an index on attribute C.N.

Relation A
  Tuples stored as fixed length, fixed format records, length = 350 bytes
  Tuples contain attribute A.N (name of customer who owns account),
                        length 20 bytes; other fields (+header) make up rest
  There are 20,000 A tuples
  There is an index on attribute A.N.

While the number of accounts associated with each customer vary,
we have been told that for evaluation purposes we may assume that:
      five out of six customers have exactly 3 accounts
      one out of six customers have exactly 5 accounts

The records are to be placed in a collection of 4 KB disk blocks
that have been reserved to exclusively hold C, or A, or both types of records.
(That is, there are no other types of records in the blocks we are
discussing in this problem.)
Each block uses up 16 bytes for its header; records are not spanned.)
Three disk organization strategies are being considered:

  [sequential]  All the C records are placed sequentially (ordered
                by customer name) in one subset of blocks;
                Records of type A are separate in another set of blocks;
                accounts are not ordered by customer name.

  [clustered]   For each customer record, the accounts for that customer
                (C.N = A.N) must reside in the same block.
                The customer records are not sequenced in any way.

                Assume you have 1000 blocks with 2 customer records each
                and the corresponding 3 + 5 = 8 account records.
                The rest of the blocks have the remaining 4000 customers
                (two per block) that only have 3 accounts.

  [random]      The records are placed as follows, without regard to C.N
                or A.N values, as follows:
                    2,000 blocks with three C and six A records
                    the remaining A records in blocks with eleven records each.

We are also told there are four types of queries that constitute the vast
majority of the workload:

[probe]         Given a customer name, get his customer record
[ordered scan]  For all customers, in increasing customer name order,
                    get each customer record
[plain scan]    For all customers, in any order, get all customer records.
[join]          For all customers (any order) do:
                     Get customer record followed by all its accounts
                (This is exactly how join queries are processed.)

Problem 1:
For each of the storage strategies, compute how many total disk blocks
are needed for hold relations C and A. Briefly explain your answers.
Display your final results as explained below.

Problem 2:
For each query type and for each storage strategy,
estimate the number of disk blocks
that must be transferred from disk to execute the query.
Briefly explain each answer.  Display your final results as explained below.

Assume you only have one buffer page (4 KB) in memory;
thus, each time you need a block it counts as one IO (unless the request is
for exactly the same block you already have in the buffer).
You also have enough memory to hold a single working copy of a C record
and a single working copy of an A record. So, for example,
to do a join, you first read in a block containing a C record, and
copy the record to the working C area. Then you read in the block containing
an A record into your 4K buffer, and join the two records.

Also, ignore any IOs due to index accesses.  For example,
say I am examining a customer with C.N = "Smith".
The index will point directly to the blocks that hold Smith's
accounts. In my computations, I only need to consider the IO's
for getting the actual C and A records, not for accessing the index.
(Another way to put this is that you may assume the index resides in memory.)

DISPLAYING FINAL ANSWERS:
In Problems 1 and 2 you are to compute a total of 15 values.
Please display them in a table like this:
(You also need to explain the derivation of the numbers, this is just a
summary.)

              | storage   ||  IOs for  |  IOs for  |  IOs for  |  IOs for  |
              | cost(Blks)||  [probe]  | [ord scan]| [pln scan]|  [join]   |
----------------------------------------------------------------------------
[sequential]  |           ||           |           |           |           |
----------------------------------------------------------------------------
[clustered]   |           ||           |           |           |           |
----------------------------------------------------------------------------
[random]      |           ||           |           |           |           |
----------------------------------------------------------------------------


HINT:  Do not concern yourself with events that are very unlikely.
For example, say you want to get the three account records for a particular
customer, and that these records are randomly placed on a large
number of disk blocks. There is some chance that the account
records you want happen to be in the same block (just by luck).
However, for this problem you may assume this probability is negligible
and it will take you exactly three IOs to get those three records.