 CS245A PROBLEM SET #3
Due by 5pm, Tuesday, February 3, 1998.
4 problems, equal credit. 100 pts.

Problem 1: Linear Hashing
Suppose we start with an empty linear hash table, with only (empty)
buckets for 0 and 1.
Our hash function produces k bits for some very large k.
Blocks hold two records, as in the class example, but unlike the class
example, we use the threshold 100%; that is, we do not add an extra
bucket until the number of records exceeds twice the number of buckets.
To be more precise, the addition of a bucket occurs after a request to
insert into the table is made, but before the insertion itself is done.
We insert records into the table, and by strange coincidence, the
results of applying the hash function to the first key is 00...000, the
result of applying it to the second key is 00...001, for the next
00...010, then 00..011, 00...100, 00...101, and so on.
That is, the hash value of the ith key is i1 in binary.
Question: what is the first record that causes an overflow block to be
created?
Explain your reasoning briefly.
Problem 2: Grid Files
Suppose we have a relation R(x,y) where the range of both x and y is
from 1 to 1000.
We'd like to store R in a grid file, using three partitions (i.e., four
stripes) in the x direction and the same in the y direction (i.e., 16
regions).
Unfortunately, the data in R is surprising; it consists of the 1000
pairs (i,i) for i=1,2,...,1000.
For what partitions is the maximum number of points in
a region is minimized?
What is that maximum? (Hint: it's not 250.)
Problem 3: MultipleKey Indexes
There is a relation R(X,Y,Z), where the pair of attributes X and Y
together form the key.
X ranges from 1 to 100, and Y from 1 to 1000.
For each X there are records with 100 different values of Y, and for
each Y there are records with 10 different values of X.
(i.e., there are 10,000 records in relation R).
We need to answer queries of the form
SELECT Z
FROM R
WHERE X = c AND Y = d;
for various values of constants c and d.
We decide to set up a multiplekey index, using blocks that can hold 10
keypointer pairs (a key can be either an X or Yvalue).
The options we have are:
 i)
 We can choose either X or Y to be the first attribute.
 ii)
 We can build on top of any index one or more additional levels of
index.
Your questions:
 a)
 Describe a structure with X as the first attribute
that minimizes the number of disk reads necessary
to answer a query of the above form.
How many disk reads do you perform?
 b)
 Describe a structure with Y as the first attribute
that minimizes the number of disk reads necessary
to answer a query of the above form.
How many disk reads do you perform?
 c)
 Suppose that you were allowed to buffer
11 blocks in main memory at all times.
Which blocks would you choose?
How would it affect the average number of disk reads needed to answer a
query of the above form?
Problem 4: Partitioned Hashing
Suppose we have a hash table whose buckets are numbered 0 to (2^n)1
(i.e., bucket numbers have n bits).
We wish to store in the table a relation with two attributes, X and Y.
A query will either ask for those tuples with a given value of X, or it
will ask for those tuples with a given value of Y (never both).
With probability p, the value of X will be specified.
 a)
 Suppose we partition the hash function so that m bits are devoted
to X and the remaining nm to Y.
Give a formula in terms of n, m, and p for the expected number of
buckets that must be examined to answer a random query.
 b)
 Find the value of m, as a function of n and p, that minimizes the
expected number of buckets examined.
Do not worry that this value is unlikely to be an integer.