Database Systems: The Complete Book Solutions for Chapter 14

## Solutions for Section 14.1

### Exercise 14.1.1(a)

First, notice that if a rectangle does not have its upper-right corner in the northeast quadrant of the point (10,20), then it surely does not intersect the given rectangle. Likewise, if a rectangle does not have its lower-left corner in the southwest quadrant of the point (40,30), it cannot intersect. However, all rectangles that meet both conditions do intersect the given rectangle. An SQL query to check these conditions is:
```     SELECT id
FROM Rectangles
WHERE xur >= 10.0 AND yur >= 20.0 AND
xll <= 40.0 AND yll <= 30.0;
```

### Exercise 14.1.2(a)

```     SELECT color, COUNT(*)
FROM Sales
WHERE item = 'shirt'
GROUP BY color
HAVING COUT(*) > 1000;
```

### Exercise 14.1.4

When we use the index for x, we retrieve 1000n pointers. These pointers are on about 5n leaf blocks. As in Example 5.5, we shall add one disk I/O for an intermediate block, and none for a leaf block. The index for YY requires the same number of B-tree block accesses, for a total of 10n+2 index blocks.

Because the data file is assumed sorted by x, the pointers in the intersection, of which there are n^2/1,000,000, are spread over fraction n/1000 of the blocks of the data file or 10n blocks. Thus, the total number of block accesses is 20n+2.

This number can be mor or less than reading all 10,000 data blocks. It is less, and therefore offers an advantage over a scan of the entire data file, if n < 500.

### Exercise 14.1.5(a)

It is easier to calculate the probability that we will not be successful; it is the probability that each of 1,000,000 points chosen randomly and independently is outside a circle of radius d. The probability that any one point is outside the circle is 1 - pid^2/1000000, so the probability that all 1,000,000 points are outside is that raised to the millionth power.

There is a useful approximation for small x, that (1-x)^{a/x} = e^{-a}. Thus, the approximate probability of failure is e^{-pi*d^2}, and the probability of success, i.e., that we shall find a point within radius d is 1 - e^{-pi*d^2}.

## Solutions for Section 14.2

### Exercise 14.2.1(a)

Among the many ways to accomplish this task is to put grid lines in the speed dimension at 800 and 1150 and in the HD dimension at 15, 50, and 70.

### Exercise 14.2.4(a)

No line in either dimension will split any other bucket besides the one into which the point (1000,192) falls. Thus, we could choose a line like RAM = 150 or a line like SPEED = 950.

### Exercise 14.2.5(b)

The distance to the closest point found so far is a little more than 15. Points closer than that to the target point (110,205) can be found in five neighboring rectangles, those with lower-left corners at (80,200), (80,150), (100,150), (120,150), and (120,200).

### Exercise 14.2.6(a)

It is possible to have 7 nonempty buckets, but no more. To see that there can be no more, notice that each of the 6 grid lines can only intersect the line of the data points once, and each intersection increases the number of line segments of the data points by 1. By choosing the grid points so that no two cross on the line of data points, we can in fact divide the data into 7 nonempty buckets.

### Exercise 14.2.9

The approximate number of blocks/bucket is ceiling(m^2/100), because m^2 is the expected number of points in a square of side m, and there are 100 points per block. Since we assume the query boundaries do not coincide with a grid line, the expected number of stripes covered by the query in each dimension is 50/m + 1. Thus, the expected number of buckets that must be examined per query is (50/m + 1)^2.

Interestingly if we ignore the ceiling function, then the product of buckets per query and blocks per bucket is about 25, independent of m. However, because of the ceiling, we can never do this well. First, it doesn't make sense to pick m below 10, because the ceiling function tells us that we shall still require one block per bucket, no matter how small we make m.

At m = 10, the product is one block per bucket times 36 buckets/query = 36. We cannot do better than 36 by increasing m. To see why, remove the ceiling function to get a lower bound on the number of blocks per query: (m^2/100)(50/m + 1)^2 = 25 + m + m^2/100. Since m >= 10, the above expression is at least 36.

Thus, we conclude that m = 10, and 36 blocks per query is the best we can do.

## Solutions for Section 14.3

### Exercise 14.3.3(a)

Revised 11/29/05.

For the index on x we need 10 first-level index blocks to hold all 100 values of x and their corresponding pointers to indexes on y. Thus, for x we need a second-level index with one block. Finding the y-index for the x-value x=C thus requires two disk I/O's.

Now, we must enter the y-index for this x-value. Since there are 100 y-values for each x-value, each y-index looks just like the x-index. We thus need two more disk I/O's to access the record with the proper x and y if there is one. If there is indeed a record with x=C and y=D, then a fifth disk I/O is needed to retrieve the z-value.

### Exercise 14.3.5(a)

To get to the block with (30,260), you have to go right at a node with salary = 150 and left at a node with salary = 300. Thus the salary must be at least 150, and less than 300. Also, you must go left at a node with age = 47, so the age must be less than 47.

### Exercise 14.3.10

Nope; can't do it. Consider the case where all the points are in a line. We can split that line into three parts by the x- and y-axes, but we can't split it into four parts. Thus, one quadrant is always empty.

## Solutions for Section 14.4

### Exercise 14.4.1(a)

First, the computers 1001 through 1011 and 1013 will correspond to positions 1 through 12 of the bit vectors, respectively. Then the bit vectors and compressed bit vectors for speed are:

SpeedUncompressedCompressed
700:10000001000000110110
733:00000000000111101011
750:00000000010011101001
866:001100000000101000
1000:000010000000110100
1100:00000000001011101010
1200:00000000100011101000
1300:000001000000110101
1400:000000100000110110
1500:01000000000001

### Exercise 14.4.5(a)

First, let's translate the bit vector into ``runs.'' The lengths of the runs of 0's preceding the 1's are 1, 0, 7, and 5. The codes for these numbers are 01, 00, 110111, and 110101, respectively, so the encoding is 0100110111110101.

### Exercise 14.4.6

Suppose that pointers require p bits, and keys require k bits. Since each of the n records has a key and pointer in the B-tree, the leaves alone occupy at least n(k+p) bits. If there are, say, 100 key-pointer pairs in a block, then the total space of the B-tree is only about 1% more than the space of the leaves, so we can neglect all but the space of the leaves.

We must thus compare 2log_2n with k+p, since these are the multipliers of n in the formulas for space taken by bitmap indexes and B-trees, respectively. The contribution of k is a little tricky. If the key has many bits, but there are few different values of the key in the file (a common situation), then the space for listing the different key values in the bitmap index, which we ignored in Section 5.4.2, is much less than the space for representing keys in the B-tree. However, if the number of actual key-values in the file approximates the 2^k, then the term nk in the formula for B-tree space can be deleted, since to be fair we should have added the same term into the cost of the bitmap index.

Let us, to be conservative, compare only p with 2log_2n. If there are n records in the file, surely pointers must take at least log_2n bits. However, in practice, pointers must do more than distinguish among records of one file; they must distinguish locations on many disks, and therefore 32 bits is a minimum for pointers, with considerably higher values possible. If p = 32, then the bitmap index will take less space if 2log_2n < 32, or log_2n < 16, or n < 65,536.

Thus, for a reasonable sized file, the space required by the bitmap and B-tree are comparable under our assumptions above. The advantage goes to the bitmap index in a number of situations, including:

1. Pointers larger than 32 bits.

2. Keys require many bits, but the number of different key values is small.

There are a number of advantages to the B-tree, even when it takes more space, including the ability to handle range queries efficiently and the speed with which access to single records is provided.