| Oracle9i Data Warehousing Guide Release 2 (9.2) Part Number A96520-01 |
|
This chapter discusses aggregation of SQL, a basic aspect of data warehousing.
Aggregation is a fundamental part of data warehousing. To improve
aggregation performance in your warehouse, Oracle provides the following
extensions to the GROUP BY clause:
The CUBE, ROLLUP, and
GROUPING SETS extensions to SQL make querying and
reporting easier and faster. ROLLUP calculates aggregations such as
SUM, COUNT, MAX, MIN, and
AVG at increasing levels of aggregation, from the most detailed up
to a grand total. CUBE is an extension similar to
ROLLUP, enabling a single statement to calculate all possible
combinations of aggregations. CUBE can generate the information
needed in cross-tabulation reports with a single query.
CUBE, ROLLUP, and the
GROUPING SETS extension let you specify exactly the
groupings of interest in the GROUP BY clause. This
allows efficient analysis across multiple dimensions without performing a
CUBE operation. Computing a full cube creates a heavy processing
load, so replacing cubes with grouping sets can significantly increase
performance. CUBE, ROLLUP, and grouping sets produce a
single result set that is equivalent to a UNION ALL of
differently grouped rows.
To enhance performance, CUBE, ROLLUP, and
GROUPING SETS can be parallelized: multiple processes
can simultaneously execute all of these statements. These capabilities make
aggregate calculations more efficient, thereby enhancing database performance,
and scalability.
The three GROUPING functions help you identify the
group each row belongs to and enable sorting subtotal rows and filtering
results.
| See
Also:
Oracle9i SQL Reference for further details |
One of the key concepts in decision support systems is multidimensional analysis: examining the enterprise from all necessary combinations of dimensions. We use the term dimension to mean any category used in specifying questions. Among the most commonly specified dimensions are time, geography, product, department, and distribution channel, but the potential dimensions are as endless as the varieties of enterprise activity. The events or entities associated with a particular set of dimension values are usually referred to as facts. The facts might be sales in units or local currency, profits, customer counts, production volumes, or anything else worth tracking.
Here are some examples of multidimensional requests:
All these requests involve multiple dimensions. Many multidimensional questions require aggregated data and comparisons of data sets, often across time, geography or budgets.
To visualize data that has many dimensions, analysts commonly use
the analogy of a data cube, that is, a space where facts are stored at the
intersection of n dimensions. Figure 18-1
shows a data cube and how it can be used differently by various groups. The cube
stores sales data organized by the dimensions of product,
market, sales, and time. Note that this
is only a metaphor: the actual data is physically stored in normal tables. The
cube data consists of both detail and aggregated data.

You can retrieve slices of data from the cube. These correspond to cross-tabular reports such as the one shown in Table 18-1. Regional managers might study the data by comparing slices of the cube applicable to different markets. In contrast, product managers might compare slices that apply to different products. An ad hoc user might work with a wide variety of constraints, working in a subset cube.
Answering multidimensional questions often involves accessing and querying huge quantities of data, sometimes in millions of rows. Because the flood of detailed data generated by large organizations cannot be interpreted at the lowest level, aggregated views of the information are essential. Aggregations, such as sums and counts, across many dimensions are vital to multidimensional analyses. Therefore, analytical tasks require convenient and efficient data aggregation.
Not only multidimensional issues, but all types of processing can benefit from enhanced aggregation facilities. Transaction processing, financial and manufacturing systems--all of these generate large numbers of production reports needing substantial system resources. Improved efficiency when creating these reports will reduce system load. In fact, any computer process that aggregates data from details to higher levels will benefit from optimized aggregation performance.
Oracle9i extensions provide aggregation features and bring many benefits, including:
To illustrate the use of the GROUP BY
extension, this chapter uses the sh data of the sample schema. All
the examples refer to data from this scenario. The hypothetical company has
sales across the world and tracks sales by both dollars and quantities
information. Because there are many rows of data, the queries shown here
typically have tight constraints on their WHERE clauses to limit
the results to a small number of rows.
Table 18-1
is a sample cross-tabular report showing the total sales by
country_id and channel_desc for the US and UK through
the Internet and direct sales in September 2000.
| Channel | Country | ||
|---|---|---|---|
|
UK |
US |
Total | |
|
Direct Sales |
1,378,126 |
2,835,557 |
4,213,683 |
|
Internet |
911,739 |
1,732,240 |
2,643,979 |
|
Total |
2,289,865 |
4,567,797 |
6,857,662 |
Consider that even a simple report such as this, with just nine
values in its grid, generates four subtotals and a grand total. The subtotals
are the shaded numbers. Half of the values needed for this report would not be
calculated with a query that requested
SUM(amount_sold) and did a GROUP
BY(channel_desc, country_id). To get the higher-level
aggregates would require additional queries. Database commands that offer
improved calculation of subtotals bring major benefits to querying, reporting,
and analytical operations.
SELECT channel_desc, country_id, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$ FROM sales, customers, times, channels WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND sales.channel_id= channels.channel_id AND channels.channel_desc IN ('Direct Sales', 'Internet') AND times.calendar_month_desc='2000-09' AND country_id IN ('UK', 'US') GROUP BY CUBE(channel_desc, country_id); CHANNEL_DESC CO SALES$ -------------------- -- -------------- Direct Sales UK 1,378,126 Direct Sales US 2,835,557 Direct Sales 4,213,683 Internet UK 911,739 Internet US 1,732,240 Internet 2,643,979 UK 2,289,865 US 4,567,797 6,857,662
NULLs returned by the GROUP
BY extensions are not always the traditional null meaning value
unknown. Instead, a NULL may indicate that its row is a subtotal.
To avoid introducing another non-value in the database system, these subtotal
values are not given a special tag.
See "GROUPING
Functions"
for details on how the NULLs representing subtotals are
distinguished from NULLs stored in the data.
ROLLUP enables a SELECT statement to
calculate multiple levels of subtotals across a specified group of dimensions.
It also calculates a grand total. ROLLUP is a simple extension to
the GROUP BY clause, so its syntax is extremely easy
to use. The ROLLUP extension is highly efficient, adding minimal
overhead to a query.
The action of ROLLUP is straightforward: it creates
subtotals that roll up from the most detailed level to a grand total, following
a grouping list specified in the ROLLUP clause. ROLLUP
takes as its argument an ordered list of grouping columns. First, it calculates
the standard aggregate values specified in the GROUP
BY clause. Then, it creates progressively higher-level subtotals,
moving from right to left through the list of grouping columns. Finally, it
creates a grand total.
ROLLUP creates subtotals at n+1 levels, where n is the
number of grouping columns. For instance, if a query specifies
ROLLUP on grouping columns of time,
region, and department (n=3), the result set will
include rows at four aggregation levels.
You might want to compress your data when using ROLLUP.
This is particularly useful when there are few updates to older
partitions.
| See
Also:
Oracle9i SQL Reference for data compression syntax and restrictions |
Use the ROLLUP extension in tasks involving
subtotals.
ROLLUP(y, m, day)
or ROLLUP(country, state, city).
ROLLUP can simplify and speed up the
maintenance of summary tables. ROLLUP appears in the GROUP
BY clause in a SELECT statement. Its form is:
SELECT ... GROUP BY ROLLUP(grouping_column_reference_list)
This example uses the data in the sales history store data, the same
data as was used in Example 18-1.
The ROLLUP is across three dimensions.
SELECT channel_desc, calendar_month_desc, country_id, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$ FROM sales, customers, times, channels WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND sales.channel_id= channels.channel_id AND channels.channel_desc IN ('Direct Sales', 'Internet') AND times.calendar_month_desc IN ('2000-09', '2000-10') AND country_id IN ('UK', 'US') GROUP BY ROLLUP(channel_desc, calendar_month_desc, country_id);CHANNEL_DESC CALENDAR CO SALES$-------------------- -------- -- --------------Direct Sales 2000-09 UK 1,378,126Direct Sales 2000-09 US 2,835,557Direct Sales 2000-09 4,213,683Direct Sales 2000-10 UK 1,388,051Direct Sales 2000-10 US 2,908,706Direct Sales 2000-10 4,296,757Direct Sales 8,510,440Internet 2000-09 UK 911,739Internet 2000-09 US 1,732,240Internet 2000-09 2,643,979Internet 2000-10 UK 876,571Internet 2000-10 US 1,893,753Internet 2000-10 2,770,324Internet 5,414,30313,924,743
Note that results do not always add due to rounding.
This query returns the following sets of rows:
GROUP BY without using
ROLLUP
country_id for each combination of
channel_desc and calendar_month
calendar_month_desc and country_id for each
channel_desc value
You can also roll up so that only some of the sub-totals will be included. This partial rollup uses the following syntax:
GROUP BY expr1, ROLLUP(expr2, expr3);
In this case, the GROUP BY clause creates
subtotals at (2+1=3) aggregation levels. That is, at level (expr1,
expr2, expr3), (expr1,
expr2), and (expr1).
SELECT channel_desc, calendar_month_desc, country_id, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$ FROM sales, customers, times, channels WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND sales.channel_id= channels.channel_id AND channels.channel_desc IN ('Direct Sales', 'Internet') AND times.calendar_month_desc IN ('2000-09', '2000-10') AND country_id IN ('UK', 'US') GROUP BY channel_desc, ROLLUP(calendar_month_desc, country_id); CHANNEL_DESC CALENDAR CO SALES$ -------------------- -------- -- -------------- Direct Sales 2000-09 UK 1,378,126 Direct Sales 2000-09 US 2,835,557 Direct Sales 2000-09 4,213,683 Direct Sales 2000-10 UK 1,388,051 Direct Sales 2000-10 US 2,908,706 Direct Sales 2000-10 4,296,757 Direct Sales 8,510,440 Internet 2000-09 UK 911,739 Internet 2000-09 US 1,732,240 Internet 2000-09 2,643,979 Internet 2000-10 UK 876,571 Internet 2000-10 US 1,893,753 Internet 2000-10 2,770,324 Internet 5,414,303
This query returns the following sets of rows:
GROUP BY without using
ROLLUP
country_id for each combination of
channel_desc and calendar_month_desc
calendar_month_desc and country_id for each
channel_desc value
CUBE takes a specified set of grouping columns and
creates subtotals for all of their possible combinations. In terms of
multidimensional analysis, CUBE generates all the subtotals that
could be calculated for a data cube with the specified dimensions. If you have
specified CUBE(time, region, department),
the result set will include all the values that would be included in an
equivalent ROLLUP statement plus additional combinations. For
instance, in Example 18-1,
the departmental totals across regions (279,000 and 319,000) would not be
calculated by a ROLLUP(time, region,
department) clause, but they would be calculated by a
CUBE(time, region, department) clause. If
n columns are specified for a CUBE, there
will be 2 to the n combinations of subtotals returned. Example 18-3
gives an example of a three-dimension cube.
| See
Also:
Oracle9i SQL Reference for syntax and restrictions |
Consider Using CUBE in any situation requiring
cross-tabular reports. The data needed for cross-tabular reports can be
generated with a single SELECT using CUBE. Like
ROLLUP, CUBE can be helpful in generating summary
tables. Note that population of summary tables is even faster if the
CUBE query executes in parallel.
| See
Also:
Chapter 21, "Using Parallel Execution" for information on parallel execution |
CUBE is typically most suitable in queries that use
columns from multiple dimensions rather than columns representing different
levels of a single dimension. For instance, a commonly requested
cross-tabulation might need subtotals for all the combinations of
month, state, and product. These are
three independent dimensions, and analysis of all possible subtotal combinations
is commonplace. In contrast, a cross-tabulation showing all possible
combinations of year, month, and day
would have several values of limited interest, because there is a natural
hierarchy in the time dimension. Subtotals such as profit by day of
month summed across year would be unnecessary in most analyses. Relatively few
users need to ask "What were the total sales for the 16th of each month across
the year?" See "Hierarchy
Handling in ROLLUP and CUBE"
for an example of handling rollup calculations efficiently.
CUBE appears in the GROUP BY
clause in a SELECT statement. Its form is:
SELECT ... GROUP BY CUBE (grouping_column_reference_list)
SELECT channel_desc, calendar_month_desc, country_id, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$ FROM sales, customers, times, channels WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND sales.channel_id= channels.channel_id AND channels.channel_desc IN ('Direct Sales', 'Internet') AND times.calendar_month_desc IN ('2000-09', '2000-10') AND country_id IN ('UK', 'US') GROUP BY CUBE(channel_desc, calendar_month_desc, country_id);CHANNEL_DESC CALENDAR CO SALES$-------------------- -------- -- --------------Direct Sales 2000-09 UK 1,378,126Direct Sales 2000-09 US 2,835,557Direct Sales 2000-09 4,213,683Direct Sales 2000-10 UK 1,388,051Direct Sales 2000-10 US 2,908,706Direct Sales 2000-10 4,296,757Direct Sales UK 2,766,177Direct Sales US 5,744,263Direct Sales 8,510,440Internet 2000-09 UK 911,739Internet 2000-09 US 1,732,240Internet 2000-09 2,643,979Internet 2000-10 UK 876,571Internet 2000-10 US 1,893,753Internet 2000-10 2,770,324Internet UK 1,788,310Internet US 3,625,993Internet 5,414,3032000-09 UK 2,289,8652000-09 US 4,567,7972000-09 6,857,6622000-10 UK 2,264,6222000-10 US 4,802,4592000-10 7,067,081UK 4,554,487US 9,370,25613,924,743
This query illustrates CUBE aggregation across three
dimensions.
Partial CUBE resembles partial ROLLUP in
that you can limit it to certain dimensions and precede it with columns outside
the CUBE operator. In this case, subtotals of all possible
combinations are limited to the dimensions within the cube list (in
parentheses), and they are combined with the preceding items in the
GROUP BY list.
GROUP BY expr1, CUBE(expr2, expr3)
This syntax example calculates 2*2, or 4, subtotals. That is:
Using the sales database, you can issue the following
statement:
SELECT channel_desc, calendar_month_desc, country_id,
TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$
FROM sales, customers, times, channels
WHERE sales.time_id=times.time_id AND
sales.cust_id=customers.cust_id AND
sales.channel_id= channels.channel_id AND
channels.channel_desc IN ('Direct Sales', 'Internet') AND
times.calendar_month_desc IN ('2000-09', '2000-10')
AND country_id IN ('UK', 'US')
GROUP BY channel_desc, CUBE(calendar_month_desc, country_id);
CHANNEL_DESC CALENDAR CO SALES$
-------------------- -------- -- --------------
Direct Sales 2000-09 UK 1,378,126
Direct Sales 2000-09 US 2,835,557
Direct Sales 2000-09 4,213,683
Direct Sales 2000-10 UK 1,388,051
Direct Sales 2000-10 US 2,908,706
Direct Sales 2000-10 4,296,757
Direct Sales UK 2,766,177
Direct Sales US 5,744,263
Direct Sales 8,510,440
Internet 2000-09 UK 911,739
Internet 2000-09 US 1,732,240
Internet 2000-09 2,643,979
Internet 2000-10 UK 876,571
Internet 2000-10 US 1,893,753
Internet 2000-10 2,770,324
Internet UK 1,788,310
Internet US 3,625,993
Internet 5,414,303
Just as for ROLLUP, multiple SELECT
statements combined with UNION ALL statements could
provide the same information gathered through CUBE. However, this
might require many SELECT statements. For an n-dimensional cube, 2
to the n SELECT statements are needed. In the
three-dimension example, this would mean issuing SELECT statements
linked with UNION ALL. So many SELECT
statements yield inefficient processing and very lengthy SQL.
Consider the impact of adding just one more dimension when
calculating all possible combinations: the number of SELECT
statements would double to 16. The more columns used in a CUBE
clause, the greater the savings compared to the UNION
ALL approach.