Oracle9i Data Warehousing Guide Release 2 (9.2) Part Number A96520-01 |
|
This chapter discusses aggregation of SQL, a basic aspect of data warehousing.
Aggregation is a fundamental part of data warehousing. To improve
aggregation performance in your warehouse, Oracle provides the following
extensions to the GROUP
BY
clause:
The CUBE
, ROLLUP
, and
GROUPING
SETS
extensions to SQL make querying and
reporting easier and faster. ROLLUP
calculates aggregations such as
SUM
, COUNT
, MAX
, MIN
, and
AVG
at increasing levels of aggregation, from the most detailed up
to a grand total. CUBE
is an extension similar to
ROLLUP
, enabling a single statement to calculate all possible
combinations of aggregations. CUBE
can generate the information
needed in cross-tabulation reports with a single query.
CUBE
, ROLLUP
, and the
GROUPING
SETS
extension let you specify exactly the
groupings of interest in the GROUP
BY
clause. This
allows efficient analysis across multiple dimensions without performing a
CUBE
operation. Computing a full cube creates a heavy processing
load, so replacing cubes with grouping sets can significantly increase
performance. CUBE
, ROLLUP
, and grouping sets produce a
single result set that is equivalent to a UNION
ALL
of
differently grouped rows.
To enhance performance, CUBE
, ROLLUP
, and
GROUPING
SETS
can be parallelized: multiple processes
can simultaneously execute all of these statements. These capabilities make
aggregate calculations more efficient, thereby enhancing database performance,
and scalability.
The three GROUPING
functions help you identify the
group each row belongs to and enable sorting subtotal rows and filtering
results.
See
Also:
Oracle9i SQL Reference for further details |
One of the key concepts in decision support systems is multidimensional analysis: examining the enterprise from all necessary combinations of dimensions. We use the term dimension to mean any category used in specifying questions. Among the most commonly specified dimensions are time, geography, product, department, and distribution channel, but the potential dimensions are as endless as the varieties of enterprise activity. The events or entities associated with a particular set of dimension values are usually referred to as facts. The facts might be sales in units or local currency, profits, customer counts, production volumes, or anything else worth tracking.
Here are some examples of multidimensional requests:
All these requests involve multiple dimensions. Many multidimensional questions require aggregated data and comparisons of data sets, often across time, geography or budgets.
To visualize data that has many dimensions, analysts commonly use
the analogy of a data cube, that is, a space where facts are stored at the
intersection of n dimensions. Figure 18-1
shows a data cube and how it can be used differently by various groups. The cube
stores sales data organized by the dimensions of product
,
market
, sales
, and time
. Note that this
is only a metaphor: the actual data is physically stored in normal tables. The
cube data consists of both detail and aggregated data.
You can retrieve slices of data from the cube. These correspond to cross-tabular reports such as the one shown in Table 18-1. Regional managers might study the data by comparing slices of the cube applicable to different markets. In contrast, product managers might compare slices that apply to different products. An ad hoc user might work with a wide variety of constraints, working in a subset cube.
Answering multidimensional questions often involves accessing and querying huge quantities of data, sometimes in millions of rows. Because the flood of detailed data generated by large organizations cannot be interpreted at the lowest level, aggregated views of the information are essential. Aggregations, such as sums and counts, across many dimensions are vital to multidimensional analyses. Therefore, analytical tasks require convenient and efficient data aggregation.
Not only multidimensional issues, but all types of processing can benefit from enhanced aggregation facilities. Transaction processing, financial and manufacturing systems--all of these generate large numbers of production reports needing substantial system resources. Improved efficiency when creating these reports will reduce system load. In fact, any computer process that aggregates data from details to higher levels will benefit from optimized aggregation performance.
Oracle9i extensions provide aggregation features and bring many benefits, including:
To illustrate the use of the GROUP
BY
extension, this chapter uses the sh
data of the sample schema. All
the examples refer to data from this scenario. The hypothetical company has
sales across the world and tracks sales by both dollars and quantities
information. Because there are many rows of data, the queries shown here
typically have tight constraints on their WHERE
clauses to limit
the results to a small number of rows.
Table 18-1
is a sample cross-tabular report showing the total sales by
country_id
and channel_desc
for the US and UK through
the Internet and direct sales in September 2000.
Channel | Country | ||
---|---|---|---|
UK |
US |
Total | |
Direct Sales |
1,378,126 |
2,835,557 |
4,213,683 |
Internet |
911,739 |
1,732,240 |
2,643,979 |
Total |
2,289,865 |
4,567,797 |
6,857,662 |
Consider that even a simple report such as this, with just nine
values in its grid, generates four subtotals and a grand total. The subtotals
are the shaded numbers. Half of the values needed for this report would not be
calculated with a query that requested
SUM
(amount_sold
) and did a GROUP
BY
(channel_desc, country_id
). To get the higher-level
aggregates would require additional queries. Database commands that offer
improved calculation of subtotals bring major benefits to querying, reporting,
and analytical operations.
SELECT channel_desc, country_id, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$ FROM sales, customers, times, channels WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND sales.channel_id= channels.channel_id AND channels.channel_desc IN ('Direct Sales', 'Internet') AND times.calendar_month_desc='2000-09' AND country_id IN ('UK', 'US') GROUP BY CUBE(channel_desc, country_id); CHANNEL_DESC CO SALES$ -------------------- -- -------------- Direct Sales UK 1,378,126 Direct Sales US 2,835,557 Direct Sales 4,213,683 Internet UK 911,739 Internet US 1,732,240 Internet 2,643,979 UK 2,289,865 US 4,567,797 6,857,662
NULLs
returned by the GROUP
BY
extensions are not always the traditional null meaning value
unknown. Instead, a NULL
may indicate that its row is a subtotal.
To avoid introducing another non-value in the database system, these subtotal
values are not given a special tag.
See "GROUPING
Functions"
for details on how the NULLs
representing subtotals are
distinguished from NULLs
stored in the data.
ROLLUP
enables a SELECT
statement to
calculate multiple levels of subtotals across a specified group of dimensions.
It also calculates a grand total. ROLLUP
is a simple extension to
the GROUP
BY
clause, so its syntax is extremely easy
to use. The ROLLUP
extension is highly efficient, adding minimal
overhead to a query.
The action of ROLLUP
is straightforward: it creates
subtotals that roll up from the most detailed level to a grand total, following
a grouping list specified in the ROLLUP
clause. ROLLUP
takes as its argument an ordered list of grouping columns. First, it calculates
the standard aggregate values specified in the GROUP
BY
clause. Then, it creates progressively higher-level subtotals,
moving from right to left through the list of grouping columns. Finally, it
creates a grand total.
ROLLUP
creates subtotals at n+1 levels, where n is the
number of grouping columns. For instance, if a query specifies
ROLLUP
on grouping columns of time
,
region
, and department
(n=3), the result set will
include rows at four aggregation levels.
You might want to compress your data when using ROLLUP
.
This is particularly useful when there are few updates to older
partitions.
See
Also:
Oracle9i SQL Reference for data compression syntax and restrictions |
Use the ROLLUP
extension in tasks involving
subtotals.
ROLLUP(y,
m,
day)
or ROLLUP(country,
state,
city)
.
ROLLUP
can simplify and speed up the
maintenance of summary tables. ROLLUP
appears in the GROUP
BY
clause in a SELECT
statement. Its form is:
SELECT ... GROUP BY ROLLUP(grouping_column_reference_list)
This example uses the data in the sales history store data, the same
data as was used in Example 18-1.
The ROLLUP
is across three dimensions.
SELECT channel_desc, calendar_month_desc, country_id, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$ FROM sales, customers, times, channels WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND sales.channel_id= channels.channel_id AND channels.channel_desc IN ('Direct Sales', 'Internet') AND times.calendar_month_desc IN ('2000-09', '2000-10') AND country_id IN ('UK', 'US') GROUP BY ROLLUP(channel_desc, calendar_month_desc, country_id);CHANNEL_DESC CALENDAR CO SALES$
-------------------- -------- -- --------------
Direct Sales 2000-09 UK 1,378,126
Direct Sales 2000-09 US 2,835,557
Direct Sales 2000-09 4,213,683
Direct Sales 2000-10 UK 1,388,051
Direct Sales 2000-10 US 2,908,706
Direct Sales 2000-10 4,296,757
Direct Sales 8,510,440
Internet 2000-09 UK 911,739
Internet 2000-09 US 1,732,240
Internet 2000-09 2,643,979
Internet 2000-10 UK 876,571
Internet 2000-10 US 1,893,753
Internet 2000-10 2,770,324
Internet 5,414,303
13,924,743
Note that results do not always add due to rounding.
This query returns the following sets of rows:
GROUP
BY
without using
ROLLUP
country_id
for each combination of
channel_desc
and calendar_month
calendar_month_desc
and country_id
for each
channel_desc
value
You can also roll up so that only some of the sub-totals will be included. This partial rollup uses the following syntax:
GROUP BY expr1, ROLLUP(expr2, expr3);
In this case, the GROUP
BY
clause creates
subtotals at (2+1=3) aggregation levels. That is, at level (expr1
,
expr2
, expr3
), (expr1
,
expr2
), and (expr1
).
SELECT channel_desc, calendar_month_desc, country_id, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$ FROM sales, customers, times, channels WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND sales.channel_id= channels.channel_id AND channels.channel_desc IN ('Direct Sales', 'Internet') AND times.calendar_month_desc IN ('2000-09', '2000-10') AND country_id IN ('UK', 'US') GROUP BY channel_desc, ROLLUP(calendar_month_desc, country_id); CHANNEL_DESC CALENDAR CO SALES$ -------------------- -------- -- -------------- Direct Sales 2000-09 UK 1,378,126 Direct Sales 2000-09 US 2,835,557 Direct Sales 2000-09 4,213,683 Direct Sales 2000-10 UK 1,388,051 Direct Sales 2000-10 US 2,908,706 Direct Sales 2000-10 4,296,757 Direct Sales 8,510,440 Internet 2000-09 UK 911,739 Internet 2000-09 US 1,732,240 Internet 2000-09 2,643,979 Internet 2000-10 UK 876,571 Internet 2000-10 US 1,893,753 Internet 2000-10 2,770,324 Internet 5,414,303
This query returns the following sets of rows:
GROUP
BY
without using
ROLLUP
country_id
for each combination of
channel_desc
and calendar_month_desc
calendar_month_desc
and country_id
for each
channel_desc
value
CUBE
takes a specified set of grouping columns and
creates subtotals for all of their possible combinations. In terms of
multidimensional analysis, CUBE
generates all the subtotals that
could be calculated for a data cube with the specified dimensions. If you have
specified CUBE(time
, region
, department)
,
the result set will include all the values that would be included in an
equivalent ROLLUP
statement plus additional combinations. For
instance, in Example 18-1,
the departmental totals across regions (279,000 and 319,000) would not be
calculated by a ROLLUP(time
, region
,
department)
clause, but they would be calculated by a
CUBE(time
, region
, department)
clause. If
n columns are specified for a CUBE
, there
will be 2 to the n combinations of subtotals returned. Example 18-3
gives an example of a three-dimension cube.
See
Also:
Oracle9i SQL Reference for syntax and restrictions |
Consider Using CUBE
in any situation requiring
cross-tabular reports. The data needed for cross-tabular reports can be
generated with a single SELECT
using CUBE
. Like
ROLLUP
, CUBE
can be helpful in generating summary
tables. Note that population of summary tables is even faster if the
CUBE
query executes in parallel.
See
Also:
Chapter 21, "Using Parallel Execution" for information on parallel execution |
CUBE
is typically most suitable in queries that use
columns from multiple dimensions rather than columns representing different
levels of a single dimension. For instance, a commonly requested
cross-tabulation might need subtotals for all the combinations of
month
, state
, and product
. These are
three independent dimensions, and analysis of all possible subtotal combinations
is commonplace. In contrast, a cross-tabulation showing all possible
combinations of year
, month
, and day
would have several values of limited interest, because there is a natural
hierarchy in the time
dimension. Subtotals such as profit by day of
month summed across year would be unnecessary in most analyses. Relatively few
users need to ask "What were the total sales for the 16th of each month across
the year?" See "Hierarchy
Handling in ROLLUP and CUBE"
for an example of handling rollup calculations efficiently.
CUBE
appears in the GROUP
BY
clause in a SELECT
statement. Its form is:
SELECT ... GROUP BY CUBE (grouping_column_reference_list)
SELECT channel_desc, calendar_month_desc, country_id, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$ FROM sales, customers, times, channels WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND sales.channel_id= channels.channel_id AND channels.channel_desc IN ('Direct Sales', 'Internet') AND times.calendar_month_desc IN ('2000-09', '2000-10') AND country_id IN ('UK', 'US') GROUP BY CUBE(channel_desc, calendar_month_desc, country_id);CHANNEL_DESC CALENDAR CO SALES$
-------------------- -------- -- --------------
Direct Sales 2000-09 UK 1,378,126
Direct Sales 2000-09 US 2,835,557
Direct Sales 2000-09 4,213,683
Direct Sales 2000-10 UK 1,388,051
Direct Sales 2000-10 US 2,908,706
Direct Sales 2000-10 4,296,757
Direct Sales UK 2,766,177
Direct Sales US 5,744,263
Direct Sales 8,510,440
Internet 2000-09 UK 911,739
Internet 2000-09 US 1,732,240
Internet 2000-09 2,643,979
Internet 2000-10 UK 876,571
Internet 2000-10 US 1,893,753
Internet 2000-10 2,770,324
Internet UK 1,788,310
Internet US 3,625,993
Internet 5,414,303
2000-09 UK 2,289,865
2000-09 US 4,567,797
2000-09 6,857,662
2000-10 UK 2,264,622
2000-10 US 4,802,459
2000-10 7,067,081
UK 4,554,487
US 9,370,256
13,924,743
This query illustrates CUBE
aggregation across three
dimensions.
Partial CUBE
resembles partial ROLLUP
in
that you can limit it to certain dimensions and precede it with columns outside
the CUBE
operator. In this case, subtotals of all possible
combinations are limited to the dimensions within the cube list (in
parentheses), and they are combined with the preceding items in the
GROUP
BY
list.
GROUP BY expr1, CUBE(expr2, expr3)
This syntax example calculates 2*2, or 4, subtotals. That is:
Using the sales
database, you can issue the following
statement:
SELECT channel_desc, calendar_month_desc, country_id,
TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$
FROM sales, customers, times, channels
WHERE sales.time_id=times.time_id AND
sales.cust_id=customers.cust_id AND
sales.channel_id= channels.channel_id AND
channels.channel_desc IN ('Direct Sales', 'Internet') AND
times.calendar_month_desc IN ('2000-09', '2000-10')
AND country_id IN ('UK', 'US')
GROUP BY channel_desc, CUBE(calendar_month_desc, country_id);
CHANNEL_DESC CALENDAR CO SALES$
-------------------- -------- -- --------------
Direct Sales 2000-09 UK 1,378,126
Direct Sales 2000-09 US 2,835,557
Direct Sales 2000-09 4,213,683
Direct Sales 2000-10 UK 1,388,051
Direct Sales 2000-10 US 2,908,706
Direct Sales 2000-10 4,296,757
Direct Sales UK 2,766,177
Direct Sales US 5,744,263
Direct Sales 8,510,440
Internet 2000-09 UK 911,739
Internet 2000-09 US 1,732,240
Internet 2000-09 2,643,979
Internet 2000-10 UK 876,571
Internet 2000-10 US 1,893,753
Internet 2000-10 2,770,324
Internet UK 1,788,310
Internet US 3,625,993
Internet 5,414,303
Just as for ROLLUP
, multiple SELECT
statements combined with UNION
ALL
statements could
provide the same information gathered through CUBE
. However, this
might require many SELECT
statements. For an n-dimensional cube, 2
to the n SELECT
statements are needed. In the
three-dimension example, this would mean issuing SELECT
statements
linked with UNION
ALL
. So many SELECT
statements yield inefficient processing and very lengthy SQL.
Consider the impact of adding just one more dimension when
calculating all possible combinations: the number of SELECT
statements would double to 16. The more columns used in a CUBE
clause, the greater the savings compared to the UNION
ALL
approach.