18
SQL for Aggregation in Data Warehouses

This chapter discusses aggregation of SQL, a basic aspect of data warehousing.

Overview of SQL for Aggregation in Data Warehouses

Aggregation is a fundamental part of data warehousing. To improve aggregation performance in your warehouse, Oracle provides the following extensions to the GROUP BY clause:

CUBE and ROLLUP extensions to the GROUP BY clause
Three GROUPING functions
GROUPING SETS expression

The CUBE, ROLLUP, and GROUPING SETS extensions to SQL make querying and reporting easier and faster. ROLLUP calculates aggregations such as SUM, COUNT, MAX, MIN, and AVG at increasing levels of aggregation, from the most detailed up to a grand total. CUBE is an extension similar to ROLLUP, enabling a single statement to calculate all possible combinations of aggregations. CUBE can generate the information needed in cross-tabulation reports with a single query.

CUBE, ROLLUP, and the GROUPING SETS extension let you specify exactly the groupings of interest in the GROUP BY clause. This allows efficient analysis across multiple dimensions without performing a CUBE operation. Computing a full cube creates a heavy processing load, so replacing cubes with grouping sets can significantly increase performance. CUBE, ROLLUP, and grouping sets produce a single result set that is equivalent to a UNION ALL of differently grouped rows.

To enhance performance, CUBE, ROLLUP, and GROUPING SETS can be parallelized: multiple processes can simultaneously execute all of these statements. These capabilities make aggregate calculations more efficient, thereby enhancing database performance, and scalability.

The three GROUPING functions help you identify the group each row belongs to and enable sorting subtotal rows and filtering results.

See Also:

Oracle9i SQL Reference for further details

Analyzing Across Multiple Dimensions

One of the key concepts in decision support systems is multidimensional analysis: examining the enterprise from all necessary combinations of dimensions. We use the term dimension to mean any category used in specifying questions. Among the most commonly specified dimensions are time, geography, product, department, and distribution channel, but the potential dimensions are as endless as the varieties of enterprise activity. The events or entities associated with a particular set of dimension values are usually referred to as facts. The facts might be sales in units or local currency, profits, customer counts, production volumes, or anything else worth tracking.

Here are some examples of multidimensional requests:

Show total sales across all products at increasing aggregation levels for a geography dimension, from state to country to region, for 1999 and 2000.
Create a cross-tabular analysis of our operations showing expenses by territory in South America for 1999 and 2000. Include all possible subtotals.
List the top 10 sales representatives in Asia according to 2000 sales revenue for automotive products, and rank their commissions.

All these requests involve multiple dimensions. Many multidimensional questions require aggregated data and comparisons of data sets, often across time, geography or budgets.

To visualize data that has many dimensions, analysts commonly use the analogy of a data cube, that is, a space where facts are stored at the intersection of n dimensions. Figure 18-1 shows a data cube and how it can be used differently by various groups. The cube stores sales data organized by the dimensions of product, market, sales, and time. Note that this is only a metaphor: the actual data is physically stored in normal tables. The cube data consists of both detail and aggregated data.

Figure 18-1 Logical Cubes and Views by Different Users

You can retrieve slices of data from the cube. These correspond to cross-tabular reports such as the one shown in Table 18-1. Regional managers might study the data by comparing slices of the cube applicable to different markets. In contrast, product managers might compare slices that apply to different products. An ad hoc user might work with a wide variety of constraints, working in a subset cube.

Answering multidimensional questions often involves accessing and querying huge quantities of data, sometimes in millions of rows. Because the flood of detailed data generated by large organizations cannot be interpreted at the lowest level, aggregated views of the information are essential. Aggregations, such as sums and counts, across many dimensions are vital to multidimensional analyses. Therefore, analytical tasks require convenient and efficient data aggregation.

Optimized Performance

Not only multidimensional issues, but all types of processing can benefit from enhanced aggregation facilities. Transaction processing, financial and manufacturing systems--all of these generate large numbers of production reports needing substantial system resources. Improved efficiency when creating these reports will reduce system load. In fact, any computer process that aggregates data from details to higher levels will benefit from optimized aggregation performance.

Oracle9i extensions provide aggregation features and bring many benefits, including:

Simplified programming requiring less SQL code for many tasks
Quicker and more efficient query processing
Reduced client processing loads and network traffic because aggregation work is shifted to servers
Opportunities for caching aggregations because similar queries can leverage existing work

An Aggregate Scenario

To illustrate the use of the GROUP BY extension, this chapter uses the sh data of the sample schema. All the examples refer to data from this scenario. The hypothetical company has sales across the world and tracks sales by both dollars and quantities information. Because there are many rows of data, the queries shown here typically have tight constraints on their WHERE clauses to limit the results to a small number of rows.

Example 18-1 Simple Cross-Tabular Report With Subtotals

Table 18-1 is a sample cross-tabular report showing the total sales by country_id and channel_desc for the US and UK through the Internet and direct sales in September 2000.

Table 18-1 Simple Cross-Tabular Report With Subtotals

Channel	Country
	UK	US	Total
Direct Sales	1,378,126	2,835,557	4,213,683
Internet	911,739	1,732,240	2,643,979
Total	2,289,865	4,567,797	6,857,662

Consider that even a simple report such as this, with just nine values in its grid, generates four subtotals and a grand total. The subtotals are the shaded numbers. Half of the values needed for this report would not be calculated with a query that requested SUM(amount_sold) and did a GROUP BY(channel_desc, country_id). To get the higher-level aggregates would require additional queries. Database commands that offer improved calculation of subtotals bring major benefits to querying, reporting, and analytical operations.

SELECT channel_desc,  country_id,
   TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$ 
FROM sales, customers, times, channels 
WHERE sales.time_id=times.time_id AND
   sales.cust_id=customers.cust_id AND
   sales.channel_id= channels.channel_id AND 
   channels.channel_desc IN ('Direct Sales', 'Internet') AND 
   times.calendar_month_desc='2000-09'
   AND country_id IN ('UK', 'US')
GROUP BY CUBE(channel_desc, country_id);

CHANNEL_DESC         CO SALES$
-------------------- -- --------------
Direct Sales         UK      1,378,126
Direct Sales         US      2,835,557
Direct Sales                 4,213,683
Internet             UK        911,739
Internet             US      1,732,240
Internet                     2,643,979
                     UK      2,289,865
                     US      4,567,797
                             6,857,662

Interpreting NULLs in Examples

NULLs returned by the GROUP BY extensions are not always the traditional null meaning value unknown. Instead, a NULL may indicate that its row is a subtotal. To avoid introducing another non-value in the database system, these subtotal values are not given a special tag.

See "GROUPING Functions" for details on how the NULLs representing subtotals are distinguished from NULLs stored in the data.

ROLLUP Extension to GROUP BY

ROLLUP enables a SELECT statement to calculate multiple levels of subtotals across a specified group of dimensions. It also calculates a grand total. ROLLUP is a simple extension to the GROUP BY clause, so its syntax is extremely easy to use. The ROLLUP extension is highly efficient, adding minimal overhead to a query.

The action of ROLLUP is straightforward: it creates subtotals that roll up from the most detailed level to a grand total, following a grouping list specified in the ROLLUP clause. ROLLUP takes as its argument an ordered list of grouping columns. First, it calculates the standard aggregate values specified in the GROUP BY clause. Then, it creates progressively higher-level subtotals, moving from right to left through the list of grouping columns. Finally, it creates a grand total.

ROLLUP creates subtotals at n+1 levels, where n is the number of grouping columns. For instance, if a query specifies ROLLUP on grouping columns of time, region, and department (n=3), the result set will include rows at four aggregation levels.

You might want to compress your data when using ROLLUP. This is particularly useful when there are few updates to older partitions.

See Also:

Oracle9i SQL Reference for data compression syntax and restrictions

When to Use ROLLUP

Use the ROLLUP extension in tasks involving subtotals.

It is very helpful for subtotaling along a hierarchical dimension such as time or geography. For instance, a query could specify a ROLLUP(y, m, day) or ROLLUP(country, state, city).
For data warehouse administrators using summary tables, ROLLUP can simplify and speed up the maintenance of summary tables.

ROLLUP Syntax

ROLLUP appears in the GROUP BY clause in a SELECT statement. Its form is:

SELECT ... GROUP BY ROLLUP(grouping_column_reference_list)

Example 18-2 ROLLUP

This example uses the data in the sales history store data, the same data as was used in Example 18-1. The ROLLUP is across three dimensions.

SELECT channel_desc, calendar_month_desc, country_id,
   TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$ 
FROM sales, customers, times, channels 
WHERE sales.time_id=times.time_id AND
   sales.cust_id=customers.cust_id AND
   sales.channel_id= channels.channel_id AND 
   channels.channel_desc IN ('Direct Sales', 'Internet') AND 
   times.calendar_month_desc IN ('2000-09', '2000-10')
   AND country_id IN ('UK', 'US')
GROUP BY ROLLUP(channel_desc, calendar_month_desc, country_id);

CHANNEL_DESC         CALENDAR CO SALES$
-------------------- -------- -- --------------
Direct Sales         2000-09  UK      1,378,126
Direct Sales         2000-09  US      2,835,557
Direct Sales         2000-09          4,213,683
Direct Sales         2000-10  UK      1,388,051
Direct Sales         2000-10  US      2,908,706
Direct Sales         2000-10          4,296,757
Direct Sales                          8,510,440
Internet             2000-09  UK        911,739
Internet             2000-09  US      1,732,240
Internet             2000-09          2,643,979
Internet             2000-10  UK        876,571
Internet             2000-10  US      1,893,753
Internet             2000-10          2,770,324
Internet                              5,414,303
                                   13,924,743

Note that results do not always add due to rounding.

This query returns the following sets of rows:

Regular aggregation rows that would be produced by GROUP BY without using ROLLUP
First-level subtotals aggregating across country_id for each combination of channel_desc and calendar_month
Second-level subtotals aggregating across calendar_month_desc and country_id for each channel_desc value
A grand total row

Partial Rollup

You can also roll up so that only some of the sub-totals will be included. This partial rollup uses the following syntax:

GROUP BY expr1, ROLLUP(expr2, expr3);

In this case, the GROUP BY clause creates subtotals at (2+1=3) aggregation levels. That is, at level (expr1, expr2, expr3), (expr1, expr2), and (expr1).

Example 18-3 Partial ROLLUP

SELECT channel_desc, calendar_month_desc, country_id,
   TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$
FROM sales, customers, times, channels 
WHERE sales.time_id=times.time_id AND
   sales.cust_id=customers.cust_id AND
   sales.channel_id= channels.channel_id AND 
   channels.channel_desc IN ('Direct Sales', 'Internet') AND 
   times.calendar_month_desc IN ('2000-09', '2000-10')
   AND country_id IN ('UK', 'US')
GROUP BY channel_desc, ROLLUP(calendar_month_desc, country_id);

CHANNEL_DESC         CALENDAR CO SALES$
-------------------- -------- -- --------------
Direct Sales         2000-09  UK      1,378,126
Direct Sales         2000-09  US      2,835,557
Direct Sales         2000-09          4,213,683
Direct Sales         2000-10  UK      1,388,051
Direct Sales         2000-10  US      2,908,706
Direct Sales         2000-10          4,296,757
Direct Sales                          8,510,440
Internet             2000-09  UK        911,739
Internet             2000-09  US      1,732,240
Internet             2000-09          2,643,979
Internet             2000-10  UK        876,571
Internet             2000-10  US      1,893,753
Internet             2000-10          2,770,324
Internet                              5,414,303

This query returns the following sets of rows:

Regular aggregation rows that would be produced by GROUP BY without using ROLLUP
First-level subtotals aggregating across country_id for each combination of channel_desc and calendar_month_desc
Second-level subtotals aggregating across calendar_month_desc and country_id for each channel_desc value
It does not produce a grand total row

CUBE Extension to GROUP BY

CUBE takes a specified set of grouping columns and creates subtotals for all of their possible combinations. In terms of multidimensional analysis, CUBE generates all the subtotals that could be calculated for a data cube with the specified dimensions. If you have specified CUBE(time, region, department), the result set will include all the values that would be included in an equivalent ROLLUP statement plus additional combinations. For instance, in Example 18-1, the departmental totals across regions (279,000 and 319,000) would not be calculated by a ROLLUP(time, region, department) clause, but they would be calculated by a CUBE(time, region, department) clause. If n columns are specified for a CUBE, there will be 2 to the n combinations of subtotals returned. Example 18-3 gives an example of a three-dimension cube.

See Also:

Oracle9i SQL Reference for syntax and restrictions

When to Use CUBE

Consider Using CUBE in any situation requiring cross-tabular reports. The data needed for cross-tabular reports can be generated with a single SELECT using CUBE. Like ROLLUP, CUBE can be helpful in generating summary tables. Note that population of summary tables is even faster if the CUBE query executes in parallel.

See Also:

Chapter 21, "Using Parallel Execution" for information on parallel execution

CUBE is typically most suitable in queries that use columns from multiple dimensions rather than columns representing different levels of a single dimension. For instance, a commonly requested cross-tabulation might need subtotals for all the combinations of month, state, and product. These are three independent dimensions, and analysis of all possible subtotal combinations is commonplace. In contrast, a cross-tabulation showing all possible combinations of year, month, and day would have several values of limited interest, because there is a natural hierarchy in the time dimension. Subtotals such as profit by day of month summed across year would be unnecessary in most analyses. Relatively few users need to ask "What were the total sales for the 16th of each month across the year?" See "Hierarchy Handling in ROLLUP and CUBE" for an example of handling rollup calculations efficiently.

CUBE Syntax

CUBE appears in the GROUP BY clause in a SELECT statement. Its form is:

SELECT ...  GROUP BY CUBE (grouping_column_reference_list)

Example 18-4 CUBE

SELECT channel_desc, calendar_month_desc, country_id, 
   TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$ 
FROM sales, customers, times, channels 
WHERE sales.time_id=times.time_id AND
   sales.cust_id=customers.cust_id AND
   sales.channel_id= channels.channel_id AND 
   channels.channel_desc IN ('Direct Sales', 'Internet') AND 
   times.calendar_month_desc IN ('2000-09', '2000-10')
   AND country_id IN ('UK', 'US')
GROUP BY CUBE(channel_desc, calendar_month_desc, country_id);

CHANNEL_DESC         CALENDAR CO SALES$
-------------------- -------- -- --------------
Direct Sales         2000-09  UK      1,378,126
Direct Sales         2000-09  US      2,835,557
Direct Sales         2000-09          4,213,683
Direct Sales         2000-10  UK      1,388,051
Direct Sales         2000-10  US      2,908,706
Direct Sales         2000-10          4,296,757
Direct Sales                  UK      2,766,177
Direct Sales                  US      5,744,263
Direct Sales                          8,510,440
Internet             2000-09  UK        911,739
Internet             2000-09  US      1,732,240
Internet             2000-09          2,643,979
Internet             2000-10  UK        876,571
Internet             2000-10  US      1,893,753
Internet             2000-10          2,770,324
Internet                      UK      1,788,310
Internet                      US      3,625,993
Internet                              5,414,303
                  2000-09  UK      2,289,865
                  2000-09  US      4,567,797
                  2000-09          6,857,662
                  2000-10  UK      2,264,622
                  2000-10  US      4,802,459
                  2000-10          7,067,081
                           UK      4,554,487
                           US      9,370,256
                                  13,924,743

This query illustrates CUBE aggregation across three dimensions.

Partial CUBE

Partial CUBE resembles partial ROLLUP in that you can limit it to certain dimensions and precede it with columns outside the CUBE operator. In this case, subtotals of all possible combinations are limited to the dimensions within the cube list (in parentheses), and they are combined with the preceding items in the GROUP BY list.

Partial CUBE Syntax

GROUP BY expr1, CUBE(expr2, expr3)

This syntax example calculates 2*2, or 4, subtotals. That is:

(expr1, expr2, expr3)
(expr1, expr2)
(expr1, expr3)
(expr1)

Example 18-5 Partial CUBE

Using the sales database, you can issue the following statement:

SELECT channel_desc, calendar_month_desc, country_id, 
   TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$ 
FROM sales, customers, times, channels 
WHERE sales.time_id=times.time_id AND
   sales.cust_id=customers.cust_id AND
   sales.channel_id= channels.channel_id AND 
   channels.channel_desc IN ('Direct Sales', 'Internet') AND 
   times.calendar_month_desc IN ('2000-09', '2000-10')
   AND country_id IN ('UK', 'US')
GROUP BY channel_desc, CUBE(calendar_month_desc, country_id);

CHANNEL_DESC         CALENDAR CO SALES$
-------------------- -------- -- --------------
Direct Sales         2000-09  UK      1,378,126
Direct Sales         2000-09  US      2,835,557
Direct Sales         2000-09          4,213,683
Direct Sales         2000-10  UK      1,388,051
Direct Sales         2000-10  US      2,908,706
Direct Sales         2000-10          4,296,757
Direct Sales                  UK      2,766,177
Direct Sales                  US      5,744,263
Direct Sales                          8,510,440
Internet             2000-09  UK        911,739
Internet             2000-09  US      1,732,240
Internet             2000-09          2,643,979
Internet             2000-10  UK        876,571
Internet             2000-10  US      1,893,753
Internet             2000-10          2,770,324
Internet                      UK      1,788,310
Internet                      US      3,625,993
Internet                              5,414,303

Calculating Subtotals Without CUBE

Just as for ROLLUP, multiple SELECT statements combined with UNION ALL statements could provide the same information gathered through CUBE. However, this might require many SELECT statements. For an n-dimensional cube, 2 to the n SELECT statements are needed. In the three-dimension example, this would mean issuing SELECT statements linked with UNION ALL. So many SELECT statements yield inefficient processing and very lengthy SQL.

Consider the impact of adding just one more dimension when calculating all possible combinations: the number of SELECT statements would double to 16. The more columns used in a CUBE clause, the greater the savings compared to the UNION ALL approach.

18SQL for Aggregation in Data Warehouses

Overview of SQL for Aggregation in Data Warehouses

Analyzing Across Multiple Dimensions

Figure 18-1 Logical Cubes and Views by Different Users

Optimized Performance

An Aggregate Scenario

Example 18-1 Simple Cross-Tabular Report With Subtotals

Table 18-1 Simple Cross-Tabular Report With Subtotals

Interpreting NULLs in Examples

ROLLUP Extension to GROUP BY

When to Use ROLLUP

ROLLUP Syntax

Example 18-2 ROLLUP

Partial Rollup

Example 18-3 Partial ROLLUP

CUBE Extension to GROUP BY

When to Use CUBE

CUBE Syntax

Example 18-4 CUBE

Partial CUBE

Partial CUBE Syntax

Example 18-5 Partial CUBE

Calculating Subtotals Without CUBE

18
SQL for Aggregation in Data Warehouses