| Oracle9i Data Warehousing Guide Release 1 (9.0.1) Part Number A90237-01 |
|
This chapter discusses aggregation of SQL, a basic aspect of data warehousing. It contains these topics:
Aggregation is a fundamental part of data warehousing. To improve aggregation performance in your warehouse, Oracle provides the following extensions to the GROUP BY clause:
CUBE and ROLLUP Extensions to the GROUP BY Clause
GROUPING Functions
GROUPING SETS Expression
The CUBE, ROLLUP, and GROUPING SETS extensions to SQL make querying and reporting easier and faster. ROLLUP calculates aggregations such as SUM, COUNT, MAX, MIN, and AVG at increasing levels of aggregation, from the most detailed up to a grand total. CUBE is an extension similar to ROLLUP, enabling a single statement to calculate all possible combinations of aggregations. CUBE can generate the information needed in cross-tabulation reports with a single query.
CUBE, ROLLUP, and the GROUPING SETS extension let you specify exactly the groupings of interest in the GROUP BY clause. This allows efficient analysis across multiple dimensions without performing a CUBE operation. Computing a full cube creates a heavy processing load, so replacing cubes with grouping sets can significantly increase performance. CUBE, ROLLUP, and grouping sets produce a single result set that is equivalent to a UNION ALL of differently grouped rows.
To enhance performance, CUBE, ROLLUP, and GROUPING SETS can be parallelized: multiple processes can simultaneously execute all of these statements. These capabilities make aggregate calculations more efficient, thereby enhancing database performance, and scalability.
The three GROUPING functions help you identify the group each row belongs to and enable sorting subtotal rows and filtering results.
One of the key concepts in decision support systems is multidimensional analysis: examining the enterprise from all necessary combinations of dimensions. We use the term dimension to mean any category used in specifying questions. Among the most commonly specified dimensions are time, geography, product, department, and distribution channel, but the potential dimensions are as endless as the varieties of enterprise activity. The events or entities associated with a particular set of dimension values are usually referred to as facts. The facts might be sales in units or local currency, profits, customer counts, production volumes, or anything else worth tracking.
Here are some examples of multidimensional requests:
All the requests above involve multiple dimensions. Many multidimensional questions require aggregated data and comparisons of data sets, often across time, geography or budgets.
To visualize data that has many dimensions, analysts commonly use the analogy of a data cube, that is, a space where facts are stored at the intersection of n dimensions. Figure 18-1 shows a data cube and how it can be used differently by various groups. The cube stores sales data organized by the dimensions of product, market, and time. Note that this is only a metaphor: the actual data is physically stored in normal tables. The cube data consists of both detail and aggregated data.

You can retrieve slices of data from the cube. These correspond to cross-tabular reports such as the one shown in Table 18-1. Regional managers might study the data by comparing slices of the cube applicable to different markets. In contrast, product managers might compare slices that apply to different products. An ad hoc user might work with a wide variety of constraints, working in a subset cube.
Answering multidimensional questions often involves accessing and querying huge quantities of data, sometimes in millions of rows. Because the flood of detailed data generated by large organizations cannot be interpreted at the lowest level, aggregated views of the information are essential. Aggregations, such as sums and counts, across many dimensions are vital to multidimensional analyses. Therefore, analytical tasks require convenient and efficient data aggregation.
Not only multidimensional issues, but all types of processing can benefit from enhanced aggregation facilities. Transaction processing, financial and manufacturing systems--all of these generate large numbers of production reports needing substantial system resources. Improved efficiency when creating these reports will reduce system load. In fact, any computer process that aggregates data from details to higher levels will benefit from optimized aggregation performance.
Oracle9i extensions provide aggregation features and bring many benefits, including:
To illustrate the use of the GROUP BY extension, this chapter uses the Sales History data of the common schema. All the examples refer to data from this scenario. The hypothetical company has sales across the world and tracks sales by both dollars and quantities information. Because there are many rows of data, the queries shown here typically have tight constraints on their WHERE clauses to limit the results to a small number of rows.
Example 18-1 is a sample cross-tabular report showing the total sales by country_id and channel_desc for the US and UK through the Internet and Direct Sales in September 2000:
|
Country |
|||
| Channel |
|
||
| UK | US | Total | |
| Direct Sales |
1,378,126 |
2,835,557 |
4,213,683 |
| Internet |
911,739 |
1,732,240 |
2,643,979 |
| Total |
2,289,865 |
4,567,797 |
6,857,662 |
Consider that even a simple report like Example 18-1, with just nine values in its grid, generates four subtotals and a grand total. The subtotals are the shaded numbers. Half of the values needed for this report would not be calculated with a query that requested SUM(amount_sold) and did a GROUP BY(channel_desc, country_id). To get the higher-level aggregates would require additional queries. Database commands that offer improved calculation of subtotals bring major benefits to querying, reporting, and analytical operations.
SELECT channel_desc, country_id, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$ FROM sales, customers, times, channels WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND sales.channel_id= channels.channel_id AND channels.channel_desc IN ('Direct Sales', 'Internet') AND times.calendar_month_desc='2000-09' AND country_id IN ('UK', 'US') GROUP BY CUBE(channel_desc, country_id); CHANNEL_DESC CO SALES$ -------------------- -- -------------- Direct Sales UK 1,378,126 Direct Sales US 2,835,557 Direct Sales 4,213,683 Internet UK 911,739 Internet US 1,732,240 Internet 2,643,979 UK 2,289,865 US 4,567,797 6,857,662
NULLs returned by the GROUP BY extensions are not always the traditional null meaning value unknown. Instead, a NULL may indicate that its row is a subtotal. For instance, the first NULL value shown in Example 18-1 is in the Calendar_month_desc column. This NULL means that the row is a subtotal for "All Months" returned by this query for the Direct Sales channel, which is a subtotal for September and October 2000. To avoid introducing another non-value in the database system, these subtotal values are not given a special tag.
See "GROUPING Functions" for details on how the NULLs representing subtotals are distinguished from NULLs stored in the data.
ROLLUP enables a SELECT statement to calculate multiple levels of subtotals across a specified group of dimensions. It also calculates a grand total. ROLLUP is a simple extension to the GROUP BY clause, so its syntax is extremely easy to use. The ROLLUP extension is highly efficient, adding minimal overhead to a query.
The action of ROLLUP is straightforward: it creates subtotals that roll up from the most detailed level to a grand total, following a grouping list specified in the ROLLUP clause. ROLLUP takes as its argument an ordered list of grouping columns. First, it calculates the standard aggregate values specified in the GROUP BY clause. Then, it creates progressively higher-level subtotals, moving from right to left through the list of grouping columns. Finally, it creates a grand total.
ROLLUP creates subtotals at n+1 levels, where n is the number of grouping columns. For instance, if a query specifies ROLLUP on grouping columns of time, region, and department (n=3), the result set will include rows at four aggregation levels.
Use the ROLLUP extension in tasks involving subtotals.
ROLLUP(y, m, day) or ROLLUP(country, state, city).
ROLLUP can simplify and speed up the maintenance of summary tables.
ROLLUP appears in the GROUP BY clause in a SELECT statement. Its form is:
SELECT ... GROUP BY ROLLUP(grouping_column_reference_list)
This example uses the data in the sales history store data, the same data as was used in Example 18-1. The ROLLUP is across three dimensions.
SELECT channel_desc, calendar_month_desc, country_id, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$ FROM sales, customers, times, channels WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND sales.channel_id= channels.channel_id AND channels.channel_desc IN ('Direct Sales', 'Internet') AND times.calendar_month_desc IN ('2000-09', '2000-10') AND country_id IN ('UK', 'US') GROUP BY ROLLUP(channel_desc, calendar_month_desc, country_id);CHANNEL_DESC CALENDAR CO SALES$-------------------- -------- -- --------------Direct Sales 2000-09 UK 1,378,126Direct Sales 2000-09 US 2,835,557Direct Sales 2000-09 4,213,683Direct Sales 2000-10 UK 1,388,051Direct Sales 2000-10 US 2,908,706Direct Sales 2000-10 4,296,757Direct Sales 8,510,440Internet 2000-09 UK 911,739Internet 2000-09 US 1,732,240Internet 2000-09 2,643,979Internet 2000-10 UK 876,571Internet 2000-10 US 1,893,753Internet 2000-10 2,770,324Internet 5,414,30313,924,743
Note that results do not always add due to rounding.
This query returns the following sets of rows:
GROUP BY without using ROLLUP
country_id for each combination of channel_desc and calendar_month
calendar_month_desc and country_id for each channel_desc value
You can also roll up so that only some of the sub-totals will be included. This partial rollup uses the following syntax:
GROUP BY expr1, ROLLUP(expr2, expr3);
In this case, the GROUP BY clause creates subtotals at (2+1=3) aggregation levels. That is, at level (expr1, expr2, expr3), (expr1, expr2), and (expr1).
SELECT channel_desc, calendar_month_desc, country_id, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$ FROM sales, customers, times, channels WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND sales.channel_id= channels.channel_id AND channels.channel_desc IN ('Direct Sales', 'Internet') AND times.calendar_month_desc IN ('2000-09', '2000-10') AND country_id IN ('UK', 'US') GROUP BY channel_desc, ROLLUP(calendar_month_desc, country_id); CHANNEL_DESC CALENDAR CO SALES$ -------------------- -------- -- -------------- Direct Sales 2000-09 UK 1,378,126 Direct Sales 2000-09 US 2,835,557 Direct Sales 2000-09 4,213,683 Direct Sales 2000-10 UK 1,388,051 Direct Sales 2000-10 US 2,908,706 Direct Sales 2000-10 4,296,757 Direct Sales 8,510,440 Internet 2000-09 UK 911,739 Internet 2000-09 US 1,732,240 Internet 2000-09 2,643,979 Internet 2000-10 UK 876,571 Internet 2000-10 US 1,893,753 Internet 2000-10 2,770,324 Internet 5,414,303
This query returns the following sets of rows:
GROUP BY without using ROLLUP
country_id for each combination of channel_desc and calendar_month_desc
calendar_month_desc and country_id for each channel_desc value
CUBE takes a specified set of grouping columns and creates subtotals for all of their possible combinations. In terms of multidimensional analysis, CUBE generates all the subtotals that could be calculated for a data cube with the specified dimensions. If you have specified CUBE(time, region, department), the result set will include all the values that would be included in an equivalent ROLLUP statement plus additional combinations. For instance, in Example 18-1, the departmental totals across regions (279,000 and 319,000) would not be calculated by a ROLLUP(time, region, department) clause, but they would be calculated by a CUBE(time, region, department) clause. If n columns are specified for a CUBE, there will be 2 to the n combinations of subtotals returned. Example 18-3 gives an example of a three-dimension cube.
Consider Using CUBE in any situation requiring cross-tabular reports. The data needed for cross-tabular reports can be generated with a single SELECT using CUBE. Like ROLLUP, CUBE can be helpful in generating summary tables. Note that population of summary tables is even faster if the CUBE query executes in parallel.
CUBE is typically most suitable in queries that use columns from multiple dimensions rather than columns representing different levels of a single dimension. For instance, a commonly requested cross-tabulation might need subtotals for all the combinations of month, state, and product. These are three independent dimensions, and analysis of all possible subtotal combinations is commonplace. In contrast, a cross-tabulation showing all possible combinations of year, month, and day would have several values of limited interest, because there is a natural hierarchy in the time dimension. Subtotals such as profit by day of month summed across year would be unnecessary in most analyses. Relatively few users need to ask "What were the total sales for the 16th of each month across the year?" See "Hierarchy Handling in ROLLUP and CUBE" for an example of handling rollup calculations efficiently.
CUBE appears in the GROUP BY clause in a SELECT statement. Its form is:
SELECT ... GROUP BY CUBE (grouping_column_reference_list)
SELECT channel_desc, calendar_month_desc, country_id, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$ FROM sales, customers, times, channels WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND sales.channel_id= channels.channel_id AND channels.channel_desc IN ('Direct Sales', 'Internet') AND times.calendar_month_desc IN ('2000-09', '2000-10') AND country_id IN ('UK', 'US') GROUP BY CUBE(channel_desc, calendar_month_desc, country_id);CHANNEL_DESC CALENDAR CO SALES$-------------------- -------- -- --------------Direct Sales 2000-09 UK 1,378,126Direct Sales 2000-09 US 2,835,557Direct Sales 2000-09 4,213,683Direct Sales 2000-10 UK 1,388,051Direct Sales 2000-10 US 2,908,706Direct Sales 2000-10 4,296,757Direct Sales UK 2,766,177Direct Sales US 5,744,263Direct Sales 8,510,440Internet 2000-09 UK 911,739Internet 2000-09 US 1,732,240Internet 2000-09 2,643,979Internet 2000-10 UK 876,571Internet 2000-10 US 1,893,753Internet 2000-10 2,770,324Internet UK 1,788,310Internet US 3,625,993Internet 5,414,3032000-09 UK 2,289,8652000-09 US 4,567,7972000-09 6,857,6622000-10 UK 2,264,6222000-10 US 4,802,4592000-10 7,067,081UK 4,554,487US 9,370,25613,924,743
This query illustrates CUBE aggregation across three dimensions
Partial CUBE resembles partial ROLLUP in that you can limit it to certain dimensions and precede it with columns outside the CUBE operator. In this case, subtotals of all possible combinations are limited to the dimensions within the cube list (in parentheses), and they are combined with the preceding items in the GROUP BY list.
GROUP BY expr1, CUBE(expr2, expr3)
The above syntax example calculates 2*2, or 4, subtotals. That is:
Using the sales database, you can issue the following statement:
SELECT channel_desc, calendar_month_desc, country_id, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$ FROM sales, customers, times, channels WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND sales.channel_id= channels.channel_id AND channels.channel_desc IN ('Direct Sales', 'Internet') AND times.calendar_month_desc IN ('2000-09', '2000-10') AND country_id IN ('UK', 'US') GROUP BY channel_desc, CUBE(calendar_month_desc, country_id);CHANNEL_DESC CALENDAR CO SALES$-------------------- -------- -- -------------- Direct Sales 2000-09 UK 1,378,126 Direct Sales 2000-09 US 2,835,557 Direct Sales 2000-09 4,213,683 Direct Sales 2000-10 UK 1,388,051 Direct Sales 2000-10 US 2,908,706 Direct Sales 2000-10 4,296,757 Direct Sales UK 2,766,177 Direct Sales US 5,744,263 Direct Sales 8,510,440 Internet 2000-09 UK 911,739 Internet 2000-09 US 1,732,240 Internet 2000-09 2,643,979 Internet 2000-10 UK 876,571 Internet 2000-10 US 1,893,753 Internet 2000-10 2,770,324 Internet UK 1,788,310 Internet US 3,625,993 Internet 5,414,303
Just as for ROLLUP, multiple SELECT statements combined with UNION ALL statements could provide the same information gathered through CUBE. However, this might require many SELECT statements. For an n-dimensional cube, 2 to the n SELECT statements are needed. In the three-dimension example, this would mean issuing SELECT statements linked with UNION ALL. So many SELECT statements yield inefficient processing and very lengthy SQL.
Consider the impact of adding just one more dimension when calculating all possible combinations: the number of SELECT statements would double to 16. The more columns used in a CUBE clause, the greater the savings compared to the UNION ALL approach.
Two challenges arise with the use of ROLLUP and CUBE. First, how can you programmatically determine which result set rows are subtotals, and how do you find the exact level of aggregation for a given subtotal? You often need to use subtotals in calculations such as percent-of-totals, so you need an easy way to determine which rows are the subtotals. Second, what happens if query results contain both stored NULL values and "NULL" values created by a ROLLUP or CUBE? How can you differentiate between the two?
GROUPING handles these problems. Using a single column as its argument, GROUPING returns 1 when it encounters a NULL value created by a ROLLUP or CUBE operation. That is, if the NULL indicates the row is a subtotal, GROUPING returns a 1. Any other type of value, including a stored NULL, returns a 0.
GROUPING appears in the selection list portion of a SELECT statement. Its form is:
SELECT ... [GROUPING(dimension_column)...] ... GROUP BY ... {CUBE | ROLLUP} (dimension_column)
This example uses GROUPING to create a set of mask columns for the result set shown in Example 18-3. The mask columns are easy to analyze programmatically.
SELECT channel_desc, calendar_month_desc, country_id, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$, GROUPING(channel_desc) as Ch, GROUPING(calendar_month_desc) AS Mo, GROUPING(country_id) AS Co FROM sales, customers, times, channels WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND sales.channel_id= channels.channel_id AND channels.channel_desc IN ('Direct Sales', 'Internet') AND times.calendar_month_desc IN ('2000-09', '2000-10') AND country_id IN ('UK', 'US') GROUP BY ROLLUP(channel_desc, calendar_month_desc, country_id); CHANNEL_DESC CALENDAR CO SALES$ CH MO CO -------------------- -------- -- -------------- --------- --------- --------- Direct Sales 2000-09 UK 1,378,126 0 0 0 Direct Sales 2000-09 US 2,835,557 0 0 0 Direct Sales 2000-09 4,213,683 0 0 1 Direct Sales 2000-10 UK 1,388,051 0 0 0 Direct Sales 2000-10 US 2,908,706 0 0 0 Direct Sales 2000-10 4,296,757 0 0 1 Direct Sales 8,510,440 0 1 1 Internet 2000-09 UK 911,739 0 0 0 Internet 2000-09 US 1,732,240 0 0 0 Internet 2000-09 2,643,979 0 0 1 Internet 2000-10 UK 876,571 0 0 0 Internet 2000-10 US 1,893,753 0 0 0 Internet 2000-10 2,770,324 0 0 1 Internet 5,414,303 0 1 1 13,924,743 1 1 1
A program can easily identify the detail rows above by a mask of "0 0 0" on the T, R, and D columns. The first level subtotal rows have a mask of "0 0 1", the second level subtotal rows have a mask of "0 1 1", and the overall total row has a mask of "1 1 1".
You can resolve ambiguity in result sets by using the GROUPING and other functions as shown in the code below.
SELECT DECODE(GROUPING(channel_desc), 1, 'All Channels', channel_desc) AS Channel, DECODE(GROUPING(country_id), 1, 'All Countries', country_id) AS Country, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$ FROM sales, customers, times, channels WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND sales.channel_id= channels.channel_id AND channels.channel_desc IN ('Direct Sales', 'Internet') AND times.calendar_month_desc= '2000-09' AND country_id IN ('UK', 'US') GROUP BY CUBE(channel_desc, country_id); CHANNEL COUNTRY SALES$ -------------------- ------------- -------------- Direct Sales UK 1,378,126 Direct Sales US 2,835,557 Direct Sales All Countries 4,213,683 Internet UK 911,739 Internet US 1,732,240 Internet All Countries 2,643,979 All Channels UK 2,289,865 All Channels US 4,567,797 All Channels All Countries 6,857,662
These results include text values clarifying which rows have aggregations.
Grouping function used to differentiate aggregate-based "NULL" from stored NULL values.
To understand the SQL statement above, note its first column specification, which handles the channel_desc column. In the first line of the SQL code above:
SELECT DECODE(GROUPING(channel_desc), 1, 'All Channels', channel_desc) AS Channel,
The channel_desc value is determined with a DECODE function that contains a GROUPING function. The GROUPING function returns a 1 if a row value is an aggregate created by ROLLUP or CUBE, otherwise it returns a 0. The DECODE function then operates on the GROUPING function's results. It returns the text "All Channels" if it receives a 1 and the channel_desc value from the database if it receives a 0. Values from the database will be either a real value such as "Internet" or a stored NULL. The second column specification, displaying country_id, works the same way.
The GROUPING function is not only useful for identifying NULLs, it also enables sorting subtotal rows and filtering results. In Example 18-8, you retrieve a subset of the subtotals created by a CUBE and none of the base-level aggregations. The HAVING clause constrains columns that use GROUPING functions.
SELECT channel_desc, calendar_month_desc, country_id, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$, GROUPING(channel_desc) CH, GROUPING(calendar_month_desc) MO, GROUPING(country_id) CO FROM sales, customers, times, channels WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND sales.channel_id= channels.channel_id AND channels.channel_desc IN ('Direct Sales', 'Internet') AND times.calendar_month_desc IN ('000-09', '2000-10') AND country_id IN ('UK', 'US') GROUP BY CUBE(channel_desc, calendar_month_desc, country_id) HAVING (GROUPING(channel_desc)=1 AND GROUPING(calendar_month_desc)= 1 AND GROUPING(country_id)=1) OR (GROUPING(channel_desc)=1 AND GROUPING(calendar_month_desc)= 1) OR (GROUPING(country_id)=1 AND GROUPING(calendar_month_desc)= 1); CHANNEL_DESC C CO SALES$ CH MO CO -------------------- - -- -------------- --------- --------- --------- UK 4,554,487 1 1 0 US 9,370,256 1 1 0 Direct Sales 8,510,440 0 1 1 Internet 5,414,303 0 1 1 13,924,743 1 1 1
Compare the result set of Example 18-8 with that in Example 18-3 to see how Example 18-8 is a precisely specified group: it contains only the yearly totals, regional totals aggregated over time and department, and the grand total.
To find the GROUP BY level of a particular row, a query must return GROUPING function information for each of the GROUP BY columns. If we do this using the GROUPING function, every GROUP BY column requires another column using the GROUPING function. For instance, a four-column GROUP BY clause needs to be analyzed with four GROUPING functions. This is inconvenient to write in SQL and increases the number of columns required in the query. When you want to store the query result sets in tables, as with materialized views, the extra columns waste storage space.
To address these problems, Oracle9i introduces the GROUPING_ID function. GROUPING_ID returns a single number that enables you to determine the exact GROUP BY level. For each row, GROUPING_ID takes the set of 1's and 0's that would be generated if you used the appropriate GROUPING functions and concatenates them, forming a bit vector. The bit vector is treated as a binary number, and the number's base-10 value is returned by the GROUPING_ID function. For instance, if you group with the expression CUBE(a, b) the possible values are as shown in Table 18-2:
| Aggregation Level | Bit Vector | GROUPING_ID |
|---|---|---|
|
a, b |
0 0 |
0 |
|
a |
0 1 |
1 |
|
b |
1 0 |
2 |
|
Grand Total |
1 1 |
3 |
GROUPING_ID clearly distinguishes groupings created by grouping set specification, and it is very useful during refresh and rewrite of materialized views.
While the extensions to GROUP BY offer power and flexibility, they also allow complex result sets that can include duplicate groupings. The GROUP_ID function lets you distinguish among duplicate groupings. If there are multiple sets of rows calculated for a given level, GROUP_ID assigns the value of 0 to all the rows in the first set. All other sets of duplicate rows for a particular grouping are assigned higher values, starting with 1. For example, consider the following query, which generates a duplicate grouping:
SELECT country_id, cust_state_province, SUM(amount_sold), GROUPING_ID(country_id, cust_state_province) GROUPING_ID, GROUP_ID() FROM sales, customers, times WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND times.time_id= '30-OCT-00' AND country_id IN ('FR', 'ES') GROUP BY GROUPING SETS (country_id, ROLLUP(country_id, cust_state_province)); CO CUST_STATE_PROVINCE SUM(AMOUNT_SOLD) GROUPING_ID GROUP_ID() -- ---------------------------------------- ---------------- ---------- ES Alicante 8939 0 0 ES Almeria 1053 0 0 ES Barcelona 6312 0 0 ES Girona 220 0 0 ES Malaga 8137 0 0 ES Salamanca 324 0 0 ES Valencia 7588 0 0 FR Alsace 5099 0 0 FR Aquitaine 13183 0 0 FR Brittany 3938 0 0 FR Centre 2968 0 0 FR Ile-de-France 16449 0 0 FR Languedoc-Roussillon 20228 0 0 FR Midi-Pyrenees 2322 0 0 FR Pays de la Loire 1096 0 0 FR Provence-Alpes-Cote d'Azur 1208 0 0 FR Rhtne-Alpes 7637 0 0 106701 3 0 ES 32573 1 0 FR 74128 1 0 ES 32573 1 1 FR 74128 1 1
The query above generates the following groupings: (country_id, cust_state_province), (country_id), (country_id), and (). Note that the grouping (country_id) is repeated twice. The syntax for GROUPING SETS is explained in "GROUPING SETS Expression".
This function helps you filter out duplicate groupings from the result. For example, you can filter out duplicate (region) groupings from the above example by adding a HAVING clause condition GROUP_ID()=0 to the query.
You can selectively specify the set of groups that you want to create using a GROUPING SETS expression within a GROUP BY clause. This allows precise specification across multiple dimensions without computing the whole CUBE. For example, you can say:
SELECT channel_desc, calendar_month_desc, country_id, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$ FROM sales, customers, times, channels WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND sales.channel_id= channels.channel_id AND channels.channel_desc IN ('Direct Sales', 'Internet') AND times.calendar_month_desc IN ('2000-09', '2000-10') AND country_id IN ('UK', 'US') GROUP BY GROUPING SETS((channel_desc, calendar_month_desc, country_id), (channel_desc, country_id), (calendar_month_desc, country_id));
Note that this statement uses composite columns, described in "Composite Columns". This statement calculates aggregates over three groupings:
(channel_desc, calendar_month_desc, country_id)
(channel_desc, country_id)
(calendar_month_desc, country_id)
Compare the above statement with the alternative below, which uses the CUBE operation and the GROUPING_ID function to return the desired rows:
SELECT channel_desc, calendar_month_desc, country_id, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$, GROUPING_ID(channel_desc, calendar_month_desc, country_id) gid FROM sales, customers, times, channels WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND sales.channel_id= channels.channel_id AND channels.channel_desc IN ('Direct Sales', 'Internet') AND times.calendar_month_desc IN ('2000-09', '2000-10') AND country_id IN ('UK', 'US') GROUP BY CUBE(channel_desc, calendar_month_desc, country_id) HAVING GROUPING_ID(channel_desc, calendar_month_desc, country_id)=0 OR GROUPING_ID(channel_desc, calendar_month_desc, country_id)=2 OR GROUPING_ID(channel_desc, calendar_month_desc, country_id)=4;;
The above statement computes all the 8 (2 *2 *2) groupings, though only the above 3 groups are of interest to you.
Another alternative is the following statement, which is lengthy due to several unions. This statement requires three scans of the base table, making it inefficient. CUBE and ROLLUP can be thought of as grouping sets with very specific semantics. The following equivalences show this fact:
CUBE(a, b, c)
is equivalent to
GROUPING SETS ((a, b, c), (a, b), (a, c), (b, c), (a), (b), (c), ()) ROLLUP(a, b, c)
is equivalent to
GROUPING SETS ((a, b, c), (a, b), ())
GROUPING SETS syntax lets you define multiple groupings in the same query. GROUP BY computes all the groupings specified and combines them with UNION ALL. For example,
GROUP BY GROUPING sets (channel_desc, calendar_month_desc, country_id )
is equivalent to:
GROUP BY channel_desc UNION ALL GROUP BY calendar_month_desc UNION ALL country_id
Table 18-3 shows grouping sets specification and equivalent GROUP BY specification. Note that some examples use composite columns.
In the absence of an optimizer that looks across query blocks to generate the execution plan, a query based on UNION would need multiple scans of the base table, sales. This could be very inefficient as fact tables will normally be huge. Using GROUPING SETS statements, all the groupings of interest are available in the same query block.
A composite column is a collection of columns that are treated as a unit during the computation of groupings. You specify the columns in parentheses as in the following statement:
ROLLUP (year, (quarter, month), day)
In this statement, the data is not rolled up across year and quarter, but is instead equivalent to the following groupings of a UNION ALL:
Here, (quarter, month) form a composite column and are treated as a unit. In general, composite columns are useful in ROLLUP, CUBE, GROUPING SETS, and concatenated groupings. For example, in CUBE or ROLLUP, composite columns would mean skipping aggregation across certain levels. That is,
GROUP BY ROLLUP(a, (b, c))
is equivalent to
GROUP BY a, b, c UNION ALL GROUP BY a UNION ALL GROUP BY ()
Here, (b, c) are treated as a unit and rollup will not be applied across (b, c). It is as if you have an alias, for example z, for (b, c) and the GROUP BY expression reduces to GROUP BY ROLLUP(a, z). Compare this with the normal rollup as in:
GROUP BY ROLLUP(a, b, c)
which would be
GROUP BY a, b, c UNION ALL GROUP BY a, b UNION ALL GROUP BY a UNION ALL GROUP BY ().
Similarly,
GROUP BY CUBE((a, b), c) would be equivalent to GROUP BY a, b, c UNION ALL GROUP BY a, b UNION ALL GROUP BY c UNION ALL GROUP By ()
In GROUPING SETS, a composite column is used to denote a particular level of GROUP BY. See Table 18-3 for more examples of composite columns.
You do not have full control over what aggregation levels you want with CUBE and ROLLUP. For example, the following statement:
SELECT channel_desc, calendar_month_desc, country_id, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$ FROM sales, customers, times, channels WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND sales.channel_id= channels.channel_id AND channels.channel_desc IN ('Direct Sales', 'Internet') AND times.calendar_month_desc IN ('2000-09', '2000-10') AND country_id IN ('UK', 'US') GROUP BY ROLLUP(channel_desc, calendar_month_desc, country_id);
results in Oracle computing the following groupings:
(channel_desc, calendar_month_desc, country_id)
(channel_desc, calendar_month_desc)
(channel_desc)
()
If you are just interested in grouping of lines (1), (3) and (4) in the above, you cannot limit the calculation to those groupings without using composite columns. With composite columns, this is possible by treating month and country as a single unit while rolling up. Columns enclosed in parentheses are treated as a unit while computing CUBE and ROLLUP. Thus, you would say:
SELECT channel_desc, calendar_month_desc, country_id, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$ FROM sales, customers, times, channels WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND sales.channel_id= channels.channel_id AND channels.channel_desc IN ('Direct Sales', 'Internet') AND times.calendar_month_desc IN ('2000-09', '2000-10') AND country_id IN ('UK', 'US') GROUP BY ROLLUP(channel_desc, (calendar_month_desc, country_id));
Concatenated groupings offer a concise way to generate useful combinations of groupings. Groupings specified with concatenated groupings yield the cross-product of groupings from each grouping set. The cross-product operation enables even a small number of concatenated groupings to generate a large number of final groups. The concatenated groupings are specified simply by listing multiple grouping sets, cubes, and rollups, and separating them with commas. Here is an example of concatenated grouping sets:
GROUP BY GROUPING SETS(a, b), GROUPING SETS(c, d)
The SQL above defines the following groupings:
(a, c), (a, d), (b, c), (b, d)
Concatenation of grouping sets is very helpful for these reasons:
You can also specify more than one grouping in the GROUP BY clause. For example, if you want aggregated sales values for each product rolled up across all levels in the time dimension (year, month and day), and across all levels in the geography dimension (region), you can issue the following statement:
SELECT channel_desc, calendar_year, calendar_quarter_desc, country_id, cust_state_province, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$ FROM sales, customers, times, channels WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND sales.channel_id= channels.channel_id AND channels.channel_desc IN ('Direct Sales', 'Internet') AND times.calendar_month_desc IN ('2000-09', '2000-10') AND country_id IN ('UK', 'US') GROUP BY channel_desc, GROUPING SETS (ROLLUP(calendar_year, calendar_quarter_desc), ROLLUP(country_id, cust_state_province)) ;
This results in the following groupings:
channel_desc, calendar_year, calendar_quarter_desc)
channel_desc, calendar_year)
channel_desc)
channel_desc, country_id, cust_state_province)
channel_desc, country_id)
channel_desc)
This is the cross-product of the following:
channel_desc
ROLLUP(calendar_year, calendar_quarter_desc), which is equivalent to calendar_year, calendar_quarter_desc), (calendar_year), ())
ROLLUP(country_id, cust_state_province), which is equivalent to ((country_id, cust_state_province), (country_id), ())
Note that the output above contains two occurrences of (channel_desc) group. To filter out the extra (channel_desc) group, the query could use a GROUP_ID function.
Another concatenated join example is given below, showing the cross product of two grouping sets:
SELECT country_id, cust_state_province,calendar_year, calendar_quarter_desc,TO_CHAR(SUM(amount_sold),'9,999,999,999') SALES$FROM sales, customers, times, channelsWHERE sales.time_id=times.time_id ANDsales.cust_id=customers.cust_id ANDsales.channel_id= channels.channel_id ANDchannels.channel_desc IN ('Direct Sales','Internet') ANDtimes.calendar_month_desc IN ('2000-09','2000-10')AND country_id IN ('UK','US')GROUP BYGROUPING SETS (country_id, cust_state_province),GROUPING SETS (calendar_year, calendar_quarter_desc);
This statement results in the computation of groupings:
country_id, year), (country_id, calendar_quarter_desc), (cust_state_province, year) and (cust_state_province, calendar_quarter_desc)
One of the most important uses for concatenated groupings is to generate the aggregates needed for a hierarchical cube of data. A hierarchical cube is a data set where the data is aggregated along the rollup hierarchy of each of its dimensions and these aggregations are combined across dimensions. It includes the typical set of aggregations needed for business intelligence queries. By using concatenated groupings, you can generate all the aggregations needed by a hierarchical cube with just n ROLLUPs (where n is the number of dimensions), and avoid generating unwanted aggregations.
Consider just three of the dimensions in the Sales History data set, each of which has a multilevel hierarchy:
year, quarter, month, day (week is in a separate hierarchy)
category, subcategory, prod_name
region, subregion, country, state, city
This data is represented using a column for each level of the hierarchies, creating a total of twelve columns for dimensions, plus the columns holding sales figures.
For our business intelligence needs, we would like to calculate and store certain aggregates of the various combinations of dimensions. In Example 18-13, we create the aggregates for all levels, except for "day", which would create too many rows. In particular, we want to use ROLLUP within each dimension to generate useful aggregates. Once we have the ROLLUP-based aggregates within each dimension, we want to combine them with the other dimensions. This will generate our hierarchical cube. Note that this is not at all the same as a CUBE using all twelve of the dimension columns: that would create 2 to the 12th power (4,096) aggregation groups, of which we need only a small fraction. Concatenated grouping sets make it easy to generate exactly the aggregations we need. Below we show GROUP BY clause needed:
SELECT calendar_year, calendar_quarter_desc, calendar_month_desc, country_region, country_subregion, countries.country_id, cust_state_province, cust_city, prod_cat_desc, prod_subcat_desc, prod_name, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$ FROM sales, customers, times, channels, countries, products WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND sales.channel_id= channels.channel_id AND sales.prod_id=products.prod_id AND customers.country_id=countries.country_id AND channels.channel_desc IN ('Direct Sales', 'Internet') AND times.calendar_month_desc IN ('2000-09', '2000-10') AND prod_name IN ('Ruckpart Eclipse', 'Ukko Plain Gortex Boot') AND countries.country_id IN ('UK', 'US') GROUP BY ROLLUP(calendar_year, calendar_quarter_desc, calendar_month_desc), ROLLUP(country_region, country_subregion, countries.country_id, cust_state_province, cust_city), ROLLUP(prod_cat_desc, prod_subcat_desc, prod_name);
The ROLLUPs in the GROUP BY specification above generate the following groups, four for each dimension:
The concatenated grouping sets specified in the SQL above will take the ROLLUP aggregations listed in the table and perform a cross-product on them. The cross-product will create the 96 (4x4x6) aggregate groups needed for a hierarchical cube of the data. There are major advantages in using three ROLLUP expressions to replace what would otherwise require 96 grouping set expressions: the concise SQL is far less error-prone to develop and far easier to maintain, and it enables much better query optimization. You can picture how a cube with more dimensions and more levels would make the use of concatenated groupings even more advantageous.
This section discusses the following topics.
The ROLLUP and CUBE extensions work independently of any hierarchy metadata in your system. Their calculations are based entirely on the columns specified in the SELECT statement in which they appear. This approach enables CUBE and ROLLUP to be used whether or not hierarchy metadata is available. The simplest way to handle levels in hierarchical dimensions is by using the ROLLUP extension and indicating levels explicitly through separate columns. The code below shows a simple example of this with months rolled up to quarters and quarters rolled up to years.
SELECT calendar_year, calendar_quarter_number, calendar_month_number, SUM(amount_sold) FROM sales, times, products, customers WHERE sales.time_id=times.time_id AND sales.prod_id=products.prod_id AND sales.cust_id=customers.cust_id AND prod_name IN ('Ruckpart Eclipse', 'Ukko Plain Gortex Boot') AND country_id = 'US' AND calendar_year=1999 GROUP BY ROLLUP(calendar_year, calendar_quarter_number, calendar_month_number); CALENDAR_YEAR CALENDAR_QUARTER_NUMBER CALENDAR_MONTH_NUMBER SUM(AMOUNT_SOLD) ------------- ----------------------- --------------------- ---------------- 1999 1 2 79652 1999 1 3 156738 1999 1 236390 1999 2 4 97802 1999 2 5 116282 1999 2 6 85914 1999 2 299998 1999 3 7 113256 1999 3 8 79270 1999 3 9 103200 1999 3 295726 1999 832114 832114
CUBE, ROLLUP, and GROUPING SETS do not restrict the GROUP BY clause column capacity. The GROUP BY clause, with or without the extensions, can work with up to 255 columns. However, the combinatorial explosion of CUBE makes it unwise to specify a large number of columns with the CUBE extension. Consider that a 20-column list for CUBE would create 2 to the 20 combinations in the result set. A very large CUBE list could strain system resources, so any such query needs to be tested carefully for performance and the load it places on the system.
The HAVING clause of SELECT statements is unaffected by the use of GROUP BY. Note that the conditions specified in the HAVING clause apply to both the subtotal and non-subtotal rows of the result set. In some cases a query may need to exclude the subtotal rows or the non-subtotal rows from the HAVING clause. This can be achieved by using a GROUPING or GROUPING_ID function together with the HAVING clause. See Example 18-8 and its associated SQL statement for an example.
In many cases, a query needs to order the rows in a certain way, and this is done with the ORDER BY clause. The ORDER BY clause of a SELECT statement is unaffected by the use of GROUP BY, since the ORDER BY clause is applied after the GROUP BY calculations are complete.
Note that the ORDER BY specification makes no distinction between aggregate and non-aggregate rows of the result set. For instance, you might wish to list sales figures in declining order, but still have the subtotals at the end of each group. Simply ordering sales figures in descending sequence will not be sufficient, since that will place the subtotals (the largest values) at the start of each group. Therefore, it is essential that the columns in the ORDER BY clause include columns that differentiate aggregate from non-aggregate columns. This requirement means that queries using ORDER BY along with aggregation extensions to GROUP BY will generally need to use one or more of the GROUPING functions.
The examples in this chapter show ROLLUP and CUBE used with the SUM function. While this is the most common type of aggregation, these extensions can also be used with all other functions available to the GROUP BY clause, for example, COUNT, AVG, MIN, MAX, STDDEV, and VARIANCE. COUNT, which is often needed in cross-tabular analyses, is likely to be the second most commonly used function.
The WITH clause (formally known as subquery_factoring_clause) enables you to reuse the same query block in a SELECT statement when it occurs more than once within a complex query. WITH is a part of the SQL-99 standard. This is particularly useful when a query has multiple references to the same query block and there are joins and aggregations. Using the WITH clause, Oracle retrieves the results of a query block and stores them in the user's temporary tablespace. Note that Oracle9i does not support recursive use of the WITH clause.
The following query is an example of where you can improve performance and write SQL more simply by using the WITH clause. The query calculates the sum of sales for each channel and holds it under the name channel_summary. Then it checks each channel's sales total to see if any channel's sales are greater than one third of the total sales. By using the WITH clause, the channel_summary data is calculated just once, avoiding an extra scan through the large sales table.
WITH channel_summary AS (SELECT channels.channel_desc, SUM(amount_sold) AS channel_totalFROM sales, channelsWHERE sales.channel_id = channels.channel_idGROUP BY channels.channel_desc)SELECT channel_desc, channel_totalFROM channel_summaryWHERE channel_total > (SELECT SUM(channel_total) * 1/3FROM channel_summary);CHANNEL_DESC CHANNEL_TOTAL -------------------- ------------- Direct Sales 312829530
Note that the example above could also be performed efficiently using the reporting aggregate functions described in Chapter 19, "SQL for Analysis in Data Warehouses".
|
|
![]() Copyright © 1996-2001, Oracle Corporation. All Rights Reserved. |
|