| Oracle8i Application Developer's Guide - Fundamentals Release 8.1.5 A68003-01 |
|
This chapter covers the following topics:
The last decade has seen a tremendous increase in the use of query, reporting, and on-line analytical processing (OLAP) tools, often in conjunction with data warehouses and data marts. Enterprises exploring new markets and facing greater competition expect these tools to provide the maximum possible decision-making value from their data resources.
Oracle expands its long-standing support for analytical applications in Oracle8i release 8.1.5 with the CUBE and ROLLUP extensions to SQL. Oracle also provides optimized performance and simplified syntax for Top-N queries. These enhancements make important calculations significantly easier and more efficient, enhancing database performance, scalability and simplicity.
ROLLUP and CUBE are simple extensions to the SELECT statement's GROUP BY clause. ROLLUP creates subtotals at any level of aggregation needed, from the most detailed up to a grand total. CUBE is an extension similar to ROLLUP, enabling a single statement to calculate all possible combinations of subtotals. CUBE can generate the information needed in cross-tab reports with a single query. To enhance performance, both CUBE and ROLLUP are parallelized: multiple processes can simultaneously execute both types of statements.
Enhanced Top-N queries enable more efficient retrieval of the largest and smallest values of a data set. This chapter presents concepts, syntax, and examples of CUBE, ROLLUP and Top-N analysis.
One of the key concepts in decision support systems is "multi-dimensional analysis": examining the enterprise from all necessary combinations of dimensions. We use the term "dimension" to mean any category used in specifying questions. Among the most commonly specified dimensions are time, geography, product, department, and distribution channel, but the potential dimensions are as endless as the varieties of enterprise activity. The events or entities associated with a particular set of dimension values are usually referred to as "facts." The facts may be sales in units or local currency, profits, customer counts, production volumes, or anything else worth tracking.
Here are some examples of multi-dimensional requests:
All the requests above constrain multiple dimensions. Many multi-dimensional questions require aggregated data and comparisons of data sets, often across time, geography or budgets.
To visualize data that has many dimensions, analysts commonly use the analogy of a data "cube," that is, a space where facts are stored at the intersection of n dimensions. Figure 20-1 shows a data cube and how it could be used differently by various groups. The cube stores sales data organized by the dimensions of Product, Market, and Time.
We can retrieve "slices" of data from the cube. These correspond to cross-tabular reports such as the one shown in Table 20-1. Regional managers might study the data by comparing slices of the cube applicable to different markets. In contrast, product managers might compare slices that apply to different products. An ad hoc user might work with a wide variety of constraints, working in a subset cube.
Answering multi-dimensional questions often involves huge quantities of data, sometimes millions of rows. Because the flood of detailed data generated by large organizations cannot be interpreted at the lowest level, aggregated views of the information are essential. Subtotals across many dimensions are vital to multi-dimensional analyses. Therefore, analytical tasks require convenient and efficient data aggregation.
Not only multi-dimensional issues, but all types of processing can benefit from enhanced aggregation facilities. Transaction processing, financial and manufacturing systems--all of these generate large numbers of production reports needing substantial system resources. Improved efficiency when creating these reports will reduce system load. In fact, any computer process that aggregates data from details to higher levels needs optimized performance.
To leverage the power of the database server, powerful aggregation commands should be available inside the SQL engine. New extensions in Oracle provide these features and bring many benefits, including:
Oracle8i provides all these benefits with the new CUBE and ROLLUP extensions to the GROUP BY clause. These extensions adhere to the ANSI and ISO proposals for SQL3, a draft standard for enhancements to SQL.
To illustrate CUBE, ROLLUP, and Top-N queries, this chapter uses a hypothetical videotape sales and rental company. All the examples given refer to data from this scenario. The hypothetical company has stores in several regions and tracks sales and profit information. The data is categorized by three dimensions: Time, Department, and Region. The time dimensions are 1996 and 1997, the departments are Video Sales and Video Rentals, and the regions are East, West, and Central.
Table 20-1 is a sample cross-tabular report showing the total profit by region and department in 1997:
|
1997 |
|||
| Region |
Department |
||
| Video Rental Profit | Video Sales Profit | Total Profit | |
| Central |
82,000 |
85,000 |
167,000 |
| East |
101,000 |
137,000 |
238,000 |
| West |
96,000 |
97,000 |
193,000 |
| Total |
279,000 |
319,000 |
598,000 |
Consider that even a simple report like Table 20-1, with just twelve values in its grid, needs five subtotals and a grand total. The subtotals are the shaded numbers, such as Video Rental Profits across regions, namely, 279,000, and Eastern region profits across department, namely, 238,000. Half of the values needed for this report would not be calculated with a query that used a standard SUM() and GROUP BY. Database commands that offer improved calculation of subtotals bring major benefits to querying, reporting and analytical operations.
ROLLUP enables a SELECT statement to calculate multiple levels of subtotals across a specified group of dimensions. It also calculates a grand total. ROLLUP is a simple extension to the GROUP BY clause, so its syntax is extremely easy to use. The ROLLUP extension is highly efficient, adding minimal overhead to a query.
ROLLUP appears in the GROUP BY clause in a SELECT statement. Its form is:
SELECT ... GROUP BY ROLLUP(grouping_column_reference_list)
ROLLUP's action is straightforward: it creates subtotals which "roll up" from the most detailed level to a grand total, following a grouping list specified in the ROLLUP clause. ROLLUP takes as its argument an ordered list of grouping columns. First, it calculates the standard aggregate values specified in the GROUP BY clause. Then, it creates progressively higher-level subtotals, moving from right to left through the list of grouping columns. Finally, it creates a grand total.
ROLLUP will create subtotals at n+1 levels, where n is the number of grouping columns. For instance, if a query specifies ROLLUP on grouping columns of Time, Region, and Department ( n=3), the result set will include rows at four aggregation levels.
This example of ROLLUP uses the data in the video store database.
SELECT Time, Region, Department, sum(Profit) AS Profit FROM sales GROUP BY ROLLUP(Time, Region, Dept)
As you can see in Table 20-2, this query returns the following sets of rows:
GROUP BY without using ROLLUP
Table 20-2 ROLLUP Aggregation across Three Dimensions
The NULL values returned by ROLLUP and CUBE are not always the traditional NULL value meaning "value unknown." Instead, a NULL may indicate that its row is a subtotal. For instance, the first NULL value shown in Table 20-2 is in the Department column. This NULL means that the row is a subtotal for "All Departments" for the Central region in 1996. To avoid introducing another non-value in the database system, these subtotal values are not given a special tag.
See the section "GROUPING Function" for details on how the NULLs representing subtotals are distinguished from NULLs stored in the data.
The result set in Table 20-1 could be generated by the UNION of four SELECT statements, as shown below. This is a subtotal across three dimensions. Notice that a complete set of ROLLUP-style subtotals in n dimensions would require n+1 SELECT statements linked with UNION ALL.
SELECT Time, Region, Department, SUM(Profit) FROM Sales GROUP BY Time, Region, Department UNION ALL SELECT Time, Region, '' , SUM(Profit) FROM Sales GROUP BY Time, Region UNION ALL SELECT Time, '', '', SUM(Profits) FROM Sales GROUP BY Time UNION ALL SELECT '', '', '', SUM(Profits) FROM Sales;
The approach shown in the SQL above has two shortcomings compared to using the ROLLUP operator. First, the syntax is complex, requiring more effort to generate and understand. Second, and more importantly, query execution is inefficient because the optimizer receives no guidance about the user's overall goal. Each of the four SELECT statements above causes table access even though all the needed subtotals could be gathered with a single pass. The ROLLUP extension makes the desired result explicit and gathers its results with just one table access.
The more columns used in a ROLLUP clause, the greater the savings versus the UNION approach. For instance, if a four-column ROLLUP replaces a UNION of 5 SELECT statements, the reduction in table access is four-fifths or 80%.
Some data access tools calculate subtotals on the client side and thereby avoid the multiple SELECT statements described above. While this approach can work, it places significant loads on the computing environment. For large reports, the client must have substantial memory and processing power to handle the subtotaling tasks. Even if the client has the necessary resources, a heavy processing burden for subtotal calculations may slow down the client in its performance of other activities.
Use the ROLLUP extension in tasks involving subtotals.
ROLLUP of year/month/day or country/state/city.
ROLLUP query executes in parallel.
Note that the subtotals created by ROLLUP are only a fraction of possible subtotal combinations. For instance, in the cross-tab shown in Table 20-1, the departmental totals across regions (279,000 and 319,000) would not be calculated by a ROLLUP(Time, Region, Department) clause. To generate those numbers would require a ROLLUP clause with the grouping columns specified in a different order: ROLLUP(Time, Department, Region). The easiest way to generate the full set of subtotals needed for cross-tabular reports such as those needed for Figure 20-1 is to use the CUBE extension.
CUBE enables a SELECT statement to calculate subtotals for all possible combinations of a group of dimensions. It also calculates a grand total. This is the set of information typically needed for all cross-tabular reports, so CUBE can calculate a cross-tabular report with a single SELECT statement. Like ROLLUP, CUBE is a simple extension to the GROUP BY clause, and its syntax is also easy to learn.
CUBE appears in the GROUP BY clause in a SELECT statement. Its form is:
SELECT ... GROUP BY CUBE (grouping_column_reference_list)
CUBE takes a specified set of grouping columns and creates subtotals for all possible combinations of them. In terms of multi-dimensional analysis, CUBE generates all the subtotals that could be calculated for a data cube with the specified dimensions. If you have specified CUBE(Time, Region, Department), the result set will include all the values that would be included in an equivalent ROLLUP statement plus additional combinations. For instance, in Table 20-1, the departmental totals across regions (279,000 and 319,000) would not be calculated by a ROLLUP(Time, Region, Department) clause, but they would be calculated by a CUBE(Time, Region, Department) clause. If there are n columns specified for a CUBE, there will be 2n combinations of subtotals returned. Table 20-3 gives an example of a three-dimension CUBE.
This example of CUBE uses the data in the video store database.
SELECT Time, Region, Department, sum(Profit) AS Profit FROM sales GROUP BY CUBE (Time, Region, Dept)
Table 20-3 shows the results of this query.
Just as for ROLLUP, multiple SELECT statements combined with UNION statements could provide the same information gathered through CUBE. However, this may require many SELECT statements: for an n-dimensional cube, 2n SELECT statements are needed. In our 3-dimension example, this would mean issuing 8 SELECTS linked with UNION ALL.
Consider the impact of adding just one more dimension when calculating all possible combinations: the number of SELECT statements would double to 16. The more columns used in a CUBE clause, the greater the savings versus the UNION approach. For instance, if a four-column CUBE replaces a UNION of 16 SELECT statements, the reduction in table access is fifteen-sixteenths or 93.75%.
CUBE in any situation requiring cross-tabular reports. The data needed for cross-tabular reports can be generated with a single SELECT using CUBE. Like ROLLUP, CUBE can be helpful in generating summary tables. Note that population of summary tables is even faster if the CUBE query executes in parallel.
CUBE is especially valuable in queries that use columns from multiple dimensions rather than columns representing different levels of a single dimension. For instance, a commonly requested cross-tabulation might need subtotals for all the combinations of month/state/product. These are three independent dimensions, and analysis of all possible subtotal combinations will be commonplace. In contrast, a cross-tabulation showing all possible combinations of year/month/day would have several values of limited interest, since there is a natural hierarchy in the time dimension. Subtotals such as profit by day of month summed across year would be unnecessary in most analyses.
The examples in this chapter show ROLLUP and CUBE used with the SUM() operator. While this is the most common type of aggregation, the extensions can also be used with all the other functions available to Group by clauses, for example, COUNT, AVG, MIN, MAX, STDDEV, and VARIANCE. COUNT, which is often needed in cross-tabular analyses, is likely be the second most helpful function.
Two challenges arise with the use of ROLLUP and CUBE. First, how can we programmatically determine which result set rows are subtotals, and how do we find the exact level of aggregation of a given subtotal? We will often need to use subtotals in calculations such as percent-of-totals, so we need an easy way to determine which rows are the subtotals we seek. Second, what happens if query results contain both stored NULL values and "NULL" values created by a ROLLUP or CUBE? How does an application or developer differentiate between the two?
To handle these issues, Oracle 8i introduces a new function called GROUPING. Using a single column as its argument, Grouping returns 1 when it encounters a NULL value created by a ROLLUP or CUBE operation. That is, if the NULL indicates the row is a subtotal, GROUPING returns a 1. Any other type of value, including a stored NULL, will return a 0.
GROUPING appears in the selection list portion of a SELECT statement. Its form is:
SELECT ... [GROUPING(dimension_column)...] ... GROUP BY ... {CUBE | ROLLUP}
This example uses grouping to create a set of mask columns for the result set shown in Table 20-3. The mask columns are easy to analyze programmatically.
SELECT Time, Region, Department, SUM(Profit) AS Profit, GROUPING (Time) as T, GROUPING (Region) as R, GROUPING (Department) as D FROM Sales GROUP BY ROLLUP (Time, Region, Department)
Table 20-4 shows the results of this query.
A program can easily identify the detail rows above by a mask of "0 0 0" on the T, R, and D columns. The first level subtotal rows have a mask of "0 0 1", the second level subtotal rows have a mask of "0 1 1", and the overall total row have a mask of "1 1 1".
Table 20-5 shows an ambiguous result set created using the CUBE extension.
| Time | Region | Profit |
|---|---|---|
|
1996 |
East |
200,000 |
|
1996 |
[NULL] |
200,000 |
|
[NULL] |
East |
200,000 |
|
[NULL] |
[NULL] |
190,000 |
|
[NULL] |
[NULL] |
190,000 |
|
[NULL] |
[NULL] |
190,000 |
|
[NULL] |
[NULL] |
390,000 |
In this case, four different rows show NULL for both Time and Region. Some of those NULLs must represent aggregates due to the CUBE extension, and others must be NULLs stored in the database. How can we tell which is which? GROUPING functions, combined with the NVL and DECODE functions, resolve the ambiguity so that human readers can easily interpret the values.
We can resolve the ambiguity by using the GROUPING and other functions in the code below.
SELECT decode(grouping(Time), 1, 'All Times', Time) as Time, decode(grouping(region), 1, 'All Regions', 0, null)) as Region, sum(Profit) AS Profit from Sales group by CUBE(Time, Region)
This code generates the result set in Table 20-6. These results include text values clarifying which rows have aggregations.
| Time | Region | Profit |
|---|---|---|
|
1996 |
East |
200,000 |
|
1996 |
All Regions |
200,000 |
|
All Times |
East |
200,000 |
|
[NULL] |
[NULL] |
190,000 |
|
[NULL] |
All Regions |
190,000 |
|
All Times |
[NULL] |
190,000 |
|
All Times |
All Regions |
390,000 |
To explain the SQL statement above, we will examine its first column specification, which handles the Time column. Look at the first line of the in the SQL code above, namely,
decode(grouping(Time), 1, 'All Times', Time) as Time,
The Time value is determined with a DECODE function that contains a GROUPING function. The GROUPING function returns a 1 if a row value is an aggregate created by ROLLUP or CUBE, otherwise it returns a 0. The DECODE function then operates on the GROUPING function's results. It returns the text "All Times" if it receives a 1 and the time value from the database if it receives a 0. Values from the database will be either a real value such as 1996 or a stored NULL. The second column specification, displaying Region, works the same way.
The GROUPING function is not only useful for identifying NULLs, it also enables sorting subtotal rows and filtering results. In the example below (Table 20-7), we retrieve a subset of the subtotals created by a CUBE and none of the base-level aggregations. The HAVING clause constrains columns which use GROUPING functions.
SELECT Time, Region, Department, SUM(Profit) AS Profit, GROUPING (Time) AS T, GROUPING (Region) AS R, GROUPING (Department) AS D FROM Sales GROUP BY CUBE (Time, Region, Department) HAVING (D=1 AND R=1 AND T=1) OR (R=1 AND D=1) OR (T=1 AND D=1)
Table 20-7 shows the results of this query.
| Time | Region | Department | Profit |
|---|---|---|---|
|
1996 |
[NULL] |
[NULL] |
526,000 |
|
1997 |
[NULL] |
[NULL] |
598,000 |
|
[NULL] |
Central |
[NULL] |
316,000 |
|
[NULL] |
East |
[NULL] |
442,000 |
|
[NULL] |
West |
[NULL] |
366,000 |
|
[NULL] |
[NULL] |
[NULL] |
1,124,000 |
Compare the result set of Table 20-7 with that in Table 20-3 to see how Table 20-7 is a precisely specified group: it contains only the yearly totals, regional totals aggregated over time and department, and the grand total.
This section discusses the following topics.
The ROLLUP and CUBE extensions work independently of any hierarchy metadata in your system. Their calculations are based entirely on the columns specified in the SELECT statement in which they appear. This approach enables CUBE and ROLLUP to be used whether or not hierarchy metadata is available. The simplest way to handle levels in hierarchical dimensions is by using the ROLLUP extension and indicating levels explicitly through separate columns. The code below shows a simple example of this with months rolled up to quarters and quarters rolled up to years.
SELECT Year, Quarter, Month, SUM(Profit) AS Profit FROM sales GROUP BY ROLLUP(Year, Quarter, Month)
This query returns the rows in Table 20-8.
CUBE and ROLLUP do not restrict the GROUP BY clause column capacity. The GROUP BY clause, with or without the extensions, can work with up to 255 columns. However, the combinatorial explosion of CUBE makes it unwise to specify a large number of columns with the CUBE extension. Consider that a 20-column list for CUBE would create 220 combinations in the result set. A very large CUBE list could strain system resources, so any such query needs to be tested carefully for performance and the load it places on the system.
The HAVING clause of SELECT statements is unaffected by the use of ROLLUP and CUBE. Note that the conditions specified in the HAVING clause apply to both the subtotal and non-subtotal rows of the result set. In some cases a query may need to exclude the subtotal rows or the non-subtotal rows from the HAVING clause. This can be achieved by using the GROUPING function together with the HAVING clause. See Table 20-7 and its associated SQL for an example.
Top-N queries ask for the n largest or smallest values of a column. An example is "What are the top ten best selling products in the U.S.?" Of course, we may also want to ask "What are the 10 worst selling products?" Both largest-values and smallest-values sets are considered Top-N queries.
Top-N queries use a consistent nested query structure with the elements described below.
ORDER BY clause to ensure that the ranking is in the desired order. For results retrieving the largest values, a DESC parameter is needed.
The high-level structure of these queries is:
SELECT column_list ROWNUM FROM (SELECT column_list FROM table ORDER BY Top-N_column) WHERE ROWNUM <= N
To illustrate the concepts here, we extend the scenario used in our earlier examples. We will now access the name of the sales representative associated with each sale, stored in the "name" column. and the sales commission earned on every sale. The SQL below returns the top 10 sales representatives ordered by dollar sales, with sample data shown in Table 20-9:
select ROWNUM AS Rank, Name, Region, Sales from (select Name, Region, sum(Sales) AS Sales from Sales GROUP BY Name, Region order by sum(Sales) DESC) WHERE ROWNUM <= 10
This example can be augmented to show the sales representatives' ranks both for sales and commissions in a single query. We now extend our query to include the sales commission earned on every sale, stored in the "commission" column. The extra information requires another layer of nested subquery. Although interpreting several layers of queries can be challenging, the SQL below has been formatted to clarify the meaning.
Below is the SQL needed for our scenario, with the sample results shown in Table 20-10. To understand the query, please step through the code following the number sequence shown at the left edge:
4) SELECT ROWNUM as SalesRank, Name, Region, SalesDollars, CommRank from 2) (SELECT Name, Region, SalesDollars, ROWNUM AS CommRank from 1) ( SELECT Name, Region, sum(Sales) AS SalesDollars, sum(commission) FROM Sales GROUP BY Name, Region ORDER BY sum(Commission) DESC ) 3) ORDER BY Sales DESC ) 5) WHERE ROWNUM <=10
Note that the results in Table 20-10 show how commission ranks are not identical to sales ranks in this data set: some representatives had higher or lower commission rates tied to specific sales.
Joint Technical Committee ISO/IEC JTC 1, Information Technology. ISO Working Draft Database Language SQL --Part 2: Foundation (SQL/Foundation), Document ID: ISO/IEC FCD 9075-2:199x, September 1997.