|Oracle8i Data Warehousing Guide
Release 2 (8.1.6)
Part Number A76994-01
The following topics provide information about schemas in a data warehouse:
A schema is a collection of database objects, including tables, views, indexes, and synonyms.
There is a variety of ways of arranging schema objects in the schema models designed for data warehousing. The most common data-warehouse schema model is a star schema. For this reason, most of the examples in this book utilize a star schema. However, a significant but smaller number of data warehouses use third-normal-form (3NF) schemas, or other schemas which are more highly normalized than star schemas. These 3NF data warehouses are typically very large data warehouses, which are used primarily for loading data and for feeding data marts. These data warehouses are not typically used for heavy end-user query workloads.
Some features of the Oracle8i database, such as the star transformation feature described in this chapter, are specific to star schemas, however, the vast majority of Oracle8i's data warehousing features are equally applicable to both star schemas and 3NF schemas.
The star schema is the simplest data warehouse schema. It is called a star schema because the diagram of a star schema resembles a star, with points radiating from a center. The center of the star consists of one or more fact tables and the points of the star are the dimension tables.
A star schema is characterized by one or more very large fact tables that contain the primary information in the data warehouse and a number of much smaller dimension tables (or lookup tables), each of which contains information about the entries for a particular attribute in the fact table.
A star query is a join between a fact table and a number of lookup tables. Each lookup table is joined to the fact table using a primary-key to foreign-key join, but the lookup tables are not joined to each other.
Cost-based optimization recognizes star queries and generates efficient execution plans for them. (Star queries are not recognized by rule-based optimization.)
A typical fact table contains keys and measures. For example, a simple fact table might contain the measure Sales, and keys Time, Product, and Market. In this case, there would be corresponding dimension tables for Time, Product, and Market. The Product dimension table, for example, would typically contain information about each product number that appears in the fact table. A measure is typically a numeric or character column, and can be taken from one column in one table or derived from two columns in one table or two columns in more than one table.
A star join is a primary-key to foreign-key join of the dimension tables to a fact table. The fact table normally has a concatenated index on the key columns to facilitate this type of join.
The main advantages of star schemas are that they:
Figure 16-1 presents a graphical representation of a star schema.
The snowflake schema is a more complex data warehouse model than a star schema, and is a type of star schema. It is called a snowflake schema because the diagram of the schema resembles a snowflake.
Snowflake schemas normalize dimensions to eliminate redundancy. That is, the dimension data has been grouped into multiple tables instead of one large table. For example, a product dimension table in a star schema might be normalized into a Product table, a Product_Category table, and a Product_Manufacturer table in a snowflake schema. While this saves space, it increases the number of dimension tables and requires more foreign key joins. The result is more complex queries and reduced query performance. Figure 16-2 presents a graphical representation of a snowflake schema.
In order to get the best possible performance for star queries, it is important to follow some basic guidelines:
When a data warehouse satisfies these conditions, the majority of the star queries running in the data warehouse will use a query-execution strategy known as the star transformation. The star transformation provides very efficient query performance for star queries.
The star transformation is a cost-based query transformation aimed at executing star queries efficiently. Whereas the star optimization works well for schemas with a small number of dimensions and dense fact tables, the star transformation may be considered as an alternative if any of the following holds true:
The star transformation does not rely on computing a Cartesian product of the dimension tables, which makes it better suited for cases where fact table sparsity and/or a large number of dimensions would lead to a large Cartesian product with few rows having actual matches in the fact table. In addition, rather than relying on concatenated indexes, the star transformation is based on combining bitmap indexes on individual fact table columns.
The transformation can thus combine indexes corresponding precisely to the constrained dimensions. There is no need to create many concatenated indexes where the different column orders match different patterns of constrained dimensions in different queries.
This section provides an example of the star transformation. The star transformation is a powerful, and interesting, optimization technique which relies upon implicitly rewriting (or transforming) the SQL of the original star query.
The end user never needs to know any of the details about the star transformation; Oracle's cost-based optimizer will automatically choose the star transformation where appropriate.
However, the DBA may be interested to learn the details of the star transformation. This section will enable the DBA to understand how the star transformation algorithm works, and moreover, the DBA will be able to recognize the execution plans of star queries which are using the star transformation.
Oracle processes a star query using two basic phases. The first phase retrieves exactly the necessary rows from the fact table (the result set). Because this retrieval utilizes bitmap indexes, it is very efficient.
The second phase joins this result set to the dimension tables. Below is an example of how an end-user may query this data warehouse: "What were the sales and profits for the grocery department of stores in the west and southwest sales districts over the last three quarters?" This is a simple star query. The SQL generated by an end-user tool could look like:
SELECT store.sales_district, time.fiscal_period, SUM(sales.dollar_sales) revenue, SUM(dollar_sales) - SUM(dollar_cost) income FROM sales, store, time, product WHERE sales.store_key = store.store_key AND sales.time_key = time.time_key AND sales.product_key = product.product_key AND time.fiscal_period IN ('3Q95', '4Q95', '1Q96') and product.department = 'Grocery' AND store.sales_district IN ('San Francisco', 'Los Angeles') GROUP BY store.sales_district, time.fiscal_period;
Oracle will process this query in two phases. In the first phase, Oracle will use the bitmap indexes on the foreign-key columns of the fact table to identify and retrieve the only the necessary rows from the fact table. That is, Oracle will retrieve the result set from the fact table using essentially the following query:
SELECT ... FROM sales WHERE store_key IN (SELECT store_key FROM store WHERE sales_district IN ('WEST', 'SOUTHWEST')) AND time_key IN (SELECT time_key FROM time WHERE quarter IN ('3Q96', '4Q96', '1Q97')) AND product_key IN (SELECT product_key FROM product WHERE department = 'GROCERY');
This is the transformation step of the algorithm, because the original star query has been transformed into this subquery representation. This method of accessing the fact table leverages the strengths of Oracle's bitmap indexes. Intuitively, bitmap indexes provide a set-based processing scheme within a relational database. Oracle has implemented very fast methods for doing set operations such as AND (an intersection in standard set-based terminology), OR (a set-based union), MINUS, and COUNT.
In this star query, a bitmap index on store_key is used to identify the set of all rows in the fact table corresponding to sales in the West sales district. This set is represented as a bitmap (a string of 1's and 0's that indicates which rows of the fact table are members of the set).
A similar bitmap is retrieved for the fact-table rows corresponding to the sale in the Southwest sales district. The bitmap OR operation is used to combine this set of Southwest sales with the set of West sales.
Additional set operations will be done for the time dimension and the product dimension. At this point in the star query processing, there are three bitmaps: each bitmap corresponds to a separate dimension table, and each bitmap represents the set of rows of the fact table that satisfy that individual dimension's constraints.
These three bitmaps are combined into a single bitmap using the bitmap AND operation. This final bitmap represents the set of rows in the fact table that satisfy all of the constraints on the dimension table; this is the result set, the exact set of rows from the fact table needed to evaluate the query. Note that none of the actual data in the fact table has been accessed; all of these operations rely solely on the bitmap indexes and the dimension tables. Because of the bitmap indexes' patented, compressed data representations, the bitmap set-based operations are extremely efficient.
Once the result set is identified, the bitmap is used to access the actual data from the sales table. Only those rows that are required for the end-user's query are retrieved from the fact table.
The second phase of this query is to join these rows from the fact table to the dimension tables. Oracle will use the most efficient method for accessing and joining the dimension tables. Many dimension are very small, and table scans are typically the most efficient access method for these dimension tables. For large dimension tables, table scans may not be the most efficient access method. In the example above, a bitmap index on product.department may be used to quickly identify all of those products in the grocery department. Oracle8's cost-based optimizer will automatically determine which access method is most appropriate for a given dimension table, based upon the cost-based optimizer's knowledge about the sizes and data distributions of each dimension table.
The specific join method (as well as indexing method) for each dimension tables will likewise be intelligently determined by the cost-based optimizer. A hash join is often the most efficient algorithm for joining the dimension tables. The final answer is returned to the user once all of the dimension tables have been joined. The query technique of retrieving only the matching rows from one table and then joining to another table is commonly known as a semi-join.
The following execution plan might result from "Star Transformation Example":
SELECT STATEMENT HASH JOIN HASH JOIN HASH JOIN TABLE ACCESS SALES BY INDEX ROWID BITMAP CONVERSION TO ROWIDS BITMAP AND BITMAP MERGE BITMAP KEY ITERATION TABLE ACCESS STORE FULL BITMAP INDEX SALES_STORE_KEY RANGE SCAN BITMAP MERGE BITMAP KEY ITERATION TABLE ACCESS TIME FULL BITMAP INDEX SALES_TIME_KEY RANGE SCAN BITMAP MERGE BITMAP KEY ITERATION TABLE ACCESS PRODUCTS FULL BITMAP INDEX SALES_PRODUCT_KEY RANGE SCAN TABLE ACCESS TIME FULL TABLE ACCESS PRODUCTS FULL TABLE ACCESS STORE FULL
In this plan, the fact table is accessed through a bitmap access path based on a bitmap AND of three merged bitmaps. The three bitmaps are generated by the BITMAP MERGE row source being fed bitmaps from row source trees underneath it. Each such row source tree consists of a BITMAP KEY ITERATION row source which fetches values from the subquery row source tree, which in this example is just a full table access. For each such value, the BITMAP KEY ITERATION row source retrieves the bitmap from the bitmap index. After the relevant fact table rows have been retrieved using this access path, they are joined with the dimension tables and temporary tables to produce the answer to the query.
The star transformation is a cost-based transformation in the following sense. The optimizer generates and saves the best plan it can produce without the transformation. If the transformation is enabled, the optimizer then tries to apply it to the query and if applicable, generates the best plan using the transformed query. Based on a comparison of the cost estimates between the best plans for the two versions of the query, the optimizer will then decide whether to use the best plan for the transformed or untransformed version.
If the query requires accessing a large percentage of the rows in the fact table, it may well be better to use a full table scan and not use the transformations. However, if the constraining predicates on the dimension tables are sufficiently selective that only a small portion of the fact table needs to be retrieved, the plan based on the transformation will probably be superior.
Note that the optimizer will generate a subquery for a dimension table only if it decides that it is reasonable to do so based on a number of criteria. There is no guarantee that subqueries will be generated for all dimension tables. The optimizer may also decide, based on the properties of the tables and the query, that the transformation does not merit being applied to a particular query. In this case the best regular plan will be used.
Star transformation is not supported for tables with any of the following characteristics:
In addition, temporary tables will not be used by star transformation under the following conditions: