This chapter describes some of the basic ideas in business intelligence.
This chapter contains the following topics:
A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources.
In addition to a relational database, a data warehouse environment includes an extraction, transportation, transformation, and loading (ETL) solution, an online analytical processing (OLAP) engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users.
A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon:
Data warehouses are designed to help you analyze data. For example, to learn more about your company's sales data, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like "Who was our best customer for this item last year?" This ability to define a data warehouse by subject matter, sales in this case, makes the data warehouse subject oriented.
Integration is closely related to subject orientation. Data warehouses must put data from disparate sources into a consistent format. They must resolve such problems as naming conflicts and inconsistencies among units of measure. When they achieve this, they are said to be integrated.
In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. A data warehouse's focus on change over time is what is meant by the term time variant.
Typically, data flows from one or more online transaction processing (OLTP) databases into a data warehouse on a monthly, weekly, or daily basis. The data is normally processed in a staging file before being added to the data warehouse. Data warehouses commonly range in size from tens of gigabytes to a few terabytes. Usually, the vast majority of the data is stored in a few very large fact tables.
Data warehouses and OLTP systems have very different requirements. Here are some examples of differences between typical data warehouses and OLTP systems:
Data warehouses are designed to accommodate ad hoc queries. You might not know the workload of your data warehouse in advance, so a data warehouse should be optimized to perform well for a wide variety of possible query operations.
OLTP systems support only predefined operations. Your applications might be specifically tuned or designed to support only these operations.
A data warehouse is updated on a regular basis by the ETL process (run nightly or weekly) using bulk data modification techniques. The end users of a data warehouse do not directly update the data warehouse.
In OLTP systems, end users routinely issue individual data modification statements to the database. The OLTP database is always up to date, and reflects the current state of each business transaction.
Data warehouses often use denormalized or partially denormalized schemas (such as a star schema) to optimize query performance.
OLTP systems often use fully normalized schemas to optimize update/insert/delete performance, and to guarantee data consistency.
A typical data warehouse query scans thousands or millions of rows.For example, "Find the total sales for all customers last month."
A typical OLTP operation accesses only a handful of records. For example, "Retrieve the current order for this customer."
Data warehouses usually store many months or years of data. This is to support historical analysis.
OLTP systems usually store data from only a few weeks or months. The OLTP system stores only historical data as needed to successfully meet the requirements of the current transaction.
Data warehouses and their architectures vary depending upon the specifics of an organization's situation. Three common architectures are:
Figure 16-1 shows a simple architecture for a data warehouse. End users directly access data derived from several source systems through the data warehouse.
In Figure 16-1, the metadata and raw data of a traditional OLTP system is present, as is an additional type of data, summary data. Summaries are very valuable in data warehouses because they pre-compute long operations in advance. For example, a typical data warehouse query is to retrieve something like August sales.
Summaries in Oracle are called materialized views.
Figure 16-1, you need to clean and process your operational data before putting it into the warehouse. You can do this programmatically, although most data warehouses use a staging area instead. A staging area simplifies building summaries and general warehouse management. Figure 16-2 illustrates this typical architecture.
Although the architecture in Figure 16-2 is quite common, you might want to customize your warehouse's architecture for different groups within your organization.
Do this by adding data marts, which are systems designed for a particular line of business. Figure 16-3 illustrates an example where purchasing, sales, and inventories are separated. In this example, a financial analyst might want to analyze historical data for purchases and sales.
You need to load your data warehouse regularly so that it can serve its purpose of facilitating business analysis. To do this, data from one or more operational systems needs to be extracted and copied into the warehouse. The process of extracting data from source systems and bringing it into the data warehouse is commonly called ETL, which stands for extraction, transformation, and loading. The acronym ETL is perhaps too simplistic, because it omits the transportation phase and implies that each of the other phases of the process is distinct. We refer to the entire process, including data loading, as ETL. You should understand that ETL refers to a broad process, and not three well-defined steps.
The methodology and tasks of ETL have been well known for many years, and are not necessarily unique to data warehouse environments: a wide variety of proprietary applications and database systems are the IT backbone of any enterprise. Data has to be shared between applications or systems, trying to integrate them, giving at least two applications the same picture of the world. This data sharing was mostly addressed by mechanisms similar to what we now call ETL.
Data warehouse environments face the same challenge with the additional burden that they not only have to exchange but to integrate, rearrange and consolidate data over many systems, thereby providing a new unified information base for business intelligence. Additionally, the data volume in data warehouse environments tends to be very large.
What happens during the ETL process? During extraction, the desired data is identified and extracted from many different sources, including database systems and applications. Very often, it is not possible to identify the specific subset of interest, therefore more data than necessary has to be extracted, so the identification of the relevant data will be done at a later point in time. Depending on the source system's capabilities (for example, operating system resources), some transformations may take place during this extraction process. The size of the extracted data varies from hundreds of kilobytes up to gigabytes, depending on the source system and the business situation. The same is true for the time delta between two (logically) identical extractions: the time span may vary between days/hours and minutes to near real-time. Web server log files for example can easily become hundreds of megabytes in a very short period of time.
After extracting data, it has to be physically transported to the target system or an intermediate system for further processing. Depending on the chosen way of transportation, some transformations can be done during this process, too. For example, a SQL statement which directly accesses a remote target through a gateway can concatenate two columns as part of the
If any errors occur during loading, an error is logged and the operation can continue.
Transportable tablespaces are the fastest way for moving large volumes of data between two Oracle databases. You can transport tablespaces between different computer architectures and operating systems.
Previously, the most scalable data transportation mechanisms relied on moving flat files containing raw data. These mechanisms required that data be unloaded or exported into files from the source database. Then, after transportation, these files were loaded or imported into the target database. Transportable tablespaces entirely bypass the unload and reload steps.
Using transportable tablespaces, Oracle data files (containing table data, indexes, and almost every other Oracle database object) can be directly transported from one database to another. Furthermore, like import and export, transportable tablespaces provide a mechanism for transporting metadata in addition to transporting data.
The most common applications of transportable tablespaces in data warehouses are in moving data from a staging database to a data warehouse, or in moving data from a data warehouse to a data mart.
Table functions provide the support for pipelined and parallel execution of transformations implemented in PL/SQL, C, or Java. Scenarios as mentioned earlier can be done without requiring the use of intermediate staging tables, which interrupt the data flow through various transformations steps.
A table function is defined as a function that can produce a set of rows as output. Additionally, table functions can take a set of rows as input. Table functions extend database functionality by allowing:
Multiple rows to be returned from a function
Results of SQL subqueries (that select multiple rows) to be passed directly to functions
Functions take cursors as input
Functions can be parallelized
Returning result sets incrementally for further processing as soon as they are created. This is called incremental pipelining
Table functions can be defined in PL/SQL using a native PL/SQL interface, or in Java or C using the Oracle Data Cartridge Interface (ODCI).
External tables let you use external data as a virtual table that can be queried and joined directly and in parallel without requiring the external data to be first loaded in the database. You can then use SQL, PL/SQL, and Java to access the external data.
External tables enable the pipelining of the loading phase with the transformation phase. The transformation process can be merged with the loading process without any interruption of the data streaming. It is no longer necessary to stage the data inside the database for further processing inside the database, such as comparison or transformation. For example, the conversion functionality of a conventional load can be used for a direct-path
SELECT statement in conjunction with the
SELECT from an external table. Figure 16-4 illustrates a typical example of pipelining.
The main difference between external tables and regular tables is that externally organized tables are read-only. No DML operations (
DELETE) are possible and no indexes can be created on them.
External tables are a complement to SQL*Loader and are especially useful for environments where the complete external source has to be joined with existing database objects and transformed in a complex manner, or where the external data volume is large and used only once. SQL*Loader, on the other hand, might still be the better choice for loading of data where additional indexing of the staging table is necessary. This is true for operations where the data is used in independent complex transformations or the data is only partially used in further processing.
To reduce disk use and memory use (specifically, the buffer cache), you can store tables and partitioned tables in a compressed format inside the database. This often leads to a better scaleup for read-only operations. Table compression can also speed up query execution. There is, however, a slight cost in CPU overhead.
Table compression should be used with highly redundant data, such as tables with many foreign keys. You should avoid compressing tables with much update or other DML activity. Although compressed tables or partitions are updatable, there is some overhead in updating these tables, and high update activity may work against compression by causing some space to be wasted.
See Also:"Table Compression"
Change Data Capture efficiently identifies and captures data that has been added to, updated, or removed from Oracle relational tables, and makes the change data available for use by applications.
Oftentimes, data warehousing involves the extraction and transportation of relational data from one or more source databases into the data warehouse for analysis. Change Data Capture quickly identifies and processes only the data that has changed, not entire tables, and makes the change data available for further use.
Change Data Capture does not depend on intermediate flat files to stage the data outside of the relational database. It captures the change data resulting from
DELETE operations made to user tables. The change data is then stored in a database object called a change table, and the change data is made available to applications in a controlled way.
One technique employed in data warehouses to improve performance is the creation of summaries. Summaries are special kinds of aggregate views that improve query execution times by precalculating expensive joins and aggregation operations prior to execution and storing the results in a table in the database. For example, you can create a table to contain the sums of sales by region and by product.
The summaries or aggregates that are referred to in this book and in literature on data warehousing are created in Oracle using a schema object called a materialized view. Materialized views can perform a number of roles, such as improving query performance or providing replicated data.
Previously, organizations using summaries spent a significant amount of time and effort creating summaries manually, identifying which summaries to create, indexing the summaries, updating them, and advising their users on which ones to use. Summary management eased the workload of the database administrator and meant that the user no longer needed to be aware of the summaries that had been defined. The database administrator creates one or more materialized views, which are the equivalent of a summary. The end user queries the tables and views at the detail data level.
The query rewrite mechanism in the Oracle database server automatically rewrites the SQL query to use the summary tables. This mechanism reduces response time for returning results from the query. Materialized views within the data warehouse are transparent to the end user or to the database application.
Although materialized views are usually accessed through the query rewrite mechanism, an end user or database application can construct queries that directly access the summaries. However, serious consideration should be given to whether users should be allowed to do this because any change to the summaries will affect the queries that reference them.
To help you select from among the many possible materialized views in your schema, Oracle provides a collection of materialized view analysis and advisor functions and procedures in the
DBMS_ADVISOR package. Collectively, these functions are called the SQL Access Advisor, and they are callable from any PL/SQL program. The SQL Access Advisor recommends materialized views from a hypothetical or user-defined workload or one obtained from the SQL cache. You can run the SQL Access Advisor from Oracle Enterprise Manager or by invoking the
See Also:Oracle Database Performance Tuning Guide for information about materialized views and the SQL Access Advisor
Bitmap indexes are widely used in data warehousing environments. The environments typically have large amounts of data and ad hoc queries, but a low level of concurrent DML transactions. For such applications, bitmap indexing provides:
Reduced response time for large classes of ad hoc queries
Reduced storage requirements compared to other indexing techniques
Dramatic performance gains even on hardware with a relatively small number of CPUs or a small amount of memory
Efficient maintenance during parallel DML and loads
Fully indexing a large table with a traditional B-tree index can be prohibitively expensive in terms of space because the indexes can be several times larger than the data in the table. Bitmap indexes are typically only a fraction of the size of the indexed data in the table.
An index provides pointers to the rows in a table that contain a given key value. A regular index stores a list of rowids for each key corresponding to the rows with that key value. In a bitmap index, a bitmap for each key value replaces a list of rowids.
Each bit in the bitmap corresponds to a possible rowid, and if the bit is set, it means that the row with the corresponding rowid contains the key value. A mapping function converts the bit position to an actual rowid, so that the bitmap index provides the same functionality as a regular index. If the number of different key values is small, bitmap indexes save space.
Bitmap indexes are most effective for queries that contain multiple conditions in the
WHERE clause. Rows that satisfy some, but not all, conditions are filtered out before the table itself is accessed. This improves response time, often dramatically. A good candidate for a bitmap index would be a gender column due to the low number of possible values.
When Oracle runs SQL statements in parallel, multiple processes work together simultaneously to run a single SQL statement. By dividing the work necessary to run a statement among multiple processes, Oracle can run the statement more quickly than if only a single process ran it. This is called parallel execution or parallel processing.
Parallel execution dramatically reduces response time for data-intensive operations on large databases typically associated with decision support systems (DSS) and data warehouses. Symmetric multiprocessing (SMP), clustered systems, and large-scale cluster systems gain the largest performance benefits from parallel execution because statement processing can be split up among many CPUs on a single Oracle system. You can also implement parallel execution on certain types of online transaction processing (OLTP) and hybrid systems.
Parallelism is the idea of breaking down a task so that, instead of one process doing all of the work in a query, many processes do part of the work at the same time. An example of this is when 12 processes handle 12 different months in a year instead of one process handling all 12 months by itself. The improvement in performance can be quite high.
Parallel execution helps systems scale in performance by making optimal use of hardware resources. If your system's CPUs and disk controllers are already heavily loaded, you need to alleviate the system's load or increase these hardware resources before using parallel execution to improve performance.
Some tasks are not well-suited for parallel execution. For example, many OLTP operations are relatively fast, completing in mere seconds or fractions of seconds, and the overhead of utilizing parallel execution would be large, relative to the overall execution time.
See Also:Oracle Database Data Warehousing Guide for specific information on tuning your parameter files and database to take full advantage of parallel execution
When parallel execution is not used, a single server process performs all necessary processing for the sequential execution of a SQL statement. For example, to perform a full table scan (such as
FROM emp), one process performs the entire operation, as illustrated in Figure 16-5.
Figure 16-6 illustrates several parallel execution servers performing a scan of the table
emp. The table is divided dynamically (dynamic partitioning) into load units called granules and each granule is read by a single parallel execution server. The granules are generated by the coordinator. Each granule is a range of physical blocks of the table
emp. The mapping of granules to execution servers is not static, but is determined at execution time. When an execution server finishes reading the rows of the table
emp corresponding to a granule, it gets another granule from the coordinator if there are any granules remaining. This continues until all granules are exhausted, in other words, until the entire table
emp has been read. The parallel execution servers send results back to the parallel execution coordinator, which assembles the pieces into the desired full table scan.
Given a query plan for a SQL query, the parallel execution coordinator breaks down each operator in a SQL query into parallel pieces, runs them in the right order as specified in the query, and then integrates the partial results produced by the parallel execution servers executing the operators. The number of parallel execution servers assigned to a single operation is the degree of parallelism (DOP) for an operation. Multiple operations within the same SQL statement all have the same degree of parallelism.
See Also:Oracle Database Data Warehousing Guide for information on granules as well as how Oracle divides work and handles DOP in multiuser environments
Oracle has introduced many SQL operations for performing analytic operations in the database. These operations include ranking, moving averages, cumulative sums, ratio-to-reports, and period-over-period comparisons. Although some of these calculations were previously possible using SQL, this syntax offers much better performance.
This section discusses:
Aggregation is a fundamental part of data warehousing. To improve aggregation performance in your warehouse, Oracle provides extensions to the GROUP BY clause to make querying and reporting easier and faster. Some of these extensions enable you to:
Aggregate at increasing levels of aggregation, from the most detailed up to a grand total
Calculate all possible combinations of aggregations with a single statement
Generate the information needed in cross-tabulation reports with a single query
These extension let you specify exactly the groupings of interest in the
BY clause. This allows efficient analysis across multiple dimensions without performing a
CUBE operation. Computing a full cube creates a heavy processing load, so replacing cubes with grouping sets can significantly increase performance.
ROLLUP, and grouping sets produce a single result set that is equivalent to a
ALL of differently grouped rows.
To enhance performance, these extensions can be parallelized: multiple processes can simultaneously run all of these statements. These capabilities make aggregate calculations more efficient, thereby enhancing database performance, and scalability.
One of the key concepts in decision support systems is multidimensional analysis: examining the enterprise from all necessary combinations of dimensions. We use the term dimension to mean any category used in specifying questions. Among the most commonly specified dimensions are time, geography, product, department, and distribution channel, but the potential dimensions are as endless as the varieties of enterprise activity. The events or entities associated with a particular set of dimension values are usually referred to as facts. The facts might be sales in units or local currency, profits, customer counts, production volumes, or anything else worth tracking.
Here are some examples of multidimensional requests:
Show total sales across all products at increasing aggregation levels for a geography dimension, from state to country to region, for 1999 and 2000.
Create a cross-tabular analysis of our operations showing expenses by territory in South America for 1999 and 2000. Include all possible subtotals.
List the top 10 sales representatives in Asia according to 2000 sales revenue for automotive products, and rank their commissions.
All these requests involve multiple dimensions. Many multidimensional questions require aggregated data and comparisons of data sets, often across time, geography or budgets.
Oracle has advanced SQL analytical processing capabilities using a family of analytic SQL functions. These analytic functions enable you to calculate:
Rankings and percentiles
Moving window calculations
Linear regression statistics
Ranking functions include cumulative distributions, percent rank, and N-tiles. Moving window calculations allow you to find moving and cumulative aggregations, such as sums and averages. Lag/lead analysis enables direct inter-row references so you can calculate period-to-period changes. First/last analysis enables you to find the first or last value in an ordered group.
Other features include the
CASE expressions provide if-then logic useful in many situations.
To enhance performance, analytic functions can be parallelized: multiple processes can simultaneously run all of these statements. These capabilities make calculations easier and more efficient, thereby enhancing database performance, scalability, and simplicity.
MODEL clause brings a new level of power and flexibility to SQL calculations. With the
MODEL clause, you can create a multidimensional array from query results and then apply formulas to this array to calculate new values. The formulas can range from basic arithmetic to simultaneous equations using recursion. For some applications, the
MODEL clause can replace PC-based spreadsheets. Models in SQL leverage Oracle's strengths in scalability, manageability, collaboration, and security. The core query engine can work with unlimited quantities of data. By defining and executing models within the database, users avoid transferring large datasets to and from separate modeling environments. Models can be shared easily across workgroups, ensuring that calculations are consistent for all applications. Just as models can be shared, access can also be controlled precisely with Oracle's security features. With its rich functionality, the
MODEL clause can enhance all types of applications.
Oracle OLAP provides the query performance and calculation capability previously found only in multidimensional databases to Oracle's relational platform. In addition, it provides a Java OLAP API that is appropriate for the development of internet-ready analytical applications. Unlike other combinations of OLAP and RDBMS technology, Oracle OLAP is not a multidimensional database using bridges to move data from the relational data store to a multidimensional data store. Instead, it is truly an OLAP-enabled relational database. As a result, Oracle provides the benefits of a multidimensional database along with the scalability, accessibility, security, manageability, and high availability of the Oracle database. The Java OLAP API, which is specifically designed for internet-based analytical applications, offers productive data access.
Basing an OLAP system directly on the Oracle database server offers the following benefits:
There is tremendous growth along three dimensions of analytic applications: number of users, size of data, and complexity of analyses. There are more users of analytical applications, and they need access to more data to perform more sophisticated analysis and target marketing. For example, a telephone company might want a customer dimension to include detail such as all telephone numbers as part of an application that is used to analyze customer turnover. This would require support for multi-million row dimension tables and very large volumes of fact data. Oracle can handle very large data sets using parallel execution and partitioning, as well as offering support for advanced hardware and clustering.
Partitioning allows management of precise subsets of tables and indexes, so that management operations affect only small pieces of these data structures. By partitioning tables and indexes, data management processing time is reduced, thus minimizing the time data is unavailable. Transportable tablespaces also support high availability. With transportable tablespaces, large data sets, including tables and indexes, can be added with almost no processing to other databases. This enables extremely rapid data loading and updates.
Oracle lets you precisely control resource utilization. The Database Resource Manager, for example, provides a mechanism for allocating the resources of a data warehouse among different sets of end-users.
Another resource management facility is the progress monitor, which gives end users and administrators the status of long-running operations. Oracle maintains statistics describing the percent-complete of these operations. Oracle Enterprise Manager lets you view a bar-graph display of these operations showing what percent complete they are. Moreover, any other tool or any database administrator can also retrieve progress information directly from the Oracle data server using system views.
Oracle provides a server-managed infrastructure for backup, restore, and recovery tasks that enables simpler, safer operations at terabyte scale. Some of the highlights are:
Details related to backup, restore, and recovery operations are maintained by the server in a recovery catalog and automatically used as part of these operations.
Backup and recovery operations are fully integrated with partitioning. Individual partitions, when placed in their own tablespaces, can be backed up and restored independently of the other partitions of a table.
Oracle includes support for incremental backup and recovery using Recovery Manager, enabling operations to be completed efficiently within times proportional to the amount of changes, rather than the overall size of the database.
The security features in Oracle have reached the highest levels of U.S. government certification for database trustworthiness. Oracle's fine grained access control enables cell-level security for OLAP users. Fine grained access control works with minimal burden on query processing, and it enables efficient centralized security management.
Oracle Data Mining (ODM) embeds data mining within the Oracle Database. The data never leaves the database — the data, data preparation, model building, and model scoring results all remain in the database. This enables Oracle to provide an infrastructure for application developers to integrate data mining seamlessly with database applications. Some typical examples of the applications that data mining are used in are call centers, ATMs, ERM, and business planning applications.
By eliminating the need for extracting data into specialized tools and then importing the results back into the database, you can save significant amounts of time. In addition, by having the data and the data model in the same location (an Oracle database), there is no need to export the model as code.
Oracle Data Mining supports the following algorithms:
For classification, Naive Bayes, Adaptive Bayes Networks, and Support Vector Machines (SVM)
For regression, Support Vector Machines
For clustering, k-means and O-Cluster
For feature extraction, Non-Negative Matrix Factorization (NMF)
For sequence matching and annotation, BLAST
ODM also includes several feature and performance enhancements.
Javadoc descriptions of classes for detailed information about the classes that constitute the ODM Java API