7
Constraints

This chapter describes constraints, and discusses:

Why Constraints are Useful in a Data Warehouse

Constraints provide a mechanism for ensuring that data conforms to guidelines specified by the database administrator. The most common types of constraints include unique constraints (ensuring that a given column is unique), not-null constraints, and foreign-key constraints (which ensure that two keys share a primary key-foreign key relationship).

Constraints can be used for the following basic purposes in a data warehouse:

Data cleanliness: Constraints can be used to verify that the data in the data warehouse conforms to basic level of data consistency and correctness. Thus, constraints are often used in a data warehouse to prevent the introduction of dirty data.
Supporting query optimization: The Oracle database will utilize constraints when optimizing SQL queries. Although constraints can be useful in many aspects of query optimization, constraints are particularly important for query-rewrite of materialized views.

Unlike many relational-database environments, data in a data warehouse is typically added and/or modified under very controlled circumstances during the ETT process. Multiple users typically do not update the data warehouse directly; this is considerably different from the usage of a typical operational system.

Thus, the specific usage of constraints in a data warehouse may vary considerably from the usage of constraints in operational systems. Oracle8i provides a wide breadth of constraint functionality to address both the needs of data warehouses as well as the needs to operational systems.

Many significant constraint features for data warehousing were introduced in Oracle8i, so that readers familiar with Oracle's constraint functionality in Oracle7 and Oracle8 should take special note of the new functionality described in this chapter. In fact, many Oracle7 and Oracle8-based data warehouses lacked constraints because of concerns about constraint performance. The new Oracle8i functionality for constraints is designed to address these concerns for data warehouses.

Overview of Constraint States

In order to understand how to best utilize constraints in a data warehouse, it is important to first understand the basic purposes of constraints:

Enforcement: In order to use a constraint for enforcement, the constraint must be in the ENABLE state. An ENABLEd constraint can be used to ensure that all data-modifications upon a given table (or tables) will satisfy the conditions of the constraints. Data-modification operations, which would result in data that violated the constraint, will fail with a constraint-violation error.
Validation: In order to use a constraint for validation, the constraint must be in the VALIDATE state. If the constraint is validated, then all of the data that currently resides in the table satisfies the constraint.

Note that validation is independent of enforcement. Although the typical constraint in an operational system is both enabled and validated, any constraint could be validated but not enforced or vice versa (enforced but not validated). These latter two cases can be very useful for data warehouses, and are discussed in the examples below.
Belief: In order to use a constraint for belief, the constraint must be in the RELY state. In some cases, the data-warehouse administrator may know that a given constraint is true. The RELY state provides the data-warehouse administrator with a mechanism for telling Oracle8i that a given constraint is believed to be true.

Note that the RELY state is only meaningful for constraints that have not been validated.

Typical Data Warehouse Constraints

This section assumes that most readers are familiar with the typical usage of constraints; that is, constraints that are both enabled and validated. For data warehousing, many users have discovered the such constraints may not be effective, because it may be prohibitively costly to build and maintain such constraints.

Unique Constraints in a Data Warehouse

A unique constraint is typically enforced using a unique index. However, in a data warehouse, in which tables can be extremely large, creating a unique index may be very costly both in terms of processing time and disk space.

Suppose that a data warehouse contained a table SALES, with includes a column SALES_ID. SALES_ID uniquely identifies a single sales transaction, and the data-warehouse wants to ensure that this column is unique within the data warehouse.

One way to create the constraint is:

ALTER TABLE sales ADD CONSTRAINT sales_unique UNIQUE(sales_id);

By default, this constraint is both enabled and validated. Oracle will implicitly create a unique index on sales_id to support this constraint. However, this index may be problematic in a data warehouse for three reasons:

The unique index can be very large, since the SALES table could easily have millions or even billions of rows.
The unique index is rarely used for query execution; most typical data warehousing queries will not have predicates on unique keys, so this index will probably not improve performance.
If SALES is partitioned along a column other than SALES_ID, the unique index must be a global index. This can detrimentally affect all maintenance operations on the SALES table.

Why is an index required for unique constraints? The index is used for enforcement. That is, the unique index is necessary to ensure that each individual row that is modified in the SALES table will satisfy the unique constraint.

For data warehousing tables, an alternative mechanism for unique constraints is:

ALTER TABLE sales ADD CONSTRAINT sales_unique UNIQUE (sales_id) 
  DISABLE VALIDATE;

This statement will create a unique constraint, but since the constraint is disabled, there is no need for a unique index. This approach can be advantageous for many data warehousing environments, because the constraint now ensures uniqueness without the shortcomings of a unique index.

However, there are trade-offs for the data warehouse administrator to consider with DISABLE VALIDATE constraints. Since this constraint is disabled, no DML statements which may modify the unique column are permitted against the SALES table. Thus, there are two strategies for modifying this table in the presence of a constraint:

Use DDL to add data to this table (such as exchanging partitions). See the example in Chapter 14, "Loading and Refreshing".
Prior to the modifying this table, drop the constraint. Then, make all of the necessary data modifications. Finally, recreate the disabled constraint. Note that recreating the constraint will be much more efficient that re-creating an enabled constraint. This approach will not guarantee that the data in the SALES table is unique while the constraint has been dropped.

Foreign Key Constraints in a Data Warehouse

In a star schema data warehouse, foreign-key constraints are typically created in order to validate the relationship between the fact table and the dimension tables. An example constraint might be:

ALTER TABLE sales ADD CONSTRAINT fk_sales_time
FOREIGN KEY (sales_time_id) REFERENCES time (time_id)
ENABLE VALIDATE;

However, there are several different requirements in which a data warehouse administrator may choose to use different states for the foreign-key constraints. In particular, the ENABLE NOVALIDATE state can be useful for data warehousing. A data warehouse administrator may choose to use an ENABLE NOVALIDATE constraint when either:

There is data in the tables that currently disobeys the constraint, but the data warehouse administrator wishes to create a constraint for enforcement.
There is a requirement to quickly create an enforced constraint.

Suppose that the data warehouse loaded new data into the fact table(s) every day, but only refreshed the dimension tables on the weekend. Then, during the week, it is possible that the dimension tables and fact tables may in fact disobey the foreign-key constraints. Nevertheless, the data warehouse administrator may wish to maintain the enforcement of this constraint to prevent any changes that might affect the foreign key constraint outside of the ETT process. Thus, the foreign key constraints could be created every night, following the ETT process, as:

ALTER TABLE sales ADD CONSTRAINT fk_sales_time
FOREIGN KEY (sales_time_id) REFERENCES time (time_id)
ENABLE NOVALIDATE;

Another usage of ENABLE NOVALIDATE is for quickly creating an enforced constraint, even when the constraint is believed to be true. Suppose that the ETT process verifies that a foreign-key constraint is true. Rather than have the database re-verify this foreign-key constraint (which would require time and database resources), the data-warehouse administrator could instead create a foreign-key constraint using ENABLE NOVALIDATE.

RELY Constraints

It is not uncommon for the ETT process to verify that certain constraints are true. For example, the ETT process may validate all of the foreign keys in the incoming data for the fact table. In these environments, the data warehouse administrator may choose to trust the ETT process to provide clean data, instead of implementing constraints in the data warehouse.

For example, during the ETT process, the data warehouse administrator may have verified that a foreign-key constraint is true. Rather than have the database re-verify this foreign-key constraint (which would require time and database resources), the data warehouse administrator could instead create a foreign-key constraint with the RELY state:

ALTER TABLE sales add CONSTRAINT fk_sales_time
FOREIGN KEY (sales_time_id) REFERENCES time (time_id) 
DISABLE NOVALIDATE rely;

The existence of RELY constraints, even though they are not used for data-validation, may be important for several purposes, including:

constraints may be used to enable more sophisticated query-rewrites for materialized views. See Chapter 19, "Query Rewrite", for further details.
other data-warehousing tools may retrieve information regarding constraints directly from the Oracle data dictionary.

Creating a RELY constraint is very inexpensive. Because the constraint is not being validated, there is no data processing necessary to create such a constraint.

Constraints and Parallelism

All constraints can be validated in parallel. When validating constraints on very large tables, parallelism is often necessary to meet performance goals. The degree of parallelism for a given constraint operation is determined by the default degree of parallelism of the underlying table.

Constraints and Partitioning

Many aspects of creating and maintaining constraints can be executed on a per-partition basis. Later chapters will discuss the significance of partitioning for data warehousing, and partitioning can provide benefits to constraint management (just as partitioning provide benefits to managing of many other operations). For example, Chapter 14, "Loading and Refreshing", provides a scenario example of creating unique and foreign-key constraints on a separate staging table, and these constraints are maintained during the EXCHANGE PARTITION statement.

7 Constraints