B Design the Data Mart

This chapter looks at the issues involved in the design of a data mart. Think of this chapter as a collection of tips on how to run your data mart implementation project. Just as important as learning what you should do is learning what to watch out for—the things that can trip you up on a project like this (and these may not always be technical issues).The driving business factor for the data mart is the need for information, and the best way to start the design process is by identifying the business needs. You should involve those who have an investment in the data mart, such as the business sponsor and the end user, as early as possible in the design process. Together, you should agree on the information requirements that the data mart must fulfill, the data sources, the technical requirements (such as how often the data needs to be refreshed from the source), and the success criteria for the project.

The steps in designing a data mart are:

Conducting a study to define the scope of the project
Defining the business and technical requirements for the data mart
Developing the logical and physical design of the data mart

This chapter contains the following topics:

Section B.1, "Defining the Scope of the Data Mart Project"
Section B.2, "Defining the Requirements for the Data Mart"
Section B.3, "Data Mart Design"

B.1 Defining the Scope of the Data Mart Project

Before you begin to implement the data mart, you need to develop a plan for its delivery. Critical inputs to this plan are the information requirements and priorities of your users. After this information has been defined and approved by your business sponsor, you can develop your list of key deliverables and assign responsibilities to your team.

Your first task is to define the scope of the project. The scope of the data mart defines the boundaries of the project and is typically expressed in some combination of geography, organization and application, or business functions. Defining scope usually requires making compromises as you try to balance resources (such as people, systems, and budget) with the scheduled completion date and the capabilities you promised to deliver. Defining your scope and making it clear to everyone involved is important because it:

Sets the right expectations
Prioritizes incremental development
Highlights risks and issues
Allows you to estimate costs

B.2 Defining the Requirements for the Data Mart

To start the implementation of the data mart, you need to define the business and technical requirements. However, you should expect the requirements to change as users use the initial implementation and are better able to communicate their requirements. The development of the data mart is an iterative process, because the data mart evolves in response to feedback from users.

This section contains the following topics:

Section B.2.1, "Define Business Requirements"
Section B.2.2, "Identify Technical Requirements"
Section B.2.3, "How Do You Know If You Have Done It Right?"

B.2.1 Define Business Requirements

The purpose of the data mart is to provide access to data that is specific to a particular department or functional area. The data should be at a meaningful level of detail for the kind of analysis that the end users want to perform, and should be presented in the business terms that they understand. The expectation is that the analysis of the data in a data mart will lead to more informed business decisions. Therefore, you need to understand how the business person makes decisions—what questions the users ask in the decision-making process, and what data is necessary to answer those questions. The best way to understand the business processes is through interviews with the business people. The requirements identified as a result of these interviews comprise the business requirements for your data mart.

As you gather requirements for the data mart, you should focus your efforts on the information needs of a single subject area. Remember that no requirements document will be complete enough to get all information at the outset, and you need to design the data mart to accommodate changing needs. However, if you introduce new subjects or deviate from your primary theme, you will lose your focus and schedule.

Even though the data mart addresses one subject area, it usually has many business users, each with different requirements and expectations. Try to identify at least one representative from each area of the business for your interviews. For example, if you are building a marketing data mart, interview several people involved in various aspects of the marketing function (such as a marketing analyst, channel specialist, direct marketing manager, and promotion manager).

Use a consistent set of questions or an interview template for each interview. The questions should focus on the users' information requirements, such as content, frequency of update, priorities, and level of detail. Finally, set a definite time limit on the interview and requirements gathering phase; otherwise, it could continue indefinitely as you try to refine each requirement. You cannot collect all requirements in this time frame, but you will get enough to create a road map. Needs will change during the implementation period, and you will need a way to evaluate and accommodate requests for changes or to reject and consider them for a future phase.

B.2.2 Identify Technical Requirements

You must identify the technical requirements. These specify where you get the data that feeds the data mart. The primary sources of data for data marts are the operational systems that handle the day-to-day transactional activities. Usually, these operational systems are online transaction processing (OLTP) systems. Your data mart may be fed from more than one such operational source. However, you cannot usually transfer the data from the operational system into the data mart without intermediate processing. You need to understand how clean the operational data is and how much formatting or translation is needed to integrate it with other sources. Also, you need to determine how often you must refresh or update the data. For example, if you use the data for relatively long-term planning and analytical horizons, you may need a weekly or monthly feed rather than a daily feed. Note that the frequency of update does not necessarily determine the level of detail in the data mart. You can have a monthly feed of data summarized by week. In this initial phase of data mart design, you need to identify data sources, the kind of data cleansing needed, and the frequency with which data should be refreshed.

B.2.3 How Do You Know If You Have Done It Right?

When you finish your interviews, you have a set of information and performance requirements that your data mart application must meet. You should be realistic and prioritize the needs and develop a list of success criteria. To prioritize your list, ask yourself these questions:

Is performance the primary concern?
Are you constrained by your systems configuration?
How often do you want to update or append to the data?
Do the users expect the data mart to be a comprehensive source for departmental data, or is the data mart limited in scope to a particular topic within that department?
Is your scope consistent with IT architecture, or can you develop autonomously?

Consider the answers to these questions as you develop your priorities and critical success factors.

In summary, here are some guidelines to facilitate your requirements definition process:

Involve the end users throughout the process.
Classify the requirements analysis framework: define the requirements for the business sponsor, the IT architect, the data mart developer, and the end users.
Manage the expectations of the end users.

B.3 Data Mart Design

At the beginning of the design stage, business requirements are already defined, the scope of your data mart application has been agreed upon, and you have a conceptual design. Now, you need to translate your requirements into a system deliverable. In this step, you create the logical and physical design for the data mart and, in the process, define the specific data content, relationships within and between groups of data, the system environment supporting your data mart, the data transformations required, and the frequency with which data is refreshed.The logical design is more conceptual and abstract than the physical design. In the logical design, you look at the logical relationships among the objects. In the physical design, you look at the most effective way of storing and retrieving the objects.

Your data mart design should be oriented toward the needs of your end users. End users typically want to perform analysis and look at aggregated data, rather than at individual transactions. Your design is driven primarily by end-user utility, but the end users may not know what they need until they see it. A well-planned design allows for growth and changes as the needs of users change and evolve.

The quality of your design determines your success in meeting the initial requirements. Because you do not have the luxury of unlimited system and network resources, optimal utilization of resources is determined primarily by your design. By beginning with the logical design, you focus on the information requirements without getting bogged down immediately with implementation detail.

Note that you are not forced to work in a top-down fashion. You can reverse-engineer an existing data schema and use this as a starting point for your design. If your data requirements are very clear and you are familiar with the source data, you might be able to begin at the physical design level and then proceed directly to the physical implementation. In practice, it takes several iterations before you achieve the right design.

This section contains the following topics:

Section B.3.1, "Creating a Logical Design"
Section B.3.2, "Creating a Wish List of Data"
Section B.3.3, "Identifying Sources"
Section B.3.4, "Classifying Data for the Data Mart Schema"
Section B.3.5, "Designing the Star Schema"
Section B.3.6, "Moving from Logical to Physical Design"

B.3.1 Creating a Logical Design

A logical design is a conceptual, abstract design. You do not deal with the physical implementation details yet; you deal only with defining the types of information that you need. The process of logical design involves arranging data into a series of logical relationships called entities and attributes. An entity represents a chunk of information. In relational databases, an entity often maps to a table. An attribute is a component of an entity and helps define the uniqueness of the entity. In relational databases, an attribute maps to a column.

While entity-relationship diagramming has traditionally been associated with highly normalized models, such as OLTP applications, the technique is still useful in dimensional modeling. You just approach it differently. In dimensional modeling, instead of seeking to discover atomic units of information and all of the relationships between them, you try to identify which information belongs to a central fact table and which information belongs to its associated dimension tables.

Attention to design is critical. Keep your business requirements on hand throughout the design process. Nothing else is more important!

As part of the design process, you map operational data from your source into subject-oriented information in your target data mart schema. You identify business subjects or fields of data, define relationships between business subjects, and name the attributes for each subject.

The elements that help you to determine the data mart schema are the model of your source data and your user requirements. Sometimes, you can get the source model from your company's enterprise data model and reverse-engineer the logical data model for the data mart from this. The physical implementation of the logical data mart model may require some changes due to your system parameters—size of computer, number of users, storage capacity, type of network, and software. You will need to make decisions as you develop the logical design:

Facts and dimensions
Granularity of the facts
Relationship between the entities
Historical duration of the data

B.3.2 Creating a Wish List of Data

You generate the wish list of your data elements from the business user requirements. This tutorial assumes that the scope of the data mart is fully specified by the users. Often, you must look beyond the specific requests of the users and anticipate future needs.

Start with the business parameters that matter to your subject area. For a Sales and Marketing data mart, parameters might be Customer, Geography, Product, Sales, and Promotions. Remember Time—do you want to look at monthly, daily, or weekly figures?

Then, create a list of desired data elements, either from the requirements provided by the users, or by brainstorming with them. At the end of this exercise, you should have the following:

A list of data elements, both raw and calculated
Attributes of the data, such as character or numeric data types
Reasonable groupings of the data, such as geographical regions for the elements country, county, city
An idea of the relationship between the data, such as "a city is within a county"

Typical data fields of interest in the Sales and Marketing example might be dollar sales, unit sales, product names, packages, promotion characteristics, regions, and countries. Identify the critical fields—those that drove the creation of the data mart. Data such as dollar sales or unit sales are critical for a sales data mart.

Users may provide you with reports to give you an idea of their data requirements. These reports may be existing reports, or the kind of reports they would like to see. Reports are a good vehicle to get the users to articulate their needs.

At this point, you can separate the data into numeric data (the facts) and textual or descriptive data (the dimensions). During the iterative process of interaction with the end user, ask why certain data is important—what decisions are driven by this data? Some insight into the business processes will help you anticipate future data needs.

B.3.3 Identifying Sources

Now, you have a list of dimensions and facts that you want for your data mart. The question is, can you get the data? And if yes, at what price? Data sources can range from operational systems, such as order processing systems, to spreadsheets. You need to map the individual elements from your wish list to the sources. You should start with the largest, most comprehensive source and seek other sources as needed.

Typically, a large percentage of the data comes from one or two sources. The dimensions can usually be mapped to lookup tables in your operational system. In their raw form, the facts can be mapped to the transaction tables. For use in the data mart, the transaction data usually needs to be aggregated, based on the specified level of granularity. Granularity is the lowest level of information that the user might want. You may find that some of the requested data cannot be mapped. This usually happens when groupings in the source system are not consistent with the desired groups within the data mart. For example, in a telecommunications company, calls can be aggregated easily by area code. However, your data mart needs data by postal code. Because an area code contains multiple postal codes and one postal code may span multiple area codes, it is difficult to map these dimensions.

You may find that some data is too costly to acquire. For example, the promotion data that the users requested may not be obtained easily because the information is not consistent across time or promotion. To translate to a common system format would be very costly.

B.3.4 Classifying Data for the Data Mart Schema

At this point, you have started thinking about the classification of your data as facts and dimensions. A common representation of facts, dimensions, and the relationships between them in data mart applications is the star schema. Typically, it contains a dimension of time and is optimized for access and analysis. It is called a star schema because the graphical representation looks like a star with a large fact table in the center and the smaller dimension tables arranged around it.

Advanced design modeling may involve schemas, called snowflake or constellation schemas, which are more complex than the simple star schema shown. The following sections provide more information about dimensions, facts, and level of granularity.

This section contains the following topics:

Section B.3.4.1, "Dimensions"
Section B.3.4.2, "Facts"
Section B.3.4.3, "Granularity"

B.3.4.1 Dimensions

In your classification exercise, many of the fields from the OLTP source will end up as dimensions. The big design issue is to decide when a field is just another item in an existing dimension, or when it should have its own dimension. The time dimension is generated independently using the discrete dates in the OLTP source. This offers flexibility in doing any time series analysis. For a true star schema, the creation order of the dimension tables does not matter as long as they are created before the fact table. Generally, a table must be created before other tables can reference it. Therefore, be sure to create all dimension tables first.

B.3.4.2 Facts

Facts are the numeric metrics of the business. They support mathematical calculations used to report on and analyze the business. Some numeric data are dimensions in disguise, even if they seem to be facts. If you are not interested in a summarization of a particular item, the item may actually be a dimension. Database size and overall performance will improve if you categorize borderline fields as dimensions.

For example, assume that you have a membership database for a health club and want to find out how much of the club brand vitamins the members buy. In your wish list, you have several queries like "Give me the usage by age by..." and "Give me the average age of members by..." Is age a fact or a dimension? Make it a dimension.

B.3.4.3 Granularity

After you define the facts and dimensions, you determine the appropriate granularity for the data in the data mart. At this point, you know why your users have requested a particular level of information within a dimension. You need to estimate the resource requirements to provide the requested level of granularity and, based on the costs, decide whether or not you can support this level of granularity.

B.3.5 Designing the Star Schema

After you have a list of all facts, dimensions, and the desired level of granularity, you are ready to create the star schema. The next step is to define the relationships between the fact and dimension tables using keys.

A primary key is one or more columns that make the row within a table unique. The primary key of the fact table can consist of several columns. Such a key is called a composite or concatenated key.

It is a good idea to use system-generated keys (synthetic keys), in place of natural keys, to link the facts and the dimensions. This provides the data mart administrator with control of the keys within the data mart environment, even if the keys change in the operational system.

A synthetic key is a generated sequence of integers. You include the synthetic keys in the dimension table, in addition to the natural key. Then, you use the synthetic key in the fact table as the column that joins the fact table to the dimension table.

Although creating synthetic keys requires additional planning and work, the keys can provide benefits over natural keys:

Natural keys are often long character strings, such as in a product code. Because synthetic keys are integers, response time to queries is improved.
The data mart administrator has control over the synthetic key. If a manufacturing group changes the product code naming conventions, the changes do not affect the structure of the data mart. Consider using synthetic keys for most dimension tables. (In the rest of this tutorial, we refer to synthetic keys as warehouse keys.)

The process of translating the data from the OLTP database and loading the target star schema requires mapping between the schemas. The mapping may require aggregations or other transforms.

B.3.6 Moving from Logical to Physical Design

During the physical design process, you convert the data gathered during the logical design phase into a description of the physical database, including tables and constraints. This description optimizes the placement of the physical database structures to attain the best performance. Because data mart users execute certain types of queries, you want to optimize the data mart database to perform well for those types of queries. Physical design decisions, such as the type of index or partitioning, have a huge impact on query performance.

As the data mart becomes successful and more widely used, more and more users will access it. Over time, the volume of data will also grow. Scalability, the ability to increase the volume of data and number of users, is an important consideration when you move from your logical design to a physical representation. To accommodate the need for scalability, you should minimize the limitations of factors such as hardware capacity, software, and network bandwidths.

This section contains the following topics:

Section B.3.6.1, "Estimating the Size of the Data Mart"
Section B.3.6.2, "What Is Metadata?"

B.3.6.1 Estimating the Size of the Data Mart

In estimating the size of your data mart, you need to develop a method that will accommodate its future growth. There are several methods for estimating the size of the database. Here is one approach:

Use a representative sample of the source data to determine the number of rows in the fact table.
Estimate the size of one row in the fact table.
Estimate the size of the fact table by multiplying the number of rows by the size of one row.
Estimate the size of the data mart. Generally, the total size of the data mart is three to five times the size of the fact table.

This process is usually iterative. Each time the design changes, you should estimate the size again. Even if you think that your star schema is small, you should do this calculation once.

After you calculate the size, you can validate your assumptions by doing the following:

Extract sample files.
Load data into the database.
Compute exact expected row lengths.
Add overhead for indexing, rollback, and temporary tablespaces, and a file system staging area for flat files.

To plan for future growth, you can use the ratio of the estimated size to the largest possible size of the fact table to calculate the future size of the data mart:

For each dimension, check the granularity that you want and estimate the number of entries in the finest level.
Multiply the number of entries of all dimensions to get the maximum possible rows.
Calculate the ratio of actual rows from representative data to possible rows.
Estimate the growth for each dimension table over a period of time.
Multiply the number of rows of all dimension tables.
Adjust the number, using the ratio calculated in Step 3.
Multiply the result by the fact table row size.

You may need to schedule a regular batch job to refresh your data mart from your sources. Depending on the data volumes and system load, this job may take several hours. Plan your data mart refresh so that under normal circumstances it can be accomplished within the time allowed for batch processing, usually at night. In your planning process, you should also estimate the data volume that will be refreshed. Develop a strategy for purging the data beyond the specified retention period.

B.3.6.2 What Is Metadata?

Metadata is information about the data. For a data mart, metadata includes:

A description of the data in business terms
Format and definition of the data in system terms
Data sources and frequency of refreshing data

The primary objective for the metadata management process is to provide a directory of technical and business views of the data mart metadata. Metadata can be categorized as technical metadata and business metadata.

Technical metadata consists of metadata created during the creation of the data mart, as well as metadata to support the management of the data mart. This includes data acquisition rules, the transformation of source data into the format required by the target data mart, and schedules for backing up and refreshing data.

Business metadata allows end users to understand what information is available in the data mart and how it can be accessed.

You use the technical metadata to determine data extraction rules and refresh schedules for the Oracle Warehouse Builder component. Similarly, you use the business metadata to define the business layer used by the Oracle BI Answers tool.