Mining an Essbase Database

In This Section:

Understanding Data Mining

Essbase Data Mining Framework

Accessing Data Mining Functionality

Algorithms

Transformations

Importing and Exporting Models

Understanding Data Mining

Data mining tools can sift through data to come up with hidden relationships and patterns. You may find that people who bought root beer in July also bought ice cream in July. Then you can use this knowledge to create an advertising campaign around root beer floats.

You can use data mining to tell you things about existing data, as in the root beer example. Such data mining is called descriptive. You can also use data mining to forecast future results based on past performance. For example, you can forecast sales for next year based on sales for this year. Such data mining is called predictive.

The data mining process involves using algorithms to analyze a data and then, from the relationships that it finds, the algorithm results are projected against another set of data. An algorithm is a method (set of instructions) that is used to analyze the data and apply the collected knowledge to other data.

For example, suppose you want to determine how sales of televisions, DVDs, and VCRs in the East region correspond to sales of cameras in the same region. You can use the regression algorithm to model sales of cameras as a function of sales of TVs, VCRs, and DVDs. The algorithm performs a series of mathematical steps to create a concise representation (model) of the knowledge obtained from analysis of data. Then it uses this model to predict the value of a dependent (target) variable based on the value of one or more independent (predictor) variables.

In some cases, in order to better fit model to given data, it may be necessary to modify the data. Transformations are used to mathematically adjust data values before an algorithm analyzes the data. For example, a transformation could cause an exponentiation of values that the algorithm would include in its analysis.

Essbase Data Mining Framework

Data Mining Framework is a collection of features that enables the execution of data mining on Essbase databases.

Data Mining is not available in the default Administration Services deployment. To make Data Mining available, use the Administration Services Console. See the Oracle Essbase Administration Services Online Help.

Note:

Data Mining Framework does not currently operate on relational data, that is, Hybrid Analysis data. Data mining is also not supported currently for Unicode-mode applications. The names of data mining algorithms, models, transformations and templates must contain ASCII characters only.

To understand the processes for mining an Essbase database you need to become familiar with the following terms:

  • Algorithm: a method (set of instructions) that is used to analyze the data.

  • Model: a collection of an algorithm's findings about examined data. A model can be used (applied) against a different set of data to generate useful information about that data.

  • Task: a step in the data mining process.

  • Task template: a task specification that can be modified and used again for a variation of that task.

Essbase Data Mining Tasks

The Essbase data mining procedure includes the following tasks:

  • Specify a build task.

    You implement a build task through a task template. The Mining Wizard, available with Administration Services, provides a build task template and steps you through the process of specifying a build task. You can also use MaxL statements and sample template files provided with Data Mining Framework.

    In the build task, you specify:

    • The algorithm to use.

    • The database to mine and a representative set of data.

    • Parameters that determine how the algorithm is applied.

  • Execute the build task to build, or train, a model.

    When you execute a build task, the specified algorithm is applied against the specified set of data to generate a data mining model. During the training process, the algorithm discovers and describes internally the patterns and relationships in the data that can be used for prediction. Later, the trained model can use these patterns and relationships to generate new information from a different, but similarly structured, set of data.

    See Training the Model.

  • Specify a test task to test the model. This is an optional task to verify the accuracy of the model you have built.

    You implement a test task through a task template using the Mining Wizard or MaxL statements.

    In the test task, you specify:

    • The model to use. This is a model that you have built.

    • A database and set of data to mine. Specify a set of data with known results or correlations.

  • Execute the test task to generate test results.

    After training the model on one set of data, you execute the model against a different set of data with known results or correlations. You can compare the results generated by the model with the known results to determine the accuracy of the model.

    See Testing the Model.

  • Specify an apply task.

    You implement an apply task through a task template using the Mining Wizard or MaxL statements.

    In the apply task, you specify:

    • The model to use. This is a model that you have built.

    • A database and set of data to mine

    • Where in the cube to store the results

  • Execute the apply task to generate result records.

    When you execute an apply task, the model is executed against the specified set of data to generate result records. Data Mining Framework writes results back to the Essbase cube.

    See Applying the Model.

  • Query the models or mining results.

    Query the models or results to view their contents.

  • Specify a scoring task.

    You implement an scoring task through a task template using the Mining Wizard or MaxL statements.

    In the scoring task, you specify:

    • The model to use. This is a model that you have built.

    • The data values to be mined. These values may come from the Essbase cube (cube scoring) or may be entered manually (data scoring).

  • Execute the scoring task to generate results.

    When you execute a scoring task, the model is executed against the data you specify and the results are displayed by the Administration Services Console.

The following sections describe the data mining process in more detail.

Creating Data Mining Models

The first step in building a data mining model is to understand the business situation or problem and choose the appropriate algorithm to use for analysis or prediction. Algorithms determine how you process data.

Data Mining Framework provides a set of powerful algorithms that can be used for a broad range of business applications. See Algorithms for a description of the available algorithms. The description of each algorithm provides information about the type of problem that the algorithm is best suited for.

Note:

Data Mining Framework also enables you to easily register and use new algorithms created by Oracle or third-party vendors.

Preparing for Data Mining

The one essential prerequisite for performing data mining is that you understand your data and the problem you are trying to solve. Data mining is a powerful tool and can yield new insights. If you already have a strong hunch about your data, data mining can be particularly useful in confirming or denying your hunch, and giving you some additional insights and directions to follow.

Before you mine an Essbase database, make sure that the database is loaded and calculated.

Using an Algorithm

All data mining algorithms require training, or learning. Training is the process of executing an algorithm against a representative set of data to discover and describe the patterns and relationships in the data that can be used for prediction. After a model has been trained against a representative set of data, it can then be applied to a wider set of data to derive useful information.

Learning can be either supervised or unsupervised. Supervised learning discovers patterns and relationships in the data by comparing the resulting data to a set of known data. Unsupervised learning discovers patterns and relationships in the data without comparing the data to a known answers.

The algorithm vendor determines the specific information that you must provide to use the algorithm. The regression algorithm employs supervised learning. Therefore, the model requires input data and output data.

To use an algorithm, you need to enter its settings, parameters and MaxL DML expressions describing input and output.

Settings are determined by the Data Mining Framework, and as such are the same for all algorithms, but they do influence the operation of the algorithm. For example, the framework provides a setting to define how missing values are treated by the algorithm.

Parameters are specific to each algorithm and determine how the algorithm operates on the data. For example, the clustering algorithm provides a parameter to specify the maximum number of clusters to return.

MaxL DML expressions specify the input and output data for the algorithm. Administration Services Console offers simplified and advanced interfaces for entering these expressions. In the simplified interface, expressions are specified per domain. In the advanced interface, expressions are specified per domain within each accessor. Accessors and domains are determined by the algorithm. In a build task, supervised algorithms such as the Regression algorithm have accessors to define the independent or input data (for example, predictor accessors) and accessors to define the dependent or expected output data (for example, target accessors). Unsupervised algorithms, such as clustering, have a predictor accessor only.

Test and apply tasks generally have input and output accessors.

Accessors consist of domains, which are typically MaxL DML expressions that define components of the accessor. For example, the predictor accessor for the Regression algorithm contains the following domains:

  • Predictor

    Defines the member or member set combination that determines the predictor domain.

  • Sequence

    Defines the cases to be traversed for the predictor variable. Sequence is often a subset of the time dimension.

  • External

    Defines the scope of the predictor (optional).

  • Anchor

    Defines restrictions from additional dimensions (optional).

The target accessor has the same set of domains as the predictor except that instead of Predictor domain it has Target domain. You write MaxL DML expressions to define the predictor and target accessors. In the simplified interface, you write an expression per domain, regardless of the accessor it comes from.

For example, consider this sample data mining problem:

Given the number of TVs, DVDs, and VCRs sold during a particular period, in the East region, how many cameras were sold in the same period in the East? Restrict sales data to prior year actual sales.

Using the regression algorithm, the predictor and target accessors define the model for this problem. When the values are the same, you can define them in simplified mode as follows:

DomainValue

Predictor

{[Television], [DVD], [VCR]}

Target

{[Camera]}

Sequence

{[Jan 1].Level.Members}

External

{[East].Children}

Anchor

{([2001], [Actual], [Sales])}

If you need to specify different expressions for similar domains, you should define them in advanced mode; for example:

DomainValue

Predictor.Predictor

{[Television], [DVD], [VCR]}

Predictor.Sequence

{[Jan 1].Level.Members}

Predictor.External

{[East].Children}

Predictor.Anchor

{([2001], [Actual], [Sales])}

Target.Target

{[Camera]}

Target.Sequence

{[Jan 1].Level.Members}

Target.External

{[West].Children}

Target.Anchor

{([2002], [Actual], [Sales])}

In both interfaces a predictor component (for example predictor.sequence) and the corresponding target component (target.sequence) must be the same size.

External domain is used when there is a need to build a family of models. For each city in the East ({[East].Children}), the algorithm models camera sales as a function of TV, DVD, and VCR sales. The Data Mining Framework creates, under the same name, a family of results, or models; a separate model or result for each city in the East. If a single model is desired, the external domain may be omitted (left blank).

When you define accessors using the Data Mining Wizard in Advanced mode, you can have mathematical transformations applied to them. For additional information, see Transformations.

  To perform a data mining build task, see “Creating or Modifying Build Tasks” in the Oracle Essbase Administration Services Online Help.

Training the Model

In the final step of the build task Data Mining Framework executes the algorithm against the data specified by the accessors to build or train the model. During the training process, the algorithm discovers and describes the patterns and relationships in the data that can be used for prediction.

Internally, the algorithm represents the patterns and relationships it has discovered as a set of mathematical coefficients. Later, the trained model can use these patterns and relationships to generate new information from a different, but similarly structured, set of data.

Note:

If you cancel a data mining model while you are training it, the transaction is rolled back.

Testing the Model

After the model is trained, it is ready to use. If you have a known set of existing results, you can test the model against these known results. To test a model, you create a test task. In the test task, you specify a model you have trained and a set of accessors. In addition to accessors specified for input, you specify test accessors that reference the known set of results.

The test task compares the results derived from the trained model to the set of known results you specify. The test task determines if the results match within a specified range of expected error. If the results do not match, you can do any of the following:

  • Verify the homogeneity of the known data. If the structure of the known data does not match the structure of the test data, the results will not match.

  • Verify the integrity of the known data. Corrupt or incomplete data can cause the test model to fail.

  • Verify the integrity of the test input data.

  • Consider changing the stringency of the settings. At very least, a less stringent setting that returns a positive test gives you an idea of how closely the trained model compares to known data.

  To perform a data mining test task, see “Creating or Modifying Test Tasks” in the Oracle Essbase Administration Services Online Help.

Applying the Model

After the model is trained, you can use it on a new set of data. In the apply task you specify a build model you trained and a set of accessors. Applying a model uses the model against a known set of stored data which you specify in the apply task specifications. Generally, the accessor or domain expressions are the same for the predictor domains for a build task and its related apply task. You change the sequence, external, or anchor domain to apply the model to a different set of data. For example, you could change the external domain to specify a different region or country. Or you could use the anchor domain to specify a different year.

In the apply task, the target result data is not known, but is to be predicted by the model.

The apply task applies the model coefficients it generated to the new set of data and generates a set of output data. The Data Mining Framework writes the result data back to the Essbase cube into the locations specified by the target accessor expressions. The apply task generates a named result that you can use to query the result data. For information about viewing the result information, see Viewing Data Mining Results.

  To perform a data mining apply task, see “Creating or Modifying Apply Tasks” in the Oracle Essbase Administration Services Online Help.

Viewing Data Mining Results

When you execute an apply task, Data Mining Framework writes mining results back to the Essbase cube. Data Mining Framework creates a result record, in XML format, that contains accessors that specify the location of the result data in the cube.

You can view data mining results through the Data Mining node in Administration Services, or using MaxL statements, and an XML-based tool that can access the results.

  To view data mining results, see “Managing and Viewing Data Mining Results” in the Oracle Essbase Administration Services Online Help.

Scoring the Model

After the model is trained, instead of using an apply task to write back the results to the database you can perform a scoring task to view the results immediately. The input data can come from the cube (Score on Cube) or you can enter data when you execute the task (Score on Data).

For cube scoring you must provide all the expressions required in apply mode with the following exception: the external (if present) and sequence expressions may point to a single position only.

For data scoring you must enter the external expression (if present) for a single position; you must also enter the predictor data as numbers. No sequence expressions are used.

Note:

External expression may appear in apply or scoring tasks only if it was used during build. Otherwise it is omitted.

  To perform a data mining scoring task, see “Creating or Modifying Scoring Tasks” in the Oracle Essbase Administration Services Online Help.

Accessing Data Mining Functionality

Data Mining Framework is supported by the MaxL and Administration Services interfaces.

MaxL provides a set of statements explicitly for data mining. With MaxL you can perform all data mining functions, including creating, training, testing, scoring, and applying a data mining model.

Sample model templates for each of the algorithms are available through the Administration Services interface. A model template provides an outline of the accessors needed for that algorithm that you can fill in. It also enables you to provide parameters required by the algorithm.

Algorithms

Essbase provides six commonly used algorithms. You can use these algorithms or you can create and register your own algorithms. This topic discusses:

Built-in Algorithms

Essbase supplies the following basic algorithms:

  • Regression. Identifies dependencies between a specific value and other values. For example, multilinear regression can determine how the amount of money spent on advertising and payroll affects sales values.

  • Clustering. Arranges items into groups with similar patterns. You use the clustering algorithm for unsupervised classification. The algorithm examines data and determines itself how to split the data into groups, or clusters, based on specific properties of the data. The input required to build the model consists of a collection of vectors with numeric coordinates. The algorithm organizes these vectors into clusters based on their proximity to each other. The basic assumption is that the clusters are relatively smaller than the distance between them, and, therefore, can be effectively represented by their respective centers. Hence the model consists of coordinates of center vectors.

    Sequential runs on the same training set may produce slightly different results due to the stochastic nature of the method. You specify the number of clusters to generate, but it is possible the algorithm will find fewer clusters than requested.

    Clusters can provide useful information about market segmentation and can be used with other predictive tools. For example, clusters can determine the kinds of users most likely to respond to an advertising campaign and then target just those users.

  • Neural network. Generalizes and learns from data. For example, neural networks can be used to predict financial results.

    You can use the neural net algorithm for both prediction and classification. This algorithm is much more powerful and flexible than linear regression. For example, you can specify multiple targets as well multiple predictors.

    On the other hand, the model generated by the neural net algorithm is not as easy to interpret as that from linear regression.

    One use of neural nets is binary classification. A series of inputs (predictors) produces a set of results (targets) normalized to values between zero and one. For example, a set of behaviors results in values between 0 and 1, with 1 being risky and 0 being risk free. Values in between require interpretation; for example, 0.4 is the high end of safe and 0.6 is the low end of risky.

  • Decision tree. Determines simple rules for making decisions. The algorithm results are the answers to a series of yes and no questions. A yes answer leads to one part of the tree and a no answer to another part of the tree. The end result is a yes or no answer. Decision trees are used for classification and prediction. For example, a decision tree can tell you to suggest ice cream to a particular customer because that customer is more likely to buy ice cream with root beer.

    Use the decision tree algorithm to organize a collection of data belonging to several different classes or types. In the build phase, you specify a set of data vectors and provide the class of each. In the apply phase, you provide a set of previously unknown vectors and the algorithm deduces their classes from the model.

    The algorithm constructs a series of simple tests or predicates to create a tree structure. To determine the class of a data vector, the algorithm takes the input data and traverses the tree from the root to the leaves performing a test at each branch.

  • Association Rules. Discovers rules in a series of events. The typical application for this algorithm is market basket analysis: people who buy particular items also buy which other items. For example, the result of a market basket analysis might be that men who buy beer also buy diapers.

    You define support and confidence parameters for the algorithm. The algorithm selects sufficiently frequent subsets selected from a predefined set of items. On input it reads a sequence of item sets, and looks for an item set (or its subset), whose frequency in the whole sequence is greater than support level. Such item sets are broken into antecedent-consequent pairs, called rules. Rule confidence is the ratio of its item set frequency to the antecedent frequency in all the item sets. Rules with confidence greater than the given confidence level are added to the list of "confident" rules.

    Although the algorithm uses logical shortcuts during computations, thus avoiding the need to consider all the combinations of the item sets, whose number can be practically infinite, the speed with which the algorithm executes depends on the number of attributes to consider and the frequency with which they occur.

  • Naive Bayes. Predicts class membership probabilities. Naive Bayes is a light-weight classification algorithm. It is fast, takes small memory and in a good number of applications behaves quite satisfactory, so you can use it first before going to the decision tree or fully fledged clustering schemes.

    The algorithm treats all the attributes of the case vector as if they were independent of each other. It uses a training sequence of vectors and the theoretical definition of the conditional probability to calculate the probabilities or likelihoods that an attribute with a certain value belongs to a case with a certain class. The model stores these probabilities. In the apply mode the case attributes are used to calculate the likelihood of the case for each class. Then a class with the maximal likelihood is assigned to the case.

For details about the parameters, accessors, and other information used with these algorithms, see the Oracle Essbase Administration Services Online Help, which includes a separate topic for each algorithm provided by Essbase.

Creating Algorithms

You can create your own algorithms using Java and register them with Data Mining Framework. In order to be recognized by Data Mining Framework, an algorithm must implement certain interfaces and have a specific signature.

After a new algorithm is registered, it appears in the list of supplied algorithms and you can use the new algorithm to create build and apply tasks. Data Mining Framework reads the instructions for each parameter in the algorithm from the algorithm signature. The instructions appear in the Build, Score, and Apply Wizard panels where the user sets the algorithm parameters, just like the instructions for the supplied algorithms.

Transformations

Transformations are mathematical operations that you can have applied to accessors when you use the Data Mining Wizard in advanced mode to define a build task model.

Built-In Transformations

Essbase provides the following built-in transformation operations:

  • Exponential transformation with a shift: x -> exp (x + s)

  • Logarithmic transformation with a shift: log (x + s)

  • Power transformation: x -> sign(x)*[x]^p

  • Scale transformation: x -> s * x

  • Shift transformation: x -> x + s

  • Linear transformation: y = b + kx

Creating Transformations

You can create your own transformations using Java and register them with Data Mining Framework. In order to be recognized by Data Mining Framework, a transformation must implement certain interfaces and have a specific signature. These requirements are described in detail in the Algorithm and Transformation Vendor’s Guide included in the Data Mining Framework SDK. After a new transformation is registered, it appears in the list of supplied transformations that are displayed in the Data Mining Wizard in Advanced mode.

  To create and use transformations, see the following topics in the Oracle Essbase Administration Services Online Help:

  • “Applying Transformations”

  • “Managing Data Mining Transformations”

Importing and Exporting Models

Using Predictive Model Markup Language (PMML) format, you can import and export trained data mining models. Importing models enables use of models created outside the Data Mining Framework. You can also use the import-export feature to move models across multiple Essbase Servers.

  To import or export data mining models, see “Importing and Exporting Data Mining Models” in the Oracle Essbase Administration Services Online Help.

Note:

Extensions added to PMML by other vendors may not be supported. When using an imported model, you must identify an algorithm that most closely matches the algorithm of the imported model.