13 Data Mining Algorithms

The following algorithms are supported by Oracle Data Miner:

Settings Information contains the setting information common to most model viewers.

Oracle Data Mining Concepts provides overview information about algorithms, data preparation, and scoring. See the manual for the database version that you connect to, as described in Oracle Data Miner Documentation.

13.1 Anomaly Detection

Anomaly Detection (AD) identifies cases that are unusual within data that is apparently homogeneous. Anomaly detection is an important tool for fraud detection, network intrusion, and other rare events that may have great significance but are hard to find.

Oracle Data Mining uses Support Vector Machine (SVM) as the one-class classifier for Anomaly Detection (AD). When SVM is used for anomaly detection, it has the classification mining function but no target.

There are two ways to search for anomalies:

See Also:

"Applying Anomaly Detection Models"for information about how to use AD models to make predictions

13.1.1 Applying Anomaly Detection Models

One-class SVM models, when applied, produce a prediction and a probability for each case in the scoring data.

  • If the prediction is 1, then the case is considered Typical.

  • If the prediction is 0, then the case is considered Anomalous.

This behavior reflects the fact that the model is trained with normal data.

13.1.2 Anomaly Detection Viewers and Algorithm Settings

This section discusses the Anomaly Detection Model viewer, the procedure to view the AD Model Viewer and the algorithm settings related to Anomaly Detection. The topics discussed are:

13.1.2.1 Anomaly Detection Model Viewer

You can view an Anomaly Detection model in one of the following ways:

  • Method 1

    1. Right-click the node where the model was built.

    2. Select Go to Properties.

    3. In the Models section in Properties, click view selected model.

  • Method 2

    1. Select the workflow node where the model was built.

    2. Right-click and click View Models.

    3. Select the model to view. The Anomaly Detection Model Viewer opens in a new tab. The default name of an Anomaly Detection Models has AD in the name.

The information displayed in the model viewer depends on which kernel was used to build the model.

  • If the Gaussian Kernel is used, then there is one tab, Settings.

  • If the Linear kernel is used, then there are three tabs: Coefficients, Compare, and Settings.

Anomaly Detection model is a special kind of Support Vector Machine Classification model.

13.1.2.1.1 AD Model Viewer for Gaussian Kernel

The model viewer for an AD model with Gaussian Kernel has the following tabs:

  • Summary (AD): Contains Model and Algorithm settings.

  • Input (AD): Contains attributes used to build the model.

13.1.2.1.2 Settings (AD)

The AD Settings Viewer has two tabs:

13.1.2.1.3 Summary (AD)

General settings describe the characteristics of the model, including:

  • Owner

  • Name

  • Type

  • Algorithm

  • Target Attribute

  • Creation Date

  • Duration of Model Build

  • Comments

Algorithm settings control the model build. Algorithm setting are specified when the Build node is defined.

13.1.2.1.4 Input (AD)

A list of the attributes used to build the model. For each attribute the following information is displayed:

  • Name: The name of the attribute.

  • Data Type: The data type of the attribute

  • Mining Type:

    • Categorical

    • Numerical

  • Data Prep: YES indicates that data preparation was performed.

When you select an attribute in the Attributes list, the transformation properties viewer displays the transformation, created by either the user or Automatic Data Preparation, in the Model transformations list.

To see the reverse transformation, click Show Reverse Expression. Transformations are displayed in SQL notation. Not all transformations have a reverse. Transformations and Reverse transformations are not always displayed.

13.1.2.1.5 Anomaly Detection Algorithm Settings

Anomaly Detection models are built using a special version of the SVM classification, one-class SVM. The algorithm has these default settings:

  • Kernel function: The default is System Determined. After the model is built, the kernel function used (Linear or Gaussian) is displayed.

  • Tolerance value: The default is 0.001

  • Specify complexity factors: The default is Do Not

  • Specify the rate of outliers: The default is 0.1

  • Active learning: ON

  • Automatic Data Preparation: ON

13.1.2.1.6 AD Model Viewer for Linear Kernel

The model viewer for an AD model with a Linear Kernel has these tabs:

13.1.2.2 Algorithm Settings for AD

The algorithm for Anomaly Detection is one-class SVM. The kernel setting is one of the following:

  • System Determined (Default)

  • Gaussian

  • Linear

The settings that you can specify for any version of the Support Vector Machine (SVM) algorithm depend on which of the SVM Kernel Functions that you select:

Note:

After the model is built, the kernel function used (Linear or Gaussian) is displayed in Kernel Function in Algorithm Settings.
13.1.2.2.1 AD Algorithm Settings for Linear or System Determined Kernel

If you specify a linear kernel or if you let the system determine the kernel, then you can change the following settings:

13.1.2.2.2 AD Algorithm Settings for Gaussian Kernel

If you specify the Gaussian Kernel, then you can change the following settings:

13.1.2.2.3 Rate of Outliers

The rate of outliers is the approximate rate of outliers (negative predictions) produced by a one-class SVM model on the training data. This rate indicates the percent of suspicious records.

The rate is a number greater than 0 and less than or equal to 1. The default value is 0.1.

If you do not want to specify the rate of outliers, then deselect Specify the rate of outliers.

13.2 Association

Association is an unsupervised mining function for discovering association rules, that is predictions of items, that are likely to be grouped together. Oracle Data Mining provides one algorithm, Association Rules (AR).

To build an AR model, use an Association Node.

See Also:

"Troubleshooting AR Models"if the model has 0 rules or a very large number of rules

Data for Association Rules (AR) models is usually in transactional form, unlike the data for other kinds of models.

Oracle Data Mining does not support applying (scoring) AR models.

These topics describe AR models:

13.2.1 Calculating Associations

An association mining problem can be broken down into two subproblems:

  1. Find all combinations of items in a set of transactions that occur with a specified minimum frequency. These combinations are called frequent Itemsets.

  2. Calculate Association Rules that express the probability of items occurring together within frequent itemsets.

Apriori calculates the probability of an item being present in a frequent itemset, given that other items are present.

13.2.1.1 Itemsets

An itemset is any combination of two or more items in a transaction.

The maximum number of items in an itemset is user-specified.

  • If the maximum is 2, then all the item pairs are counted.

  • If the maximum is greater than 2, then all the item pairs, all the item triples, and all the item combinations up to the specified maximum are counted.

Association rules are calculated from itemsets. Usually, it is desirable to only generate rules from itemsets that are well-represented in the data. Frequent itemsets are those that occur with a minimum frequency specified by the user.

The minimum frequent itemset support is a user-specified percentage that limits the number of itemsets used for association rules. An itemset must appear in at least this percentage of all the transactions if it is to be used as a basis for Association Rules.

13.2.1.2 Association Rules

The Apriori algorithm calculates rules that express probable relationships between items in frequent itemsets. For example, a rule derived from frequent itemsets containing A, B, and C might state that if A and B are included in a transaction, then C is likely to also be included.

An association rule is of the form IF antecedent THEN consequent. An association rule states that an item or group of items, the antecedent, implies the presence of another item, the consequent, with some probability. Unlike Decision Tree rules, which predict a target, Association Rules simply express correlation.

Association rules have Confidence and Support:

  • Confidence of an association rule indicates the probability of both the antecedent and the consequent appearing in the same transaction. Confidence is the conditional probability that the consequent occurs given the occurrence of the antecedent. In other words, confidence is the ratio of the rule support to the number of transactions that include the antecedent.

  • Support of an association rule indicates how frequently the items in the rule occur together. Support is the ratio of transactions that include all the items in the antecedent and consequent to the number of total transactions.

13.2.2 Data for AR Models

Association Rules is normally used with transactional data, but it can also be applied to single-record case data (similar to other algorithms).

Association does not support text.

Native transactional data consists of two columns:

  • Case ID, either categorical or numericalItem ID, either categorical or numerical

Transactional data may also include a third column:

  • Item value, either categorical or numerical

A typical example of transactional data is market basket data. In market basket data, a case represents a basket that might contain many items. Each item is stored in a separate row, and many rows may be needed to represent a case. The Case ID values do not uniquely identify each row. Transactional data is also called Multirecord Case Data.

When building an Association model, specify the following:

  • Item ID: This is the name of the column that contains the items in a transaction.

  • Item Value: This is the name of a column that contains a value associated with each item in a transaction. The Item Value column may specify information such as the number of items (for example, three apples) or the type of the item (for example, Macintosh Apples).

    The default value for Item Value is <Existence>. That is, one or more item identified by Item ID is in the basket.

    If you select a specific value for Item Value, you may have to perform appropriate data preparation. The maximum number of distinct values of Item Value is 10. If the specific value for Item Value is greater than 128, then bin the attribute specified in Item Value using a Transform node.

    For more information, see the discussion of Market Basket data in the Oracle Data Mining User's Guide.

13.2.2.1 Support for Text (AR)

In Oracle Data Miner, Association does not support text.

If the Oracle Data Mining API, supports text, but using test for Association is not recommended.

13.2.3 Troubleshooting AR Models

AR models may generate many rules with very low Support and Confidence. If you increase Support and Confidence, then you reduce the number of rules generated.

Usually, Confidence should be greater than or equal to Support.

If a model has no rules, then the following message is displayed in the Rules tab of the Model Viewer:

Model contains no rules. Consider rebuilding model with lower confidence and support settings.

See Also:

"Algorithm Settings for AR"to learn how to change these values

13.2.3.1 Algorithm Settings for AR

To change algorithm settings for an Association node, right-click the node, and select Advanced Settings.Then select the model. The following settings are displayed in the Algorithm settings tab:

  1. Right-click the node.

  2. Select Advanced Settings.

  3. Select the model. The following settings are displayed in the Algorithm Settings tab:

    • Maximum rule length. The default is 4

    • Minimum Confidence. The default is 10%

    • Minimum Support. The default is 1%

  4. If no rules were generated, then:

    • First, try decreasing Minimum Support.

    • If that does not work, decrease the minimum Confidence value. It may be necessary to specify a much smaller value for either of these values.

  5. After you have finished, click OK.

  6. Run the node.

13.2.4 AR Model Viewers and Algorithm Settings

This section discusses the Association Rules Model viewer, the procedure to view the AR Model Viewer and the Algorithm Settings related to AR. The topics discussed are:

13.2.4.1 AR Model Viewer

You can view an AR model in one of the following ways:

  • Method 1

    1. Right-click the node where the model was built.

    2. Select Go to Properties.

    3. In the Models section in Properties, click view selected model.

  • Method 2

    1. Select the workflow node where the model was built.

    2. Right-click and click View Models.

    3. Select the model to view.

The AR model viewer opens in a new tab. The default name of an Association model has ASSOC in the name. The AR Model viewer has these tabs:

13.2.4.1.1 AR Rules

An Association Rule states that an item or group of items implies the presence of another item. Each rule has a probability. Unlike Decision Tree rules, which predict a target, Association Rules simply express correlation.

If an attribute is a nested column, then the full name is displayed as COLUMN_ NAME.SUBNAME. For example, GENDER.MALE. If the attribute is a normal column, then just the column name is displayed.

Oracle Data Mining supports Association Rules that have one or more items in the antecedent (IF part of a rule) and a single item in the consequent (THEN part of the rule). The antecedent may be referred to as the condition, and the consequent as the association.

Rules have Confidence, Support and Lift.

The Rules tab is divided into two sections: Filtering and Sorting in the upper section, and AR Rules Grid in the lower section. Sorting or filtering that is defined using the settings in the upper section apply to all the rules in the model. Sorting or filtering defined using the settings in the lower section apply to the grid display only.

You can perform the following functions in the Rules tab:

  • Sort By: You can specify the order for display of the rules. You can sort rules by:

    • Lift, Confidence, Support, or Length

    • Ascending or Descending order

    For more sort options, click More. You can specify up to four levels of sorting, and you can specify the order for each level.

    See Also:

    "Sorting"
  • Filter: You can filter rules. To see filtering options, click More. You can specify the following in rules:

    • Minimum Lift

    • Minimum Support

    • Minimum Confidence

    • Maximum Items in Rules

    • Minimum Items in Rules

    See Also:

    "Filtering"
  • Fetch Size: Association models often generate many rules. Specify the number of rules to examine by clicking Fetch Size. The default is 1000.

  • Query: You can query the database using the criteria that you specify. For example, if you change the default sorting order, specify filtering, or change the fetch size, then click Query.

13.2.4.1.2 AR Rules Grid

The lower part of the Rules tab displays the retrieved rule in a grid. The following is displayed above the grid:

  • Available Rules: The total number of rules in the model.

  • Rules Retrieved: The number of rules retrieved by the query, that is, the number of rules retrieved subject to filtering.

  • Rule Content: For maximum information, select Name, Subname, and Value; you can select fewer characteristics from the menu. This selection applies to the rules in the grid only. Rule content is smart in the sense that it sets this value to whatever is more appealing given the nature of the model.

  • Search: To search the rules, use the search box, identified by find.

    The drop-down list enables you to search rules by All (the default), by Antecedent, or by Consequent. If you search by Antecedent for 117, then all rules that have 117 in the antecedent are displayed.

    To clear a search, click delete.

See Also:

"AR Rules Display"for more information about how the rules are displayed
13.2.4.1.3 AR Rules Display

For each rule, the Rules grid displays the following information:

  • ID: Identifier for the rule, a string of integers.

  • Condition

  • Association

  • Lift: A bar is included in the column. The size of the bar is set to scale to the largest lift value provided by the model by any rule.

    See Also:

    "Lift for AR Rules"for more information about Lift
  • Confidence:

  • Support:

  • Length

  • Antecedent Support

  • Condition Support

You can perform the following tasks:

  • Sort: You can sort the items in the grid by clicking on the title of the column. This sorting applies to the grid only.

  • View details: To see details for a rule, click the rule and examine the Rule Details.

  • Determine validity of a rule: To determine if a rule is valid, you must use support and confidence plus lift, as described in Lift for AR Rules.

For more information, including examples of these statistics, see the discussion of evaluating association rules in Oracle Data Mining Concepts.

See Also:

Oracle Data Mining Concepts for more information, including examples of these statistics.
13.2.4.1.4 Lift for AR Rules

Both Support for AR Rules and Confidence for AR Rules must be used to determine if a rule is valid. However, there are times when both of these measures may be high, and yet produce a rule that is not useful.

Lift indicates the strength of a rule over the random co-occurrence of the Antecedent and the Consequent, given their individual support. It provides information about the improvement, the increase in probability of the consequent given the antecedent. Lift is defined as follows:

(Rule Support) /(Support(Antecedent) * Support(Consequent))

Lift can also be defined as the Confidence of the combination of items divided by the support of the consequent.

13.2.4.1.5 Confidence for AR Rules

The Confidence of a rule indicates the probability of both the Antecedent and the Consequent appearing in the same transaction. Confidence is the conditional probability of the Consequent, given the Antecedent.

AR rules are of the form IF antecedent THEN consequent.

13.2.4.1.6 Support for AR Rules

The Support of a rule indicates how frequently the items in the rule occur together. Support is the ratio of transactions that include all the items in the Antecedent and Consequent to the number of total transaction.

AR rules are of the form IF antecedent THEN consequent.

13.2.4.1.7 Rule Details

The information in the rule grid is displayed in a readable format in the Rule Detail list.

13.2.4.1.8 Sorting

The default sort is:

  1. Sort by Lift in descending order

  2. Sort by Confidence in descending order

  3. Sort by Support in descending order

  4. Sort by rule length in descending order

The sorting specified here applies to all rules in the model.

13.2.4.1.9 Filtering

To see all the filtering options, click More.

You can specify the following:

  • Filter: Filter rules are based on values of rule characteristics. You can specify the following:

    • Minimum lift

    • Minimum support

    • Minimum confidence

    • Maximum items in rule

    • Minimum items in rule

  • Fetch Size: It is the maximum number of rows to fetch. The default is 1000. Smaller values result in faster fetches.

  • Define Item Filters to reduce the number of rules returned.

To define a filter, select Use Filter. After you define the filter, click Query.

13.2.4.1.10 Item Filters

Item filters enable you to see only those rules that contain what you are interested in. A rule filter must consider the item as being required for the Association, Condition, or Both. The rule filter uses OR logic for each side of the Rule (Association Collection, Condition Collection). However, the rule filter performs an AND rule across the collection. So, for a Rule to be returned, it must have at least one Association item AND one Condition item.

You can manage Item Filters using these controls:

  • To open the Add Item Filter dialog box, click add.

  • To delete selected item filters, click delete.

  • To change the Filter column of selected rows to Both, click both. Both implies Association and Condition.

  • To change the Filter column of selected rows to Condition, click condition.

  • To change the Filter column of selected rows to Association, click association.

13.2.4.1.11 Add Item Filter

To open the Add Item Dialog, click add.

The exact information that is displayed depends on the model. For example, if data has different values for the model that you are viewing, then there is a Values column.

Click More to see all possibilities:

  • Specify sorting for item filters: the default is to sort by Attribute Descending and then by Support Ascending.

  • Specify a name for the filter.

  • Change the Fetch Size from the default of 100,000.

  • If you made any changes, then click Query to retrieve the attribute or value pairs.

  • Filter the retrieved items by name or value.

  • Select one or more item in the grid.

  • Select how to use items when filtering rules.

Click OK when you have finished defining the filter.

13.2.4.1.12 Itemsets

Rules are calculated from itemsets. The Itemsets tab displays information about the itemsets.

If an attribute is a nested column, then the full name is displayed as COLUMN_ NAME.SUBNAME. For example, GENDER.MALE. If the attribute is a normal column, then just the column name is displayed.

Itemsets have Support. Each itemset contains one or more items.

  • Sort Itemsets: Sort By specifies the order of itemsets. You can sort itemsets by:

    • ID

    • Number of Items

    • Support in Ascending Order

    • Support in Descending Order

    By default, itemset is sorted by Support in Descending order. For more sort options, click More. To change the sort order, make changes and then click Query.

  • Filter Itemsets

  • View Itemset details. Click the itemset to view the details.

The Itemsets tab displays the following information:

  • Available Itemsets: The total number of itemsets in the model.

  • Itemsets Retrieved: The number of itemsets retrieved by the query. That is, the number of itemsets retrieved subject to filtering.

  • Itemset Content: For maximum information, select all three—Name, Subname, and Value. You can select a few characteristics from the menu.

See Also:

"Itemsets Display"for more information displayed for each itemset

To see details of a itemset, click the itemset and examine Rule Details.

Other Tabs: The AR Model viewer has these other tabs:

13.2.4.1.13 Itemsets Display

For each itemset, the Itemsets grid displays the following information:

  • ID: Identifier for the itemset, a string of integers

  • Items

  • Support. A bar in the column illustrates the relative size of the support.

  • Number of Items in the itemset

13.2.4.1.14 Itemset Details

To see itemset details, select one or more itemsets in the itemsets grid. The information in the itemsets grid is displayed in a more readable format.

13.2.4.1.15 Settings (AR))

The Settings tab has the following tabs:

Note:

AR models are not scoreable, that is, they cannot be applied to new data. Models that are not scoreable do not have an Attributes tab in the model viewer.

Other Tabs: The AR Model viewer has these other tabs:

13.2.4.1.16 Summary

This tab show information about the model. The summary tab show two kinds of information:

  • General: Lists the following:

    • Type of Model

    • Owner of the model (Classification, Regression, and so on)

    • Model Name (the Schema where the model was built)

    • Creation Date

    • Duration of the Model build (in minutes)

    • Model size in MB

    • Comments

  • Algorithm: Lists the following:

    • Automatic Preparation (on or off),

    • Minimum Confidence

    • Minimum Support

    To change these values, right-click the model node and select Advanced Settings.

See Also:

"Algorithm Settings for AR"if the model has no rules
13.2.4.1.17 Algorithm Settings

Association (AR) supports these settings:

  • Maximum Rule Length: The maximum number of attributes in each rule. This number must be an integer between 2 and 20. Higher numbers of rules result in slower builds. The default value is 4.

    You can change the number of attributes in a rule, or you can specify no limit for the number of attributes in a rule. Specifying many attributes in each rule increases the number of rules considerably. A good practice is to start with the default and increase this number slowly.

  • Minimum Confidence: Confidence indicates how likely it is that these items to occur together in the data. Confidence is the conditional probability that the consequent will occur given the occurrence of the antecedent.

    Confidence is a number between 0 and 100 indicating a percentage. High confidence results in a faster build. The default is 10 percent.

  • Minimum Support: A number between 0 and 100 indicating a percentage. Support indicates how often these items occur together in the data.

    Smaller values for support results in slower builds and requires more system resources. The default is 1 percent.

13.3 Decision Tree

The Decision Tree algorithm is a Classification algorithm that generates rules. Oracle Data Mining supports of the Decision Tree (DT) algorithm. This section contains the following topics:

13.3.1 Decision Tree Algorithm

The Decision Tree algorithm is based on conditional probabilities. Unlike Naive Bayes, Decision Trees generate rules. A rule is a conditional statement that can easily be used by humans and easily used within a database to identify a set of records.

The Decision Tree algorithm:

  • Creates accurate and interpretable models with relatively little user intervention. The algorithm can be used for both binary and multiclass classification problems.
    The algorithm is fast, both at build time and apply time. The build process for the Decision Tree Algorithm is parallelized. Scoring can be parallelized irrespective of the algorithm.

  • Predicts a target value by asking a sequence of questions. At a given stage in the sequence, the question that is asked depends upon the answers to the previous questions. The goal is to ask questions that together uniquely identify specific target values.

Decision Tree scoring is especially fast. The tree structure, created in the model build, is used for a series of simple tests, (typically 2-7). Each test is based on a single predictor. It is a membership test: either IN or NOT IN a list of values (categorical predictor); or LESS THAN or EQUAL TO some value (numeric predictor).

During the model build, the Decision Tree algorithm must repeatedly find the most efficient way to split a set of cases (records) into two child nodes. Oracle Data Mining offers two homogeneity metrics, gini and entropy, for calculating the splits. The default metric is gini.

13.3.1.1 Decision Tree Rules

Rules provide model transparency, a window on the inner workings of the model. Rules show the basis for the prediction of the model. Oracle Data Mining supports a high level of model transparency.

Confidence and Support are used to rank the rules generated by the Decision Tree Algorithm:

  • Support: It is the number of records in the training data set that satisfy the rule.

  • Confidence: It is the likelihood of the predicted outcome, given that the rule has been satisfied.

13.3.2 Build, Test, and Apply Decision Tree Models

The Decision Tree manages its own data preparation internally. It does not require pre-treatment of the data. The Decision Tree is not affected by Automatic Data Preparation.

The Decision Tree interprets missing values as missing at random. The algorithm does not support nested tables and thus does not support sparse data.

Building Decision Tree model

To build a Decision Tree model, use a Classification Node. In Oracle Data Mining 12c Release 1(12.1) or later, Decision Tree supports nested data. The Decision Tree supports text for Oracle Database 12c, but not for earlier releases.

Testing Decision Tree model

By default, a Classification Node tests all models that it builds. The test data is created by splitting the input data into build and test subsets. You can also test a Decision Tree model using a Test Node.

Tuning Decision Tree model

After you build and test a Decision Tree model, you can tune it.

Applying Decision Tree model

To apply a model, use an Apply Node.

13.3.3 Decision Tree Model Viewer and Algorithm Settings

This section discusses the Decision Tree model viewer, the procedure to view the Decision Tree Model viewer and the algorithm settings related to Decision Tree. Topic include:

13.3.3.1 Decision Tree Model Viewer

You can view a Decision Tree model in one of the following ways:

  • Method 1

    1. Right-click the node where the model was built.

    2. Select Go to Properties.

    3. In the Models section in Properties, click view.

  • Method 2

    1. Select the workflow node where the model was built.

    2. Right-click and click View Models.

    3. Select the model to view.

    The Decision Tree viewer opens in a new tab. The default name of a Decision Tree model has DT in the name.

The Tree viewer has two tabs:

  • Tree: This tab is displayed by default. Use the Structure Window to navigate and analyze the tree. It is split horizontally into two panes:

    • The upper pane displays the tree. The root node is at the top of the pane. The following information is displayed for each node of the tree:

      • Node number. 0 is the root node.

      • Prediction, the predicted target value.

      • Support for the prediction.

      • Confidence for the prediction.

      • A histogram shows the distribution of target values in the node.

      • Split, the attribute used to split the node (Leaf nodes do not have splits).

    • The lower pane displays rules. To view the rule associated with a node or a link, select the node or link. The rule is displayed in the lower pane. The following information is displayed in the lower pane:

      • Rule

      • Surrogates

      • Target Values

  • Settings (DT)

Icons and menus at the top of the upper pane control how the tree and its nodes are displayed. You can perform the following tasks:

  • Zoom in or zoom out for the tree. You can also select a size from the drop-down list. You can also fit the tree to the window.

  • Change the layout to horizontal. The default Layout Type for the tree is vertical.

  • Hide the histograms displayed in the node.

  • Show less detail.

  • Expand all nodes.

  • Save Rules.

13.3.3.1.1 Save Rules

To save the Decision Tree or Clustering Rules:.

  1. Click Save Rules on the far right of the upper tab.
    By default, rules are saved for leaf nodes only to the Microsoft Windows Clipboard. You can then paste the rules into any rich document, such as a Microsoft Word document.
    You can deselect Leaves Only to save all rules.

  2. To save rules to a file, click Save to File and specify a file name.

  3. Select the location of the file in the Save dialog box. By default, the rules are saved as an HTML file.

  4. Click OK.

13.3.3.1.2 Settings (DT)

The Settings tab has these tabs:

13.3.3.1.3 DT Summary

This tab displays the following information about the model:

  • General: Describes the following:

    • Type of model

    • Owner of the model (the Schema where the model was built)

    • Model Name

    • Creation Date

    • Duration of the model Build (in minutes)

    • Size of the model (in MB)

    • Comments (if the model has any comments)

  • Algorithm settings are Decision Tree Algorithm Settings.

13.3.3.1.4 DT Inputs

This tabs shows information about those attributes used to build the model.

Oracle Data Miner does not necessarily use all of the attributes in the build data. For example, if the values of an attribute are constant, then that attribute is not used.

For each attribute used to build the model, this tab displays:

  • Name

  • Data type

  • Mining Type: Categorical, Numerical, or text.

  • Target: The check in this column indicates that the attribute is a target.

  • Data Prep

    • YES: Indicates that data preparation was performed.

      If Data Prep is Yes, then select the column (click it). The data preparation is displayed in Data Preparation. To see the reverse transformation for data preparation, select Show Reverse Transformation. Transformations are displayed in SQL notation. Not all transformations have a reverse. Transformations and Reverse transformations are not always displayed.

    • NO: Indicates that no data preparation was performed

13.3.3.1.5 DT Target Values

Displays the target attributes, their data types, and the values of each target attribute.

13.3.3.2 Decision Tree Algorithm Settings

The Decision Tree algorithm supports these settings:

  • Homogeneity Metric:

    • Gini (default)

    • Entropy

  • Maximum Depth: The maximum number of levels of the tree. The default is 7. The value must be an integer in the range 2 to 20.

  • Minimum Records in a Node: The minimum number of records in a node. The default is 10. The value must be an integer greater than or equal to 0.

  • Minimum Percent of Records in a Node: The default is 0.05. The value must be a number in the range 0 to 10.

  • Minimum Records for a Split: The minimum number of records for a split. The default is 20. The value must be an integer greater than or equal to 0.

  • Minimum Percent of Records for a Split: The default is 0.1. The value must be a number in the range 0 to 20.

13.4 Expectation Maximization

Expectation Maximization (EM) is a density estimation technique. Oracle Data Mining implements EM as a distribution-based clustering algorithm that uses probability density estimation.

In density estimation, the goal is to construct a density function that captures how a given population is distributed. The density estimate is based on observed data that represents a sample of the population.

Note:

Expectation Maximization requires Oracle Database 12c.

Dense areas are interpreted as components or clusters. Density-based clustering is conceptually different from distance-based clustering (such as k-Means), where emphasis is placed on minimizing intercluster and maximizing the intracluster distances.

The shape of the probability density function used in EM effectively predetermines the shape of the identified clusters. For example, Gaussian density functions can identify single peak symmetric clusters. These clusters are modeled by single components. Clusters of a more complex shape need to be modeled by multiple components. The EM algorithm assigns model components to high-level clusters by default.

13.4.1 Build and Apply an EM Model

To build an EM model, use a Clustering Node. To build an EM model, you must be connected to Oracle Database 12c.

To apply an EM model, use an Apply Node.

13.4.2 EM Model Viewer and Algorithm Settings

This section discusses the Expectation Maximization Model Viewer, the procedure to view the EM Model viewer and the algorithm settings related to EM. The topics include:

13.4.2.1 EM Model Viewer

You can view and examine an EM Model in an EM Model Viewer. You can view the model in one of the following ways:

  • Method 1

    1. Right-click the node where the model was built.

    2. Select Go to Properties.

    3. In the Models section in Properties, click view.

  • Method 2

    1. Select the workflow node where the model was built.

    2. Right-click and click View Models.

    3. Select the model to view.

    The EM model viewer opens in a new tab. The default name of an EM model has EM in the name.

The Tree tab is displayed by default. The EM model viewer has these tabs:

13.4.2.1.1 EM Component

The Component tab provides detailed information about the components of the EM model.

The tab is divided into several panes.

The top pane specifies the cluster to view:

  • Component: It is the integer that identifies the cluster. The default value is 1.

  • Prior: It is the priority for the specified component.

  • Filter by Attribute Name: Enables you to display only those attributes of interest. Enter the attribute name, and click Query.

  • Fetch Size: It is the number of records fetched. The default is 2,000.

The middle pane displays information about the attributes in the specified component:

  • You can search for a specified Attribute using the search box.

  • The attributes are displayed in a grid. The grid lists Attribute (name), Distribution (as a histogram), and Mean and Variance (numerical attributes only).

    To sort any of these columns, click the column title.

  • To see a larger version of the histogram for an attribute and information about the distribution, select the attribute. The histogram is displayed in the bottom pane.

The bottom pane displays a large version of the selected histogram, data, and projections (if any):

  • The Chart tab contains a larger version of the histogram of the selected attribute.

  • The Data tab shows the frequency of the histogram bins.

  • The Projections tab lists projects in a grid, listing Value and Coefficient for each Attribute Subname.

13.4.2.1.2 EM Details

The Details tab shows global details for the EM model. The following information is displayed:

  • Log Likelihood Improvement

  • Number of Clusters

  • Number of Components

13.4.2.2 EM Algorithm Settings

EM supports these settings:

  • Number of Clusters is the maximum number of leaf clusters generated by the algorithm. EM may return fewer clusters than the number specified, depending on the data. The number of clusters returned by EM cannot be greater than the number of components, which is governed by algorithm-specific settings. Depending on these settings, there may be fewer clusters than components. If component clustering is disabled, then the number of clusters equals the number of components.

    The default is system determined. To specify a specific number of clusters, click User specified and enter an integer value.

  • Component Clustering is selected by default.

    Component Cluster Threshold specifies a dissimilarity threshold value that controls the clustering of EM components. Smaller values may produce more clusters that are more compact while large values may produce fewer clusters that are more spread out. The default value is 2.

  • Linkage Function enables the specification of a linkage function for the agglomerative clustering step. The linkage functions are:

    • Single uses the nearest distance within the branch. The clusters tend to be larger and have arbitrary shapes.

      Single is the default.

    • Average uses the average distance within the branch. There is less chaining effect and the clusters are more compact.

    • Complete uses the maximum distance within the branch. The clusters are smaller and require a strong component overlap.

  • Approximate Computation indicates whether the algorithm should use approximate computations to improve performance.

    For EM, approximate computation is appropriate for large models with many components and for data sets with many columns. The approximate computation uses localized parameter optimization that restricts learning to parameters that are likely to have the most significant impact on the model.

    The values for approximate Computation are:

    • System Determined (Default)

    • Enable

    • Disable

  • Number of Components specifies the maximum number of components in the model. The algorithm automatically determines the number of components, based on improvements in the likelihood function or based on regularization, up to the specified maximum.

    The number of components must be greater than or equal to the number of clusters.

    The default number of components is 20.

  • Max Number of Iterations specifies the maximum number of iterations in the EM core algorithm. Maximum number of iterations must be greater than or equal to 1. This setting applies to the input table/view as a whole and does not allow per attribute specification.

    The default is 100.

  • Log Likelihood Improvement specifies the percentage improvement in the value of the log likelihood function required to add a new component to the model.

    The default value is 0.001

  • Convergence Criterion specifies the convergence criterion for EM. The convergence criteria are:

    • System Determined (Default)

    • Bayesian Information Criterion

    • Held-aside data set

  • Numerical Distribution specifies the distribution for modeling numeric attributes. The options are the following distributions:

    • Bernoulli

    • Gaussian

    • System Determined (Default)

    When the Bernoulli or Gaussian distribution is chosen, all numerical attributes are modeled using the same distribution. When the distribution is system-determined, individual attributes may use different distributions (either Bernoulli or Gaussian), depending on the data.

  • Gather Class Statistics enables or disables the gathering of descriptive statistics for clusters (centroids, histograms, and rules). Disabling the cluster statistics will result in smaller models and will reduce the model details calculated.

    The default is to enable or select the Gather Class Statistics.

    • If you disable Gather Class Statistics, then you will not be able to view models.

    • If you enable Gather Class Statistics, then you can specify Min Percent of Attribute Rule Support.

    Min Percent of Attribute Rule Support specifies the percentage of the data rows assigned to a cluster that must be present in an attribute to include that attribute in the cluster rule. The default value is 0.1.

  • Data Preparation and Analysis specifies settings for data preparation and analysis. To view or change the selections, click Settings.

Click OK after you are done.

13.4.2.2.1 EM Data Preparation and Analysis Settings

This dialog box enables you to view or change these settings:

  • Max Number of Correlated 2D Attributes specifies the maximum number of correlated two-dimensional attributes that will be used in the EM model. Two-dimensional attributes correspond to columns that have a simple data type (not nested).

    The default is 50.

  • Number of Projections per Nested Column specifies the number of projections that will be used for each nested column. If a column has fewer distinct attributes than the specified number of projections, then the data will not be projected. The setting applies to all nested columns.

    The default is 50.

  • Number of Quantile Bins (Numerical Columns) specifies the number of quantile bins that will be used for modeling numerical columns with multivalued Bernoulli distributions.

    The default is system determined.

  • Number of Top-N Bins (Categorical Columns) specifies the number of top-N bins that will be used for modeling categorical columns with multivalued Bernoulli distributions.

    The default is system determined.

  • Number of Equi-Width Bins (Numerical Columns) specifies the number of equi-width bins that will be used for gathering cluster statistics for numerical columns.

    The default is 11.

  • Include uncorrelated 2D Attributes specifies whether uncorrelated two-dimensional attributes should be included in the model or not. Two-dimensional attributes correspond to columns that are not nested.

    The values are:

    • System Determined (Default)

    • Enable

    • Disable

When you have finished making changes, click OK.

13.5 Generalized Linear Models

Generalized Linear Models (GLM) is a statistical technique for linear modeling. Oracle Data Mining supports GLM for both Regression and Classification. The following topics describe GLM models:

13.5.1 Generalized Linear Models Overview

Generalized Linear Models (GLM) include and extend the class of linear models referred to as Linear Regression.

Oracle Data Mining includes two of the most popular members of the GLM family of models with their most popular link and variance functions:

  • Linear Regression with the identity link and variance function equal to the constant 1 (constant variance over the range of response values).

  • Logistic Regression with the logistic link and binomial variance functions.

In Oracle Database 12c, GLM Classification and Regression are enhanced to implement Feature Selection and Feature Generation. This capability, when specified, can enhance the performance of the algorithm and improve the accuracy and interpretability.

See Also:

"Data Preparation for GLM" for information about how to use ADP with GLM

13.5.1.1 Linear Regression

Linear Regression is the GLM Regression algorithm supported by Oracle Data Mining. The algorithm assumes no target transformation and constant variance over the range of target values.

13.5.1.2 Logistic Regression

Binary Logistic Regression is the GLM classification algorithm supported by Oracle Data Mining. The algorithm uses the logit link function and the binomial variance function.

13.5.1.3 Data Preparation for GLM

Oracle recommends that you use Automatic Data Preparation with GLM.

13.5.2 GLM Classification Models

You can perform the following tasks with GLM Classification model:

  • Build and Test GLM Classification Model: To build and test a GLM Classification (GLMC) model, use a Classification Node. By default, the Classification Node tests the models that it builds. Test data is created by splitting the input data into build and test subsets. You can also test a model using a Test Node.

  • Tune GLM Classification Model: After you build and test a GLM classification model, you can tune it.

  • Apply GLM Classification Model: To apply the GLM Classification model, use an Apply Node.

13.5.3 GLM Regression Models

You can perform the following tasks with GLM Regression models:

  • Build and Test GLM Regression model: To build and test a GLM Regression (GLMR) model use a Regression Node. By default, a Regression Node tests the models that it builds. Test data is created by splitting the input data into build and test subsets. You can also test a model using a Test Node.

    See Also:

    "Testing Regression Models" for more information about testing
  • Apply GLM Regression model: To apply a GLM Regression model, use an Apply Node.

13.5.4 GLM Model Viewers and Algorithm Settings

This section discusses the Generalized Linear Model viewer, the procedure to view the GLM viewer and the algorithm settings related to the GLM model. The topics are:

13.5.4.1 GLM Classification Model Viewer

The GLM Classification (GLMC) model viewer displays characteristics of a GLMC model. GLMC is also known as Logistic Regression. To view a GLMC model, use one of these methods:

Method 1

  • Method 1

    1. Right-click the node where the model was built.

    2. Select Go to Properties.

    3. In the Models section in Properties, click view.

  • Method 2

    1. Select the workflow node where the model was built.

    2. Right-click and click View Models.

    3. Select the model to view.

    The GLMC viewer opens in a new tab. The default name of a GLM model has GLM in the name. The Detail tab is displayed by default.

The viewer has these tabs:

13.5.4.1.1 GLMC Details

Model Details list global metrics for the model as a whole. The metrics display has two columns: Name of the metric and Value of the metric. The following metrics are displayed:

  • Akaike's criterion (AIC) for the fit of the intercept only model

  • Akaike's criterion model for the fit of the intercept and the covariates (predictors) mode

  • Dependent mean

  • Likelihood ratio chi-square value

  • Likelihood ratio chi-square probability value

  • Likelihood ratio degrees of freedom

  • Model converged (Yes or No)

  • -2 log likelihood of the intercept only model

  • -2 log likelihood of the mode

  • Number of parameters (number of coefficients, including the intercept)

  • Number of rows

  • Correct Prediction percentage

  • Incorrectly predicted percentage of rows

  • Tied cases prediction, that is, cases where no prediction can be made

  • Pseudo R-Square Cox and Snell

  • Pseudo R-Square Nagelkerke

  • Schwartz's Criterion (SC) for the fit of the intercept-only model

  • Schwartz's Criterion for the fit of the intercept and the covariates (predictors) model

  • Termination (normal or not)

  • Valid covariance matrix (Yes or No)

Note:

The exact list of metrics computed depends on the model settings.

Other Tabs: The viewer has these other tabs:

13.5.4.1.2 GLMC Coefficients

The Coefficient tab enables you to view GLM coefficients. The viewer supports sorting to control the order in which coefficients are displayed and filtering to select the coefficients to display.

The default is to sort coefficients by absolute value. If you deselect Sort by absolute value, click Query.

The default fetch size is 1000 records. To change the fetch size, specify a new number of records and click Query.

Note:

After you change any criteria on this tab, click Query to query the database. You must click Query even for changes such as selecting or deselecting sort by absolute value or changing the fetch size.

The relative value of coefficients is shown graphically as a bar, with different colors for positive and negative values. If a coefficient is close to 0, then the bar may be too small to display.

See Also:

"Sort and Search GLMC Coefficients" for information about sorting and searching the grid
  • Target Value: Select a specific target value and see only those coefficients. The default is to display the coefficients of the value that occurs least frequently. It is possible for a target value to have no coefficients; in that case, the list has no entries.

  • Sort by absolute value: The default is to sort the list of coefficients by absolute value; you can deselect this option.

  • Fetch Size: The number of rows displayed. The default is 1000. To figure out if all coefficients are displayed, choose a fetch size that is greater than the number of rows displayed.

Coefficients are listed in a grid. If no items are listed, then there are no coefficients for that target value. The coefficients grid has these columns:

  • Attribute: Name of the attribute

  • Value: Value of the attribute

  • Coefficient: The linear coefficient estimate for the selected target value is displayed. A bar is shown in front of (and possibly overlapping) each coefficient. The bar indicates the relative size of the coefficient. For positive values, the bar is light blue; for negative values, the bar is red. (If a value is close to 0, then the bar may be too small to be displayed.)

  • Standardized coefficient: The coefficient rescaled by the ratio of the standard deviation of the predictor to the standard deviation of the target.

    The standardized coefficient places all coefficients on the same scale, so that you can, at a glance, tell the large contributors from the small ones.

  • Standard error

  • Exp (Coefficient). This is the exponent of the coefficient.

  • Standard Error of the estimate.

  • Wald Chi Square

  • Probability of greater than Chi Square

  • Test Statistic: For linear Regression, the t-value of the coefficient estimate; for Logistic Regression, the Wald chi-square value of the coefficient estimate

  • Probability of the test statistic. Used to analyze the significance of specific attributes in the model

  • Variance Inflation Factor

    • 0 for the intercept

    • Null for Logistic Regression

  • Lower Coefficient Limit, lower confidence bound of the coefficient

  • Upper Coefficient Limit, upper confidence bound of the coefficient

  • Exp (Coefficient)

    • Exponentiated coefficient for Logistic Regression

    • Null for Linear Regression

  • Exp (Lower Coefficient Limit)

    • exponentiated coefficient of the lower confidence bound for Logistic Regression

    • Null for Linear Regression

  • Exp (Upper Coefficient Limit)

    • Exponentiated coefficient of upper confidence bound for Logistic Regression

    • Null for Linear Regression

Note:

Not all statistics are necessarily returned for each coefficient.

Statistics are null if any of the following are true:

  • The statistic does not apply to the mining function. For example, Exp (coefficient) does not apply to Linear Regression.

  • The statistic cannot be calculated because of limitations in system resources.

  • The value of the statistics is infinity.

  • If the model was built using Ridge Regression, or if the covariance matrix is found to be singular during the build, then coefficient bounds (upper and lower) have the value NULL.

Other Tabs: The viewer has these other tabs:

13.5.4.1.3 Sort and Search GLMC Coefficients

You can sort the numerical columns by clicking the title of the column. For example, to arrange the coefficients in increasing order, click Coefficients in the grid.

Use search to search for items. The default is to search by Attribute (name).

There are search options that limit the columns displayed. The filter settings with the (or)/(and) suffixes enable you to enter multiple strings separated by spaces. For example, if you select Attribute/Value/Coefficient(or), the filter string A .02 produces all columns where the Attribute or the Value Type starts with the letter A or the Coefficient starts with 0.02.

To clear a search, click delete.

13.5.4.1.4 GLMC Compare

GLM Classification Compare viewer is similar to the SVM Coefficients Compare viewer except that the GLM model can only be built for binary classification models. Only two target class values would be available to compare.

See Also:

"GLMC Compare" for details

Other Tabs: The viewer has the following tabs:

13.5.4.1.5 GLMC Diagnostics

The Diagnostics tab for GLM Classification displays diagnostics for each Case ID in the build data. You can filter the results.

Note:

Diagnostics are not generated by default. To generate diagnostics, you must specify a Case ID and select Generate Row Diagnostics in Advanced Settings.

The following information is displayed in the Diagnostics grid:

  • CASE_ ID

  • TARGET_VALUE for the row in the training data

  • TARGET_VALUE_PROB, probability associated with the target value

  • HAT, value of diagonal element of the HAT matrix

  • WORKING_RESIDUAL, the residual with the adjusted dependent variable

  • PEARSON_RESlDUAL, the raw residual scaled by the estimated standard deviation of the target

  • DEVIANCE_RESIDUAL, contribution to the overall goodness of the fit of the model

  • C, confidence interval displacement diagnostic

  • CBAR, confidence interval displacement diagnostic

  • DIFDEV, change in the deviance due to deleting an individual observation

  • DIFCHISQ, change in the Pearson chi-square

Other Tabs: The viewer has these other tabs:

13.5.4.1.6 GLMC Settings

The Settings tab has these other tabs:

Other Tabs: The viewer has these other tabs:

13.5.4.1.7 GLMC Data Usage:

Describes data usage for the model.

13.5.4.1.8 GLMC Summary

General settings describe the characteristics of the model, including:

  • Name

  • Type

  • Algorithm

  • Target Attribute

  • Creation Date

  • Duration of Model Build

  • Comments

Algorithm settings control model build. Algorithm setting are specified when the Build node is defined.

See Also:

"GLM Classification Algorithm Settings" for a list of model settings

After a model is built, values calculated by the system are displayed on this tab. For example, if you select System Determined for Enable Ridge Regression, this tab shows if Ridge Regression was enabled and what ridge value was calculated.

Other Tabs: The Settings tab has these other tabs:

13.5.4.1.9 GLMC Inputs

A list of the attributes used to build the model. For each attribute the following information is displayed:

  • Name: The name of the attribute.

  • Data type The data type of the attribute.

  • Mining type: Categorical or Numerical.

  • Target: The check icon indicates that the attribute is a target attribute.

  • Data Prep: YES indicates that data preparation was performed on the attribute.

When you select an attribute in the Attributes list, the transformation properties viewer displays the imbedded transformation, created by either the user or Automatic Data Preparation, in the Model transformations list. To see the reverse transformation, click Show Reverse Expression. Transformations are displayed in SQL notation. Not all transformations have a reverse. Transformations and Reverse transformations are not always displayed.

Other Tabs: The Settings tab has these other tabs:

13.5.4.1.10 GLMC Target Values

Displays the target attributes, their data types, and the values of each target attribute.

Other Tabs: The Settings tab has these other tabs:

13.5.4.2 GLM Classification Algorithm Settings

GLM supports these settings for classification:

  • Generate Row Diagnostics: By default, Generate Row Diagnostic is deselected. To generate row diagnostics, you must select this option and also specify a Case ID.

    If you do not specify a Case ID, then this setting is not available.

    You can view Row Diagnostics on the Diagnostics tab in the model viewer. To further analyze row diagnostics, use a Model Details node to extract the row diagnostics table.

  • Confidence Level: A positive number that is less than 1.0. This value indicates the degree of certainty that the true probability lies within the confidence bounds computed by the model. The default confidence is 0.95.

  • Reference Class name: The Reference Target Class is the target value used as a reference in a binary Logistic Regression model. The probabilities are produced for the other (non-reference) classes. By default, the algorithm chooses the value with the highest prevalence (the most cases). If there are ties, then the attributes are sorted alpha-numerically in ascending order. The default for Reference Class name is System Determined, that is, the algorithm determines the value.

    See Also:

    "Choose Reference Value (GLMC)" for more information about how to select a specific value
  • Missing Values Treatment: The default is Mean Mode, that is, use mean for numeric values and mode for categorical values. You can also select Delete Row to delete any row that contains missing values. If you delete rows with missing values, then the same missing values treatment (delete rows) must be applied to any data that the model is applied to.

  • Specify Row Weights Column: By default, Row Weights Column is not specified. The Row Weights Column is a column in the training data that contains a weighting factor for the rows.

    Row weights can be used as a compact representation of repeated rows, as in the design of experiments where a specific configuration is repeated several times.

    Row weights can also be used to emphasize certain rows during model construction. For example, to bias the model toward rows that are more recent and away from potentially obsolete data.

    To specify a Row Weights column, click the check box and select the column from the list.

  • Ridge Regression: By default, Ridge Regression is system determined (not disabled) in both Oracle Database 11g and 12c.

    Note:

    The Ridge Regression setting in both Oracle Database 11g and Oracle Database 12c should be consistent (system determined).

    If you select Ridge Regression, then Feature Selection is automatically deselected.

    Ridge Regression is a technique that compensates for multicollinearity (multivariate regression with correlated predictors). Oracle Data Mining supports Ridge Regression for both regression and classification mining functions.

    To specify options for Ridge Regression, click Option to open the Ridge Regression Option Dialog (GLMC).

    When Ridge Regression is enabled, fewer global details are returned. For example, when Ridge Regression is enabled, no prediction bounds are produced.

    Note:

    If you are connected to Oracle Database 11g Release 2 (11.2) and you get the error ORA-40024 when you build a GLM model, enable Ridge Regression and rebuild the model.
  • Feature Selection: By default, Feature Selection is deselected. This setting requires connection to Oracle Database 12c. To specify Feature Selection or view or specify Feature Selection settings, click Option to open the Feature Selection Option Dialog.

    If you select Feature Selection, then Ridge Regression is automatically deselected.

    Note:

    The Feature Selection setting is available only in Oracle Database 12c.
  • Approximate Computation: Specifies whether the algorithm should use approximate computations to improve performance. For GLM, approximation is appropriate for data sets that have many rows and are densely populated (not sparse).

    The values for Approximate Computation are:

    • System Determined (Default)

    • Enable

    • Disable

13.5.4.2.1 Feature Selection Option Dialog

The setting requires connection to Oracle Database 12c.

If you select Feature Selection, then Ridge Regression is automatically deselected. This dialog box enables you to specify Feature Selection for a GLMC or a GLMR model:

  • Feature Selection Criteria: The default setting is system determined. You can select one of the following:

    • Akaike Information

    • Schwarz Bayesian Information

    • Risk Inflation

    • Alpha Investing

  • Max Number of Features: The default setting is system determined.

    To specify several features, click the User specified option and enter an integer number of features.

  • Feature Identification: The default setting is system determined.

    You can also choose:

    • Enable Sampling

    • Disable Sampling

  • Feature Acceptance: The default setting is system determined.

    You can also choose:

    • Strict

    • Relaxed

  • Prune Model: By default, Enable is selected. You can also select Disable.

  • Categorical Predictor Treatment: By default, Add One at a Time is selected. You can also select Add All at Once.

    If you accept the default, that is Add One at a Time, then Feature Generation is not selected. If you select Feature Generation, then the default is Quadratic Candidates. You can also select Cubic Candidates.

13.5.4.2.2 Choose Reference Value (GLMC)

To select a value, click Edit. In the Choose Reference Value dialog, select Custom. Then, select one of the values in the target values list. Click OK.

13.5.4.2.3 Ridge Regression Option Dialog (GLMC)

You can use the system-determined Ridge Value or you can supply your own. By default, the system determined value is used.

Click OK.

13.5.4.3 GLM Regression Model Viewer

The GLM Regression (GLMR) model viewer displays characteristics of a GLMR model. GLMR is also known as Linear Regression.

To view a GLMR model, use one of these methods:

  • Right-click the node where the where the model was built and select Go to Properties from the context menu. In the Models section of Properties, select the model and click view selected model.

  • Select the workflow node where the model was built and right-click. Select View Models from the context menu and then select the model to view.

The default name of a GLM model has GLM in the name.

The GLMR viewer opens in a new tab.

The Detail tab is displayed by default.

The GLM Regression Model Viewer has these tabs:.

13.5.4.3.1 GLMR Coefficients

The Coefficient tab enables you to view GLM coefficients. The viewer supports sorting to control the order in which coefficients are displayed and filtering to select the coefficients to display.

See Also:

"GLMR Settings" for more information about confidence for GLM regression models

By default, coefficients are sorted by absolute value. You can deselect or select Sort by absolute value and click Query.

The default fetch size is 1,000 records. To change the fetch size, specify a new number of records and click Query.

Note:

After you change any criteria on this tab, click Query to query the database. You must click Query even for changes such as selecting or deselecting sort by absolute value or changing the fetch size.

Sort and Search GLMC Coefficients describes sorting and searching the grid.

The relative value of coefficients is shown graphically as a bar, with different colors for positive and negative values. If a coefficient is close to 0, then the bar may be too small to display.

  • Sort by absolute value: Sort the list of coefficients by absolute value.

  • Fetch Size: The number of rows displayed. To figure out if all coefficients are displayed, choose a fetch size that is greater than the number of rows displayed.

Coefficients are listed in a grid. If no items are listed, then there are no coefficients for that target value. The coefficients grid has these columns:

  • Attribute: Name of the attribute

  • Value: Value of the attribute

  • Coefficient: The linear coefficient estimate for the selected target value is displayed. A bar is shown in front of (and possible overlapping) each coefficient. The bar indicates the relative size of the coefficient. For positive values, the bar is light blue; for negative values, the bar is red. (If a value is close to 0, then the bar may be too small to be displayed.)

  • Standard Error of the estimate

  • Wald Chi Squared

  • Pr > Chi Square

  • Upper coefficient limit

  • Lower coefficient limit

Note:

Not all statistics are necessarily returned for each coefficient.

Statistics are null if any of the following are true:

  • The statistic does not apply to the mining function. For example, exp_coefficient does not apply to Linear Regression.

  • The statistic cannot be calculated because of limitations in the system resources.

  • The value of the statistics is infinity.

  • If the model was built using Ridge Regression, or if the covariance matrix is found to be singular during the build, then coefficient bounds (upper and lower) have the value NULL.

Other Tabs: The viewer has these other tabs:

13.5.4.3.2 GLMR Details

The Model Details list global metrics for the model as a whole. The metrics display has two columns: Name of the metric and Value of the metric. The following metrics are displayed:

  • Adjusted R-Square

  • Akaike's information criterion

  • Coefficient of variation

  • Corrected total degrees of freedom

  • Corrected total sum of squares

  • Dependent mean

  • Error degrees of freedom

  • Error mean square

  • Error sum of squares

  • Model F value statistic

  • Estimated mean square error

  • Hocking Sp statistic

  • JP statistic (the final prediction error)

  • Model converged (Yes or No)

  • Model degrees of freedom

  • Model F value probability

  • Model mean square

  • Model sum of squares

  • Number of parameters (the number of coefficients, including the intercept)

  • Number of rows

  • Root mean square error

  • R-square

  • Schwartz's Bayesian Information Criterion

  • Termination

  • Valid covariance matrix computed (Yes or No)

13.5.4.3.3 GLMR Diagnostics

The Diagnostics tab displays diagnostics for each Case ID in the build data. You can filter the results.

Note:

Diagnostics are not generated by default. To generate diagnostics, you must and specify a Case ID and select Generate Row Diagnostics.

The following information is displayed in the Diagnostics grid:

  • CASE_ID

  • TARGET_VALUE for the row in the training data

  • PREDICTED_VALUE, value predicted by the model for the target

  • HAT, value of the diagonal element of the HAT matrix

  • RESIDUAL, the residual with the adjusted dependent variable

  • STD_ERR_RESIDUAL, Standard Error of the residual

  • STUDENTIZED_RESIDUAL

  • PRED_RES, predicted residual

  • COOKS_D, Cook's D influence statistic

Other Tabs: The viewer has these other tabs:

13.5.4.3.4 GLMR Settings

The Settings tab has these tabs:

Other Tabs: The viewer has these other tabs:

13.5.4.3.5 GLMR Summary

General settings describe the characteristics of the model, including owner, name, type, algorithm, target attribute, creation date duration of model build, and comments.

Algorithm settings control model build; algorithm setting are specified when the build node is defined.

See Also:

"GLM Regression Algorithm Settings" for a list of algorithm settings

After a model is built, values calculated by the system are displayed on this tab. For example, if you select System Determined for Enable Ridge Regression, this tab shows if Ridge Regression was enabled and what ridge value was calculated.

Other Tabs: The Settings tab has this other tab:

13.5.4.3.6 GLMR Inputs

A list of the attributes used to build the model. For each attribute the following information is displayed:

  • Name: The name of the attribute.

  • Data type The data type of the attribute

  • Mining type: Categorical or numerical

  • Target: A check mark indicates that the attribute is a target attribute.

  • Data Prep: YES indicates that data preparation was performed.

When you select an attribute in the Attributes list, the transformation properties viewer displays the imbedded transformation, created by either the user or Automatic Data Preparation, in the Model transformations list. To see the reverse transformation, click Show Reverse Expression. Transformations are displayed in SQL notation. Not all transformations have a reverse. Transformations and Reverse transformations are not always displayed.

Other Tabs: The Settings tab has this other tab:

13.5.4.4 GLM Regression Algorithm Settings

GLM supports these settings for regression:

  • Generate Row Diagnostics is set to OFF by default. To generate row diagnostics, you must select this option and also specify a Case ID.

    If you do not specify a Case ID, then this setting is not available.

    You can view Row Diagnostics on the Diagnostics tab when you view the model. To further analyze row diagnostics, use a Model Details node to extract the row diagnostics table.

  • Confidence Level: A positive number that is less than 1.0. This level indicates the degree of certainty that the true probability lies within the confidence bounds computed by the model. The default confidence is 0.95.

  • Missing Values Treatment: The default is Mean Mode.That is, use Mean for numeric values and Mode for categorical values.
    You can also select Delete Row to delete any row that contains missing values. If you delete rows with missing values, then the same missing values treatment (delete rows) must be applied to any data that the model is applied to.

  • Specify Row Weights Column: The Row Weights Column is a column in the training data that contains a weighting factor for the rows. By default, Row Weights Column is not specified. Row weights can be used:

    • As a compact representation of repeated rows, as in the design of experiments where a specific configuration is repeated several times.

    • To emphasize certain rows during model construction. For example, to bias the model toward rows that are more recent and away from potentially obsolete data

  • Ridge Regression: Ridge Regression is a technique that compensates for multicollinearity (multivariate regression with correlated predictors). Oracle Data Mining supports Ridge Regression for both regression and classification mining functions.

    By default, Ridge Regression is system determined (not disabled) in both Oracle Database 11g and Oracle Database 12c. If you select Ridge Regression, then Feature Selection is automatically deselected.

    To specify options for Ridge Regression, click Option to open the Ridge Regression Option Dialog (GLMR).

    When Ridge Regression is enabled, fewer global details are returned. For example, when Ridge Regression is enabled, no prediction bounds are produced.

    Note:

    If you are connected to Oracle Database 11g Release 2 (11.2) and you get the error ORA-40024 when you build a GLM model, enable Ridge Regression and rebuild the model.
  • Feature Selection: This setting requires connection to Oracle Database 12c. By default, Feature Selection is deselected. To specify Feature Selection or view or specify Feature Selection settings, click Option to open the Feature Selection Option Dialog.

    If you select Feature Selection, then Ridge Regression is automatically deselected.

    Note:

    The Feature Selection setting is available only in Oracle Database 12c.
  • Approximate Computation: Specifies whether the algorithm should use approximate computations to improve performance. For GLM, approximation is appropriate for data sets that have many rows and are densely populated (not sparse).

    Values for Approximate Computation are:

    • System Determined (Default)

    • Enable

    • Disable

13.5.4.4.1 Ridge Regression Option Dialog (GLMR)

You can use the System Determined Ridge Value or you can supply your own. By default, the System Determined value is used. Produce Variance Inflation Factor (VIF) is not selected by default. You can select it.

Click OK.

13.5.4.4.2 Choose Reference Value (GLMR)

To select a value:

  1. Click Edit.

  2. In the Choose Reference dialog box, click Custom.

  3. Select one of the values in the Target Values field.

  4. Click OK.

13.6 k-Means

The k-Means (KM) algorithm is a distance-based clustering algorithm that partitions the data into a predetermined number of clusters, provided there are enough distinct cases.

Distance-based algorithms rely on a distance metric (function) to measure the similarity between data points. The distance metric is either Euclidean, Cosine, or Fast Cosine distance. Data points are assigned to the nearest cluster according to the distance metric used.

Use a Clustering Node to build KM models.

Use an Apply Node to apply a KM model to new data.

The following topics describe KM models:

13.6.1 k-Means Algorithm

Oracle Data Mining implements an enhanced version of the k-Means algorithm with the following features:

  • The algorithm builds models in a hierarchical manner. The algorithm builds a model top down using binary splits and refinement of all nodes at the end. In this sense, the algorithm is similar to the bisecting k-Means algorithm. The centroid of the inner nodes in the hierarchy are updated to reflect changes as the tree evolves. The whole tree is returned.

  • The algorithm grows the tree, one node at a time (unbalanced approach). Based on a user setting, the node with the largest variance is split to increase the size of the tree until the desired number of clusters is reached. The maximum number of clusters is specified as a build setting.

  • The algorithm provides probabilistic scoring and assignment of data to clusters.

  • The algorithm returns the following, for each cluster:

    • A centroid (cluster prototype). The centroid reports the mode for categorical attributes, or the mean and variance for numerical attributes.

    • Histograms (one for each attribute),

    • A rule describing the hyper box that encloses the majority of the data assigned to the cluster.

The clusters discovered by enhanced k-Means are used to generate a Bayesian probability model that is then used during scoring (model apply) for assigning data points to clusters. The k-Means algorithm can be interpreted as a mixture model where the mixture components are spherical multivariate normal distributions with the same variance for all components.

Note:

The k-Means algorithm samples one million rows. You can use the sample to build the model.

13.6.2 KM Model Viewer and Algorithm Settings

This section discusses the k-Means (KM) Model viewer, the procedure to view the KM Model viewer and the algorithm settings related to KM. Topics are:

13.6.2.1 KM Model Viewer

The KM Model Viewer lets you examine a KM model. You can view a KM model in one of the following ways

  • Method 1

    1. Right-click the node where the model was built.

    2. Select Go to Properties.

    3. In the Models section in Properties, click view.

  • Method 2

    1. Select the workflow node where the model was built.

    2. Right-click and click View Models.

    3. Select the model to view.

    The KM model viewer opens in a new tab. The default name of a k-Means model has KM in the name. The Tree tab is displayed by default.

The KM model viewer has these tabs:

13.6.2.1.1 EM, KM, and OC Tree

The tree viewer for EM, KM, and OC operate in the same way.

The Tree Viewer is the graphical tree for hierarchical clusters. When you view the tree:

You can compare the attributes in a given node with the attributes in the population using EM, KM, and OC Compare.

Viewing Information:

To view information about a particular node:

  1. Select the node.

  2. In the lower pane, the following are displayed in each of these tabs:

    • Centroid: Displays the centroid of the cluster

    • Cluster Rule: Displays the rule that all elements of the cluster satisfy.

Display Control:

The following control the display of the tree as a whole:

  • Zoom in: Zooms in to the diagram, providing a more detailed view of the rule.

  • Zoom out: Zooms out from the diagram, providing a view of much or all of the rule.

  • Percent size: Enables you select an exact percentage to zoom the view.

  • Fit to Window: Zooms out from the diagram until the whole diagram fits within the screen.

  • Layout Type: Enables you to select horizontal layout or vertical layout; the default is vertical.

  • Expand: All nodes shows branches of the tree.

  • Show more detail: Shows more data for each tree node. Click again to show less detail.

  • Top Attributes: Displays the top N attributes. N is 5 by default. To change N, select a different number from the list.

  • Refresh: Enables you to apply the changed Query Settings.

  • Query Settings: Enables you to change the number of top settings. The default is 10. You can save a different number as the new default.

  • Save Rules.

13.6.2.1.2 Cluster (Viewer)

The Cluster tab for EM, KM, and OC operate in the same way.

The Cluster tab enables you to view information about a selected cluster. The viewer supports filtering so that only selected probabilities are displayed.

The following information is displayed:

  • Cluster: The ID of the cluster being viewed. To view another cluster, select a different ID from the menu. You can view Leaves Only (terminal clusters) by selecting Leaves Only. Leaves Only is the default.

  • Fetch Size: Default is 20. You can change this value.

    If you change Fetch Size, click Query to see the new display.

The grid lists the attributes in the cluster. For each attribute, the following information is displayed:

  • Name of the attribute.

  • Histogram of the attribute values in the cluster.

  • Confidence displayed as both a number and with a bar indicating a percentage. If confidence is very small, then no bar is displayed.

  • Support, the number of cases.

  • Mean, displayed for numeric attributes.

  • Mode, displayed for categorical attributes.

  • Variance

To view a larger version of the histogram, select an attribute; the histogram is displayed below the grid. Place the cursor over a bar in the histogram to see the details of the histogram including the exact value.

You can search the attribute list for a specific attribute name or for a specific value of mode. To search, use the search box.

The drop-down list enables you to search the grid by Attribute (the default) or by Mode. Type the search term in the box next to search.

To clear a search, click delete.

Other Tabs: The NB Model Viewer has these other tabs:

13.6.2.1.3 EM, KM, and OC Compare

The Compare tab for EM, KM, and OC operate in the same way. The Compare tab enables you to compare two clusters in the same model. The display enables you to select the two clusters to compare.

You can perform the following tasks:

  • Compare Clusters: To select clusters to compare, pick them from the lists. The clusters are compared by comparing attribute values. The comparison is displayed in a grid. You can use Compare to compare an individual cluster with the population.

  • Rename Clusters: To rename clusters, click Edit. This opens the Rename Cluster dialog box. By default, only Show Leaves is displayed. To show all nodes, then deselect Leaves Only. The default Fetch Size is 20. You can change this value.

  • Search Attribute: To search an attribute, enter its name in the search box. You can also search by rank.

  • Create Query: If you make any changes, click Query.

For each cluster, a histogram is generated that shows the attribute values in the cluster. To see enlarged histograms for a cluster, click the attribute that you are interested in. The enlarged histograms are displayed below the attribute grid.

In some cases, there may appear to be Missing Histograms in a Cluster.

13.6.2.1.4 Compare Cluster with Population

To see how an individual cluster compares with the population:

  1. Click Compare.

  2. Deselect Leaves Only.

  3. Select the root node as Cluster 1. This is cluster 1, if the clusters have not been renamed. The distribution of attribute values in Cluster 1 represents the distribution of values in the population as a whole. Select the cluster that you want to compare with the population as Cluster 2.

  4. You can now compare the distribution of values for each attribute in the cluster selected as Cluster 2 with the values in Cluster 1.

13.6.2.1.5 Missing Histograms in a Cluster

If clusters are built using sparse data, then some attribute values are not present in the records assigned to a cluster.

In this case, a cluster comparison shows the centroid and histogram values for the cluster where the attribute is present and leaves blanks for the cluster where the attribute is present.

13.6.2.1.6 Rename Cluster

The title bar of the dialog box shows the cluster to rename. Cluster ID is a number. You can change it to a string. Enter in the new name and click OK.

Note:

Two different clusters cannot have the same name.
13.6.2.1.7 KM Settings

The Settings tab displays information about how the model was built:

Other Tabs: The KM Model Viewer has these other tabs:

13.6.2.1.8 Cluster Model Settings (Viewer)

The Settings tab of the model viewer contains two tabs: Cluster Model Summary and Cluster Model Input.

13.6.2.1.9 Cluster Model Summary

The Summary tab lists the following:

  • General settings lists the following information:

    • Type of Model (Classification, Regression, and so on)

    • Owner of the model (the Schema where the model was built)

    • Model Name

    • Creation Date

    • Duration of the Model Build (in minutes)

    • Size of the model (in MB)

    • Comments

  • Algorithm settings list the algorithm and algorithm settings used to build the model.

13.6.2.1.10 Cluster Model Input

The Input tab is displayed for models that can be scored only. A list of the attributes used to build the model. For each attribute the following information is displayed:

  • Name: The name of the attribute.

  • Data Type: The data type of the attribute.

  • Mining type: Categorical or numerical.

  • Data Prep: YES indicates that data preparation was performed.

When you select an attribute in the Attributes list, the transformation properties viewer displays the imbedded transformation, created by either the user or Automatic Data Preparation, in the Model transformations list.

To see the reverse transformation, click Show Reverse Expression. Transformations are displayed in SQL notation. Not all transformations have a reverse. Transformations and Reverse transformations are not always displayed.

13.6.2.2 KM Algorithm Settings

The k-Means (KM) algorithm supports these settings:

  • Number of Clusters is the maximum number of leaf clusters generated by the algorithm. The default is 10. k-Means usually produces the exact number of clusters specified, unless there are fewer distinct data points.

  • Growth Factor is a number greater than 1 and less than or equal to 5. This value specifies the growth factor for memory allocated to hold cluster data. Default is 2.

  • Convergence Tolerance must be between 0.001 (slow build) and 0.1 (fast build). The default is 0.01. The tolerance controls the convergence of the algorithm. The smaller the value, the closest to the optimal solution at the cost of longer run times. This parameter interacts with the number of iterations parameter.

  • Distance Function specifies how the algorithm calculates distance. The default distance function is Euclidean.The other distance functions are:

    • Cosine

    • Fast Cosine

  • Number of Iterations must be greater than or equal to 1. The default is 30. This value is the maximum number of iterations for the k-Means algorithm. In general, more iterations result in a slower build. However, the algorithm may reach the maximum, or it may converge early. The convergence is determined by whether the Convergence Tolerance setting is satisfied.

  • Min Percent Attribute Support is not an integer. The range of the value for Min Percent Attribute Support is:

    • Greater than or equal to 0, and

    • Less than or equal to 1.

      The default value is 0.1. The default value enables you to highlight the more important predicates instead producing a long list of predicates that have very low support.

    You can use this value to filter out rule predicates that do not meet the support threshold. Setting this value too high can result in very short or even empty rules.

    In extreme cases, for very sparse data, all attribute predicates may be filtered out so that no rule is produced. If no rule is produced, then you can lower the support threshold and rebuild the model to make the algorithm produce rules even if the predicate support is very low.

  • Number of Histogram Bins is a positive integer; the default value is 10. This value specifies the number of bins in the attribute histogram produced by k-Means. The bin boundaries for each attribute are computed globally on the entire training data set. The binning method is equiwidth. All attributes have the same number of bins except attributes with a single value that have only one bin.

  • Split Criterion is either Variance or Size. The default is Variance. The split criterion is related to the initialization of the k-Means clusters. The algorithm builds a binary tree and adds one new cluster at a time. Size results in placing the new cluster in the area where the largest current cluster is located. Variance places the new cluster in the area of the most spread out cluster.

13.7 Naive Bayes

The Naive Bayes (NB) algorithm is used to build Classification models. You can build, test, apply, and tune a Naive Bayes model.

The following topics describe Naive Bayes:

13.7.1 Naive Bayes Algorithm

The Naive Bayes (NB) algorithm is based on conditional probabilities. It uses Bayes' Theorem, which calculates a probability by counting the frequency of values and combinations of values in the historical data.Bayes' Theorem finds the probability of an event occurring given the probability of another event that has already occurred.

Assumption:

Naive Bayes makes the assumption that each predictor is conditionally independent of the others. For a given target value, the distribution of each predictor is independent of the other predictors. In practice, this assumption of independence, even when violated, does not degrade the model's predictive accuracy significantly, and makes the difference between a fast, computationally feasible algorithm and an intractable one.

Sometimes, the distribution of a given predictor is clearly not representative of the larger population. For example, there might be only a few customers under 21 in the training data, but in fact, there are many customers in this age group in the wider customer base. To compensate, you can specify prior probabilities (priors) when training the model.

There are several Advantages of Naive Bayes.

13.7.1.1 Advantages of Naive Bayes

The advantages of Naive Bayes model are:

  • The Naive Bayes algorithm provides fast, highly scalable model building and scoring. It scales linearly with the number of predictors and rows.

  • The Naive Bayes build process is parallelized. Scoring can also be parallelized irrespective of the algorithm.

  • Naive Bayes can be used for both binary and multiclass classification problem

13.7.2 Naive Bayes Viewers and Algorithm Settings

This section discusses the Naive Bayes (NB) Model viewer, the procedure to view the NB Model viewer and the algorithm settings related to NB model. Topics are:

13.7.2.1 Naive Bayes Model Viewer

The NB Model Viewer lets you examine an NB model. You can view an NB model using any one of the following methods.

  • Method 1

    1. Right-click the node where the model was built.

    2. Select Go to Properties.

    3. In the Models section in Properties, click view.

  • Method 2

    1. Select the workflow node where the model was built.

    2. Right-click and click View Models.

    3. Select the model to view.

    The model viewer opens in a new tab. The default name of a Naive Bayes model has NB in the name.

The NB model viewer has these tabs:

13.7.2.1.1 Probabilities (NB)

The Probabilities tab lists the probabilities calculated during model build. You can sort and filter the order in which probabilities are displayed.

The relative value of probabilities is shown graphically as a bar, with a blue bar for positive values and red bar for negative values. For numbers close to zero, the bar may be too small to be displayed.

Select Target Value. The probabilities associated with the selected value are displayed. The default is to display the probabilities for the value that occurs least frequently.

Probabilities are listed in the Grid.

Other Tabs: The NB Model Viewer has these other tabs:

13.7.2.1.2 Grid

If no items are listed, then there are no values that satisfy the criteria that you specified:

  • Row Count: The number of rows displayed.

  • Grid Filter: Use the Grid Filter to filter the information in the grid.

The probabilities grid has these columns:

  • Attribute: Name of the attribute

  • Value: Value of the attribute

  • Probability: The probability for the value of the attribute. Probability is displayed as both a number and with a bar indicating a percentage. If the probability is very small, then no bar is displayed.

13.7.2.1.3 Fetch Size

This value limits the number of rows returned regardless of Filter or Server settings. The default fetch size is 1000. Change the Fetch Size by clicking the up or down arrows. If you change the Fetch Size, click Query.

13.7.2.1.4 Grid Filter

The filter control check enables you to filter the items that are displayed in the grid. The filtering is done as you type in the filter search box.

To see the filter categories, click the down arrow next to the binoculars icon. The following categories are supported for probabilities:

  • Attribute: Filters the Attribute (name) column. This is the default category. For example, to display all entries with CUST in the attribute name, enter CUST in the search box.

  • Value: Filters the value column.

  • Probability: Filters the probability column.

  • All (And): Enter in one or more strings and their values are compared against the Attribute and Value columns using the AND condition. For example, enter CUST M to display rows where the attribute name contains CUST and the value is M.

  • All (Or): Works the same as All (And) except that the comparison uses an OR condition.

The Grid Filter for Compare lists similar categories:

  • Name: Filters by attribute name (Default).

  • Value: Filters the value column.

  • Attribute/Value/Propensity (or): Filters for values in any of the attribute, value, and propensity columns.

  • Attribute/Value/Propensity (and): Filters for values in any of the attribute, value, and propensity columns.

  • Propensity for Target Value 1: Filters the propensity values for Target Value 1.

  • Propensity for Target Value 2: Filters the propensity values for Target Value 2.

After you enter one or more strings into the filter search box, delete is displayed. Click this icon to clear the search string.

13.7.2.1.5 Compare (NB)

The Compare tab enables you to compare results for two different target values. Select the two target values.

The default values for Target Value 1 and Target Value 2 are displayed. You can do the following:

  • Change the Target Values. The Target Values that you select must be different.

  • Use the Grid Filter to display specific values.

  • Change the Fetch Size.

  • Sort the grid columns. The grid for compare has these columns:

    • Attribute: Name of the attribute

    • Value: Value of the attribute

    • Propensity for target value 1

    • Propensity for target value 2

    For both propensities, a histogram bar is displayed. The maximum value of propensity is 1.0. The minimum is -1.0.

    Propensity shows which of the two target values has a more predictive relationship for a given attribute value pair. Propensity can be measured in terms of being predicted for or against a target value, where prediction against is shown as a negative value.

Other Tabs: The NB Model Viewer has these other tabs:

13.7.2.1.6 Settings (NB)

The Settings tab displays information about how the model was built:

Other Tabs: The NB Model Viewer has these other tabs:

13.7.2.1.7 Settings (NB)

The Settings tab shows information about the model.

The Settings tab has these tabs:

13.7.2.1.8 Summary (NB)

The Summary tab describes all model settings. Model settings describe characteristics of model building. The Settings are divided into:

13.7.2.1.9 Naive Bayes Algorithm Settings

This section identifies the algorithm and whether Automatic Data Preparation (ADP) is ON or OFF.

These settings are specific to Naive Bayes:

  • Pair wise Threshold: The minimum percentage of pair wise occurrences required for including a predictor in the model. The default is 0.

  • Singleton Threshold: The minimum percentage of singleton occurrences required for including a predictor in the model. The default is 0.

13.7.2.1.10 Input (NB)

The Input tab is displayed for models that can be scored only.

A list of the attributes used to build the model. For each attribute the following information is displayed:

  • Name: The name of the attribute.

  • Data Type: The data type of the attribute.

  • Mining Type: Categorical or Numerical.

  • Target: The check icon indicates that the attribute is a target attribute.

  • Data Prep: YES indicates that data preparation was performed.

When you select an attribute in the Attributes list, the transformation properties viewer displays the imbedded transformation, created by either the user or Automatic Data Preparation, in the Model transformations list.

To see the reverse transformation, click Show Reverse Expression. Transformations are displayed in SQL notation. Not all transformations have a reverse. Transformations and Reverse transformations are not always displayed.

13.7.2.1.11 Weights (NB)

Weights that are calculated by the system for each target value are displayed on the Weights tab. If you tune the model, the weights may change.

13.7.2.1.12 Target Values (NB)

Displays the following:

  • Target Attributes

  • Data Types

  • Values of each Target Attributes

13.7.2.2 Naive Bayes Test Viewer

By default, any Classification or Regression model is automatically tested. A Classification model is tested by comparing the predictions of the model with known results. Oracle Data Miner keeps the latest test result.

To view the test results for a model, right-click the Build node and select View Results.

See Also:

"Testing Classification Models" for more information about Test Viewers

13.8 Nonnegative Matrix Factorization

Nonnegative Matrix Factorization (NMF) is the unsupervised algorithm used by Oracle Data Mining for feature extraction.

These topics describe NMF:

13.8.1 Using Nonnegative Matrix Factorization

Nonnegative Matrix Factorization (NMF) is useful when there are many attributes and the attributes are ambiguous or have weak predictability. By combining attributes, NMF can produce meaningful patterns, topics, or themes.

NMF is especially well-suited for text mining. In a text document, the same word can occur in different places with different meanings. For example, hike can be applied to the outdoors or to interest rates. By combining attributes, NMF introduces context, which is essential for predictive power:


"hike" + "mountain" -> "outdoor sports"
"hike" + "interest" -> "interest rates"

13.8.2 How Does Nonnegative Matrix Factorization Work

Non-Negative Matrix Factorization (NMF) uses techniques from multivariate analysis and linear algebra. NMF decomposes multivariate data by creating a user-defined number of features. Each feature is a linear combination of the original attribute set. The coefficients of these linear combinations are nonnegative.

NMF decomposes a data matrix V into the product of two lower rank matrices W and H so that V is approximately equal to W times H. NMF uses an iterative procedure to modify the initial values of W and H so that the product approaches V. The procedure terminates when the approximation error converges or the specified number of iterations is reached.

When applying to a model, an NMF model maps the original data into the new set of attributes (features) discovered by the model.

13.8.3 NMF Model Viewer and Algorithm Settings

This section discusses the Nonnegative Matrix Factorization (NMF) Model viewer, the procedure to view the NMF Model viewer and the algorithm settings related to NMF. The topics are:

13.8.3.1 NMF Model Viewer

View an NMF model in one of the following ways:

  • Method 1

    1. Right-click the node where the model was built.

    2. Select Go to Properties.

    3. In the Models section in Properties, click view.

  • Method 2

    1. Select the workflow node where the model was built.

    2. Right-click and click View Models.

    3. Select the model to view. The model viewer opens in a new tab. The Settings tab is displayed by default.

The NMF Model Viewer has these tabs:

13.8.3.1.1 Coefficients (NMF)

For a given Feature ID, the coefficients are displayed in the Coefficients grid. The title of the grid Coefficients x of y displays the number of rows returned out of all the rows available in the model.

By default, Feature IDs are integers.

Fetch Size limits the number of rows returned. The default is 1000 or the value specified in the Preference settings for Model Viewers.

You can perform the following tasks:

The Coefficients grid has these columns:

  • Attribute, attribute name

  • Value, value of attribute

  • Coefficient. The value is shown as a bar with the value centered in the bar. Positive values are light blue. Negative values are red.

13.8.3.1.2 Rename (NMF)

You can rename the selected Feature ID.

  1. Enter in the new name in the Feature ID field.

  2. Click OK.

Note:

Different features should have different names.
13.8.3.1.3 Filter (NMF)

To view the filter categories, click find.

The filter categories are

  • Attribute (Default): Search for an attribute name.

  • Value: This is the value column.

  • Coefficient: This is the coefficient column

To create a filter, enter a string in the text box. After a string has been entered, delete icon is displayed. To clear the filter, click the icon.

13.8.3.1.4 Settings (NMF)

The Settings tab has these tabs:

13.8.3.1.5 Summary (NMF)
  • General Settings lists the following:

    • Type of model (Classification, Regression, and so on)

    • Owner of the model (the Schema where the model was built)

    • Model Name

    • Creation Date

    • Duration of the model build (in minutes)

    • Size of the model (in MB)

    • Comments

  • Algorithm Settings lists the following:

    • The name of the algorithm used to build the model.

    • The algorithm settings that control the model build.

13.8.3.1.6 Inputs (NMF)

This tabs shows information about those attributes used to build the model.

Oracle Data Miner does not necessarily use all of the attributes in the build data. For example, if the values of an attribute are constant, then that attribute is not used.

For each attribute used to build the model, this tab displays:

  • Name: The name of the attribute.

  • Data Type: The data type of the attribute

  • Mining Type: Categorical or Numerical

  • Data Prep: YES indicates that data preparation was performed. If Data Prep is YES, then select the column (click it). The data preparation is displayed in Data Preparation at the bottom of the tab.
    To see the reverse transformation for data preparation, select Show Reverse Transformation. Transformations are displayed in SQL notation. Not all transformations have a reverse. Transformations and Reverse transformations are not always displayed.

13.8.3.2 NMF Algorithm Settings

The Nonnegative Matrix Factorization (NMF) algorithm supports these settings:

  • Convergence Tolerance: Indicates the minimum convergence tolerance value. The default is 0.5.

  • Automatic preparation: ON (Default). Indicates automatic data preparation.

  • NMFS_NONNEGATIVE_SCORING: Enabled or Disabled. The default is Enabled (NMFS_NONNEG_SCORING_ENABLE).

  • Number of features: The default is to not specify the number of features. If you do not specify the number of features, then the algorithm determines the number of features.

    To specify the number of features, then select Specify number of features and enter the integer number of features. The number of features must be a positive integer less than or equal to the minimum of the number of attributes and to the number of cases. In many cases, 5 or some other number less than or equal to 7 gives good results.

  • Number of iterations: Indicates the maximum number of iterations to be performed. The default is 50.

  • Random Seed: It is the random seed for the sample. The default value is -1.The seed can be changed. If you plan to repeat this operation to get the same results, ensure to use the same random seed.

13.9 Orthogonal Partitioning Clustering

Orthogonal Partitioning Clustering (O-Cluster) is a clustering algorithm that is proprietary to Oracle. The requirements to build and apply the O-Cluster algorithm:

The following topics describe O-Cluster:

13.9.1 O-Cluster Algorithm

The O-Cluster (OC) algorithm creates a hierarchical grid-based clustering model. That is, it creates axis-parallel (orthogonal) partitions in the input attribute space. The algorithm operates recursively. The resulting hierarchical structure represents an irregular grid that tessellates the attribute space into clusters. The resulting clusters define dense areas in the attribute space.

The clusters are described by intervals along the attribute axes and the corresponding centroids and histograms. The sensitivity parameter defines a baseline density level. Only areas with a peak density above this baseline level can be identified as clusters.

The clusters discovered by O-Cluster are used to generate a Bayesian probability model that is then used during scoring (model apply) for assigning data points to clusters. The generated probability model is a mixture model where the mixture components are represented by a product of independent normal distributions for numerical attributes and multinomial distributions for categorical attributes.

O-Cluster goes through the data in chunks until it converges. There is no explicit limit on the number of rows processed.

O-Cluster handles missing values naturally as missing at random. The algorithm does not support nested tables and thus does not support sparse data.

Note:

OC does not support text.

13.9.2 OC Model Viewer and Algorithm Settings

This section discusses the O-Cluster Model viewer, the procedure to view the OC Model viewer and the algorithm settings related to OC. The topics are:

13.9.2.1 OC Model Viewer

The OC Model Viewer lets you examine an OC model. You can view an OC model using one of the following methods:

  • Method 1

    1. Right-click the node where the model was built.

    2. Select Go to Properties.

    3. In the Models section in Properties, click view.

  • Method 2

    1. Select the workflow node where the model was built.

    2. Right-click and click View Models.

    3. Select the model to view. The OC model viewer opens in a new tab. The Tree tab is displayed by default. The default name of an O-Cluster model has OC in the name.

The OC Model viewer has these tabs:

13.9.2.1.1 Detail (OC)

The Details tab enables you to view details for a cluster. You can discover what values of an attribute are in the selected cluster. The viewer supports filtering so that only selected probabilities are displayed.

The following information is displayed:

  • Cluster: The ID of the cluster being viewed. You can change the cluster by selecting a different ID. Select Leaves Only to view terminal clusters only.

  • Fetch Size: The number of columns selected. The default is 50. You can change the Fetch Size. If you change the Fetch Size, click Query.

The grid lists the attributes in the cluster. For each attribute, the following information is displayed:

  • Attribute: An attribute is a predictor in a predictive model or an item of descriptive information in a descriptive model. Data attributes are the columns of data that are used to build a model. Data attributes undergo transformations so that they can be used as categoricals or numericals by the model. Categoricals and numericals are model attributes.

  • Histogram: The attribute values in the selected cluster are displayed as a histogram.

    To view a larger version of the histogram, select an attribute. The histogram is displayed below the grid. Place the cursor over a bar in the histogram to see the details of the histogram including the exact value.

  • Confidence: Displayed as both a number and with a bar indicating a percentage. If the confidence is very small, then no bar is displayed.

  • Support: The number of cases.

  • Mean: Displayed for numeric attributes.

  • Mode: Displayed for categorical attributes.

  • Variance

You can perform the following tasks:

  • Sort the attributes in the cluster. To sort, click the appropriate column heading in the grid. For example, to sort by attribute name, click Attribute. The attributes are sorted by:

    • Confidence

    • Support

    • Mean

    • Mode

    • Variance

    • Attribute name

  • Search the attribute list for a specific attribute name or for a specific value of mode. To search, use the search box next to view.

  • Search the grid by Attribute. The drop down list enables you to search the grid by Attribute (the default) or by Mode. Enter the search term in the search field. To clear a search, click delete.

Other Tabs: The OC Model Viewer has this other tab:

13.9.2.1.2 Settings (OC)

The Settings tab displays information about how the model was built:

13.9.2.1.3 Summary (OC)
  • General settings lists the following:

    • Type of model (Classification, Regression, and so on)

    • Owner of the model (the Schema where the model was built)

    • Model Name

    • Creation Date

    • Duration of the model build (in minutes)

    • Size of the model (in MB)

    • Comments

  • Algorithm settings lists the following:

    • The name of the algorithm.

    • The settings that control the model build. Algorithm settings are specified when the build node is defined.

13.9.2.1.4 Inputs (OC)

The Inputs tab is displayed for models that can be scored only.

When you select an attribute in the Attributes list, the transformation properties viewer displays the imbedded transformation, created by either the user or Automatic Data Preparation, in the Model transformations list.

To see the reverse transformation, click Show Reverse Expression. Transformations are displayed in SQL notation. Not all transformations have a reverse. Transformations and Reverse transformations are not always displayed.

13.9.2.2 OC Algorithm Settings

The O-Cluster (OC) algorithm supports these settings:

  • Number of Clusters: It is the maximum number of leaf clusters generated by the algorithm. The default is 10.

  • Buffer Size: It is the maximum size of the memory buffer, in logical records, that can be used by the algorithm. The default is 50,000 logical records.

  • Sensitivity: It is a number between 0 (fewer clusters) and 1 (more clusters). The default is 0.5.This value specifies the peak density required for separating a new cluster. This value is related to the global uniform density.

13.10 Singular Value Decomposition and Principal Components Analysis

Singular Value Decomposition (SVD) and Principal Component Analysis (PCA) are unsupervised algorithms used by Oracle Data Mining for feature extraction.

Unlike Nonnegative Matrix Factorization, SVD and PCA are orthogonal linear transformations that are optimal for capturing the underlying variance of the data. This property is extremely useful for reducing the dimensionality of high-dimensional data and for supporting meaningful data visualization.

Note:

Singular Value Decomposition (SVD) and Principal Component Analysis (PCA) require Oracle Database 12c.

In addition to dimensionality reduction, SVD and PCA have several other important applications, such as, data denoising (smoothing), data compression, matrix inversion, and solving a system of linear equations. All these areas can be effectively supported by the Oracle Data Mining implementation SVD/PCA.

SVD is implemented as a feature extraction algorithm. PCA is implemented as a special scoring method for the SVD algorithm.

See Also:

For more information about SVD and PCA models and algorithm settings:

13.10.1 Build and Apply SVD and PCA Models

To build an SVD or PCA model, use a Feature Extraction Node. A Feature Extraction model creates a Feature Build Node. If you are connected to Oracle Database 12c, then a Feature Build node creates one NMF model and one PCA model. You can add an SVD model.

To apply an SVD or PCA model, use an Apply Node.

13.10.2 PCA Model Viewer and Algorithm Settings

This section discusses the PCA Model viewer, the procedure to view the PCA Model viewer and the algorithm settings related to PCA. The topics are:

13.10.2.1 PCA Model Viewer

The PCA Model Viewer lets you examine a successfully built PCA model. You can view a PCA model by using any one of the following methods.

  • Method 1

    1. Right-click the node where the model was built.

    2. Select Go to Properties.

    3. In the Models section in Properties, click view.

  • Method 2

    1. Select the workflow node where the model was built.

    2. Right-click and click View Models.

    3. Select the model to view. The model viewer opens in a new tab. The default name of a PCA model has PCA in the name.

The model viewer has these tabs:

13.10.2.1.1 Coefficients (PCA)

For a given Feature ID, the coefficients are displayed in the Coefficients grid. The title of the grid Coefficients x of y displays number of rows returned out of all the rows available in the model. By default, Feature IDs are integers (1, 2, 3, …). The Eigenvalue for the selected Feature ID is displayed as a read-only value.

You can perform the following tasks:

The Coefficients grid has these columns:

  • Attribute

  • Singular Value

    The value is shown as a bar with the value centered in the bar. Positive values are light blue; negative values are red.

    The default is Sort by absolute value; if you deselect this option, click Query.

13.10.2.1.2 Rename (PCA)

You can rename the selected Feature ID.

  1. Enter in the new name in the Feature ID field.

  2. Click OK.

Note:

Different features should have different names.
13.10.2.1.3 Filter (PCA)

To view the filter categories, click view.

The filter categories are

  • Attribute, the default; search for an attribute name.

  • Singular Value, the Singular value column

To create a filter, enter a string in the text box. After a string has been entered, delete is displayed. To clear the filter, click it.

13.10.2.1.4 PCA Scree Plot

In the PCA Screen Plot:

  • Features are plotted along the X-axis.

  • Cutoff is plotted along the Y-axis.

  • Variance is plotted as a red line.

  • Cumulative percent is plotted as a blue line.

A grid below the graph shows Eigenvalue, Variance, and Cumulative Percent Variance for each Feature ID.

13.10.2.1.5 PCA Details

This tab displays the value for these global details of the SVD model:

  • Number of Components

  • Suggested Cutoff

13.10.2.1.6 Settings (PCA)

The Settings tab has these tabs:

13.10.2.1.7 Summary (PCA)
  • General settings lists the following:

    • Type of model (Classification, Regression, and so on

    • Owner of the model (the schema where the model was built)

    • Model Name

    • Creation Date

    • Duration of the model build (in minutes)

    • Size of the model (in MB)

    • Comments

  • Algorithm settings lists the following:

    • The name of the algorithm used to build the model.

    • The algorithm settings that control the model build.

13.10.2.1.8 Inputs (PCA)

This tabs shows information about those attributes used to build the model.

Oracle Data Miner does not necessarily use all of the attributes in the build data. For example, if the values of an attribute are constant, then that attribute is not used.

For each attribute used to build the model, this tab displays:

  • Name: The name of the attribute.

  • Data Type: The data type of the attribute.

  • Mining Type: Categorical or Numerical.

  • Data Prep: YES indicates that data preparation was performed.

    When you select an attribute in the Attributes list, the transformation properties viewer displays the imbedded transformation, created by either the user or Automatic Data Preparation, in the Model transformations list.

    To see the reverse transformation, click Show Reverse Expression. Transformations are displayed in SQL notation. Not all transformations have a reverse. Transformations and Reverse transformations are not always displayed.

13.10.2.2 PCA Algorithm Settings

The PCA algorithm supports these settings:

  • Number of features: The default is System Determined. To specify a value, select User specified and enter in an integer value.

  • Approximate Computation: The default is System Determined. You can select either Enable or Disable. Approximate computations improve performance.

    See Also:

    "SVD Algorithm Settings" for more information about Approximate Computation
  • Projections: The default is to not select Projections.

  • Number of Features: The default is System Determined. You can specify a number.

  • Scoring Mode: It is the scoring mode to use, either Singular Value Decomposition Scoring or Principal Components Analysis Scoring. The default is Principal Components Analysis Scoring (PCA scoring).

    • When the build data is scored with SVD, the projections will be the same as the U matrix.

    • When the build data is scored with PCA, the projections will be the product of the U and S matrices.

  • U Matrix Output: Whether or not the U matrix produced by SVD persists. The U matrix in SVD has as many rows as the number of rows in the Build data. To avoid creating a large model, the U matrix persists only when U Matrix Output is enabled. When U Matrix Output is enabled, the Build data must include a Case ID. The default is Disable.

13.10.3 SVD Model Viewer and Algorithm Settings

This section discusses the SVD Model viewer, the procedure to view the SVD Model viewer and the algorithm settings related to SVD. The topics are:

13.10.3.1 SVD Model Viewer

The SVD Model Viewer lets you examine a successfully built SVD model. You can view an SVD model by using one of the following methods:

  • Method 1

    1. Right-click the node where the model was built.

    2. Select Go to Properties.

    3. In the Models section in Properties, click view.

  • Method 2

    1. Select the workflow node where the model was built.

    2. Right-click and click View Models.

    3. Select the model to view. The model viewer opens in a new tab. The default name of an SVD model has SVD in the name.

The model viewer has these tabs:

13.10.3.1.1 Coefficients (SVD)

For a given Feature ID, the coefficients are displayed in the Coefficients grid. The title of the grid Coefficients x of y displays the number of rows returned out of all the rows available in the model. By default, Feature IDs are integers.

The Eigenvalue for the selected Feature ID is displayed as a read-only value.

Fetch Size limits the number of rows returned. The default is 1,000 or the value specified in the Preference settings for Model Viewers.

You can perform the following tasks:

The Coefficients grid has these columns:

  • Attribute, Attribute name

  • Singular Value

    The value is shown as a bar with the value centered in the bar. Positive values are light blue; negative values are red.

    The default is Sort by absolute value. To sort by signed value, deselect the option and then click Query.

13.10.3.1.2 Rename (SVD)

You can rename the selected Feature ID. Enter in the new name and click OK. Different features should have different names.

13.10.3.1.3 Filter (SVD)

To view the filter categories, click view.

The filter categories are:

  • Attribute, the default; search for an attribute name

  • Singular Value, the singular value column

To create a filter, enter a string in the text box. After a string has been entered, delete is displayed. To clear the filter, click it.

13.10.3.1.4 SVD Singular Values

A grid shows the Singular Value for each Feature ID.

13.10.3.1.5 SVD Details

This tab displays the value for these global details of the SVD model:

  • Number of Components

  • Suggested Cutoff

13.10.3.1.6 Settings (SVD)

The Settings tab has these tabs:

13.10.3.1.7 Summary (SVD)
  • General settings list the following:

    • Type of model (Classification, Regression, and so on)

    • Owner of the model (the Schema where the model was built)

    • Model Name

    • Creation Date

    • Duration of the model build (in minutes)

    • Size of the model (in MB)

    • Comments

  • Algorithm settings list the following:

    • The name of the algorithm used to build the model.

    • The algorithm settings that control model build.

13.10.3.1.8 Inputs (SVD)

This tabs shows information about those attributes used to build the model.

Oracle Data Miner does not necessarily use all of the attributes in the build data. For example, if the values of an attribute are constant, then that attribute is not used.

For each attribute used to build the model, this tab displays:

  • Name: The name of the attribute.

  • Data Type: The data type of the attribute.

  • Mining Type: Categorical or Numerical.

  • Data Prep: YES indicates that data preparation was performed.

    When you select an attribute in the Attributes list, the transformation properties viewer displays the imbedded transformation, created by either the user or Automatic Data Preparation, in the Model transformations list.

    To see the reverse transformation, click Show Reverse Expression. Transformations are displayed in SQL notation. Not all transformations have a reverse. Transformations and Reverse transformations are not always displayed.

13.10.3.2 SVD Algorithm Settings

The SVD algorithm supports the following settings:

  • Approximate Computation: Specifies whether the algorithm should use approximate computations to improve performance. For SVD, approximation is often appropriate for data sets with many columns. An approximate low-rank decomposition provides good solutions at a reasonable computational cost. If you disable approximate computations for SVD, then approximation depends on the characteristics of the data. For data sets with more than 2500 attributes (the maximum number of features allowed) only approximate decomposition is possible. If approximate computation is disabled for a data set with more than 2500 attributes, then an exception is raised.

    Values for Approximate Computation are

    • System Determined (Default)

    • Enable

    • Disable

  • Automatic Preparation: ON or OFF. The default is ON.

  • Number of Features: System Determined (Default). You can specify a number.

  • Scoring Mode: It is the scoring mode to use, either Singular Value Decomposition Scoring or Principal Components Analysis Scoring. The default is Singular Value Decomposition Scoring (SVD scoring).

    • When the build data is scored with SVD, the projections will be the same as the U matrix.

    • When the build data is scored with PCA, the projections will be the product of the U and S matrices.

  • U Matrix Output Whether or not the U matrix produced by SVD persists. The U matrix in SVD has as many rows as the number of rows in the build data. To avoid creating a large model, the U matrix persists only when U Matrix Output is enabled. When U Matrix Output is enabled, the build data must include a Case ID. The default is Disable.

13.11 Support Vector Machine

You can use the Support Vector Machine (SVM) algorithm to build Classification, Regression, and Anomaly Detection models. The following topics explain Support Vector Machine:

13.11.1 Support Vector Machine Algorithms

The Support Vector Machines (SVM) algorithms are a suite of algorithms that can be used with variety of problems and data. By changing one kernel for another, SVM can solve a variety of data mining problems. Oracle Data Mining supports two kernel functions:

  • Linear

  • Gaussian

The key features of SVM are:

  • SVM can emulate traditional methods, such as Linear Regression and Neural Nets, but goes far beyond those methods in flexibility, scalability, and speed.

  • SVM can be used to solve the following kinds of problems: Classification, Regression, and Anomaly Detection.

    Oracle Data Mining uses SVM as the one-class classifier for anomaly detection. When SVM is used for anomaly detection, it has the classification mining function but no target. Applying a One-class SVM model results in a prediction and a probability for each case in the scoring data. If the prediction is 1, then the case is considered typical. If the prediction is 0, then the case is considered anomalous.

13.11.1.1 How Support Vector Machines Work

Data records with n attributes can be considered as points in n-dimensional space. SVM attempts to separate the points into subsets with homogeneous target values. Points are separated by hyperplanes in the linear case, and by non-linear separators in the non-linear case (Gaussian). SVM finds those vectors that define the separators giving the widest separation of classes (the support vectors). This is easy to visualize if n = 2; in that case, SVM finds a straight line (linear) or a curve (non-linear) separating the classes of points in the plane.

SVM solves regression problems by defining an n-dimensional tube around the data points, determining the vectors giving the widest separation.

13.11.1.2 SVM Kernel Functions

The Support Vector Machine (SVM) algorithm supports two kernel functions: Gaussian and Linear. The choice of kernel function depends on the type of model (classification or regression) that you are building and on your data.

When you choose a Kernel function, select one of the following:

  • System Determined (Default)

  • Gaussian

  • Linear

For Classification models and Anomaly Detection models, use the Gaussian kernel for solving problems where the classes are not linearly separable, that is, the classes cannot be separated by lines or planes. Gaussian kernel models allow for powerful non-linear class separation modeling. If the classes are linearly separable, then use the linear kernel.

For Regression problems, the linear kernel is similar to approximating the data with a line. The linear kernel is more robust than fitting a line to the data. The Gaussian kernel approximates the data with a non-linear function.

13.11.2 Building and Testing SVM Models

This section describes how to build and test SVM models. You specify building a model by connecting the Data Source node that represents the build data to an appropriate Build node.

By default, a Classification or Regression node tests all the models that it builds. By default, the test data is created by splitting the input data into build and test subsets. Alternatively, you can connect two data sources to the build node, or you can test the model using a Test Node.

You can build three kinds of SVM models:

13.11.2.1 SVM Classification Models

SVM Classification (SVMC) is based on the concept of decision planes that define decision boundaries. A decision plane is one that separates between a set of objects having different class memberships. SVM finds the vectors (support vectors) that define the separators giving the widest separation of classes.

SVMC supports both binary and multiclass targets.

To build and test an SVMC model, use a Classification Node. By default, the SVMC Node tests the models that it builds. Test data is created by splitting the input data into build and test subsets. You can also test a model using a Test Node.

After you test an SVMC model, you can tune it.

SVMC uses SVM Weights to specify the relative importance of target values.

13.11.2.1.1 SVM Weights

SVM models are automatically initialized to achieve the best average prediction across all classes. If the training data does not represent a realistic distribution, then you can bias the model to compensate for class values that are under-represented. If you increase the weight for a class, then the percentage of correct predictions for that class should increase.

13.11.2.2 SVM Regression Models

SVM uses an epsilon-insensitive loss function to solve regression problems. SVM Regression (SVMR) tries to find a continuous function such that the maximum number of data points lie within the epsilon-wide insensitivity tube. Predictions falling within epsilon distance of the true target value are not interpreted as errors.

The epsilon factor is a regularization setting for SVMR. It balances the margin of error with model robustness to achieve the best generalization to new data.

To build and test an SVMR model, use a Regression Node. By default, the Regression Node tests the models that it builds. Test data is created by splitting the input data into build and test subsets. You can also test a model using a Test Node.

See Also:

"Testing Regression Models" or more information about testing an SVMR model

13.11.2.3 SVM Anomaly Detection Models

Oracle Data Mining uses one-class SVM for anomaly detection (AD). There is no target for anomaly detection.

To build an AD model, use an Anomaly Detection Node connected to an appropriate data source.

13.11.3 Applying SVM Models

You apply a model to new data to predict behavior. Use an Apply Node to apply an SVM model.

You can apply all three kinds of SVM models.

See Also:

"Applying One-Class SVM Models" for more information about interpreting the apply results for One-Class SVM (anomaly detection) models

13.11.3.1 Applying One-Class SVM Models

One-class SVM models, when applied, produce a prediction and a probability for each case in the scoring data. This behavior reflects the fact that the model is trained with normal data.

  • If the prediction is 1, then the case is considered typical.

  • If the prediction is 0, then the case is considered anomalous.

13.11.4 SVM Model Viewers and Algorithm Settings

This section discusses the SVM Model viewer, the procedure to view the SVM Model viewer and the algorithm settings related to SVM. The topics are:

13.11.4.1 SVM Classification Model Viewer

The SVM Model Viewer lets you examine an SVM Classification model. You can view an SVMC model by using one of the following methods:

  • Method 1

    1. Right-click the node where the model was built.

    2. Select Go to Properties.

    3. In the Models section in Properties, click view.

  • Method 2

    1. Select the workflow node where the model was built.

    2. Right-click and click View Models.

    3. Select the model to view. The model viewer opens in a new tab. The information displayed in the model viewer depends on which kernel was used to build the model.

    The information displayed in the model viewer depends on which kernel was used to build the model.

    • If the Gaussian kernel was used, then there is one tab, Settings.

    • If the Linear Kernel was used, then there are three tabs, Coefficients, Compare, and Settings.

The tabs displayed in a SVMC model viewer depend on the kernel used to build the model:

13.11.4.1.1 SVMC Model Viewer for Models with Linear Kernel

If the SVMC model has a Linear kernel, then the viewer has these tabs:

13.11.4.1.2 SVMC Model Viewer for Models with Gaussian Kernel

If the SVMC model has a Gaussian kernel, then the viewer has these tabs:

13.11.4.1.3 Coefficients (SVMC Linear)

Support Vector Machine Models built with the Linear Kernel have coefficients; the coefficients are real numbers. The number of coefficients may be quite large.

The Coefficients tab enables you to view SVM coefficients. The viewer supports sorting to specify the order in which coefficients are displayed and filtering to select which coefficients to display.

Coefficients are displayed in the Coefficients Grid (SVMC). The relative value of coefficients is shown graphically as a bar, with different colors for positive and negative values. For numbers close to zero, the bar may be too small to be displayed.

13.11.4.1.4 Coefficients Grid (SVMC)

The coefficients grid has these controls:

  • Target Value: Select a specific target value and see the coefficients associated with that value. The default is to display the coefficients for the value that occurs least frequently.

  • Sort By Absolute Value: If selected, coefficients are sorted by absolute value. If you sort by absolute value, then a coefficient of -2 comes before a coefficient of 1.9.The default is to sort by absolute value.

  • Fetch Size: The number of rows displayed. To figure out if all the coefficients are displayed, choose a fetch size that is greater than the number of rows displayed.

You can search for attributes by name. Use view. If no items are listed in the grid, then there are no coefficients for the selected target value. The coefficients grid has these columns:

  • Attribute: Name of the attribute.

  • Value: Value of the attribute. If the attribute is binned, then this may be a range.

  • Coefficient: The probability for the value of the attribute.

    The value is shown as a bar with the value centered in the bar. Positive values are light blue; negative values are red.

13.11.4.1.5 Compare (SVMC Linear)

Support Vector Machine Models built with the Linear kernel allow the comparison of target values. For selected attributes, Data Miner calculates the propensity (that is, the natural inclination or preference) to favor one of two target values. For example, propensity for target value 1 is the propensity to favor target value 1.

See Also:

"Propensity"

To compare target values:

  1. Select how to display information:

    • Fetch Size: The default fetch size is 1000 attributes. You can change this number.

    • Sort by absolute value is the default. You can deselect this option.

  2. Select two distinct target values to compare:

    • Target Value 1: Select the first target value.

    • Target Value 2: Select the second target value.

  3. Click Query. If you have not changed any defaults, then this step is not necessary.

The following information is displayed in the grid:

  • Attribute: The name of the attribute.

  • Value: Value of the attribute

  • Propensity for Target_Value_1: Propensity to favor Target Value 1.

  • Propensity for Target_Value_2: Propensity to favor Target Value 2.

You can Search the grid in several ways:

13.11.4.1.6 Search

Use view to search the grid.

You can search by name (the default), by value, and by propensity for Target Value 1 or propensity for Target Value 2.

  • To select a different search option, click the triangle beside the binoculars.

  • To clear a search, click delete.

13.11.4.1.7 Propensity

Propensity is intended to show for a given attribute/value pair, which of the two target values has more predictive relationship. Propensity can be measured in terms of being predicted for or against a target value. If propensity is against a value, then the number is negative.

13.11.4.1.8 Settings (SVMC)

The Settings tab displays information about how the model was built:

  • Summary (SVMR) tab: Contains Model and Algorithm settings

  • Inputs (SVMC) tab: Contains attributes used to build the model.

  • Target Values (SVMC) tab: Contains targets.

  • Cost Matrix/Benefit tab: If you tune the model, then this tab displays the cost matrix created by tuning.

13.11.4.1.9 Summary (SVMC)
  • General settings list the following:

    • Type of model (Classification, Regression, and so on)

    • Owner of the model (the Schema where the model was built)

    • Model Name

    • Creation Date

    • Duration of the model build (in minutes)

    • Size of the model (in MB)

    • Comments.

  • Algorithm settings list the following:

13.11.4.1.10 Settings (SVMC Linear)

The Settings tab has these tabs:

13.11.4.1.11 Inputs (SVMC)

A list of the attributes used to build the model. For each attribute the following information is displayed:

  • Name: The name of the attribute.

  • Data Type: The data type of the attribute.

  • Mining Type: Categorical or Numerical.

  • Target: The check icon indicates that the attribute is a target attribute.

  • Data Prep: YES indicates that data preparation was performed.

    When you select an attribute in the Attributes list, the transformation properties viewer displays the imbedded transformation, created by either the user or Automatic Data Preparation, in the Model transformations list.

    To see the reverse transformation, click Show Reverse Expression. Transformations are displayed in SQL notation. Not all transformations have a reverse. Transformations and Reverse transformations are not always displayed.

13.11.4.1.12 Target Values (SVMC)

Shows the values of the target attributes.

  • Click view to search for target values.

  • Click delete to clear search.

13.11.4.1.13 Weights (SVMC)

In Support Vector Machines classifications, weights are a biasing mechanism for specifying the relative importance of target values (classes). SVM models are automatically initialized to achieve the best average prediction across all classes. However, if the training data does not represent a realistic distribution, then you can bias the model to compensate for class values that are underrepresented. If you increase the weight for a class, then the percentage of correct predictions for that class should increase.

13.11.4.1.14 Algorithm Settings for SVMC

For Classification, the SVM algorithm has these settings:

  • Algorithm Name: Support Vector Machine

  • Kernel Function: Gaussian or Linear

  • Tolerance value: The default is 0.001

  • Specify complexity factor: By default, it is not specified.

  • Active Learning: ON

  • Standard Deviation (Gaussian Kernel only)

  • Cache Size (Gaussian Kernel only)

See Also:

"SVM Classification Algorithm Settings"for more information about these settings

13.11.4.2 SVM Classification Test Viewer

By default, any Classification or Regression model is automatically tested. A Classification model is tested by comparing the predictions of the model with known results. Oracle Data Miner keeps the latest test result.

To view the test results for a model, right-click the build node and select View Results.

See Also:

"Testing Classification Models" for more information about the Test Viewers

13.11.4.3 SVM Classification Algorithm Settings

The settings that you can specify for the Support Vector Machine (SVM) algorithm depend on the Kernel function that you select. See "SVM Kernel Functions" for information about how to select a Kernel Function.

The meaning of the individual settings is the same for both Classification and Regression.

To edit settings SVM Classification algorithm settings:

  1. You can edit the settings by using one of the following options:

    • Right-click the Classification node and select Advanced Settings.

    • Right-click the Classification node and select Edit. Then, click Advanced.

  2. In the Algorithm Settings tab, the settings are available. Select the Kernel Function. The options are:

  3. Click OK after you are done.

13.11.4.3.1 Algorithm Settings for Linear or System Determined Kernel (SVMC)

If you specify a linear kernel or if you let the system determine the kernel, then you can change the following settings:

13.11.4.3.2 Algorithm Settings for Gaussian Kernel (SVMC)

If you specify the Gaussian kernel, then you can change the following settings:

13.11.4.4 SVM Regression Model Viewer

The SVM Regression Model Viewer lets you examine an SVMR model. You can view an SVMR model by using one of the following methods:

  • Method 1

    1. Right-click the node where the model was built.

    2. Select Go to Properties.

    3. In the Models section in Properties, click view.

  • Method 2

    1. Select the workflow node where the model was built.

    2. Right-click and click View Models.

    3. Select the model to view.

The information displayed in the model viewer depends on which kernel was used to build the model.

  • If the Gaussian kernel was used, then there is one tab, Settings.

  • If the Linear Kernel was used, then there are three tabs: Coefficients, Compare, and Settings.

The tabs displayed in a SVMC model viewer depend on the kernel used to build the model:

13.11.4.4.1 SVMR Model Viewer for Models with Linear Kernel

If the SVMC model has a linear kernel, then the viewer has these tabs:

13.11.4.4.2 SVMR Model Viewer for Models with Gaussian Kernel

If the SVMC model has a Gaussian kernel, then the viewer has these tabs:

13.11.4.4.3 Coefficients (SVMR)

Support Vector Machine Models built with the Linear Kernel have coefficients. The coefficients are real numbers. The number of coefficients may be quite large.

The Coefficients tab enables you to view SVMR coefficients. The viewer supports sorting to specify the order in which coefficients are displayed and filtering to select which coefficients to display.

The coefficients are displayed in the SVMR Coefficients Grid. The relative value of the coefficients is shown graphically as a bar, with different colors for positive and negative values. For numbers close to zero, the bar may be too small to be displayed.

13.11.4.4.4 SVMR Coefficients Grid

Information about coefficients is organized as follows:

  • Sort by absolute value: The default is to sort by absolute value. For example 1 and -1 have the same absolute value. If you change this value, then you must click Query.

  • Fetch Size: It is the maximum number of rows to fetch; the default is 1,000. Smaller values result in faster fetches. If you change this value, then you must click Query.

  • Coefficients: The number of coefficients displayed; for example, 95 out of 95, indicating that there are 95 coefficients and all 95 of them are displayed.

You can perform the following tasks:

  • Search: Use view to search for items. You can search by:

    • Attribute name (Default)

    • Value

    • Coefficient

    • All (AND): If you search by this criteria, then you search for items that satisfy all criteria specified. For example, a search for ED Bac finds all attributes where both values appear.

    • All (Or): If you search by this criteria, then you search for attributes that include at least one value

  • Clear search: To clear a search, click delete.

  • To select a different search option, click the triangle beside the binoculars.

Coefficients are listed in a grid. The coefficients grid has these columns:

  • Attribute: Name of the attribute

  • Value: Value of the attribute

  • Coefficient: The value of each coefficient for the selected target value is displayed. A bar is shown in front of (and possible overlapping) each coefficient. The bar indicates the relative size of the coefficient. For positive values, the bar is light blue; for negative values, the bar is red. If a value is close to 0, then the bar may be too small to be displayed.

13.11.4.4.5 Inputs (SVMR)

A list of the attributes used to build the model. For each attribute, the following information is displayed:

  • Name: The name of the attribute.

  • Data Type: The data type of the attribute.

  • Mining Type: Categorical or Numerical.

  • Target: The check icon indicates that the attribute is a target attribute.

  • Data Prep: YES indicates that data preparation was performed.

    When you select an attribute in the Attributes list, the transformation properties viewer displays the imbedded transformation, created by either the user or Automatic Data Preparation, in the Model transformations list.

    To see the reverse transformation, click Show Reverse Expression. Transformations are displayed in SQL notation. Not all transformations have a reverse. Transformations and Reverse transformations are not always displayed.

13.11.4.4.6 Settings (SVMR)

The Settings tab displays information about how the model was built:

13.11.4.4.7 Summary (SVMR)
  • General settings list the following:

    • Type of model (Classification, Regression, and so on)

    • Owner of the model (the Schema where the model was built)

    • Model Name

    • Creation Date

    • Duration of the model build (in minutes)

    • Size of the model (in MB)

    • Comments.

  • Algorithm settings list the following:

13.11.4.5 SVM Regression Test Viewer

By default, any Classification or Regression model is automatically tested. A Classification model is tested by comparing the predictions of the model with known results. Oracle Data Miner keeps the latest test result.

To view the test results for a model, right-click the Build node and select View Results.

13.11.4.6 SVM Regression Algorithm Settings

The settings that you can specify for the Support Vector Machine (SVM) algorithm depend on the Kernel function that you select.

See Also:

"SVM Kernel Functions" for information about how to select a Kernel Function

The meaning of the individual settings is the same for both classification and regression.

To edit settings SVM Regression Algorithm settings:

  1. You can edit the settings by using one of the following options:

    • Right-click the Classification node and select Advanced Settings.

    • Right-click the Classification node and select Edit. Then click Advanced.

  2. In the Algorithm Settings tab, the settings are available. Select Kernel Function. The options are:

  3. Click OK after you are done.

13.11.4.6.1 Algorithm Settings for Linear or System Determined Kernel (SVMR)

If you specify a linear kernel or if you let the system determine the kernel, then you can change the following settings for an SVM Regression model:

13.11.4.6.2 Algorithm Settings for Gaussian Kernel (SVMR)

If you specify the Gaussian kernel, then you can change the following settings for an SVM Regression model:

13.11.4.6.3 Active Learning

Active Learning is a methodology optimizes the selection of a subset of the support vectors that maintain accuracy while enhancing the speed of the model. The Key features of Active Learning are:

  • Increases performance for a linear kernel. Active learning both increases performance and reduces the size of the Gaussian kernel. This is an important consideration if memory and temporary disk space are issues.

  • Forces the SVM algorithm to restrict learning to the most informative examples and not to attempt to use the entire body of data. Usually, the resulting models have predictive accuracy comparable to that of the standard (exact) SVM model.

You should not disable this setting

Active Learning is selected by default. It can be turned off by deselecting Active Learning.

13.11.4.6.4 Automatic Data Preparation

Most algorithms require some form of data transformation. During the model building process, Oracle Data Mining can automatically perform the transformations required by the algorithm. You can supplement the automatic transformations with additional transformations of your own, or you can manage all the transformations yourself.

In calculating automatic transformations, Oracle Data Mining uses heuristics that address the common requirements of a given algorithm. This process results in reasonable model quality mostly.

13.11.4.6.5 Cache Size (Gaussian Kernel)

If you select the Gaussian kernel, then you can specify a cache size for the size of the cache used for storing computed kernels during the build operation. The default size is 50 megabytes.

The most expensive operation in building a Gaussian SVM model is the computation of kernels. The general approach taken to build is to converge within a chunk of data at a time, then to test for violators outside of the chunk. The build is complete when there are no more violators within tolerance. The size of the chunk is chosen such that the associated kernels can be maintained in memory in a Kernel Cache. The larger the chunk size, the better the chunk represents the population of training data and the fewer number of times new chunks must be created. Generally, larger caches imply faster builds.

13.11.4.6.6 Complexity Factor

The default is to not specify a complexity factor.

You specify the complexity factor for an SVM model by selecting Specify the complexity factors.

The complexity factor determines the trade-off between minimizing model error on the training data and minimizing model complexity. Its responsibility is to avoid over-fit (an over-complex model fitting noise in the training data) and under-fit (a model that is too simple).

A very large value of the complexity factor places an extreme penalty on errors, so that SVM seeks a perfect separation of target classes. A small value for the complexity factor places a low penalty on errors and high constraints on the model parameters, which can lead to under-fit.

If the histogram of the target attribute is skewed to the left or to the right, then try increasing complexity.

The default is to specify no complexity factor, in which case the system calculates a complexity factor. If you do specify a complexity factor, then specify a positive number. If you specify a complexity factor for Anomaly Detection, then the default is 1.

13.11.4.6.7 Standard Deviation (Gaussian Kernel)

If you select the Gaussian kernel, then you can specify the standard deviation of the Gaussian kernel. This value must be positive. The default is to not specify the standard deviation.

For anomaly detection, if you specify standard deviation, then the default is 1.

13.11.4.6.8 Tolerance Value

Tolerance value is the maximum size of a violation of convergence criteria such that the model is considered to have converged. The default value is 0.001. Larger values imply faster building but less accurate models.

13.12 Settings Information

This section contains topics about settings that are common to most algorithms:

13.12.1 General Settings

The Settings tab of a model viewer displays settings in two categories:

  • General Settings, described in this topic

  • Algorithms Settings that are specific to the selected algorithm

These General Settings are provided for all algorithms:

  • Type: The mining function for the model: anomaly detection, association rules, attribute importance, classification, clustering, feature extraction, or regression.

  • Owner: The data mining account (schema) used to build the model.

  • Model Name: The name of the model.

  • Target Attribute: The target attribute; only Classification and Regression models have targets.

  • Creation Date: The date when the model was created in the form MM/DD/YYYY

  • Duration: Time in minutes required to build model.

  • Size: The size of the model in megabytes.

  • Comment: For models not created using Oracle Data Miner, this option displays comments embedded in the models. To see comments for models built using Oracle Data Miner, go to Properties for the node where the model is built.

    Models created using Oracle Data Miner may contain BALANCED, NATURAL, CUSTOM, or TUNED. Oracle Data Miner inserts these values to indicate if the model has been tuned and in what way it was tuned.

13.12.2 Other Settings

Other Settings include:

  • Limit Number of Attributes in Each Rule: By default, this option is selected. The maximum number of attributes in each rule. This number must be an integer between 2 and 20. Higher numbers of rules result in slower builds. You can change the number of attributes in a rule, or you can specify no limit for the number of attributes in a rule. It is a good practice to start with the default and increase this number slowly.

    • To specify no limit, deselect this option.

    • Specifying many attributes in each rule increases the number of rules considerably.

    • The default is 3.

  • Automatic preparation: ON or OFF. ON signifies that Automatic Data Preparation (ADP) is used for normalization and outlier detection. The SVM algorithm automatically handles missing value treatment and the transformation of categorical data. Normalization and outlier detection must be handled by ADP or prepared manually. The default is ON.

  • Minimum Support: A number between 0 and 100 indicating a percentage. Smaller values for support results in slower builds and requires more system resources. The default is 5%.

  • Minimum Confidence: Confidence in the rules. A number between 0 and 100 indicating a percentage. High confidence results in a faster build. The default is 10%.

13.12.3 Epsilon Value

You specify the epsilon value for an SVM model by clicking the option Yes in answer to the question Do you want to specify an epsilon value?. The epsilon value must be either greater than 0 or undefined.

SVM makes a distinction between small errors and large errors. The difference is defined by the epsilon value. The algorithm calculates and optimizes an epsilon value internally, or you can supply a value.

  • If the number of support vectors defined by the model is very large, then try a larger epsilon value.

  • If there are very high cardinality categorical attributes, then try decreasing epsilon.

By default, no epsilon value is specified. In such a case, the algorithm calculates an epsilon value.