13 Data Mining Algorithms

The models in Oracle Data Miner are supported by different data mining algorithms.

The algorithms supported by Oracle Data Miner are:

See Also:

Oracle Data Mining Concepts for more information about data mining functions, data preparation, scoring, and data mining algorithms.

Anomaly Detection

Anomaly Detection (AD) identifies cases that are unusual within data that is apparently homogeneous.

Anomaly detection is an important tool for fraud detection, network intrusion, and other rare events that may have great significance but are hard to find.

Oracle Data Mining uses Support Vector Machine (SVM) as the one-class classifier for Anomaly Detection (AD). When SVM is used for anomaly detection, it has the classification mining function but no target.

There are two ways to search for anomalies:

  • By building and applying an Anomaly Detection model. To build an AD model, use an Anomaly Detection node, connected to an appropriate Data Source node.

  • By using an Anomaly Detection query, one of the Predictive Query nodes.

Applying Anomaly Detection Models

Oracle Data Mining uses Support Vector Machine (SVM) as the one-class classifier for Anomaly Detection (AD).

When SVM is used for Anomaly Detection, it has the Classification mining function but no target. One-class SVM models, when applied, produce a prediction and a probability for each case in the scoring data.

  • If the prediction is 1, then the case is considered Typical.

  • If the prediction is 0, then the case is considered Anomalous.

This behavior reflects the fact that the model is trained with normal data.

Algorithm Settings for AD

The algorithm for Anomaly Detection is one-class SVM.

The kernel setting is one of the following:

  • System Determined (Default)

  • Gaussian

  • Linear

The settings that you can specify for any version of the Support Vector Machine (SVM) algorithm depend on which of the SVM kernel function that you select.

Note:

After the model is built, the kernel function used (Linear or Gaussian) is displayed in Kernel Function in Algorithm Settings.

Related Topics

Anomaly Detection Algorithm Settings for Linear or System Determined Kernel

The Anomaly Detection algorithm settings for linear kernel or system determined kernel include Tolerance Value, Complexity Factor, Rate of Outliers, and Active Learning.

If you specify a linear kernel or if you let the system determine the kernel, then you can change the following settings:

Active Learning

Active Learning is a methodology optimizes the selection of a subset of the support vectors that maintain accuracy while enhancing the speed of the model.

Note:

Active Learning is not supported in Oracle Data Miner 18.1 connected to Oracle Database 12.2 and later.

The key features of Active Learning are:

  • Increases performance for a linear kernel. Active learning both increases performance and reduces the size of the Gaussian kernel. This is an important consideration if memory and temporary disk space are issues.

  • Forces the SVM algorithm to restrict learning to the most informative examples and not to attempt to use the entire body of data. Usually, the resulting models have predictive accuracy comparable to that of the standard (exact) SVM model.

You should not disable this setting

Active Learning is selected by default. It can be turned off by deselecting Active Learning.

Complexity Factor

The complexity factor determines the trade-off between minimizing model error on the training data and minimizing model complexity.

Its responsibility is to avoid over-fit (an over-complex model fitting noise in the training data) and under-fit (a model that is too simple). The default is to not specify a complexity factor.

You specify the complexity factor for an SVM model by selecting Specify the complexity factors.

A very large value of the complexity factor places an extreme penalty on errors, so that SVM seeks a perfect separation of target classes. A small value for the complexity factor places a low penalty on errors and high constraints on the model parameters, which can lead to under-fit.

If the histogram of the target attribute is skewed to the left or to the right, then try increasing complexity.

The default is to specify no complexity factor, in which case the system calculates a complexity factor. If you do specify a complexity factor, then specify a positive number. If you specify a complexity factor for Anomaly Detection, then the default is 1.

Rate of Outliers

The rate of outliers is the approximate rate of outliers (negative predictions) produced by a one-class SVM model on the training data. This rate indicates the percent of suspicious records.

The rate is a number greater than 0 and less than or equal to 1. The default value is 0.1.

If you do not want to specify the rate of outliers, then deselect Specify the Rate of Outliers.

Tolerance Value

Tolerance Value is the maximum size of a violation of convergence criteria such that the model is considered to have converged.

The default value is 0.001. Larger values imply faster building but less accurate models.

Anomaly Detection Algorithm Settings for Gaussian Kernel

The Anomaly Detection algorithm settings for Gaussian kernel include Tolerance Value, Complexity Factor, Rate of Outliers, Active Learning, Standard Deviation, and Cache Size.

If you specify the Gaussian Kernel, then you can change the following settings:

Active Learning

Active Learning is a methodology optimizes the selection of a subset of the support vectors that maintain accuracy while enhancing the speed of the model.

Note:

Active Learning is not supported in Oracle Data Miner 18.1 connected to Oracle Database 12.2 and later.

The key features of Active Learning are:

  • Increases performance for a linear kernel. Active learning both increases performance and reduces the size of the Gaussian kernel. This is an important consideration if memory and temporary disk space are issues.

  • Forces the SVM algorithm to restrict learning to the most informative examples and not to attempt to use the entire body of data. Usually, the resulting models have predictive accuracy comparable to that of the standard (exact) SVM model.

You should not disable this setting

Active Learning is selected by default. It can be turned off by deselecting Active Learning.

Cache Size (Gaussian Kernel)

If you select the Gaussian kernel, then you can specify a cache size for the size of the cache used for storing computed kernels during the build operation.

The default size is 50 megabytes.

The most expensive operation in building a Gaussian SVM model is the computation of kernels. The general approach taken to build is to converge within a chunk of data at a time, then to test for violators outside of the chunk. The build is complete when there are no more violators within tolerance. The size of the chunk is chosen such that the associated kernels can be maintained in memory in a Kernel Cache. The larger the chunk size, the better the chunk represents the population of training data and the fewer number of times new chunks must be created. Generally, larger caches imply faster builds.

Complexity Factor

The complexity factor determines the trade-off between minimizing model error on the training data and minimizing model complexity.

Its responsibility is to avoid over-fit (an over-complex model fitting noise in the training data) and under-fit (a model that is too simple). The default is to not specify a complexity factor.

You specify the complexity factor for an SVM model by selecting Specify the complexity factors.

A very large value of the complexity factor places an extreme penalty on errors, so that SVM seeks a perfect separation of target classes. A small value for the complexity factor places a low penalty on errors and high constraints on the model parameters, which can lead to under-fit.

If the histogram of the target attribute is skewed to the left or to the right, then try increasing complexity.

The default is to specify no complexity factor, in which case the system calculates a complexity factor. If you do specify a complexity factor, then specify a positive number. If you specify a complexity factor for Anomaly Detection, then the default is 1.

Rate of Outliers

The rate of outliers is the approximate rate of outliers (negative predictions) produced by a one-class SVM model on the training data. This rate indicates the percent of suspicious records.

The rate is a number greater than 0 and less than or equal to 1. The default value is 0.1.

If you do not want to specify the rate of outliers, then deselect Specify the Rate of Outliers.

Standard Deviation (Gaussian Kernel)

Standard Deviation is a measure that is used to quantify the amount of variation.

If you select the Gaussian kernel, then you can specify the standard deviation of the Gaussian kernel. This value must be positive. The default is to not specify the standard deviation.

For Anomaly Detection, if you specify Standard Deviation, then the default is 1.

Tolerance Value

Tolerance Value is the maximum size of a violation of convergence criteria such that the model is considered to have converged.

The default value is 0.001. Larger values imply faster building but less accurate models.

Anomaly Detection Model Viewer

The information displayed in the model viewer depends on which kernel was used to build the model.

The information displayed in the Model viewer depends on:

  • If the Gaussian Kernel is used, then there is one tab, Settings.

  • If the Linear kernel is used, then there are three tabs: Coefficients, Compare, and Settings.

Anomaly Detection model is a special kind of Support Vector Machine Classification model.

AD Model Viewer for Gaussian Kernel

The information displayed in the model viewer depends on which kernel was used to build the model.

The Model Viewer of an AD model with Gaussian Kernel displays the information in the Inputs tab and Settings tab.

Settings (AD)

The Settings tab in the Anomaly Detection model viewer displays information under Summary and Inputs.

The Anomaly Detection Settings tab has the following tabs:

Summary (AD)

The General Settings under Summary tab describe the characteristics of the model.

This includes:

  • Owner

  • Name

  • Type

  • Algorithm

  • Target Attribute

  • Creation Date

  • Duration of Model Build

  • Comments

Algorithm Settings control the model build. Algorithm setting are specified when the Build node is defined.

Build Details displays computed settings. Computed settings are generated by Oracle Data Mining when the model is created.

Input (AD)

In the Inputs tab, the attributes that are used to build the model are displayed.

For each attribute the following information is displayed:

  • Name: The name of the attribute.

  • Data Type: The data type of the attribute

  • Mining Type:

    • Categorical

    • Numerical

    • Mixed: Indicates that the input signature column takes on more than one attribute type.

    • Partition: Indicates that the input signature column is used as the partitioning key.

  • Data Preparation: YES indicates that data preparation was performed. It helps to distinguish between User and Auto Data Preparation (ADP) as ADP could be turned off, but the user could still embedded a transformation. If Data Preparation is indicated as YES, then select the column and click it. Each group can contain an Input and a Reverse Expression. If there is no Reverse Expression, then it is not displayed. If there is no Input, then nothing is displayed. Transformations are displayed in SQL notation.

  • Partition Key: YES indicates that the attribute is a partition key.

Input (AD)

In the Inputs tab, the attributes that are used to build the model are displayed.

For each attribute the following information is displayed:

  • Name: The name of the attribute.

  • Data Type: The data type of the attribute

  • Mining Type:

    • Categorical

    • Numerical

    • Mixed: Indicates that the input signature column takes on more than one attribute type.

    • Partition: Indicates that the input signature column is used as the partitioning key.

  • Data Preparation: YES indicates that data preparation was performed. It helps to distinguish between User and Auto Data Preparation (ADP) as ADP could be turned off, but the user could still embedded a transformation. If Data Preparation is indicated as YES, then select the column and click it. Each group can contain an Input and a Reverse Expression. If there is no Reverse Expression, then it is not displayed. If there is no Input, then nothing is displayed. Transformations are displayed in SQL notation.

  • Partition Key: YES indicates that the attribute is a partition key.

Anomaly Detection Algorithm Settings

Anomaly Detection models are built using a special version of the SVM classification, one-class SVM.

The algorithm has these default settings:

  • Kernel function: The default is System Determined. After the model is built, the kernel function used (Linear or Gaussian) is displayed.

  • Tolerance value: The default is 0.001

  • Specify complexity factors: The default is Do Not

  • Specify the rate of outliers: The default is 0.1

  • Active learning: ON

  • Automatic Data Preparation: ON

Related Topics

AD Model Viewer for Linear Kernel

The information displayed in the model viewer depends on which kernel was used to build the model.

The model viewer of an AD model with a Linear Kernel has these tabs:

Coefficients (SVMC Linear)

Support Vector Machine Models built with the Linear Kernel have coefficients. The coefficients are real numbers. The number of coefficients may be quite large.

The Coefficients tab enables you to view SVM coefficients. The viewer supports sorting to specify the order in which coefficients are displayed and filtering to select which coefficients to display.

Coefficients are displayed in the Coefficients Grid. The relative value of coefficients is shown graphically as a bar, with different colors for positive and negative values. For numbers close to zero, the bar may be too small to be displayed.

Settings (AD)

The Settings tab in the Anomaly Detection model viewer displays information under Summary and Inputs.

The Anomaly Detection Settings tab has the following tabs:

Compare (SVMC Linear)

Support Vector Machine Models built with the Linear kernel allow the comparison of target values. You can compare target values.

For selected attributes, Data Miner calculates the propensity, that is, the natural inclination or preference to favor one of two target values. For example, propensity for target value 1 is the propensity to favor target value 1.

To compare target values:

  1. Select how to display information:
    • Fetch Size: The default fetch size is 1000 attributes. You can change this number.

    • Sort by absolute value: This is the default. You can deselect this option.

  2. Select two distinct target values to compare:
    • Target Value 1: Select the first target value.

    • Target Value 2: Select the second target value.

  3. Click Query. If you have not changed any defaults, then this step is not necessary.

The following information is displayed in the grid:

  • Attribute: The name of the attribute.

  • Value: Value of the attribute

  • Propensity for Target_Value_1: Propensity to favor Target Value 1.

  • Propensity for Target_Value_2: Propensity to favor Target Value 2.

Viewing Models in Model Viewer

After you build a model successfully, you can view the model details in the Model Viewer.

The node where the model is built must be run successfully.
You can access the Model Viewer in two ways. You can access it using the View Model context menu option:
  1. Select the workflow node where the model was built.

  2. Right-click and select View Models.

  3. Select the model to view.

You can also view the models from model properties:
  1. Right-click the node where the model was built.

  2. Select Go to Properties.

  3. In the Models section in Properties, click view icon.

Association

Association is an unsupervised mining function for discovering association rules, that is predictions of items, that are likely to be grouped together. Oracle Data Mining provides one algorithm, Association Rules (AR).

To build an AR model, use an Association node.

Note:

If the model has 0 rules or a very large number of rules, then you may be required to troubleshoot the AR models.

Data for Association Rules (AR) models is usually in transactional form, unlike the data for other kinds of models.

Oracle Data Mining does not support applying (scoring) AR models.

Topics include:

Related Topics

Calculating Associations

When calculating associations rules, Apriori algorithm calculates the probability of an item being present in a frequent itemset, given that other items are present. It proceeds by identifying the frequent individual items in the database.

An association mining problem can be broken down into two subproblems:

  1. Find all combinations of items in a set of transactions that occur with a specified minimum frequency. These combinations are called frequent itemsets.

  2. Calculate Association Rules that express the probability of items occurring together within frequent itemsets.

Itemsets

An itemset comprises one or more items.

The maximum number of items in an itemset is user-specified.

  • If the maximum is 2, then all the item pairs are counted.

  • If the maximum is greater than 2, then all the item pairs, all the item triples, and all the item combinations up to the specified maximum are counted.

Association rules are calculated from itemsets. Usually, it is desirable to only generate rules from itemsets that are well-represented in the data. Frequent itemsets are those that occur with a minimum frequency specified by the user.

The minimum frequent itemset support is a user-specified percentage that limits the number of itemsets used for association rules. An itemset must appear in at least this percentage of all the transactions if it is to be used as a basis for Association Rules.

Related Topics

Association Rules

An association rule is of the form IF antecedent THEN consequent.

An association rule states that an item or group of items, the antecedent, implies the presence of another item, the consequent, with some probability. Unlike Decision Tree rules, which predict a target, Association Rules simply express correlation.

The Apriori algorithm calculates rules that express probable relationships between items in frequent itemsets. For example, a rule derived from frequent itemsets containing A, B, and C might state that if A and B are included in a transaction, then C is likely to also be included.

Association rules have Confidence and Support:

  • Confidence of an association rule indicates the probability of both the antecedent and the consequent appearing in the same transaction. Confidence is the conditional probability that the consequent occurs given the occurrence of the antecedent. In other words, confidence is the ratio of the rule support to the number of transactions that include the antecedent.

  • Support of an association rule indicates how frequently the items in the rule occur together. Support is the ratio of transactions that include all the items in the antecedent and consequent to the number of total transactions.

Data for AR Models

Association does not support text. Association Rules are normally used with transactional data, but it can also be applied to single-record case data (similar to other algorithms).

Native transactional data consists of two columns:

  • Case ID, either categorical or numerical Item ID, either categorical or numerical

Transactional data may also include a third column:

  • Item value, either categorical or numerical

A typical example of transactional data is market basket data. In market basket data, a case represents a basket that might contain many items. Each item is stored in a separate row, and many rows may be needed to represent a case. The Case ID values do not uniquely identify each row. Transactional data is also called Multirecord Case Data.

When building an Association model, specify the following:

  • Item ID: This is the name of the column that contains the items in a transaction.

  • Item Value: This is the name of a column that contains a value associated with each item in a transaction. The Item Value column may specify information such as the number of items (for example, three apples) or the type of the item (for example, Macintosh Apples).

    The default value for Item Value is Existence. That is, one or more item identified by Item ID is in the basket.

    If you select a specific value for Item Value, then you may have to perform appropriate data preparation. The maximum number of distinct values of Item Value is 10. If the specific value for Item Value is greater than 128, then bin the attribute specified in Item Value using a Transform node.

Support for Text (AR)

In Oracle Data Miner, Association does not support text.

If the Oracle Data Mining API, supports text, then using it for Association is not recommended.

Troubleshooting AR Models

Association Rules models may generate many rules with very low Support and Confidence.

If you increase Support and Confidence, then you reduce the number of rules generated.

Usually, Confidence should be greater than or equal to Support.

If a model has no rules, then the following message is displayed in the Rules tab of the Model Viewer: Model contains no rules. Consider rebuilding model with lower confidence and support settings.

Algorithm Settings for Association Rules

To change algorithm settings for an Association node, right-click the node, and select Advanced Settings. Then select the model. The following settings are displayed in the Algorithm settings tab:

  1. Right-click the node.
  2. Select Advanced Settings.
  3. Select the model. The following settings are displayed in the Algorithm Settings tab:
    • Maximum rule length: The default is 4

    • Minimum Confidence: The default is 10%

    • Minimum Support: The default is 1%

  4. If no rules were generated, then:
    • First, try decreasing Minimum Support.

    • If that does not work, then decrease the minimum Confidence value. It may be necessary to specify a much smaller value for either of these values.

  5. After you have finished, click OK.
  6. Run the node.

Related Topics

AR Model Viewer

The AR model viewer opens in a new tab. The default name of an Association model has ASSOC in the name.

The AR Model viewer has these tabs:

AR Rules

An Association Rule states that an item or group of items implies the presence of another item.

Each rule has a probability. Unlike Decision Tree rules, which predict a target, Association Rules simply express correlation.

If an attribute is a nested column, then the full name is displayed as COLUMN_ NAME.SUBNAME. For example, GENDER.MALE. If the attribute is a normal column, then just the column name is displayed.

Oracle Data Mining supports Association Rules that have one or more items in the antecedent (IF part of a rule) and a single item in the consequent (THEN part of the rule). The antecedent may be referred to as the condition, and the consequent as the association.

Rules have Confidence, Support and Lift.

The Rules tab is divided into two sections: Filtering and Sorting upper section, and AR Rules Grid in the lower section. Sorting or filtering that is defined using the settings in the upper section apply to all the rules in the model. Sorting or filtering defined using the settings in the lower section apply to the grid display only.

You can perform the following functions in the Rules tab:

  • Sort By: You can specify the order for display of the rules. You can sort rules by:

    • Lift, Confidence, Support, or Length

    • Ascending or Descending order

    Note:

    You can sort by Aggregate information only if aggregates were defined on the node.
  • Filter: To see filtering options, click More and select Use Filter. The filter table has the following column:

    • TYPE: Indicates the type - Metric or Item

    • FILTER ON: Double-click and select any one of the following:

      • Lift

      • Confidence

      • Reverse Confidence

      • Item Count

      • Support

      • Support Count

    • FILTER FOR: Indicates whether the filter is for the Rule, Antecedent, or Consequent. Double click to edit.

    • VALUES: You can set the range of values here. Double click to edit the range of values and click Apply.

  • Fetch Size: Association models often generate many rules. Specify the number of rules to examine by clicking Fetch Size. The default is 1000.

  • Query: You can query the database using the criteria that you specify. For example, if you change the default sorting order, specify filtering, or change the fetch size, then click Query.

AR Rules Grid

The lower part of the Rules tab displays the retrieved rule in a grid. The following is displayed above the grid:

  • Available Rules: The total number of rules in the model.

  • Rules Retrieved: The number of rules retrieved by the query, that is, the number of rules retrieved subject to filtering.

  • Rule Content: For maximum information, select Name, Subname, and Value; you can select fewer characteristics from the menu. This selection applies to the rules in the grid only. Rule content is smart in the sense that it sets this value to whatever is more appealing given the nature of the model.

Related Topics

AR Rules Display

For each rule, the Rules grid displays the following information:

  • ID: Identifier for the rule, a string of integers.

  • Condition

  • Association

  • Lift: A bar is included in the column. The size of the bar is set to scale to the largest lift value provided by the model by any rule.

  • Confidence:

  • Support:

  • Length

  • Antecedent Support

  • Condition Support. Aggregate columns can be displayed if aggregate are defined. The number of columns that appear by default is controlled by preferences settings for Association rules.

You can perform the following tasks:

  • Sort: You can sort the items in the grid by clicking on the title of the column. This sorting applies to the grid only.

  • View details: To see details for a rule, click the rule and examine the rule details.

  • Determine validity of a rule: To determine if a rule is valid, you must use support and confidence plus lift.

  • Choose different itemset display structure for selected rules. To choose a different display structure, select the rule and click gear icon. The Choose ItemSet Display Structure dialog box opens.

For more information, including examples of these statistics, see the discussion of evaluating association rules in Oracle Data Mining Concepts

Lift for AR Rules

Lift can be defined as the Confidence of the combination of items divided by the support of the consequent.

Support for AR Rules and Confidence for AR rules must be used to determine if a rule is valid. However, there are times when both of these measures may be high, and yet produce a rule that is not useful.

Lift indicates the strength of a rule over the random co-occurrence of the Antecedent and the Consequent, given their individual support. It provides information about the improvement, the increase in probability of the consequent given the antecedent. Lift is defined as follows:

(Rule Support) /(Support(Antecedent) * Support(Consequent))

Confidence for AR Rules

The Confidence of a rule indicates the probability of both the antecedent and the consequent appearing in the same transaction.

Confidence is the conditional probability of the consequent, given the antecedent.

Association Rules are of the form IF antecedent THEN consequent.

Support for AR Rules

The Support of a rule indicates how frequently the items in the rule occur together.

Support is the ratio of transactions that include all the items in the antecedent and consequent to the number of total transaction.

Association Rules are of the form IF antecedent THEN consequent.

Choose ItemSet Display Structure

The Choose ItemSet Display Structure allows you to select different formats to view transactional data in the Rules tab in the AR Model viewer.

Oracle Data Miner builds AR models that use transaction format only. The transactional data is represented in AR rules as name.subname, where name=column name and subname=item name. The value is not selected, since AR supports aggregation metrics. If you see Value, then it defaults to 1.
If you use Oracle Data Mining API, then you can build a model that uses non-transactional data. In such a case you would only have name filled in and Subname would be empty. Oracle Data Miner can not build such a model, as non-transactional data structure is not a commonly used structure to represent Market Basket Analysis data. But you can reference it through the Model Node and then view the model.
You must create and run an Association Rules model to view the AR Model viewer.
  1. In the Rules tab, select a rule in the Rules section.
  2. Click gear icon in the Rules section.
    The Choose ItemSet Display Structure dialog box opens.
  3. In the Choose ItemSet Display Structure field, select an option from the drop-down list. The available options are:
    • Name
    • Subname
    • Name = Value
    • Name.Subname = Value
    • Subname
    • Subname = Value
  4. Click OK.
Rule Details

The information in the rule grid is displayed in a readable format in the Rule Detail list.

Sorting

The default sort is:

  1. Sort by Lift in descending order (default)

  2. Sort by Confidence in descending order

  3. Sort by Support in descending order

  4. Sort by rule length in descending order

The sorting specified here applies to all rules in the model.

Filtering

To see all the filtering options, click More.

You can specify the following:

  • Filter: Filter rules are based on values of rule characteristics. You can specify the following:

    • Minimum lift

    • Minimum support

    • Minimum confidence

    • Maximum items in rule

    • Minimum items in rule

  • Fetch Size: It is the maximum number of rows to fetch. The default is 1000. Smaller values result in faster fetches.

  • Define item filters to reduce the number of rules returned.

To define a filter, select Use Filter. After you define the filter, click Query.

Related Topics

Item Filters

Item filters enable you to see only those rules that contain what you are interested in. A rule filter must consider the item as being required for the Association, Condition, or Both. The rule filter uses OR logic for each side of the Rule (Association Collection, Condition Collection). However, the rule filter performs an AND rule across the collection. So, for a Rule to be returned, it must have at least one Association item AND one Condition item.

You can manage Item Filters using these controls:

  • To open the Add Item Filter dialog box, click add.

  • To delete selected item filters, click delete.

  • To change the Filter column of selected rows to Both, click both. Both implies Association and Condition.

  • To change the Filter column of selected rows to Condition, click condition.

  • To change the Filter column of selected rows to Association, click association.

Related Topics

Add Item Filter

To open the Add Item dialog box, click add.

The exact information that is displayed depends on the model. For example, if data has different values for the model that you are viewing, then there is a Values column.

Click More to see all possibilities:

  • Specify sorting for item filters: the default is to sort by Attribute Descending and then by Support Ascending.

  • Specify a name for the filter.

  • Change the Fetch Size from the default of 100,000.

  • If you made any changes, then click Query to retrieve the attribute or value pairs.

  • Filter the retrieved items by name or value.

  • Select one or more item in the grid.

  • Select how to use items when filtering rules.

Click OK when you have finished defining the filter.

Itemsets

Rules are calculated from itemsets. The Itemsets tab displays information about the itemsets.

If an attribute is a nested column, then the full name is displayed as COLUMN_ NAME.SUBNAME. For example, GENDER.MALE. If the attribute is a normal column, then just the column name is displayed.

Itemsets have Support. Each itemset contains one or more items.

  • Sort Itemsets: Sort By specifies the order of itemsets. You can sort itemsets by:

    • ID

    • Number of Items

    • Support in Ascending Order

    • Support in Descending Order

    By default, itemset is sorted by Support in Descending order. For more sort options, click More. To change the sort order, make changes and then click Query.

  • Filter Itemsets

  • View Itemset details. Click the itemset to view the details.

The Itemsets tab displays the following information:

  • Available Itemsets: The total number of itemsets in the model.

  • Itemsets Retrieved: The number of itemsets retrieved by the query. That is, the number of itemsets retrieved subject to filtering.

  • Itemset Content: For maximum information, select all three—Name, Subname, and Value. You can select a few characteristics from the menu.

Other Tabs: The AR Model viewer has these other tabs:

  • AR Rules

  • Settings

Itemsets Display

For each itemset, the Itemsets grid displays the following information:

  • ID: Identifier for the itemset, a string of integers

  • Items

  • Support. A bar in the column illustrates the relative size of the support.

  • Number of Items in the itemset

Itemset Details

To see itemset details, select one or more itemsets in the itemsets grid. The information in the itemsets grid is displayed in a more readable format.

Settings (AR)

AR models are not scoreable, that is, they cannot be applied to new data. Models that are not scoreable do not have an Attributes tab in the model viewer.

The Settings tab has the following tabs:

  • Summary

Other Tabs:

  • Itemsets

  • AR Rules

Related Topics

Summary

The Summary tab contains the following information about the model.

  • General: Lists the following:

    • Type of Model

    • Owner of the model (Classification, Regression, and so on)

    • Model Name (the Schema where the model was built)

    • Creation Date

    • Duration of the Model build (in minutes)

    • Model size in MB

    • Comments

  • Algorithm: Lists the following:

    • Automatic Preparation (on or off),

    • Minimum Confidence

    • Minimum Support

    To change these values, right-click the model node and select Advanced Settings.

  • Build Details: Lists the following:

    • Itemset Counts

    • Maximum Support

    • Number of Rows

    • Rule Count

    • Transaction Count

Algorithm Settings

Association (AR) supports these algorithm settings:

  • Maximum Rule Length: The maximum number of attributes in each rule. This number must be an integer between 2 and 20. Higher numbers of rules result in slower builds. The default value is 4.

    You can change the number of attributes in a rule, or you can specify no limit for the number of attributes in a rule. Specifying many attributes in each rule increases the number of rules considerably. A good practice is to start with the default and increase this number slowly.

  • Minimum Confidence: Confidence indicates how likely it is that these items to occur together in the data. Confidence is the conditional probability that the consequent will occur given the occurrence of the antecedent.

    Confidence is a number between 0 and 100 indicating a percentage. High confidence results in a faster build. The default is 10 percent.

  • Minimum Support: A number between 0 and 100 indicating a percentage. Support indicates how often these items occur together in the data.

    Smaller values for support results in slower builds and requires more system resources. The default is 1 percent.

  • Minimum Support Count: Accepts any integer. Default value is 1.

  • Minimum Reverse Confidence (%): Accepts any float value. Default is 0.0 %.

Related Topics

Viewing Models in Model Viewer

After you build a model successfully, you can view the model details in the Model Viewer.

The node where the model is built must be run successfully.
You can access the Model Viewer in two ways. You can access it using the View Model context menu option:
  1. Select the workflow node where the model was built.

  2. Right-click and select View Models.

  3. Select the model to view.

You can also view the models from model properties:
  1. Right-click the node where the model was built.

  2. Select Go to Properties.

  3. In the Models section in Properties, click view icon.

Decision Tree

The Decision Tree algorithm is a Classification algorithm that generates rules. Oracle Data Mining supports the Decision Tree (DT) algorithm.

Topics include:

Decision Tree Algorithm

The Decision Tree algorithm is based on conditional probabilities.

Unlike Naive Bayes, Decision Trees generate rules. A rule is a conditional statement that can be used by humans and used within a database to identify a set of records.

The Decision Tree algorithm:

  • Creates accurate and interpretable models with relatively little user intervention. The algorithm can be used for both binary and multiclass classification problems. The algorithm is fast, both at build time and apply time. The build process for the Decision Tree Algorithm is parallelized. Scoring can be parallelized irrespective of the algorithm.

  • Predicts a target value by asking a sequence of questions. At a given stage in the sequence, the question that is asked depends upon the answers to the previous questions. The goal is to ask questions that together uniquely identify specific target values.

Decision Tree scoring is especially fast. The tree structure, created in the model build, is used for a series of simple tests, (typically 2-7). Each test is based on a single predictor. It is a membership test: either IN or NOT IN a list of values (categorical predictor); or LESS THAN or EQUAL TO some value (numeric predictor).

During the model build, the Decision Tree algorithm must repeatedly find the most efficient way to split a set of cases (records) into two child nodes. Oracle Data Mining offers two homogeneity metrics, gini and entropy, for calculating the splits. The default metric is gini.

Decision Tree Rules

Rules show the basis for the prediction of the model.

Rules provide model transparency, a window on the inner workings of the model. Oracle Data Mining supports a high level of model transparency.

Confidence and Support are used to rank the rules generated by the Decision Tree Algorithm:

  • Support: It is the number of records in the training data set that satisfy the rule.

  • Confidence: It is the likelihood of the predicted outcome, given that the rule has been satisfied.

Build, Test, and Apply Decision Tree Models

The Decision Tree algorithm manages its own data preparation internally. It does not require pre-treatment of the data.

The Decision Tree is not affected by Automatic Data Preparation.

The Decision Tree interprets missing values as missing at random. The algorithm does not support nested tables and therefore does not support sparse data. You can perform the following tasks with a Decision Tree model:
  • Build Decision Tree model: To build a Decision Tree model, use a Classification node. In Oracle Data Mining 12c Release 1(12.1) or later, Decision Tree supports nested data. The Decision Tree supports text for Oracle Database 12c and later, but not for earlier releases.

  • Test Decision Tree model: By default, a Classification Node tests all models that it builds. The test data is created by splitting the input data into build and test subsets. You can also test a Decision Tree model using a Test node.

  • Tune Decision Tree model: After you build and test a Decision Tree model, you can tune it.

  • Apply Decision Tree model: To apply a model, use an Apply node.

Decision Tree Algorithm Settings

Lists the settings supported by the Decision Tree algorithm.

  • Homogeneity Metric:

    • Gini (default)

    • Entropy

  • Maximum Depth: The maximum number of levels of the tree. The default is 7. The value must be an integer in the range 2 to 20.

  • Minimum Records in a Node: The minimum number of records in a node. The default is 10. The value must be an integer greater than or equal to 0.

  • Minimum Percent of Records in a Node: The default is 0.05. The value must be a number in the range 0 to 10.

  • Maximum Number of Supervised Bins: The upper limit for the number of bins per attribute for algorithms that use supervised binning. The default value is 32.

  • Minimum Records for a Split: The minimum number of records for a split. The default is 20. The value must be an integer greater than or equal to 0.

  • Minimum Percent of Records for a Split: The default is 0.1. The value must be a number in the range 0 to 20.

Decision Tree Model Viewer

The default name of a Decision Tree model has DT in the name. The Tree viewer has two tabs:

  • Tree: This tab is displayed by default. Use the Structure window to navigate and analyze the tree. It is split horizontally into two panes:

    • The upper pane displays the tree. The root node is at the top of the pane. The following information is displayed for each node of the tree:

      • Node number: 0 is the root node.

      • Prediction: The predicted target value.

      • Support: This is for the prediction.

      • Confidence: This is for the prediction.

      • A histogram shows the distribution of target values in the node.

      • Split: The attribute used to split the node (Leaf nodes do not have splits).

    • The lower pane displays rules. To view the rule associated with a node or a link, select the node or link. The rule is displayed in the lower pane. The following information is displayed in the lower pane:

      • Rule

      • Surrogates

      • Target Values

  • Settings: Icons and menus at the top of the upper pane control how the tree and its nodes are displayed. You can perform the following tasks:

  • Zoom in or zoom out for the tree. You can also select a size from the drop-down list. You can also fit the tree to the window.

  • Change the layout to horizontal. The default Layout Type for the tree is vertical.

  • Hide the histograms displayed in the node.

  • Show less detail.

  • Expand all nodes.

  • Save Rules

Save Rules

You can save the Decision Tree or Clustering Rules as a file in your system.

To save the algorithm rules:

  1. Click Save Rules on the far right of the upper tab. By default, rules are saved for leaf nodes only to the Microsoft Windows Clipboard. You can then paste the rules into any rich document, such as a Microsoft Word document. You can deselect Leaves Only to save all rules.
  2. To save rules to a file, click Save to File and specify a file name.
  3. Select the location of the file in the Save dialog box. By default, the rules are saved as an HTML file.
  4. Click OK.

Settings (DT)

The Settings tab contains information related to the model summary, inputs, target values, cost matrix (if the model is tuned), partition keys (if the model is partitioned) and so on.

In the Partition field, click the partition name. The partition detail is displayed in the Partition Details window.

Click search to open the Select Partition dialog box to view filtered partitions based on the Partition keys.

The Settings tab includes:

Related Topics

DT Summary

The Summary tab displays the following information about the model.

  • General contains the following information:

    • Type of model

    • Owner of the model (the Schema where the model was built)

    • Model Name

    • Creation Date

    • Duration of the model Build (in minutes)

    • Size of the model (in MB)

    • Comments (if the model has any comments)

  • Algorithm

  • Build Details displays computed settings. Computed settings are generated by Oracle Data Mining when the model is created.

DT Inputs

The Inputs tabs shows information about those attributes used to build the model.

Oracle Data Miner does not necessarily use all of the attributes in the build data. For example, if the values of an attribute are constant, then that attribute is not used.

For each attribute used to build the model, this tab displays:

  • Name

  • Data type

  • Mining Type: Categorical, Numerical, or text.

  • Target: The check in this column indicates that the attribute is a target.

  • Data Prep

    • Data Preparation: YES indicates that data preparation was performed. It helps to distinguish between User and Auto Data Preparation (ADP) as ADP could be turned off, but the user could still embedded a transformation. If Data Preparation is indicated as YES, then select the column and click it. Each group can contain an Input and a Reverse Expression. If there is no Reverse Expression, then it is not displayed. If there is no Input, then nothing is displayed. Transformations are displayed in SQL notation.

    • Partition Key: YES indicates that the attribute is a partition key.
Partition Keys

The Partition Keys tab lists the columns that are partitioned.

Along with the partitioned columns, the Partition Keys tab lists the following details:
  • Partition Name

  • Source

  • Data Type

  • Value

DT Target Values

The Target Values tab displays the target attributes, their data types, and the values of each target attribute.

Expectation Maximization

Expectation Maximization (EM) is a density estimation technique. Oracle Data Mining implements EM as a distribution-based clustering algorithm that uses probability density estimation.

In density estimation, the goal is to construct a density function that captures how a given population is distributed. The density estimate is based on observed data that represents a sample of the population.

Note:

Expectation Maximization requires Oracle Database 12c and later.

Dense areas are interpreted as components or clusters. Density-based clustering is conceptually different from distance-based clustering (such as k-Means), where emphasis is placed on minimizing intercluster and maximizing the intracluster distances.

The shape of the probability density function used in EM effectively predetermines the shape of the identified clusters. For example, Gaussian density functions can identify single peak symmetric clusters. These clusters are modeled by single components. Clusters of a more complex shape need to be modeled by multiple components. The EM algorithm assigns model components to high-level clusters by default.

Build and Apply an EM Model

To build and apply an Expectation Maximization model, use a Clustering node and an Apply node respectively.

To build an EM model, use a Clustering node.

Note:

You must be connected to Oracle Database 12c and later.

To apply an EM model, use an Apply node.

Related Topics

EM Algorithm Settings

Lists the settings supported by the Expectation Maximization algorithm.

The settings are:

  • Number of Clusters is the maximum number of leaf clusters generated by the algorithm. EM may return fewer clusters than the number specified, depending on the data. The number of clusters returned by EM cannot be greater than the number of components, which is governed by algorithm-specific settings. Depending on these settings, there may be fewer clusters than components. If component clustering is disabled, then the number of clusters equals the number of components.

    The default is system determined. To specify a specific number of clusters, click User specified and enter an integer value.

  • Component Clustering is selected by default.

    Component Cluster Threshold specifies a dissimilarity threshold value that controls the clustering of EM components. Smaller values may produce more clusters that are more compact while large values may produce fewer clusters that are more spread out. The default value is 2.

  • Linkage Function enables the specification of a linkage function for the agglomerative clustering step. The linkage functions are:

    • Single uses the nearest distance within the branch. The clusters tend to be larger and have arbitrary shapes.

      Single is the default.

    • Average uses the average distance within the branch. There is less chaining effect and the clusters are more compact.

    • Complete uses the maximum distance within the branch. The clusters are smaller and require a strong component overlap.

  • Approximate Computation indicates whether the algorithm should use approximate computations to improve performance.

    For EM, approximate computation is appropriate for large models with many components and for data sets with many columns. The approximate computation uses localized parameter optimization that restricts learning to parameters that are likely to have the most significant impact on the model.

    The values for approximate Computation are:

    • System Determined (Default)

    • Enable

    • Disable

  • Number of Components specifies the maximum number of components in the model. The algorithm automatically determines the number of components, based on improvements in the likelihood function or based on regularization, up to the specified maximum.

    The number of components must be greater than or equal to the number of clusters.

    The default number of components is 20.

  • Max Number of Iterations specifies the maximum number of iterations in the EM core algorithm. Maximum number of iterations must be greater than or equal to 1. This setting applies to the input table/view as a whole and does not allow per attribute specification.

    The default is 100.

  • Log Likelihood Improvement specifies the percentage improvement in the value of the log likelihood function required to add a new component to the model.

    The default value is 0.001

  • Convergence Criterion specifies the convergence criterion for EM. The convergence criteria are:

    • System Determined (Default)

    • Bayesian Information Criterion

    • Held-aside data set

  • Numerical Distribution specifies the distribution for modeling numeric attributes. The options are the following distributions:

    • Bernoulli

    • Gaussian

    • System Determined (Default)

    When the Bernoulli or Gaussian distribution is chosen, all numerical attributes are modeled using the same distribution. When the distribution is system-determined, individual attributes may use different distributions (either Bernoulli or Gaussian), depending on the data.

  • Levels of Details enables or disables the gathering of descriptive statistics for clusters (centroids, histograms, and rules). Disabling the cluster statistics will result in smaller models and will reduce the model details calculated.

    • If you select All, then the algorithm settings is enabled.

    • If you select Hierarchy, then the algorithm setting is disabled.

  • Min Percent of Attribute Rule Support specifies the percentage of the data rows assigned to a cluster that must be present in an attribute to include that attribute in the cluster rule. The default value is 0.1.

  • Data Preparation and Analysis specifies settings for data preparation and analysis. To view or change the selections, click Settings.

  • Random Seed controls the seed of the random generator used in Expectation Maximization. The value must be a non-negative integer. Default is 0.

  • Model Search enables search in EM, where different model sizes are explored and the best size is selected. By default, the setting is set to DISABLE.

  • Remove Small Components allows the algorithm to remove very small components from the solution. By default, the setting is set to ENABLE.

Click OK after you are done.

EM Data Preparation and Analysis Settings

This dialog box enables you to view or change these settings:

  • Max Number of Correlated 2D Attributes specifies the maximum number of correlated two-dimensional attributes that will be used in the EM model. Two-dimensional attributes correspond to columns that have a simple data type (not nested).

    The default is 50.

  • Number of Projections per Nested Column specifies the number of projections that will be used for each nested column. If a column has fewer distinct attributes than the specified number of projections, then the data will not be projected. The setting applies to all nested columns.

    The default is 50.

  • Number of Quantile Bins (Numerical Columns) specifies the number of quantile bins that will be used for modeling numerical columns with multivalued Bernoulli distributions.

    The default is system determined.

  • Number of Top-N Bins (Categorical Columns) specifies the number of top-N bins that will be used for modeling categorical columns with multivalued Bernoulli distributions.

    The default is system determined.

  • Number of Equi-Width Bins (Numerical Columns) specifies the number of equi-width bins that will be used for gathering cluster statistics for numerical columns.

    The default is 11.

  • Include uncorrelated 2D Attributes specifies whether uncorrelated two-dimensional attributes should be included in the model or not. Two-dimensional attributes correspond to columns that are not nested.

    The values are:

    • System Determined (Default)

    • Enable

    • Disable

When you have finished making changes, click OK.

EM Model Viewer

You can view and examine an EM Model in an EM Model Viewer.

The Tree tab is displayed by default. The EM model viewer has these tabs:

EM, KM, and OC Tree Viewer

The Tree Viewer is the graphical tree for hierarchical clusters.

The tree viewer for Expectation Maximization, k-Means, and Orthogonal Clustering operate in the same way. When you view the tree:

  • The Workflow Thumbnail opens to give you a view of the entire tree.

  • The Structure window helps you navigate and analyze the tree.

You can compare the attributes in a given node with the attributes in the population using EM, KM, and OC Compare.

To view information about a particular node:

  1. Select the node.

  2. In the lower pane, the following are displayed in each of these tabs:

    • Centroid: Displays the centroid of the cluster

    • Cluster Rule: Displays the rule that all elements of the cluster satisfy.

Display Control:

The following control the display of the tree as a whole:

  • Zoom in: Zooms in to the diagram, providing a more detailed view of the rule.

  • Zoom out: Zooms out from the diagram, providing a view of much or all of the rule.

  • Percent size: Enables you select an exact percentage to zoom the view.

  • Fit to Window: Zooms out from the diagram until the whole diagram fits within the screen.

  • Layout Type: Enables you to select horizontal layout or vertical layout; the default is vertical.

  • Expand: All nodes shows branches of the tree.

  • Show more detail: Shows more data for each tree node. Click again to show less detail.

  • Top Attributes: Displays the top N attributes. N is 5 by default. To change N, select a different number from the list.

  • Refresh: Enables you to apply the changed Query Settings.

  • Query Settings: Enables you to change the number of top settings. The default is 10. You can save a different number as the new default.

  • Save Rules

Cluster (Viewer)

The Cluster tab enables you to view information about a selected cluster. The viewer supports filtering so that only selected probabilities are displayed.

The Cluster tab for EM, KM, and OC operate in the same way.

The following information is displayed:

  • Cluster: The ID of the cluster being viewed. To view another cluster, select a different ID from the menu. You can view Leaves Only (terminal clusters) by selecting Leaves Only. Leaves Only is the default.

  • Fetch Size: Default is 20. You can change this value.

    If you change Fetch Size, then click Query to see the new display.

The grid lists the attributes in the cluster. For each attribute, the following information is displayed:

  • Name of the attribute.

  • Histogram of the attribute values in the cluster.

  • Confidence displayed as both a number and with a bar indicating a percentage. If confidence is very small, then no bar is displayed.

  • Support, the number of cases.

  • Mean, displayed for numeric attributes.

  • Mode, displayed for categorical attributes.

  • Variance

To view a larger version of the histogram, select an attribute; the histogram is displayed below the grid. Place the cursor over a bar in the histogram to see the details of the histogram including the exact value.

You can search the attribute list for a specific attribute name or for a specific value of mode. To search, use the search box.

The drop-down list enables you to search the grid by Attribute (the default) or by Mode. Type the search term in the box next to search.

To clear a search, click delete.

Other Tabs: The NB Model Viewer also has these tabs:

  • EM, KM, and OC Tree Viewer

  • EM, KM, and OC Compare

  • Settings

Cluster Model Settings (Viewer)

The Settings tab in the Cluster Model Viewer contains information related to model summary and model inputs.

The information is available in the following tabs:

EM, KM, and OC Compare

In the Compare tab, you can compare two clusters in the same model.

The Compare tab for EM, KM, and OC operate in the same way. The display enables you to select the two clusters to compare.

You can perform the following tasks:

  • Compare Clusters: To select clusters to compare, pick them from the lists. The clusters are compared by comparing attribute values. The comparison is displayed in a grid. You can use Compare to compare an individual cluster with the population.

  • Rename Clusters: To rename clusters, click Edit. This opens the Rename Cluster dialog box. By default, only Show Leaves is displayed. To show all nodes, then deselect Leaves Only. The default Fetch Size is 20. You can change this value.

  • Search Attribute: To search an attribute, enter its name in the search box. You can also search by rank.

  • Create Query: If you make any changes, then click Query.

For each cluster, a histogram is generated that shows the attribute values in the cluster. To see enlarged histograms for a cluster, click the attribute that you are interested in. The enlarged histograms are displayed below the attribute grid.

In some cases, there may be missing histograms in a cluster.

EM Component

The Component tab provides detailed information about the components of the EM model.

The tab is divided into several panes.

The top pane specifies the cluster to view:

  • Component: It is the integer that identifies the cluster. The default value is 1.

  • Prior: It is the priority for the specified component.

  • Filter by Attribute Name: Enables you to display only those attributes of interest. Enter the attribute name, and click Query.

  • Fetch Size: It is the number of records fetched. The default is 2,000.

The middle pane displays information about the attributes in the specified component:

  • You can search for a specified Attribute using the search box.

  • The attributes are displayed in a grid. The grid lists Attribute (name), Distribution (as a histogram), and Mean and Variance (numerical attributes only).

    To sort any of these columns, click the column title.

  • To see a larger version of the histogram for an attribute and information about the distribution, select the attribute. The histogram is displayed in the bottom pane.

The bottom pane displays a large version of the selected histogram, data, and projections (if any):

  • The Chart tab contains a larger version of the histogram of the selected attribute.

  • The Data tab shows the frequency of the histogram bins.

  • The Projections tab lists projects in a grid, listing Value and Coefficient for each Attribute Subname.

EM Details

The Details tab contains global details about the Expectation Maximization model.

The following information is displayed:

  • Log Likelihood Improvement

  • Number of Clusters

  • Number of Components

Explicit Semantic Analysis

Explicit Semantic Analysis algorithm uses the concepts of an existing knowledge base as features instead of latent features derived by latent semantic analysis methods such as Singular Value Decomposition.

Each concept or feature is represented by an attribute vector or Feature ID. Elements of these attribute vectors quantifies the strength of association between the corresponding attributes and the concept. Elements of the attribute vectors may also be categorical values indicating properties of the concept. Explicit Semantic Analysis creates an inverted index that maps every attribute to knowledge base concepts, that is to a vector of concept-attribute association values. (ESA) is a vectorial representation of text (individual words or entire documents) that uses a document corpus as a knowledge base. In ESA, a word and a document are represented as follows:

  • Word: Represented as a column vector in the tf-idf matrix of the text corpus. Typically, the text corpus is Wikipedia.

  • Document (string of words): Represented as the centroid of the vectors representing its words.

Oracle Data Mining provides a prebuilt ESA model based on Wikipedia, and user can import the model to Oracle Data Miner for data mining purposes.

Uses of Algorithm

You can use the Explicit Semantic Analysis (ESA) algorithm in the area of text processing.

Specific areas of text processing are:

  • Document classification

  • Semantic related calculations

  • Information retrieval

Supported Mining Models

Lists the data mining models supported by Explicit Semantic Algorithm.

The data mining models are:

  • Model Node

  • Model Details Node

  • Apply Node

ESA Algorithm Settings

Lists the settings supported by the Explicit Semantic Analysis algorithm.

The settings are:

  • Algorithm Name: Displays the name Explicit Semantic Analysis

  • Automatic Preparation: ON (Default). Indicates automatic data preparation.

  • Maximum Number of Text Features: Displays the number of text features.

  • Minimum Items: Determines the minimum number of non-zero entries that must be present in an input row. The default values are:

    • For text inputs: 100

    • For non-text inputs: 0

  • Minimum number of rows required or Token: Displays the minimum number of rows that is required for a token.

  • Missing Values Treatment: If there are missing values in columns with simple data types, then the algorithm replaces missing categorical values with the mode, and missing numerical values with the mean. If there are missing values in nested columns, then the algorithm interprets them as sparse.

  • Sampling: Indicates Enabled or Disabled

  • Threshold Value: Sets the thresholds value, which is must be very small values in the transformed build data. It must be a non-negative number. The default is 0.00000001

  • Top N Feature: Controls the maximum number of features per attribute.

ESA Model Viewer

The ESA Model Viewer displays ESA coefficients, alerts, features and algorithm settings.

The Model Viewer has the following tabs:

  • Coefficients: Displays the ESA coefficients. You can specify a Feature ID to search for the coefficients and their attributes

  • Features

  • Settings

  • Alerts: Displays alerts related to partitioned models, if any.

Features

The Features tab displays all the features along with the Feature IDs and the corresponding items.

The lower panel contains the following tabs:

  • Tag Cloud: Displays the selected feature in a tag cloud format. You can sort the feature tags based on coefficients or alphabetical order. You can also view them in ascending or descending order. To copy and save the cloud image, right-click and select:

    • Save Image As

    • Copy Image to Clipboard

  • Coefficients: Displays the attribute of the selected feature along with their values and coefficients in a tabular format.

Settings (ESA)

The Settings tab contains generic information about the model, algorithm, inputs and text features in the following tabs.

  • Summary: Displays information under three categories:

    • General: Displays generic information related to the model, such as the model name, model type, creation date, duration and so on.

    • Algorithm: Displays information related to the algorithm.

    • Build Details: Displays computed settings. Computed settings are generated by Oracle Data Mining when the model is created.

  • Inputs: Displays the name, Data Type, Mining Type, Data Preparation, and Partition Key for each attribute.

    • Data Preparation: YES indicates that data preparation was performed. It helps to distinguish between User and Auto Data Preparation (ADP) as ADP could be turned off, but the user could still embedded a transformation. If Data Preparation is indicated as YES, then select the column and click it. Each group can contain an Input and a Reverse Expression. If there is no Reverse Expression, then it is not displayed. If there is no Input, then nothing is displayed. Transformations are displayed in SQL notation.

    • Partition Key: Yes indicates that the attribute is a partition key

  • Text Features: This tab is visible only if text processing takes place. The tab displays words along with the associated document frequencies.

Coefficients ESA

Displays the attribute of the selected feature along with their values and coefficients in a tabular format. You can search a feature by specifying the Feature ID.

  • Click The magnifying glass icon to indicate search option.. In the Find Values dialog box, enter a feature to search for. You can also provide other parameters to search features.

  • Click The green color right arrow icon to indicate the option to query a feature. to query a feature.

Generalized Linear Models

Generalized Linear Models (GLM) is a statistical technique for linear modeling. Oracle Data Mining supports GLM for both Regression and Classification.

The following topics describe GLM models:

Generalized Linear Models Overview

Generalized Linear Models (GLM) include and extend the class of linear models referred to as Linear Regression.

Oracle Data Mining includes two of the most popular members of the GLM family of models with their most popular link and variance functions:

  • Linear Regression with the identity link and variance function equal to the constant 1 (constant variance over the range of response values).

  • Logistic Regression with the logistic link and binomial variance functions.

In Oracle Database 12c and later, GLM Classification and Regression are enhanced to implement Feature Selection and Feature Generation. This capability, when specified, can enhance the performance of the algorithm and improve the accuracy and interpretability.

Linear Regression

Linear Regression is the GLM Regression algorithm supported by Oracle Data Mining. The algorithm assumes no target transformation and constant variance over the range of target values.

Logistic Regression

Binary Logistic Regression is the GLM classification algorithm supported by Oracle Data Mining. The algorithm uses the logit link function and the binomial variance function.

Data Preparation for GLM

Oracle recommends that you use Automatic Data Preparation with GLM.

GLM Classification Models

Use a Classification node to build, test, and apply a GLM Classification model.

You can perform the following tasks with GLM Classification model:

  • Build and Test GLM Classification Model: To build and test a GLM Classification (GLMC) model, use a Classification node. By default, the Classification Node tests the models that it builds. Test data is created by splitting the input data into build and test subsets. You can also test a model using a Test node.

  • Tune GLM Classification Model: After you build and test a GLM classification model, you can tune it.

  • Apply GLM Classification Model: To apply the GLM Classification model, use an Apply node.

GLM Classification Algorithm Settings

Lists the settings supported by GLM algorithm.

The settings for Classification include:

  • Generate Row Diagnostics: By default, Generate Row Diagnostic is deselected. To generate row diagnostics, you must select this option and also specify a Case ID.

    If you do not specify a Case ID, then this setting is not available.

    You can view Row Diagnostics on the Diagnostics tab in the model viewer. To further analyze row diagnostics, use a Model Details node to extract the row diagnostics table.

  • Confidence Level: A positive number that is less than 1.0. This value indicates the degree of certainty that the true probability lies within the confidence bounds computed by the model. The default confidence is 0.95.

  • Reference Class name: The Reference Target Class is the target value used as a reference in a binary Logistic Regression model. The probabilities are produced for the other (non-reference) classes. By default, the algorithm chooses the value with the highest prevalence (the most cases). If there are ties, then the attributes are sorted alpha-numerically in ascending order. The default for Reference Class name is System Determined, that is, the algorithm determines the value.

  • Missing Values Treatment: The default is Mean Mode, that is, use mean for numeric values and mode for categorical values. You can also select Delete Row to delete any row that contains missing values. If you delete rows with missing values, then the same missing values treatment (delete rows) must be applied to any data that the model is applied to.

  • Specify Row Weights Column: By default, Row Weights Column is not specified. The Row Weights Column is a column in the training data that contains a weighting factor for the rows.

    Row weights can be used as a compact representation of repeated rows, as in the design of experiments where a specific configuration is repeated several times.

    Row weights can also be used to emphasize certain rows during model construction. For example, to bias the model toward rows that are more recent and away from potentially obsolete data.

    To specify a Row Weights column, click the check box and select the column from the list.

  • Feature Selection: By default, Feature Selection is deselected. This setting requires connection to Oracle Database 12c or later. To specify Feature Selection or view or specify Feature Selection settings, click Option to open the Feature Selection Option dialog box.

    If you select Feature Selection, then Ridge Regression is automatically deselected.

    Note:

    The Feature Selection setting is available in Oracle Database 12c and later.

  • Solver: Allows you to choose the GLM Solver. The options are:

    • System Determined (Default)

    • Stochastic Gradient Descent

    • Cholesky

    • QR

    Note:

    This setting is available only if Oracle Data Miner 18.1 is connected to Oracle Database 12.2 and later.

    Sparse Solver: By default, this setting is disabled.

  • Ridge Regression: By default, Ridge Regression is system determined (not disabled) in both Oracle Database 11g and 12c and later.

    Note:

    The Ridge Regression setting in both Oracle Database 11g and Oracle Database 12c and later must be consistent (system determined).

    If you select Ridge Regression, then Feature Selection is automatically deselected.

    Ridge Regression is a technique that compensates for multicollinearity (multivariate regression with correlated predictors). Oracle Data Mining supports Ridge Regression for both regression and classification mining functions.

    To specify options for Ridge Regression, click Option to open the Ridge Regression Option dialog box.

    When Ridge Regression is enabled, fewer global details are returned. For example, when Ridge Regression is enabled, no prediction bounds are produced.

    Note:

    If you are connected to Oracle Database 11g Release 2 (11.2) and you get the error ORA-40024 when you build a GLM model, then enable Ridge Regression and rebuild the model.

  • Convergence Tolerance: Convergence Tolerance: Determines the convergence tolerance of the GLM algorithm. The value must be in the range 0 to 1, non-inclusive. The default is system determined.

    Note:

    This setting is available only if Oracle Data Miner 18.1 is connected to Oracle Database 12.2 and later.
  • Number of Iterations: Controls the maximum number of iterations for the GLM algorithm. Default is system determined.

    Note:

    This setting is available only if Oracle Data Miner 18.1 is connected to Oracle Database 12.2 and later.
  • Batch Rows: Controls the number of rows in a batch used by the solver. Default is 2000.

    Note:

    This setting is available only if Oracle Data Miner 18.1 is connected to Oracle Database 12.2 and later.
  • Approximate Computation: Specifies whether the algorithm should use approximate computations to improve performance. For GLM, approximation is appropriate for data sets that have many rows and are densely populated (not sparse).

    The values for Approximate Computation are:

    • System Determined (Default)

    • Enable

    • Disable

Feature Selection Option Dialog

The Feature Selection Option dialog box enables you to specify feature selection for a GLMC or a GLMR model.

In the Algorithm Settings tab, if you select Feature Selection, then Ridge Regression is automatically deselected. Select Feature Selection, and click Option. In the Feature Selection Option dialog box, provide the following settings:

Note:

The setting requires connection to Oracle Database 12c or later.
  • Prune Model: By default, Enable is selected. You can also select Disable.

  • Max Number of Features: The default setting is system determined.

    To specify several features, click User specified and enter an integer number of features.

  • Feature Selection Criteria: The default setting is system determined. You can select one of the following:

    • Akaike Information

    • Schwarz Bayesian Information

    • Risk Inflation

    • Alpha Investing

  • Feature Identification: The default setting is system determined.

    You can also choose:

    • Enable Sampling

    • Disable Sampling

  • Feature Acceptance: The default setting is system determined.

    You can also choose:

    • Strict

    • Relaxed

  • Categorical Predictor Treatment: By default, Add One at a Time is selected. You can also select Add All at Once.

    If you accept the default, that is Add One at a Time, then Feature Generation is not selected. If you select Feature Generation, then the default is Quadratic Candidates. You can also select Cubic Candidates.

Choose Reference Value (GLMC)

You can set the reference value for Generalized Linear Model in the Choose Reference Value dialog box.

To set a reference value for the Generalized Linear Model:

  1. In the Advanced Settings dialog box, go to the Algorithm Settings tab.

  2. In the Reference Class Name field, click Edit. The Choose Reference Value dialog box opens.

  3. In the Choose Reference Value dialog, select Custom. Click the search icon to select a reference value for the model.

  4. Select one of the values in the Target Values list.

  5. Click OK.

Ridge Regression Option Dialog (GLMC)

You can use the system-determined Ridge Value or you can supply your own. By default, the system determined value is used.

Click OK.

GLM Classification Model Viewer

GLMC is also known as Logistic Regression.. The GLM Classification (GLMC) model viewer displays characteristics of a GLMC model.

To view a GLMC model, use one of these methods:

The viewer has these tabs:

  • Details

  • Coefficients

  • Compare

  • Diagnostics. Diagnostics are not generated by default.

  • Settings

GLMC Details

Model Details lists the global metrics for the model as a whole.

The metrics display has two columns: Name of the metric and Value of the metric. The following metrics are displayed:

  • Akaike's criterion (AIC) for the fit of the intercept only model

  • Akaike's criterion model for the fit of the intercept and the covariates (predictors) mode

  • Dependent mean

  • Likelihood ratio chi-square value

  • Likelihood ratio chi-square probability value

  • Likelihood ratio degrees of freedom

  • Model converged (Yes or No)

  • -2 log likelihood of the intercept only model

  • -2 log likelihood of the mode

  • Number of parameters (number of coefficients, including the intercept)

  • Number of rows

  • Correct Prediction percentage

  • Incorrectly predicted percentage of rows

  • Tied cases prediction, that is, cases where no prediction can be made

  • Pseudo R-Square Cox and Snell

  • Pseudo R-Square Nagelkerke

  • Schwartz's Criterion (SC) for the fit of the intercept-only model

  • Schwartz's Criterion for the fit of the intercept and the covariates (predictors) model

  • Termination (normal or not)

  • Valid covariance matrix (Yes or No)

Note:

The exact list of metrics computed depends on the model settings.

Other Tabs: The viewer also has these tabs:

  • Coefficients

  • Compare

  • Diagnostics (if generated)

  • Settings

GLMC Coefficients

The Coefficient tab allows you to view GLM coefficients.

The viewer supports sorting to control the order in which coefficients are displayed and filtering to select the coefficients to display.

The default is to sort coefficients by absolute value. If you deselect Sort by Absolute Value, then click Query.

The default fetch size is 1000 records. To change the fetch size, specify a new number of records and click Query.

Note:

After you change any criteria on this tab, click Query to query the database. You must click Query even for changes such as selecting or deselecting sort by absolute value or changing the fetch size.

The relative value of coefficients is shown graphically as a bar, with different colors for positive and negative values. If a coefficient is close to 0, then the bar may be too small to display.

  • Target Value: Select a specific target value and see only those coefficients. The default is to display the coefficients of the value that occurs least frequently. It is possible for a target value to have no coefficients; in that case, the list has no entries.

  • Sort by absolute value: The default is to sort the list of coefficients by absolute value; you can deselect this option.

  • Fetch Size: The number of rows displayed. The default is 1000. To figure out if all coefficients are displayed, choose a fetch size that is greater than the number of rows displayed.

Coefficients are listed in a grid. If no items are listed, then there are no coefficients for that target value. The coefficients grid has these columns:

  • Attribute: Name of the attribute

  • Value: Value of the attribute

  • Coefficient: The linear coefficient estimate for the selected target value is displayed. A bar is shown in front of (and possibly overlapping) each coefficient. The bar indicates the relative size of the coefficient. For positive values, the bar is light blue; for negative values, the bar is red. (If a value is close to 0, then the bar may be too small to be displayed.)

  • Standardized coefficient: The coefficient rescaled by the ratio of the standard deviation of the predictor to the standard deviation of the target.

    The standardized coefficient places all coefficients on the same scale, so that you can, at a glance, tell the large contributors from the small ones.

  • Exp (Coefficient). This is the exponent of the coefficient.

  • Standard Error of the estimate.

  • Wald Chi Square

  • Probability of greater than Chi Square

  • Test Statistic: For linear Regression, the t-value of the coefficient estimate; for Logistic Regression, the Wald chi-square value of the coefficient estimate

  • Probability of the test statistic. Used to analyze the significance of specific attributes in the model

  • Variance Inflation Factor

    • 0 for the intercept

    • Null for Logistic Regression

  • Lower Coefficient Limit, lower confidence bound of the coefficient

  • Upper Coefficient Limit, upper confidence bound of the coefficient

  • Exp (Coefficient)

    • Exponentiated coefficient for Logistic Regression

    • Null for Linear Regression

  • Exp (Lower Coefficient Limit)

    • exponentiation coefficient of the lower confidence bound for Logistic Regression

    • Null for Linear Regression

  • Exp (Upper Coefficient Limit)

    • Exponentiated coefficient of upper confidence bound for Logistic Regression

    • Null for Linear Regression

Note:

Not all statistics are necessarily returned for each coefficient.

Statistics are null if any of the following are true:

  • The statistic does not apply to the mining function. For example, Exp (coefficient) does not apply to Linear Regression.

  • The statistic cannot be calculated because of limitations in system resources.

  • The value of the statistics is infinity.

  • If the model was built using Ridge Regression, or if the covariance matrix is found to be singular during the build, then coefficient bounds (upper and lower) have the value NULL.

Other Tabs: The viewer also has these tabs:
  • Coefficients

  • Compare

  • Diagnostics

  • Settings

Sort and Search GLMC Coefficients

You can sort numerical columns by clicking the title of the column.

For example, to arrange the coefficients in increasing order, click Coefficients in the grid.

Use search to search for items. The default is to search by Attribute (name).

There are search options that limit the columns displayed. The filter settings with the (or)/(and) suffixes enable you to enter multiple strings separated by spaces. For example, if you select Attribute/Value/Coefficient(or), the filter string A .02 produces all columns where the Attribute or the Value Type starts with the letter A or the Coefficient starts with 0.02.

To clear a search, click delete.

GLMC Compare

GLM Classification Compare viewer is similar to the SVM Coefficients Compare viewer except that the GLM model can only be built for binary classification models.

Only two target class values would be available to compare.

Other Tabs: The viewer has the following tabs:

  • Details

  • Coefficients

  • Diagnostics

  • Settings

GLMC Diagnostics

The Diagnostics tab for GLM Classification displays diagnostics for each Case ID in the build data.

You can filter the results.

Note:

Diagnostics are not generated by default. To generate diagnostics, you must specify a Case ID and select Generate Row Diagnostics in Advanced Settings.

The following information is displayed in the Diagnostics grid:

  • CASE_ ID

  • TARGET_VALUE for the row in the training data

  • TARGET_VALUE_PROB, probability associated with the target value

  • HAT, value of diagonal element of the HAT matrix

  • WORKING_RESIDUAL, the residual with the adjusted dependent variable

  • PEARSON_RESlDUAL, the raw residual scaled by the estimated standard deviation of the target

  • DEVIANCE_RESIDUAL, contribution to the overall goodness of the fit of the model

  • C, confidence interval displacement diagnostic

  • CBAR, confidence interval displacement diagnostic

  • DIFDEV, change in the deviance due to deleting an individual observation

  • DIFCHISQ, change in the Pearson chi-square

Other Tabs: The viewer has these other tabs:

  • Details

  • Coefficients

  • Compare

  • Settings

GLMC Settings

The Settings tab provides information related to model summary, algorithm details, partition details in case of a partitioned models and so on.

In the Partition field, click the partition name. The partition detail is displayed in the Partition Details window.

Click search to open the Select Partition dialog box to view filtered partitions based on the Partition keys.

The Settings tab contains the following tabs:

Related Topics

Summary

The Summary tab contains information about the characteristics of the model, algorithm settings, and build details.

The General Settings section contains the following information:

  • Name

  • Type

  • Algorithm

  • Target Attribute

  • Creation Date

  • Duration of Model Build

  • Comments

Algorithm Settings control model build. Algorithm setting are specified when the Build node is defined.

After a model is built, values calculated by the system are displayed on this tab. For example, if you select System Determined for Enable Ridge Regression, then this tab shows if Ridge Regression was enabled, and what ridge value was calculated.

Build Details displays computed settings. Computed settings are generated by Oracle Data Mining when the model is created.

Other Tabs: The Settings tab has these other tabs:

  • Inputs

  • Target Values

Inputs

The Inputs tab displays the list of the attributes used to build the model.

For each attribute, the following information is displayed:

  • Name: The name of the attribute.

  • Data Type: The data type of the attribute.

  • Mining Type: Categorical or Numerical.

  • Target: The check icon indicates that the attribute is a target attribute.

  • Data Preparation: YES indicates that data preparation was performed. It helps to distinguish between User and Auto Data Preparation (ADP) as ADP could be turned off, but the user could still embedded a transformation. If Data Preparation is indicated as YES, then select the column and click it. Each group can contain an Input and a Reverse Expression. If there is no Reverse Expression, then it is not displayed. If there is no Input, then nothing is displayed. Transformations are displayed in SQL notation.

  • Partition Key: YES indicates that the attribute is a partition key.

Partition Keys

The Partition Keys tab lists the columns that are partitioned.

Along with the partitioned columns, the Partition Keys tab lists the following details:
  • Partition Name

  • Source

  • Data Type

  • Value

Weights

The Weights tab displays the weights that are calculated by the system for each target value.

If you tune the model, then the weights may change.

GLMC Target Values

The Target Values tab displays the target attributes, their data types, and the values of each target attribute.

Other Tabs: The Settings tab has these other tabs:

  • Summary

  • Inputs

Related Topics

GLM Regression Models

Use a Regression node to build, test, and apply a GLM Regression model.

You can perform the following tasks with GLM Regression models:

  • Build and Test GLM Regression model: To build and test a GLM Regression (GLMR) model use a Regression node. By default, a Regression Node tests the models that it builds. Test data is created by splitting the input data into build and test subsets. You can also test a model using a Test node.

  • Apply GLM Regression model: To apply a GLM Regression model, use an Apply node.

GLM Regression Algorithm Settings

Lists the settings supported by Generalized Linear Model for Regression.

The settings for Regression are:

  • Generate Row Diagnostics is set to OFF by default. To generate row diagnostics, you must select this option and also specify a Case ID.

    If you do not specify a Case ID, then this setting is not available.

    You can view Row Diagnostics on the Diagnostics tab when you view the model. To further analyze row diagnostics, use a Model Details node to extract the row diagnostics table.

  • Confidence Level: A positive number that is less than 1.0. This level indicates the degree of certainty that the true probability lies within the confidence bounds computed by the model. The default confidence is 0.95.

  • Missing Values Treatment: The default is Mean Mode. That is, use Mean for numeric values and Mode for categorical values. You can also select Delete Row to delete any row that contains missing values. If you delete rows with missing values, then the same missing values treatment (delete rows) must be applied to any data that the model is applied to.

  • Specify Row Weights Column: The Row Weights Column is a column in the training data that contains a weighting factor for the rows. By default, Row Weights Column is not specified. Row weights can be used:

    • As a compact representation of repeated rows, as in the design of experiments where a specific configuration is repeated several times.

    • To emphasize certain rows during model construction. For example, to bias the model toward rows that are more recent and away from potentially obsolete data

  • Feature Selection: This setting requires connection to Oracle Database 12c. By default, Feature Selection is deselected. To specify Feature Selection or view or specify Feature Selection settings, click Option to open the Feature Selection Option dialog box.

    If you select Feature Selection, then Ridge Regression is automatically deselected.

    Note:

    The Feature Selection setting is available only in Oracle Database 12c.

  • Solver: Allows you to choose the GLM Solver. The options are:

    • System Determined

    • Stochastic Gradient Descent

    • Cholesky

    • QR

    Sparse Solver: By default, this setting is disabled.

  • Ridge Regression: Ridge Regression is a technique that compensates for multicollinearity (multivariate regression with correlated predictors). Oracle Data Mining supports Ridge Regression for both regression and classification mining functions.

    By default, Ridge Regression is system determined (not disabled) in both Oracle Database 11g and Oracle Database 12c. If you select Ridge Regression, then Feature Selection is automatically deselected.

    To specify options for Ridge Regression, click Option to open the Ridge Regression Option dialog box.

    When Ridge Regression is enabled, fewer global details are returned. For example, when Ridge Regression is enabled, no prediction bounds are produced.

    Note:

    If you are connected to Oracle Database 11g Release 2 (11.2) and you get the error ORA-40024 when you build a GLM model, then enable Ridge Regression and rebuild the model.

  • Convergence Tolerance: Determines the convergence tolerance of the GLM algorithm. The value must be in the range 0 to 1, non-inclusive. The default is system determined.

  • Number of Iterations: Controls the maximum number of iterations for the GLM algorithm. Default is system determined.

  • Batch Rows: Controls the number of rows in a batch used by the solver. Default is 2000.

  • Approximate Computation: Specifies whether the algorithm should use approximate computations to improve performance. For GLM, approximation is appropriate for data sets that have many rows and are densely populated (not sparse).

    Values for Approximate Computation are:

    • System Determined (Default)

    • Enable

    • Disable

Ridge Regression Option Dialog (GLMR)

You can set the ridge value for Generalized Linear Model for Regression in the Ridge Regression Option dialog box.

You can use the System Determined Ridge Value or you can supply your own. By default, the System Determined value is used. Produce Variance Inflation Factor (VIF) is not selected by default.

Choose Reference Value (GLMR)

You can set the reference value for Generalized Linear Model for Regression in the Choose Reference Value dialog box.

To select a value:

  1. Click Edit.
  2. In the Choose Reference Value dialog box, click Custom.
  3. Select one of the values in the Target Values field.
  4. Click OK.

GLM Regression Model Viewer

The GLM Regression (GLMR) model viewer displays characteristics of a GLMR model. GLMR is also known as Linear Regression.

The default name of a GLM model has GLM in the name.

The GLMR viewer opens in a new tab.

The Detail tab is displayed by default.

The GLM Regression Model Viewer has these tabs:.

  • Details

  • Coefficients

  • Diagnostics (The default is to not generate diagnostics.)

  • Settings

GLMR Coefficients

You can view the GLM coefficients in the Coefficient tab.

The viewer supports sorting to control the order in which coefficients are displayed and filtered to select the coefficients to display.

By default, coefficients are sorted by absolute value. You can deselect or select Sort by absolute value and click Query.

The default fetch size is 1,000 records. To change the fetch size, specify a new number of records and click Query.

Note:

After you change any criteria on this tab, click Query to query the database. You must click Query even for changes such as selecting or deselecting sort by absolute value or changing the fetch size.

The relative value of coefficients is shown graphically as a bar, with different colors for positive and negative values. If a coefficient is close to 0, then the bar may be too small to display.

  • Sort by absolute value: Sort the list of coefficients by absolute value.

  • Fetch Size: The number of rows displayed. To figure out if all coefficients are displayed, choose a fetch size that is greater than the number of rows displayed.

Coefficients are listed in a grid. If no items are listed, then there are no coefficients for that target value. The coefficients grid has these columns:

  • Attribute: Name of the attribute

  • Value: Value of the attribute

  • Coefficient: The linear coefficient estimate for the selected target value is displayed. A bar is shown in front of (and possible overlapping) each coefficient. The bar indicates the relative size of the coefficient. For positive values, the bar is light blue; for negative values, the bar is red. If a value is close to 0, then the bar may be too small to be displayed.

  • Standard Error of the estimate

  • Wald Chi Squared

  • Pr > Chi Square

  • Upper coefficient limit

  • Lower coefficient limit

Note:

Not all statistics are necessarily returned for each coefficient.

Statistics are null if any of the following are true:

  • The statistic does not apply to the mining function. For example, exp_coefficient does not apply to Linear Regression.

  • The statistic cannot be calculated because of limitations in the system resources.

  • The value of the statistics is infinity.

  • If the model was built using Ridge Regression, or if the covariance matrix is found to be singular during the build, then coefficient bounds (upper and lower) have the value NULL.

Other Tabs: The viewer has these other tabs:

  • Details

  • Diagnostics

  • Settings

GLMR Details

The Model Details list global metrics for the model as a whole.

The metrics display has two columns: Name of the metric and Value of the metric. The following metrics are displayed:

  • Adjusted R-Square

  • Akaike's information criterion

  • Coefficient of variation

  • Corrected total degrees of freedom

  • Corrected total sum of squares

  • Dependent mean

  • Error degrees of freedom

  • Error mean square

  • Error sum of squares

  • Model F value statistic

  • Estimated mean square error

  • Hocking Sp statistic

  • JP statistic (the final prediction error)

  • Model converged (Yes or No)

  • Model degrees of freedom

  • Model F value probability

  • Model mean square

  • Model sum of squares

  • Number of parameters (the number of coefficients, including the intercept)

  • Number of rows

  • Root mean square error

  • R-square

  • Schwartz's Bayesian Information Criterion

  • Termination

  • Valid covariance matrix computed (Yes or No)

GLMR Diagnostics

The Diagnostics tab displays diagnostics for each Case ID in the build data.

You can filter the results.

Note:

Diagnostics are not generated by default. To generate diagnostics, you must and specify a Case ID and select Generate Row Diagnostics.

The following information is displayed in the Diagnostics grid:

  • CASE_ID

  • TARGET_VALUE for the row in the training data

  • PREDICTED_VALUE, value predicted by the model for the target

  • HAT, value of the diagonal element of the HAT matrix

  • RESIDUAL, the residual with the adjusted dependent variable

  • STD_ERR_RESIDUAL, Standard Error of the residual

  • STUDENTIZED_RESIDUAL

  • PRED_RES, predicted residual

  • COOKS_D, Cook's D influence statistic

Other Tabs: The viewer has these other tabs:

  • Details

  • Coefficients

  • Settings

GLMR Settings

The Settings tab contains information related to inputs, build details, algorithm settings, and other general settings.

The Settings tab has these tabs:

GLMR Summary

The Summary tab contains information related to general settings, algorithm settings and build details.

  • General settings describe the characteristics of the model, including owner, name, type, algorithm, target attribute, creation date duration of model build, and comments.

  • Algorithm settings control model build; algorithm setting are specified when the build node is defined. After a model is built, values calculated by the system are displayed on this tab. For example, if you select System Determined for Enable Ridge Regression, then this tab shows if Ridge Regression was enabled and what ridge value was calculated.

  • Build Details displays computed settings. Computed settings are generated by Oracle Data Mining when the model is created.

  • Settings tab has the GLMR Inputs tab.

GLMR Inputs

The Inputs tab displays the list of attributes that are used to build the model.

For each attribute the following information is displayed:

  • Name: The name of the attribute.

  • Data type: The data type of the attribute.

  • Mining type: Categorical or numerical.

  • Target: A check mark indicates that the attribute is a target attribute.

  • Data Preparation: YES indicates that data preparation was performed. It helps to distinguish between User and Auto Data Preparation (ADP) as ADP could be turned off, but the user could still embedded a transformation. If Data Preparation is indicated as YES, then select the column and click it. Each group can contain an Input and a Reverse Expression. If there is no Reverse Expression, then it is not displayed. If there is no Input, then nothing is displayed. Transformations are displayed in SQL notation.

  • Partition Key: YES indicates that the attribute is a partition key.

Related Topics

GLMR Coefficients

You can view the GLM coefficients in the Coefficient tab.

The viewer supports sorting to control the order in which coefficients are displayed and filtered to select the coefficients to display.

By default, coefficients are sorted by absolute value. You can deselect or select Sort by absolute value and click Query.

The default fetch size is 1,000 records. To change the fetch size, specify a new number of records and click Query.

Note:

After you change any criteria on this tab, click Query to query the database. You must click Query even for changes such as selecting or deselecting sort by absolute value or changing the fetch size.

The relative value of coefficients is shown graphically as a bar, with different colors for positive and negative values. If a coefficient is close to 0, then the bar may be too small to display.

  • Sort by absolute value: Sort the list of coefficients by absolute value.

  • Fetch Size: The number of rows displayed. To figure out if all coefficients are displayed, choose a fetch size that is greater than the number of rows displayed.

Coefficients are listed in a grid. If no items are listed, then there are no coefficients for that target value. The coefficients grid has these columns:

  • Attribute: Name of the attribute

  • Value: Value of the attribute

  • Coefficient: The linear coefficient estimate for the selected target value is displayed. A bar is shown in front of (and possible overlapping) each coefficient. The bar indicates the relative size of the coefficient. For positive values, the bar is light blue; for negative values, the bar is red. If a value is close to 0, then the bar may be too small to be displayed.

  • Standard Error of the estimate

  • Wald Chi Squared

  • Pr > Chi Square

  • Upper coefficient limit

  • Lower coefficient limit

Note:

Not all statistics are necessarily returned for each coefficient.

Statistics are null if any of the following are true:

  • The statistic does not apply to the mining function. For example, exp_coefficient does not apply to Linear Regression.

  • The statistic cannot be calculated because of limitations in the system resources.

  • The value of the statistics is infinity.

  • If the model was built using Ridge Regression, or if the covariance matrix is found to be singular during the build, then coefficient bounds (upper and lower) have the value NULL.

Other Tabs: The viewer has these other tabs:

  • Details

  • Diagnostics

  • Settings

GLMR Details

The Model Details list global metrics for the model as a whole.

The metrics display has two columns: Name of the metric and Value of the metric. The following metrics are displayed:

  • Adjusted R-Square

  • Akaike's information criterion

  • Coefficient of variation

  • Corrected total degrees of freedom

  • Corrected total sum of squares

  • Dependent mean

  • Error degrees of freedom

  • Error mean square

  • Error sum of squares

  • Model F value statistic

  • Estimated mean square error

  • Hocking Sp statistic

  • JP statistic (the final prediction error)

  • Model converged (Yes or No)

  • Model degrees of freedom

  • Model F value probability

  • Model mean square

  • Model sum of squares

  • Number of parameters (the number of coefficients, including the intercept)

  • Number of rows

  • Root mean square error

  • R-square

  • Schwartz's Bayesian Information Criterion

  • Termination

  • Valid covariance matrix computed (Yes or No)

GLMR Diagnostics

The Diagnostics tab displays diagnostics for each Case ID in the build data.

You can filter the results.

Note:

Diagnostics are not generated by default. To generate diagnostics, you must and specify a Case ID and select Generate Row Diagnostics.

The following information is displayed in the Diagnostics grid:

  • CASE_ID

  • TARGET_VALUE for the row in the training data

  • PREDICTED_VALUE, value predicted by the model for the target

  • HAT, value of the diagonal element of the HAT matrix

  • RESIDUAL, the residual with the adjusted dependent variable

  • STD_ERR_RESIDUAL, Standard Error of the residual

  • STUDENTIZED_RESIDUAL

  • PRED_RES, predicted residual

  • COOKS_D, Cook's D influence statistic

Other Tabs: The viewer has these other tabs:

  • Details

  • Coefficients

  • Settings

k-Means

The k-Means (KM) algorithm is a distance-based clustering algorithm that partitions the data into a predetermined number of clusters, provided there are enough distinct cases.

Distance-based algorithms rely on a distance metric (function) to measure the similarity between data points. The distance metric is either Euclidean, Cosine, or Fast Cosine distance. Data points are assigned to the nearest cluster according to the distance metric used.

To build and apply KM models:
  • Use a Clustering node to build KM models.

  • Use an Apply node to apply a KM model to new data.

The following topics describe KM models:

Related Topics

k-Means Algorithm

Oracle Data Mining implements an enhanced version of the k-Means algorithm.

The features of the k- Means algorithm are:

  • The algorithm builds models in a hierarchical manner. The algorithm builds a model top down using binary splits and refinement of all nodes at the end. In this sense, the algorithm is similar to the bisecting k-Means algorithm. The centroid of the inner nodes in the hierarchy are updated to reflect changes as the tree evolves. The whole tree is returned.

  • The algorithm grows the tree, one node at a time (unbalanced approach). Based on a user setting, the node with the largest variance is split to increase the size of the tree until the desired number of clusters is reached. The maximum number of clusters is specified as a build setting.

  • The algorithm provides probabilistic scoring and assignment of data to clusters.

  • The algorithm returns the following, for each cluster:

    • A centroid (cluster prototype). The centroid reports the mode for categorical attributes, or the mean and variance for numerical attributes.

    • Histograms (one for each attribute),

    • A rule describing the hyper box that encloses the majority of the data assigned to the cluster.

The clusters discovered by enhanced k-Means are used to generate a Bayesian probability model that is then used during scoring (model apply) for assigning data points to clusters. The k-Means algorithm can be interpreted as a mixture model where the mixture components are spherical multivariate normal distributions with the same variance for all components.

Note:

The k-Means algorithm samples one million rows. You can use the sample to build the model.

KM Algorithm Settings

The k-Means (KM) algorithm supports the settings related to number of clusters, growth factor, convergence tolerance, Distance function, number of iterations, and minimum attribute support.

The settings and their descriptions are:

  • Number of Clusters is the maximum number of leaf clusters generated by the algorithm. The default is 10. k-Means usually produces the exact number of clusters specified, unless there are fewer distinct data points.

  • Growth Factor is a number greater than 1 and less than or equal to 5. This value specifies the growth factor for memory allocated to hold cluster data. Default is 2.

  • Convergence Tolerance must be between 0.001 (slow build) and 0.1 (fast build). The default is 0.01. The tolerance controls the convergence of the algorithm. The smaller the value, the closest to the optimal solution at the cost of longer run times. This parameter interacts with the number of iterations parameter.

  • Distance Function specifies how the algorithm calculates distance. The default distance function is Euclidean. The other distance functions are:

    • Cosine

    • Fast Cosine

  • Number of Iterations must be greater than or equal to 1. The default is 30. This value is the maximum number of iterations for the k-Means algorithm. In general, more iterations result in a slower build. However, the algorithm may reach the maximum, or it may converge early. The convergence is determined by whether the Convergence Tolerance setting is satisfied.

  • Min Percent Attribute Support is not an integer. The range of the value for Min Percent Attribute Support is:

    • Greater than or equal to 0, and

    • Less than or equal to 1.

      The default value is 0.1. The default value enables you to highlight the more important predicates instead producing a long list of predicates that have very low support.

    You can use this value to filter out rule predicates that do not meet the support threshold. Setting this value too high can result in very short or even empty rules.

    In extreme cases, for very sparse data, all attribute predicates may be filtered out so that no rule is produced. If no rule is produced, then you can lower the support threshold and rebuild the model to make the algorithm produce rules even if the predicate support is very low.

  • Number of Histogram Bins is a positive integer; the default value is 10. This value specifies the number of bins in the attribute histogram produced by k-Means. The bin boundaries for each attribute are computed globally on the entire training data set. The binning method is equiwidth. All attributes have the same number of bins except attributes with a single value that have only one bin.

  • Split Criterion is either Variance or Size. The default is Variance. The split criterion is related to the initialization of the k-Means clusters. The algorithm builds a binary tree and adds one new cluster at a time. Size results in placing the new cluster in the area where the largest current cluster is located. Variance places the new cluster in the area of the most spread out cluster.

  • Levels of Details determine the level of cluster detail that will be computed during the build. The applicable values are:

    • None: No details. Only scoring information is persisted

    • Hierarchy: Cluster Hierarchy and Cluster Record Counts

    • All: Cluster Hierarchy, Record Counts, and all descriptive statistics such as Means, Variances, Modes, Histograms, Rules

  • Random Seed controls the seed of the random generator used during the k-Means initialization. The random seed must be a value greater than or equal to 1. Default is 0.

KM Model Viewer

In the KM Model Viewer, you can examine a KM model.

The KM model viewer contains these tabs:

EM, KM, and OC Tree Viewer

The Tree Viewer is the graphical tree for hierarchical clusters.

The tree viewer for Expectation Maximization, k-Means, and Orthogonal Clustering operate in the same way. When you view the tree:

  • The Workflow Thumbnail opens to give you a view of the entire tree.

  • The Structure window helps you navigate and analyze the tree.

You can compare the attributes in a given node with the attributes in the population using EM, KM, and OC Compare.

To view information about a particular node:

  1. Select the node.

  2. In the lower pane, the following are displayed in each of these tabs:

    • Centroid: Displays the centroid of the cluster

    • Cluster Rule: Displays the rule that all elements of the cluster satisfy.

Display Control:

The following control the display of the tree as a whole:

  • Zoom in: Zooms in to the diagram, providing a more detailed view of the rule.

  • Zoom out: Zooms out from the diagram, providing a view of much or all of the rule.

  • Percent size: Enables you select an exact percentage to zoom the view.

  • Fit to Window: Zooms out from the diagram until the whole diagram fits within the screen.

  • Layout Type: Enables you to select horizontal layout or vertical layout; the default is vertical.

  • Expand: All nodes shows branches of the tree.

  • Show more detail: Shows more data for each tree node. Click again to show less detail.

  • Top Attributes: Displays the top N attributes. N is 5 by default. To change N, select a different number from the list.

  • Refresh: Enables you to apply the changed Query Settings.

  • Query Settings: Enables you to change the number of top settings. The default is 10. You can save a different number as the new default.

  • Save Rules

Cluster (Viewer)

The Cluster tab enables you to view information about a selected cluster. The viewer supports filtering so that only selected probabilities are displayed.

The Cluster tab for EM, KM, and OC operate in the same way.

The following information is displayed:

  • Cluster: The ID of the cluster being viewed. To view another cluster, select a different ID from the menu. You can view Leaves Only (terminal clusters) by selecting Leaves Only. Leaves Only is the default.

  • Fetch Size: Default is 20. You can change this value.

    If you change Fetch Size, then click Query to see the new display.

The grid lists the attributes in the cluster. For each attribute, the following information is displayed:

  • Name of the attribute.

  • Histogram of the attribute values in the cluster.

  • Confidence displayed as both a number and with a bar indicating a percentage. If confidence is very small, then no bar is displayed.

  • Support, the number of cases.

  • Mean, displayed for numeric attributes.

  • Mode, displayed for categorical attributes.

  • Variance

To view a larger version of the histogram, select an attribute; the histogram is displayed below the grid. Place the cursor over a bar in the histogram to see the details of the histogram including the exact value.

You can search the attribute list for a specific attribute name or for a specific value of mode. To search, use the search box.

The drop-down list enables you to search the grid by Attribute (the default) or by Mode. Type the search term in the box next to search.

To clear a search, click delete.

Other Tabs: The NB Model Viewer also has these tabs:

  • EM, KM, and OC Tree Viewer

  • EM, KM, and OC Compare

  • Settings

EM, KM, and OC Compare

In the Compare tab, you can compare two clusters in the same model.

The Compare tab for EM, KM, and OC operate in the same way. The display enables you to select the two clusters to compare.

You can perform the following tasks:

  • Compare Clusters: To select clusters to compare, pick them from the lists. The clusters are compared by comparing attribute values. The comparison is displayed in a grid. You can use Compare to compare an individual cluster with the population.

  • Rename Clusters: To rename clusters, click Edit. This opens the Rename Cluster dialog box. By default, only Show Leaves is displayed. To show all nodes, then deselect Leaves Only. The default Fetch Size is 20. You can change this value.

  • Search Attribute: To search an attribute, enter its name in the search box. You can also search by rank.

  • Create Query: If you make any changes, then click Query.

For each cluster, a histogram is generated that shows the attribute values in the cluster. To see enlarged histograms for a cluster, click the attribute that you are interested in. The enlarged histograms are displayed below the attribute grid.

In some cases, there may be missing histograms in a cluster.

Compare Cluster with Population

You can view how an individual cluster compares with the population.

To compare cluster with the population:

  1. Click Compare.
  2. Deselect Leaves Only.
  3. Select the root node as Cluster 1. This is cluster 1, if the clusters have not been renamed. The distribution of attribute values in Cluster 1 represents the distribution of values in the population as a whole. Select the cluster that you want to compare with the population as Cluster 2.
  4. You can now compare the distribution of values for each attribute in the cluster selected as Cluster 2 with the values in Cluster 1.
Missing Histograms in a Cluster

If clusters are built using sparse data, then some attribute values are not present in the records assigned to a cluster.

In this case, a cluster comparison shows the centroid and histogram values for the cluster where the attribute is present and leaves blanks for the cluster where the attribute is present.

Rename Cluster

Cluster ID is a number. The title bar of the dialog box shows the cluster to rename.

You can change it to a string. To rename a cluster:

  1. Type in the new name.

  2. Click OK.

Note:

Two different clusters cannot have the same name.

KM Settings

The Settings tab displays information about how the model was built

The information is available in the following tabs:

  • Cluster Model Summary

  • Cluster Model Input

Other Tabs:

  • EM, KM, and OC Tree Viewer

  • Cluster Viewer

  • EM, KM, and OC Compare

Cluster Model Settings (Viewer)

The Settings tab in the Cluster Model Viewer contains information related to model summary and model inputs.

The information is available in the following tabs:

Cluster Model Summary

The Summary tab contains generic information related to the model, model build, and algorithms.

The Summary tab lists the following:

  • General settings lists the following information:

    • Type of Model (Classification, Regression, and so on)

    • Owner of the model (the Schema where the model was built)

    • Model Name

    • Creation Date

    • Duration of the Model Build (in minutes)

    • Size of the model (in MB)

    • Comments

  • Algorithm Settings list the algorithm and algorithm settings used to build the model.

  • Build Details displays computed settings. Computed settings are generated by Oracle Data Mining when the model is created.

Cluster Model Input

The Input tab is displayed for models that can be scored only.

A list of the attributes used to build the model. For each attribute the following information is displayed:

  • Name: The name of the attribute.

  • Data Type: The data type of the attribute.

  • Mining Type: Categorical or numerical.

  • Data Preparation: YES indicates that data preparation was performed. It helps to distinguish between User and Auto Data Preparation (ADP) as ADP could be turned off, but the user could still embedded a transformation. If Data Preparation is indicated as YES, then select the column and click it. Each group can contain an Input and a Reverse Expression. If there is no Reverse Expression, then it is not displayed. If there is no Input, then nothing is displayed. Transformations are displayed in SQL notation.

  • Partition Key: YES indicates that the attribute is a partition key.

Naive Bayes

The Naive Bayes (NB) algorithm is used to build Classification models. You can build, test, apply, and tune a Naive Bayes model.

  • To build an NB model, use a Classification Node. By default, a Classification Node tests all models that it builds. The test data is created by splitting the input data into build and test subsets.

  • To test an NB model, you can also use a Test node.

  • To apply an NB model to new data, use an Apply node.

  • To tune an NB model, you must first build and test an NB model.

The following topics describe Naive Bayes:

Naive Bayes Algorithm

The Naive Bayes (NB) algorithm is based on conditional probabilities, and used Bayes; Theorem.

Naive Bayes (NB) algorithm uses Bayes' Theorem, which calculates a probability by counting the frequency of values and combinations of values in the historical data. Bayes' Theorem finds the probability of an event occurring given the probability of another event that has already occurred.

Assumption:

Naive Bayes makes the assumption that each predictor is conditionally independent of the others. For a given target value, the distribution of each predictor is independent of the other predictors. In practice, this assumption of independence, even when violated, does not degrade the model's predictive accuracy significantly, and makes the difference between a fast, computationally feasible algorithm and an intractable one.

Sometimes, the distribution of a given predictor is clearly not representative of the larger population. For example, there might be only a few customers under 21 in the training data, but in fact, there are many customers in this age group in the wider customer base. To compensate, you can specify prior probabilities (priors) when training the model.

Advantages of Naive Bayes

The advantages of Naive Bayes model are:

  • The Naive Bayes algorithm provides fast, highly scalable model building and scoring. It scales linearly with the number of predictors and rows.

  • The Naive Bayes build process is parallelized. Scoring can also be parallelized irrespective of the algorithm.

  • Naive Bayes can be used for both binary and multiclass classification problem

Naive Bayes Test Viewer

By default, any Classification or Regression model is automatically tested. You have the option to view the test results.

A Classification model is tested by comparing the predictions of the model with known results. Oracle Data Miner keeps the latest test result.

To view the test results for a model, right-click the Build node and select View Results.

Naive Bayes Model Viewer

The Naive Bayes Model Viewer allows you examine a Naive Bayes model.

You can view a Naive Bayes model using any one of the following methods.

The NB model viewer has these tabs:

Probabilities (NB)

The Probabilities tab lists the probabilities calculated during model build. You can sort and filter the order in which probabilities are displayed.

The relative value of probabilities is shown graphically as a bar, with a blue bar for positive values and red bar for negative values. For numbers close to zero, the bar may be too small to be displayed.

Select Target Value. The probabilities associated with the selected value are displayed. The default is to display the probabilities for the value that occurs least frequently.

Probabilities are listed in the grid.

Other Tabs: The NB Model Viewer has these other tabs:

  • Compare

  • Settings

Related Topics

Grid

In the grid, you can view the row count and the grid filter.

If no items are listed, then there are no values that satisfy the criteria that you specified:

  • Row Count: The number of rows displayed.

  • Grid Filter: Use the Grid Filter to filter the information in the grid.

The probabilities grid has these columns:

  • Attribute: Name of the attribute

  • Value: Value of the attribute

  • Probability: The probability for the value of the attribute. Probability is displayed as both a number and with a bar indicating a percentage. If the probability is very small, then no bar is displayed.

Related Topics

Fetch Size

Fetch Size limits the number of rows returned regardless of Filter or Server settings.

The default fetch size is 1000. Change the Fetch Size by clicking the up or down arrows. If you change the Fetch Size, then click Query.

Grid Filter

In the grid filter, you can filter items according to different categories.

The filter control check enables you to filter the items that are displayed in the grid. The filtering is done as you type in the filter search box.

To see the filter categories, click the down arrow next to the binoculars icon. The following categories are supported for probabilities:

  • Attribute: Filters the Attribute (name) column. This is the default category. For example, to display all entries with CUST in the attribute name, enter CUST in the search box.

  • Value: Filters the value column.

  • Probability: Filters the probability column.

  • All (And): Enter in one or more strings and their values are compared against the Attribute and Value columns using the AND condition. For example, enter CUST M to display rows where the attribute name contains CUST and the value is M.

  • All (Or): Works the same as All (And) except that the comparison uses an OR condition.

The Grid Filter for Compare lists similar categories:

  • Name: Filters by attribute name (Default).

  • Value: Filters the value column.

  • Attribute/Value/Propensity (or): Filters for values in any of the attribute, value, and propensity columns.

  • Attribute/Value/Propensity (and): Filters for values in any of the attribute, value, and propensity columns.

  • Propensity for Target Value 1: Filters the propensity values for Target Value 1.

  • Propensity for Target Value 2: Filters the propensity values for Target Value 2.

After you enter one or more strings into the filter search box, delete is displayed. Click this icon to clear the search string.

Compare (NB)

The Compare tab enables you to compare results for two different target values.

Select the two target values. The default values for Target Value 1 and Target Value 2 are displayed.

You can do the following:

  • Change the Target Values. The Target Values that you select must be different.

  • Use the Grid Filter to display specific values.

  • Change the Fetch Size.

  • Sort the grid columns. The grid for compare has these columns:

    • Attribute: Name of the attribute

    • Value:Value of the attribute

    • Propensity for target value 1

    • Propensity for target value 2

    For both propensities, a histogram bar is displayed. The maximum value of propensity is 1.0. The minimum is -1.0.

    Propensity shows which of the two target values has a more predictive relationship for a given attribute value pair. Propensity can be measured in terms of being predicted for or against a target value, where prediction against is shown as a negative value.

Other Tabs:

  • Probabilities

  • Settings

Related Topics

Settings (NB)

The Settings tab contains information related to the model summary, inputs, target values, cost matrix (if the model is tuned), partition keys (if the model is partitioned) and so on.

In the Partition field, click the partition name. The partition detail is displayed in the Partition Details window.

Click search to open the Select Partition dialog box.

The Settings tab displays information about how the model was built:

Other Tabs: The NB Model Viewer has these other tabs:

  • Compare

  • Probabilities

Settings (NB)

The Settings tab shows information about the model.

The Settings tab has these tabs:

Summary (NB)

The Summary tab describes all model settings.

Model settings describe characteristics of model building. The Settings are divided into:

Input (NB)

The Input tab for Naive Bayes is displayed for models that can be scored only.

A list of the attributes used to build the model. For each attribute the following information is displayed:

  • Name: The name of the attribute.

  • Data Type: The data type of the attribute.

  • Mining Type: Categorical or Numerical.

  • Target: The check icon indicates that the attribute is a target attribute.

  • Data Preparation: YES indicates that data preparation was performed. It helps to distinguish between User and Auto Data Preparation (ADP) as ADP could be turned off, but the user could still embedded a transformation. If Data Preparation is indicated as YES, then select the column and click it. Each group can contain an Input and a Reverse Expression. If there is no Reverse Expression, then it is not displayed. If there is no Input, then nothing is displayed. Transformations are displayed in SQL notation.

  • Partition Key: YES indicates that the attribute is a partition key.

Weights

The Weights tab displays the weights that are calculated by the system for each target value.

If you tune the model, then the weights may change.

Target Values

The Target Values tab displays the target attributes, their data types, and the values of each target attribute.

The Target Values tab for Naive Bayes displays the following:

  • Target Attributes

  • Data Types

  • Values of each target attributes

Target Values

The Target Values tab displays the target attributes, their data types, and the values of each target attribute.

The Target Values tab for Naive Bayes displays the following:

  • Target Attributes

  • Data Types

  • Values of each target attributes

Summary (NB)

The Summary tab describes all model settings.

Model settings describe characteristics of model building. The Settings are divided into:

Naive Bayes Algorithm Settings

Lists the settings for Naive Bayes algorithm.

This section identifies the algorithm and whether Automatic Data Preparation is ON or OFF.

These settings are specific to Naive Bayes:

  • Pair wise Threshold: The minimum percentage of pair wise occurrences required for including a predictor in the model. The default is 0.

  • Singleton Threshold: The minimum percentage of singleton occurrences required for including a predictor in the model. The default is 0.

General Settings

The generic settings are contained in the Settings tab and General tab.

The Settings tab of a model viewer displays settings in three categories:

  • General displays generic information about the model, as described in this topic.

  • Algorithm Settings displays information that are specific to the selected algorithm.

  • Build Details displays computed settings. Computed settings are generated by Oracle Data Mining when the model is created.

The General tab contains the following information for all algorithms:

  • Type The mining function for the model: anomaly detection, association rules, attribute importance, classification, clustering, feature extraction, or regression.

  • Owner: The data mining account (schema) used to build the model.

  • Model Name: The name of the model.

  • Target Attribute: The target attribute; only Classification and Regression models have targets.

  • Creation Date: The date when the model was created in the form MM/DD/YYYY

  • Duration: Time in minutes required to build model.

  • Size: The size of the model in megabytes.

  • Comment: For models not created using Oracle Data Miner, this option displays comments embedded in the models. To see comments for models built using Oracle Data Miner, go to Properties for the node where the model is built.

    Models created using Oracle Data Miner may contain BALANCED, NATURAL, CUSTOM, or TUNED. Oracle Data Miner inserts these values to indicate if the model has been tuned and in what way it was tuned.

Input (NB)

The Input tab for Naive Bayes is displayed for models that can be scored only.

A list of the attributes used to build the model. For each attribute the following information is displayed:

  • Name: The name of the attribute.

  • Data Type: The data type of the attribute.

  • Mining Type: Categorical or Numerical.

  • Target: The check icon indicates that the attribute is a target attribute.

  • Data Preparation: YES indicates that data preparation was performed. It helps to distinguish between User and Auto Data Preparation (ADP) as ADP could be turned off, but the user could still embedded a transformation. If Data Preparation is indicated as YES, then select the column and click it. Each group can contain an Input and a Reverse Expression. If there is no Reverse Expression, then it is not displayed. If there is no Input, then nothing is displayed. Transformations are displayed in SQL notation.

  • Partition Key: YES indicates that the attribute is a partition key.

Partition Keys

The Partition Keys tab lists the columns that are partitioned.

Along with the partitioned columns, the Partition Keys tab lists the following details:
  • Partition Name

  • Source

  • Data Type

  • Value

Weights

The Weights tab displays the weights that are calculated by the system for each target value.

If you tune the model, then the weights may change.

Target Values

The Target Values tab displays the target attributes, their data types, and the values of each target attribute.

The Target Values tab for Naive Bayes displays the following:

  • Target Attributes

  • Data Types

  • Values of each target attributes

Nonnegative Matrix Factorization

Nonnegative Matrix Factorization (NMF) is the unsupervised algorithm used by Oracle Data Mining for feature extraction.

  • To build an NMF model, use a Feature Extraction node.

  • To apply an NMF model to new data, use an Apply node.

Using Nonnegative Matrix Factorization

Nonnegative Matrix Factorization (NMF) is useful when there are many attributes and the attributes are ambiguous or have weak predictability.

By combining attributes, NMF can produce meaningful patterns, topics, or themes.

NMF is especially well-suited for text mining. In a text document, the same word can occur in different places with different meanings. For example, hike can be applied to the outdoors or to interest rates. By combining attributes, NMF introduces context, which is essential for predictive power:

  • "hike" + "mountain" -> "outdoor sports"
  • "hike" + "interest" -> "interest rates"

How Does Nonnegative Matrix Factorization Work

Non-Negative Matrix Factorization (NMF) uses techniques from multivariate analysis and linear algebra.

NMF decomposes multivariate data by creating a user-defined number of features. Each feature is a linear combination of the original attribute set. The coefficients of these linear combinations are nonnegative.

NMF decomposes a data matrix V into the product of two lower rank matrices W and H so that V is approximately equal to W times H. NMF uses an iterative procedure to modify the initial values of W and H so that the product approaches V. The procedure terminates when the approximation error converges or the specified number of iterations is reached.

When applying to a model, an NMF model maps the original data into the new set of attributes (features) discovered by the model.

NMF Algorithm Settings

Lists the settings supported by the Nonnegative Matrix Factorization (NMF) algorithm.

The settings are:

  • Convergence Tolerance: Indicates the minimum convergence tolerance value. The default is 0.5.

  • Automatic Data Preparation: ON (Default). Indicates automatic data preparation.

  • Non Negative Scoring: Controls if NMF scoring results are truncated at zero, that is, no negative values are produced. Options are Enabled or Disabled. By default, non negative scoring is enabled.

  • Number of Features:The default is to not specify the number of features. If you do not specify the number of features, then the algorithm determines the number of features.

    To specify the number of features, then select Specify number of features and enter the integer number of features. The number of features must be a positive integer less than or equal to the minimum of the number of attributes and to the number of cases. In many cases, 5 or some other number less than or equal to 7 gives good results.

  • Number of iterations: Indicates the maximum number of iterations to be performed. The default is 50.

  • Random Seed: It is the random seed for the sample. The default value is -1. The seed can be changed. If you plan to repeat this operation to get the same results, then ensure that you use the same random seed.

NMF Model Viewer

In the NMF model viewer, you can view information related to the model and algorithm, such as coefficients and settings.

The NMF Model Viewer has these tabs:

Coefficients (NMF)

For a given Feature ID, the coefficients are displayed in the Coefficients grid.

The title of the grid Coefficient x of y displays the number of rows returned out of all the rows available in the model.

By default, Feature IDs are integers.

Fetch Size limits the number of rows returned. The default is 1000 or the value specified in the Preference settings for Model Viewers.

You can perform the following tasks:

  • Rename

  • Filter

The Coefficients grid has these columns:

  • Attribute: Attribute name

  • Value: Value of attribute

  • Coefficient: The value is shown as a bar with the value centered in the bar. Positive values are light blue. Negative values are red.

Rename (NMF)

You can rename any Feature ID in the Rename dialog box.

To rename the selected Feature ID:

  1. Enter in the new name in the Feature ID field.
  2. Click OK.

Note:

Different features should have different names.

Filter (NMF)

You can create filters, and view filter according to different categories such as attributes, values and coefficients in the Filter dialog box.

To view the filter categories, click find.

The filter categories are

  • Attribute (Default): Search for an attribute name.

  • Value: This is the value column.

  • Coefficient: This is the coefficient column.

To create a filter, enter a string in the text box. After a string has been entered, delete icon is displayed. To clear the filter, click the icon.

Features

The Features tab displays all the features along with the Feature IDs and the corresponding items.

The lower panel contains the following tabs:

  • Tag Cloud: Displays the selected feature in a tag cloud format. You can sort the feature tags based on coefficients or alphabetical order. You can also view them in ascending or descending order. To copy and save the cloud image, right-click and select:

    • Save Image As

    • Copy Image to Clipboard

  • Coefficients: Displays the attribute of the selected feature along with their values and coefficients in a tabular format.

Settings (NMF)

The Settings tab contains information related to inputs, build details, algorithm settings, and other general settings.

Summary (NMF)

The Summary tab contains information related to build details, algorithm settings, and other general settings.

The sections in the Summary tab are:

  • General settings lists the following:

    • Type of model (Classification, Regression, and so on)

    • Owner of the model (the Schema where the model was built)

    • Model Name

    • Creation Date

    • Duration of the model build (in minutes)

    • Size of the model (in MB)

    • Comments

  • Algorithm Settings lists the following:

    • The name of the algorithm used to build the model.

    • The algorithm settings that control the model build.

  • Build Details displays computed settings. Computed settings are generated by Oracle Data Mining when the model is created.

Inputs (NMF)

The Inputs tab displays the list of attributes that are used to build the model.

Oracle Data Miner does not necessarily use all of the attributes in the build data. For example, if the values of an attribute are constant, then that attribute is not used.

For each attribute used to build the model, this tab displays:

  • Name: The name of the attribute.

  • Data Type: The data type of the attribute

  • Mining Type: Categorical or Numerical

  • Data Preparation: YES indicates that data preparation was performed. It helps to distinguish between User and Auto Data Preparation (ADP) as ADP could be turned off, but the user could still embedded a transformation. If Data Preparation is indicated as YES, then select the column and click it. Each group can contain an Input and a Reverse Expression. If there is no Reverse Expression, then it is not displayed. If there is no Input, then nothing is displayed. Transformations are displayed in SQL notation.

  • Partition Key: YES indicates that the attribute is a partition key.

Orthogonal Partitioning Clustering

Orthogonal Partitioning Clustering is a clustering algorithm that is proprietary to Oracle.

(O-Cluster) The requirements to build and apply the O-Cluster algorithm:

  • To build OC models, use a Clustering node.

  • To apply an OC model to new data, use an Apply node.

The following topics describe an O-Cluster:

Related Topics

O-Cluster Algorithm

The O-Cluster (OC) algorithm creates a hierarchical grid-based clustering model. That is, it creates axis-parallel (orthogonal) partitions in the input attribute space.

The algorithm operates recursively. The resulting hierarchical structure represents an irregular grid that tessellates the attribute space into clusters. The resulting clusters define dense areas in the attribute space.

The clusters are described by intervals along the attribute axes and the corresponding centroids and histograms. The sensitivity parameter defines a baseline density level. Only areas with a peak density above this baseline level can be identified as clusters.

The clusters discovered by O-Cluster are used to generate a Bayesian probability model that is then used during scoring (model apply) for assigning data points to clusters. The generated probability model is a mixture model where the mixture components are represented by a product of independent normal distributions for numerical attributes and multinomial distributions for categorical attributes.

O-Cluster goes through the data in chunks until it converges. There is no explicit limit on the number of rows processed.

O-Cluster handles missing values naturally as missing at random. The algorithm does not support nested tables and thus does not support sparse data.

Note:

OC does not support text.

OC Algorithm Settings

Lists the settings supported by O-Cluster (OC) algorithm.

The settings are:

  • Number of Clusters: It is the maximum number of leaf clusters generated by the algorithm. The default is 10.

  • Buffer Size: It is the maximum size of the memory buffer, in logical records, that can be used by the algorithm. The default is 50,000 logical records.

  • Sensitivity: It is a number between 0 (fewer clusters) and 1 (more clusters). The default is 0.5. This value specifies the peak density required for separating a new cluster. This value is related to the global uniform density.

OC Model Viewer

In the OC Model Viewer, you can examine the details of an OC model.

The OC Model viewer has these tabs:

EM, KM, and OC Tree Viewer

The Tree Viewer is the graphical tree for hierarchical clusters.

The tree viewer for Expectation Maximization, k-Means, and Orthogonal Clustering operate in the same way. When you view the tree:

  • The Workflow Thumbnail opens to give you a view of the entire tree.

  • The Structure window helps you navigate and analyze the tree.

You can compare the attributes in a given node with the attributes in the population using EM, KM, and OC Compare.

To view information about a particular node:

  1. Select the node.

  2. In the lower pane, the following are displayed in each of these tabs:

    • Centroid: Displays the centroid of the cluster

    • Cluster Rule: Displays the rule that all elements of the cluster satisfy.

Display Control:

The following control the display of the tree as a whole:

  • Zoom in: Zooms in to the diagram, providing a more detailed view of the rule.

  • Zoom out: Zooms out from the diagram, providing a view of much or all of the rule.

  • Percent size: Enables you select an exact percentage to zoom the view.

  • Fit to Window: Zooms out from the diagram until the whole diagram fits within the screen.

  • Layout Type: Enables you to select horizontal layout or vertical layout; the default is vertical.

  • Expand: All nodes shows branches of the tree.

  • Show more detail: Shows more data for each tree node. Click again to show less detail.

  • Top Attributes: Displays the top N attributes. N is 5 by default. To change N, select a different number from the list.

  • Refresh: Enables you to apply the changed Query Settings.

  • Query Settings: Enables you to change the number of top settings. The default is 10. You can save a different number as the new default.

  • Save Rules

Cluster (Viewer)

The Cluster tab enables you to view information about a selected cluster. The viewer supports filtering so that only selected probabilities are displayed.

The Cluster tab for EM, KM, and OC operate in the same way.

The following information is displayed:

  • Cluster: The ID of the cluster being viewed. To view another cluster, select a different ID from the menu. You can view Leaves Only (terminal clusters) by selecting Leaves Only. Leaves Only is the default.

  • Fetch Size: Default is 20. You can change this value.

    If you change Fetch Size, then click Query to see the new display.

The grid lists the attributes in the cluster. For each attribute, the following information is displayed:

  • Name of the attribute.

  • Histogram of the attribute values in the cluster.

  • Confidence displayed as both a number and with a bar indicating a percentage. If confidence is very small, then no bar is displayed.

  • Support, the number of cases.

  • Mean, displayed for numeric attributes.

  • Mode, displayed for categorical attributes.

  • Variance

To view a larger version of the histogram, select an attribute; the histogram is displayed below the grid. Place the cursor over a bar in the histogram to see the details of the histogram including the exact value.

You can search the attribute list for a specific attribute name or for a specific value of mode. To search, use the search box.

The drop-down list enables you to search the grid by Attribute (the default) or by Mode. Type the search term in the box next to search.

To clear a search, click delete.

Other Tabs: The NB Model Viewer also has these tabs:

  • EM, KM, and OC Tree Viewer

  • EM, KM, and OC Compare

  • Settings

EM, KM, and OC Compare

In the Compare tab, you can compare two clusters in the same model.

The Compare tab for EM, KM, and OC operate in the same way. The display enables you to select the two clusters to compare.

You can perform the following tasks:

  • Compare Clusters: To select clusters to compare, pick them from the lists. The clusters are compared by comparing attribute values. The comparison is displayed in a grid. You can use Compare to compare an individual cluster with the population.

  • Rename Clusters: To rename clusters, click Edit. This opens the Rename Cluster dialog box. By default, only Show Leaves is displayed. To show all nodes, then deselect Leaves Only. The default Fetch Size is 20. You can change this value.

  • Search Attribute: To search an attribute, enter its name in the search box. You can also search by rank.

  • Create Query: If you make any changes, then click Query.

For each cluster, a histogram is generated that shows the attribute values in the cluster. To see enlarged histograms for a cluster, click the attribute that you are interested in. The enlarged histograms are displayed below the attribute grid.

In some cases, there may be missing histograms in a cluster.

Detail (OC)

In the Details tab, you can view details of a cluster. You can discover what values of an attribute are in the selected cluster.

The viewer supports filtering so that only selected probabilities are displayed. The following information is displayed:

  • Cluster: The ID of the cluster being viewed. You can change the cluster by selecting a different ID. Select Leaves Only to view terminal clusters only.

  • Fetch Size: The number of columns selected. The default is 50. You can change the Fetch Size. If you change the Fetch Size, then click Query.

The grid lists the attributes in the cluster. For each attribute, the following information is displayed:

  • Attribute: An attribute is a predictor in a predictive model or an item of descriptive information in a descriptive model. Data attributes are the columns of data that are used to build a model. Data attributes undergo transformations so that they can be used as categoricals or numericals by the model. Categoricals and numericals are model attributes.

  • Histogram: The attribute values in the selected cluster are displayed as a histogram.

    To view a larger version of the histogram, select an attribute. The histogram is displayed below the grid. Place the cursor over a bar in the histogram to see the details of the histogram including the exact value.

  • Confidence: Displayed as both a number and with a bar indicating a percentage. If the confidence is very small, then no bar is displayed.

  • Support: The number of cases.

  • Mean: Displayed for numeric attributes.

  • Mode: Displayed for categorical attributes.

  • Variance

You can perform the following tasks:

  • Sort the attributes in the cluster. To sort, click the appropriate column heading in the grid. For example, to sort by attribute name, click Attribute. The attributes are sorted by:

    • Confidence

    • Support

    • Mean

    • Mode

    • Variance

    • Attribute name

  • Search the attribute list for a specific attribute name or for a specific value of mode. To search, use the search box next to view.

  • Search the grid by Attribute. The drop down list enables you to search the grid by Attribute (the default) or by Mode. Enter the search term in the search field. To clear a search, click delete.

Other Tabs: The OC Model Viewer has the Settings tab.

Related Topics

Settings (OC)

The Settings tab displays information about how the model was built:

Related Topics

Summary (OC)

The Summary tab contains information related to build details, algorithm settings, and other general settings.

The sections in the Summary tab are:

  • General settings lists the following:

    • Type of model (Classification, Regression, and so on)

    • Owner of the model (the Schema where the model was built)

    • Model Name

    • Creation Date

    • Duration of the model build (in minutes)

    • Size of the model (in MB)

    • Comments

  • Algorithm Settings settings lists the following:

    • The name of the algorithm.

    • The settings that control the model build. Algorithm settings are specified when the build node is defined.

  • Build Details displays computed settings. Computed settings are generated by Oracle Data Mining when the model is created.

Inputs (OC)

The Inputs tab is displayed for models that can be scored only.

It displays the following information:
  • Name: The name of the attribute.

  • Data Type: The data type of the attribute

  • Mining Type: Categorical or Numerical

  • Data Preparation: YES indicates that data preparation was performed. It helps to distinguish between User and Auto Data Preparation (ADP) as ADP could be turned off, but the user could still embedded a transformation. If Data Preparation is indicated as YES, then select the column and click it. Each group can contain an Input and a Reverse Expression. If there is no Reverse Expression, then it is not displayed. If there is no Input, then nothing is displayed. Transformations are displayed in SQL notation.

  • Partition Key: YES indicates that the attribute is a partition key.

Interpreting Cluster Rules

Cluster Rules are presented as mathematical notations.

After running the Clustering Build node, the Clustering node builds three models, each using O-Cluster, k-Means, and Expectation Maximization algorithm. In the Model viewer, the Clustering rules of each cluster are displayed in the Rules tab in the lower pane of the Tree tab. For each rule, you can view the details of the algorithm in the Settings tab under Summary. Select a cluster in the tree node to view the rules of the selected cluster.

Example 13-1 Example of a Cluster Rule

Suppose the rules of the selected cluster is as follows:

If TIME_AS_CUSTOMER In ("1", "2")

And N_OF_DEPENDENTS = "(.857143; 1.71429]"

And HOUSE_OWNERSHIP = "1"

And N_MORTGAGES = "1"

And REGION In ("NorthEast", "South")

Then Cluster is: 19

The rule is generated in terms of bins. The rule in this example can be interpreted as follows:

In the rule If TIME_AS_CUSTOMER In ("1", "2"), the attribute TIME_AS_CUSTOMER considers rows that have the value of 1 and 2. Since the column's mining type is categorical the rule is expressed as a set.

The rule N_OF_DEPENDENTS = "(.857143; 1.71429]"means .857143 < N_OF_DEPENDENTS <= 1.71429. Since the column's mining type is numerical the bin is expressed as a range.

The rules HOUSE_OWNERSHIP = "1" and N_MORTGAGES = "1" means that the attributes HOUSE_OWNERSHIP and N_MORTGAGES consider rows that have value 1.

The rule REGION In ("NorthEast", "South") means that the attribute REGION considers rows that contain the values “Northeast” and “South” in it.

Based on the rules, the cluster is derived to be 19.

Singular Value Decomposition and Principal Components Analysis

Singular Value Decomposition (SVD) and Principal Component Analysis (PCA) are unsupervised algorithms used by Oracle Data Mining for feature extraction.

Unlike NMF, SVD and PCA are orthogonal linear transformations that are optimal for capturing the underlying variance of the data. This property is extremely useful for reducing the dimensionality of high-dimensional data and for supporting meaningful data visualization.

Note:

Singular Value Decomposition (SVD) and Principal Component Analysis (PCA) require Oracle Database 12c or later.

In addition to dimensionality reduction, SVD and PCA have several other important applications, such as, data denoising (smoothing), data compression, matrix inversion, and solving a system of linear equations. All these areas can be effectively supported by the Oracle Data Mining implementation SVD/PCA.

SVD is implemented as a feature extraction algorithm. PCA is implemented as a special scoring method for the SVD algorithm.

Build and Apply SVD and PCA Models

To build an SVD or PCA model, use a Feature Extraction node.

A Feature Extraction model creates a Feature Build Node. If you are connected to Oracle Database 12c or later, then a Feature Build node creates one NMF model and one PCA model. You can add an SVD model.

To apply an SVD or PCA model, use an Apply node.

PCA Algorithm Settings

Lists the settings supported by PCA algorithm.

  • Number of features: The default is System Determined. To specify a value, select User specified and enter in an integer value.

  • Solver: The solver setting indicates the type of SVD solver used for computing the Principal Components Analysis (PCA) for the data. Solvers are grouped into narrow data solvers or Tall-Skinny SVD solvers and wide data solvers or Stochastic SVD solvers. The options are:

    • Tall-Skinny (for QR computation). This is the default solver for narrow data.

    • Tall-Skinny (for Eigenvalue computation)

    • Stochastic (for QR computation). If you select this option, then click Option. This opens the Solver (Stochastic QR Computation) dialog box. This is the default for wide data.

    • Stochastic (for Eigenvalue computation)

    Note:

    The solvers using QR computation (tssvd and ssvd) are more stable and produce higher quality results for ill-conditioned data matrix than the solver using Eigenvalue computation (tseigen and steigen). The improved stability comes at a higher computation cost.

  • Tolerance: By default, it is set to System Determined. To specify a value, click User Specified. The value must be a number greater than 0 and less than 1.

  • Approximate Computation:The default is System Determined. You can select either Enable or Disable. Approximate computations improve performance.

  • Projections:The default is to not select Projections.

  • Number of Features: The default is System Determined. You can specify a number.

  • Scoring Mode: It is the scoring mode to use, either Singular Value Decomposition Scoring or Principal Components Analysis Scoring. The default is Principal Components Analysis Scoring (PCA scoring).

    • When the build data is scored with SVD, the projections will be the same as the U matrix.

    • When the build data is scored with PCA, the projections will be the product of the U and S matrices.

  • U Matrix Output: Whether or not the U matrix produced by SVD persists. The U matrix in SVD has as many rows as the number of rows in the Build data. To avoid creating a large model, the U matrix persists only when U Matrix Output is enabled. When U Matrix Output is enabled, the Build data must include a Case ID. The default is Disable.

Related Topics

Solver (Stochastic QR Computation)

You can specify the settings for the Stochastic (QR computation) solver here:

  1. In the Oversampling field, specify a value greater than or equal to 1 and less than or equal to 10000. Default is 5. A larger oversampling value yields better accuracy, but incurs longer training cost. The value configures the number of columns in the sampling matrix used by the Stochastic SVD solver. The number of columns in this matrix is equal to the requested number of features and the oversampling setting.
  2. In the Power Iterations field, specify a value greater than or equal to 0 and less than or equal to 20. Default is 2. The value improves the accuracy of the solver.
  3. In the Random Seed field, specify a value greater than or equal to 0 and less than or equal to 4294967296. Default is 0. The random seed value initializes the sampling matrix used by the Stochastic SVD solver.
  4. Click OK.

PCA Model Viewer

You can you examine the details of a Principal Components Analysis model in the PCA Model Viewer.

The model viewer has these tabs:

Coefficients (PCA)

For a given Feature ID, the coefficients are displayed in the Coefficients grid.

The title of the grid Coefficients x of y displays number of rows returned out of all the rows available in the model. By default, Feature IDs are integers (1, 2, 3, …). The Eigenvalue for the selected Feature ID is displayed as a read-only value.

You can perform the following tasks:

  • View and create filter categories

  • Rename feature IDs

The Coefficients grid has these columns:

  • Attribute

  • Singular Value

    The value is shown as a bar with the value centered in the bar. Positive values are light blue; negative values are red.

    The default is Sort by absolute value. If you deselect this option, then click Query.

Rename (PCA)

In the Rename dialog box, you can rename the selected Feature ID. To rename:

  1. Enter in the new name in the Feature ID field.
  2. Click OK.

Note:

Different features should have different names.

Filter (PCA)

To view the filter categories, click view.

The filter categories are:

  • Attribute: (Default). Search for an attribute name.

  • Singular Value: The Singular value column

To create a filter, enter a string in the text box. After a string has been entered, delete is displayed. To clear the filter, click it.

PCA Scree Plot

A scree plot displays the eigenvalues associated with a component or factor.

You can use scree plots in Principal Components Analysis to visually assess which components or factors explain most of the variability in the data. In the PCA Scree Plot:

  • Features are plotted along the X-axis.

  • Cutoff is plotted along the Y-axis.

  • Variance is plotted as a red line.

  • Cumulative percent is plotted as a blue line.

A grid below the graph shows Eigenvalue, Variance, and Cumulative Percent Variance for each Feature ID.

Features

The Features tab displays all the features along with the Feature IDs and the corresponding items.

The lower panel contains the following tabs:

  • Tag Cloud: Displays the selected feature in a tag cloud format. You can sort the feature tags based on coefficients or alphabetical order. You can also view them in ascending or descending order. To copy and save the cloud image, right-click and select:

    • Save Image As

    • Copy Image to Clipboard

  • Coefficients: Displays the attribute of the selected feature along with their values and coefficients in a tabular format.

PCA Details

The Details tab displays the value for the global details of the SVD model.

The following information is displayed:

  • Number of Components

  • Suggested Cutoff

Settings (PCA)

The Settings tab displays information about how the model was built.

The Settings tab contains these tabs:

Summary (PCA)

The Summary tab contains information related to build details, algorithm settings, and other general settings.

The sections in the Summary tab contains the following information:

  • General settings lists the following:

    • Type of model (Classification, Regression, and so on

    • Owner of the model (the schema where the model was built)

    • Model Name

    • Creation Date

    • Duration of the model build (in minutes)

    • Size of the model (in MB)

    • Comments

  • Algorithm Settings settings lists the following:

    • The name of the algorithm used to build the model.

    • The algorithm settings that control the model build.

  • Build Details displays computed settings. Computed settings are generated by Oracle Data Mining when the model is created.

Inputs (PCA)

The Inputs tab displays the list of attributes that are used to build the model.

Oracle Data Miner does not necessarily use all of the attributes in the build data. For example, if the values of an attribute are constant, then that attribute is not used.

For each attribute used to build the model, this tab displays:

  • Name: The name of the attribute.

  • Data Type: The data type of the attribute.

  • Mining Type: Categorical or Numerical.

  • Data Preparation: YES indicates that data preparation was performed. It helps to distinguish between User and Auto Data Preparation (ADP) as ADP could be turned off, but the user could still embedded a transformation. If Data Preparation is indicated as YES, then select the column and click it. Each group can contain an Input and a Reverse Expression. If there is no Reverse Expression, then it is not displayed. If there is no Input, then nothing is displayed. Transformations are displayed in SQL notation.

  • Partition Key: YES indicates that the attribute is a partition key.

SVD Algorithm Settings

Lists the settings supported by SVD algorithm.

  • Approximate Computation: Specifies whether the algorithm should use approximate computations to improve performance. For SVD, approximation is often appropriate for data sets with many columns. An approximate low-rank decomposition provides good solutions at a reasonable computational cost. If you disable approximate computations for SVD, then approximation depends on the characteristics of the data. For data sets with more than 2500 attributes (the maximum number of features allowed) only approximate decomposition is possible. If approximate computation is disabled for a data set with more than 2500 attributes, then an exception is raised.

    Values for Approximate Computation are:

    • System Determined (Default)

    • Enable

    • Disable

  • Automatic Preparation: ON or OFF. The default is ON.

  • Number of Features: System Determined (Default). You can specify a number.

  • Solver: The setting indicates the solver to be used for computing the Singular Value Decomposition (SVD) of the data. Solvers are grouped into narrow data solvers or Tall-Skinny SVD solvers and wide data solvers or Stochastic SVD solvers. The options are:

    • Tall-Skinny (for QR computation). This is the default solver for narrow data.

    • Tall-Skinny (for Eigenvalue computation)

    • Stochastic (for QR computation). If you select this option, then click Option. This opens the Solver (Stochastic QR Computation) dialog box. This is the default for wide data.

    • Stochastic (for Eigenvalue computation)

    Note:

    The solvers using QR computation (tssvd and ssvd) are more stable and produce higher quality results for ill-conditioned data matrix than the solver using Eigenvalue computation (tseigen and steigen). The improved stability comes at a higher computation cost.

  • Tolerance: By default, it is set to System Determined. To specify a value, click User Specified. The value must be a number greater than 0 and less than 1.

  • Scoring Mode: It is the scoring mode to use, either Singular Value Decomposition Scoring or Principal Components Analysis Scoring. The default is Singular Value Decomposition Scoring (SVD scoring).

    • When the build data is scored with SVD, the projections will be the same as the U matrix.

    • When the build data is scored with PCA, the projections will be the product of the U and S matrices.

  • U Matrix Output: Whether or not the U matrix produced by SVD persists. The U matrix in SVD has as many rows as the number of rows in the build data. To avoid creating a large model, the U matrix persists only when U Matrix Output is enabled. When U Matrix Output is enabled, the build data must include a Case ID. The default is Disable.

SVD Model Viewer

You can examine the details of a Singular Value Decomposition model in the SVD Model Viewer.

The SVD model viewer has these tabs:

Coefficients (SVD)

For a given Feature ID, the coefficients are displayed in the Coefficients grid.

The title of the grid Coefficients x of y displays the number of rows returned out of all the rows available in the model. By default, Feature IDs are integers.

The Eigenvalue for the selected Feature ID is displayed as a read-only value.

Fetch Size limits the number of rows returned. The default is 1,000 or the value specified in the Preference settings for Model Viewers.

You can perform the following tasks:

  • Rename

  • Filter

The Coefficients grid has these columns:

  • Attribute

  • Singular Value

    The value is shown as a bar with the value centered in the bar. Positive values are light blue; negative values are red.

    The default is Sort by absolute value. To sort by signed value, deselect the option and then click Query.

Rename (SVD)

You can rename the selected Feature ID. Enter in the new name and click OK. Different features should have different names.

Filter (SVD)

To view the filter categories, click view.

The filter categories are:

  • Attribute, the default; search for an attribute name

  • Singular Value, the singular value column

To create a filter, enter a string in the text box. After a string has been entered, delete is displayed. To clear the filter, click it.

Features

The Features tab displays all the features along with the Feature IDs and the corresponding items.

The lower panel contains the following tabs:

  • Tag Cloud: Displays the selected feature in a tag cloud format. You can sort the feature tags based on coefficients or alphabetical order. You can also view them in ascending or descending order. To copy and save the cloud image, right-click and select:

    • Save Image As

    • Copy Image to Clipboard

  • Coefficients: Displays the attribute of the selected feature along with their values and coefficients in a tabular format.

SVD Singular Values

The Singular Values for each Feature ID are displayed in a grid.

SVD Details

This tab displays the value for these global details of the SVD model:

  • Number of Components

  • Suggested Cutoff

Settings (SVD)

The Settings tab displays information about how the model was built.

The Settings tab contains these tabs:

  • Summary

  • Inputs

Summary (SVD)

The Summary tab contains information related to build details, algorithm settings, and other general settings.

The sections in the Summary tab contains the following information:

  • General settings list the following:

    • Type of model (Classification, Regression, and so on)

    • Owner of the model (the Schema where the model was built)

    • Model Name

    • Creation Date

    • Duration of the model build (in minutes)

    • Size of the model (in MB)

    • Comments

  • Algorithm Settings list the following:

    • The name of the algorithm used to build the model.

    • The algorithm settings that control model build.

  • Build Details displays computed settings. Computed settings are generated by Oracle Data Mining when the model is created.

Inputs (SVD)

The Inputs tab displays the list of attributes that are used to build the model.

Oracle Data Miner does not necessarily use all of the attributes in the build data. For example, if the values of an attribute are constant, then that attribute is not used.

For each attribute used to build the model, this tab displays:

  • Name: The name of the attribute.

  • Data Type: The data type of the attribute.

  • Mining Type: Categorical or Numerical.

  • Data Preparation: YES indicates that data preparation was performed. It helps to distinguish between User and Auto Data Preparation (ADP) as ADP could be turned off, but the user could still embedded a transformation. If Data Preparation is indicated as YES, then select the column and click it. Each group can contain an Input and a Reverse Expression. If there is no Reverse Expression, then it is not displayed. If there is no Input, then nothing is displayed. Transformations are displayed in SQL notation.

  • Partition Key: YES indicates that the attribute is a partition key.

Support Vector Machine

You can use the Support Vector Machine (SVM) algorithm to build Classification, Regression, and Anomaly Detection models.

The following topics explain Support Vector Machine:

Support Vector Machine Algorithms

The Support Vector Machines (SVM) algorithms are a suite of algorithms that can be used with variety of problems and data. By changing one kernel for another, SVM can solve a variety of data mining problems.

Oracle Data Mining supports two kernel functions:

  • Linear

  • Gaussian

The key features of SVM are:

  • SVM can emulate traditional methods, such as Linear Regression and Neural Nets, but goes far beyond those methods in flexibility, scalability, and speed.

  • SVM can be used to solve the following kinds of problems: Classification, Regression, and Anomaly Detection.

    Oracle Data Mining uses SVM as the one-class classifier for anomaly detection. When SVM is used for anomaly detection, it has the classification mining function but no target. Applying a One-class SVM model results in a prediction and a probability for each case in the scoring data. If the prediction is 1, then the case is considered typical. If the prediction is 0, then the case is considered anomalous.

How Support Vector Machines Work

SVM solves regression problems by defining an n-dimensional tube around the data points, determining the vectors giving the widest separation.

Data records with n attributes can be considered as points in n-dimensional space. SVM attempts to separate the points into subsets with homogeneous target values. Points are separated by hyperplanes in the linear case, and by non-linear separators in the non-linear case (Gaussian). SVM finds those vectors that define the separators giving the widest separation of classes (the support vectors). This is easy to visualize if n = 2; in that case, SVM finds a straight line (linear) or a curve (non-linear) separating the classes of points in the plane.

SVM Kernel Functions

The Support Vector Machine (SVM) algorithm supports two kernel functions: Gaussian and Linear.

The choice of kernel function depends on the type of model (classification or regression) that you are building and on your data.

When you choose a Kernel function, select one of the following:

  • System Determined (Default)

  • Gaussian

  • Linear

For Classification models and Anomaly Detection models, use the Gaussian kernel for solving problems where the classes are not linearly separable, that is, the classes cannot be separated by lines or planes. Gaussian kernel models allow for powerful non-linear class separation modeling. If the classes are linearly separable, then use the linear kernel.

For Regression problems, the linear kernel is similar to approximating the data with a line. The linear kernel is more robust than fitting a line to the data. The Gaussian kernel approximates the data with a non-linear function.

Building and Testing SVM Models

You specify building a model by connecting the Data Source node that represents the build data to an appropriate Build node.

By default, a Classification or Regression node tests all the models that it builds. By default, the test data is created by splitting the input data into build and test subsets. Alternatively, you can connect two data sources to the build node, or you can test the model using a Test node.

You can build three kinds of SVM models:

SVM Classification Models

SVM Classification (SVMC) is based on the concept of decision planes that define decision boundaries.

A decision plane is one that separates between a set of objects having different class memberships. SVM finds the vectors (support vectors) that define the separators giving the widest separation of classes.

SVMC supports both binary and multiclass targets.

To build and test an SVMC model, use a Classification node. By default, the SVMC Node tests the models that it builds. Test data is created by splitting the input data into build and test subsets. You can also test a model using a Test node.

After you test an SVMC model, you can tune it.

SVMC uses SVM Weights to specify the relative importance of target values.

SVM Weights

SVM models are automatically initialized to achieve the best average prediction across all classes. If the training data does not represent a realistic distribution, then you can bias the model to compensate for class values that are under-represented. If you increase the weight for a class, then the percentage of correct predictions for that class should increase.

SVM Regression Models

SVM Regression (SVMR) models tries to find a continuous function such that the maximum number of data points lie within the epsilon-wide insensitivity tube.

SVM uses an epsilon-insensitive loss function to solve regression problems. Predictions falling within epsilon distance of the true target value are not interpreted as errors.

The epsilon factor is a regularization setting for SVMR. It balances the margin of error with model robustness to achieve the best generalization to new data.

To build and test an SVMR model, use a Regression node. By default, the Regression Node tests the models that it builds. Test data is created by splitting the input data into build and test subsets. You can also test a model using a Test node.

SVM Anomaly Detection Models

Oracle Data Mining uses one-class SVM for Anomaly Detection (AD).

There is no target for Anomaly Detection. To build an AD model, use an Anomaly Detection node connected to an appropriate data source.

Applying SVM Models

You apply a model to new data to predict behavior.

Use an Apply node to apply an SVM model.

You can apply all three kinds of SVM models.

Applying One-Class SVM Models

One-class SVM models, when applied, produce a prediction and a probability for each case in the scoring data.

This behavior reflects the fact that the model is trained with normal data.

  • If the prediction is 1, then the case is considered typical.

  • If the prediction is 0, then the case is considered anomalous.

SVM Classification Algorithm Settings

The settings that you can specify for the Support Vector Machine (SVM) algorithm depend on the Kernel function that you select.

The meaning of the individual settings is the same for both Classification and Regression.

To edit settings SVM Classification algorithm settings:

  1. You can edit the settings by using one of the following options:
    • Right-click the Classification node and select Advanced Settings.

    • Right-click the Classification node and select Edit. Then, click Advanced.

  2. In the Algorithm Settings tab, the settings are available. Select the Kernel Function. The options are:
    • System determined: (Default). After the model is built, the kernel used is displayed in the settings in the model viewer.

    • Linear: If SVM uses the linear kernel, then the model generates coefficients.

    • Gaussian (a non-linear function).

  3. Click OK after you are done.

Algorithm Settings for Linear or System Determined Kernel (SVMC)

Lists the algorithm settings for SVM Classification model, if linear kernel is specified.

If you specify a linear kernel or if you let the system determine the kernel, then you can change the following settings:

  • Tolerance Value

  • Complexity Factor

  • Active Learning

  • Solver: Displays the list of SVM solvers.

    • System Determined (default)

    • Sub-Gradient Descend. To specify the settings for Sub-Gradient Descend solver, click Option. The Solver (Sub-Gradient Descend) dialog box opens.

    • Interior Point Method

    Note:

    The Solver cannot be selected if the kernel is non-linear.
  • Number of Iterations: Sets the upper limit on the number of SVM iterations.

    • System Determined

    • User Specified

Solver (Sub-Gradient Descend)

You can specify settings for Sub-Gradient Descend in the Solver Options dialog box.

Specify the following settings for Sub-Gradient Descend:

  1. Regularizer: Controls the type of regularization used by the Support Vector Machine solver. The setting can be used only for linear SVM models. Options are:
    • System Determined

    • L1

    • L2

  2. Batch Rows: Sets the batch size for the Support Vector Machine solver. Options are:
    • System Determined

    • Default: 2000

  3. Click OK.

Algorithm Settings for Gaussian Kernel (SVMC)

Lists the algorithm settings for SVM Classification model, if Gaussian kernel is specified.

If you specify the Gaussian kernel, then you can change the following settings:

  • Tolerance Value

  • Complexity Factor

  • Active Learning

    Note:

    Active Learning is not supported in Oracle Data Miner 18.1 connected to Oracle Database 12.2 and later.

  • Standard Deviation (Gaussian Kernel)

  • Cache Size (Gaussian Kernel)

  • Solver: Displays the list of SVM solvers for Gaussian kernel.

    • System Determined

    • Interior Point Method

  • Number of Iterations: Sets the upper limit on the number of SVM iterations.

    • System Determined

    • User Specified

  • Number of Pivots used in the Incomplete Cholesky Decomposition: Sets the upper limit on the number of pivots used in the incomplete Cholesky decomposition. It is applicable only for non-linear kernels. The value must be a positive integer in the range 1 to 10000. Default is 200.

Active Learning

Active Learning is a methodology optimizes the selection of a subset of the support vectors that maintain accuracy while enhancing the speed of the model.

Note:

Active Learning is not supported in Oracle Data Miner 18.1 connected to Oracle Database 12.2 and later.

The key features of Active Learning are:

  • Increases performance for a linear kernel. Active learning both increases performance and reduces the size of the Gaussian kernel. This is an important consideration if memory and temporary disk space are issues.

  • Forces the SVM algorithm to restrict learning to the most informative examples and not to attempt to use the entire body of data. Usually, the resulting models have predictive accuracy comparable to that of the standard (exact) SVM model.

You should not disable this setting

Active Learning is selected by default. It can be turned off by deselecting Active Learning.

Complexity Factor

The complexity factor determines the trade-off between minimizing model error on the training data and minimizing model complexity.

Its responsibility is to avoid over-fit (an over-complex model fitting noise in the training data) and under-fit (a model that is too simple). The default is to not specify a complexity factor.

You specify the complexity factor for an SVM model by selecting Specify the complexity factors.

A very large value of the complexity factor places an extreme penalty on errors, so that SVM seeks a perfect separation of target classes. A small value for the complexity factor places a low penalty on errors and high constraints on the model parameters, which can lead to under-fit.

If the histogram of the target attribute is skewed to the left or to the right, then try increasing complexity.

The default is to specify no complexity factor, in which case the system calculates a complexity factor. If you do specify a complexity factor, then specify a positive number. If you specify a complexity factor for Anomaly Detection, then the default is 1.

Tolerance Value

Tolerance Value is the maximum size of a violation of convergence criteria such that the model is considered to have converged.

The default value is 0.001. Larger values imply faster building but less accurate models.

Cache Size (Gaussian Kernel)

If you select the Gaussian kernel, then you can specify a cache size for the size of the cache used for storing computed kernels during the build operation.

The default size is 50 megabytes.

The most expensive operation in building a Gaussian SVM model is the computation of kernels. The general approach taken to build is to converge within a chunk of data at a time, then to test for violators outside of the chunk. The build is complete when there are no more violators within tolerance. The size of the chunk is chosen such that the associated kernels can be maintained in memory in a Kernel Cache. The larger the chunk size, the better the chunk represents the population of training data and the fewer number of times new chunks must be created. Generally, larger caches imply faster builds.

Standard Deviation (Gaussian Kernel)

Standard Deviation is a measure that is used to quantify the amount of variation.

If you select the Gaussian kernel, then you can specify the standard deviation of the Gaussian kernel. This value must be positive. The default is to not specify the standard deviation.

For Anomaly Detection, if you specify Standard Deviation, then the default is 1.

SVM Classification Test Viewer

By default, any Classification or Regression model is automatically tested. You have the option to view the test results.

A Classification model is tested by comparing the predictions of the model with known results. Oracle Data Miner keeps the latest test result.

To view the test results for a model, right-click the build node and select View Results.

SVM Classification Model Viewer

You can examine the details of a SVM Classification model in the SVM Model Viewer.

The tabs displayed in a SVMC model viewer depend on the kernel used to build the model:

  • SVMC model viewer for models with Linear Kernel

  • SVMC model viewer for models with Gaussian Kernel

SVMC Model Viewer for Models with Linear Kernel

Lists the tabs that are available if the Support Vector Machines Classification model has a Linear kernel.

The tabs are:

Coefficients (SVMC Linear)

Support Vector Machine Models built with the Linear Kernel have coefficients. The coefficients are real numbers. The number of coefficients may be quite large.

The Coefficients tab enables you to view SVM coefficients. The viewer supports sorting to specify the order in which coefficients are displayed and filtering to select which coefficients to display.

Coefficients are displayed in the Coefficients Grid. The relative value of coefficients is shown graphically as a bar, with different colors for positive and negative values. For numbers close to zero, the bar may be too small to be displayed.

Compare (SVMC Linear)

Support Vector Machine Models built with the Linear kernel allow the comparison of target values. You can compare target values.

For selected attributes, Data Miner calculates the propensity, that is, the natural inclination or preference to favor one of two target values. For example, propensity for target value 1 is the propensity to favor target value 1.

To compare target values:

  1. Select how to display information:
    • Fetch Size: The default fetch size is 1000 attributes. You can change this number.

    • Sort by absolute value: This is the default. You can deselect this option.

  2. Select two distinct target values to compare:
    • Target Value 1: Select the first target value.

    • Target Value 2: Select the second target value.

  3. Click Query. If you have not changed any defaults, then this step is not necessary.

The following information is displayed in the grid:

  • Attribute: The name of the attribute.

  • Value: Value of the attribute

  • Propensity for Target_Value_1: Propensity to favor Target Value 1.

  • Propensity for Target_Value_2: Propensity to favor Target Value 2.

Settings (SVMC)

The Settings tab contains information related to the model summary, inputs, target values, cost matrix (if the model is tuned), partition keys (if the model is partitioned) and so on.

In the Partition field, click the partition name. The partition detail is displayed in the Partition Details window.

Click search to open the Select Partition dialog box.

The Settings tab displays information about how the model was built:

SVMC Model Viewer for Models with Gaussian Kernel

Lists the tabs that are available if the Support Vector Machines Classification model has a Gaussian kernel.

The tabs are:

Summary (SVMC)

The Summary tab contains information related to inputs, build details, algorithm settings, and other general settings.

  • General settings list the following:

    • Type of model (Classification, Regression, and so on)

    • Owner of the model (the Schema where the model was built)

    • Model Name

    • Creation Date

    • Duration of the model build (in minutes)

    • Size of the model (in MB)

    • Comments.

  • Algorithm Settings tab lists the following:

    • The name of the algorithm used to build the model.

    • The algorithm settings that control the model build.

  • Build Details displays computed settings. Computed settings are generated by Oracle Data Mining when the model is created.

Inputs (SVMC)

The Inputs tab displays the list of attributes that are used to build the model.

For each attribute the following information is displayed:

  • Name: The name of the attribute.

  • Data Type: The data type of the attribute.

  • Mining Type: Categorical or Numerical.

  • Target: The check icon indicates that the attribute is a target attribute.

  • Data Preparation: YES indicates that data preparation was performed. It helps to distinguish between User and Auto Data Preparation (ADP) as ADP could be turned off, but the user could still embedded a transformation. If Data Preparation is indicated as YES, then select the column and click it. Each group can contain an Input and a Reverse Expression. If there is no Reverse Expression, then it is not displayed. If there is no Input, then nothing is displayed. Transformations are displayed in SQL notation.

  • Partition Key: YES indicates that the attribute is a partition key.

Weights (SVMC)

In Support Vector Machines classifications, weights are a biasing mechanism for specifying the relative importance of target values (classes).

SVM models are automatically initialized to achieve the best average prediction across all classes. However, if the training data does not represent a realistic distribution, then you can bias the model to compensate for class values that are underrepresented. If you increase the weight for a class, then the percentage of correct predictions for that class should increase.

Target Values (SVMC)

Target Values for Support Vector Machine for Classification models show the values of the target attributes.

  • Click view to search for target values.

  • Click delete to clear search.

Coefficients (SVMC Linear)

Support Vector Machine Models built with the Linear Kernel have coefficients. The coefficients are real numbers. The number of coefficients may be quite large.

The Coefficients tab enables you to view SVM coefficients. The viewer supports sorting to specify the order in which coefficients are displayed and filtering to select which coefficients to display.

Coefficients are displayed in the Coefficients Grid. The relative value of coefficients is shown graphically as a bar, with different colors for positive and negative values. For numbers close to zero, the bar may be too small to be displayed.

Coefficients Grid (SVMC)

The coefficients grid has these controls:

  • Target Value: Select a specific target value and see the coefficients associated with that value. The default is to display the coefficients for the value that occurs least frequently.

  • Sort By Absolute Value: If selected, then coefficients are sorted by absolute value. If you sort by absolute value, then a coefficient of -2 comes before a coefficient of 1.9. The default is to sort by absolute value.

  • Fetch Size: The number of rows displayed. To figure out if all the coefficients are displayed, choose a fetch size that is greater than the number of rows displayed.

You can search for attributes by name. Use view. If no items are listed in the grid, then there are no coefficients for the selected target value. The coefficients grid has these columns:

  • Attribute: Name of the attribute.

  • Value : Value of the attribute. If the attribute is binned, then this may be a range.

  • Coefficient: The probability for the value of the attribute.

    The value is shown as a bar with the value centered in the bar. Positive values are light blue; negative values are red.

Compare (SVMC Linear)

Support Vector Machine Models built with the Linear kernel allow the comparison of target values. You can compare target values.

For selected attributes, Data Miner calculates the propensity, that is, the natural inclination or preference to favor one of two target values. For example, propensity for target value 1 is the propensity to favor target value 1.

To compare target values:

  1. Select how to display information:
    • Fetch Size: The default fetch size is 1000 attributes. You can change this number.

    • Sort by absolute value: This is the default. You can deselect this option.

  2. Select two distinct target values to compare:
    • Target Value 1: Select the first target value.

    • Target Value 2: Select the second target value.

  3. Click Query. If you have not changed any defaults, then this step is not necessary.

The following information is displayed in the grid:

  • Attribute: The name of the attribute.

  • Value: Value of the attribute

  • Propensity for Target_Value_1: Propensity to favor Target Value 1.

  • Propensity for Target_Value_2: Propensity to favor Target Value 2.

Search

Use view to search the grid.

You can search by name (the default), by value, and by propensity for Target Value 1 or propensity for Target Value 2.

  • To select a different search option, click the triangle beside the binoculars.

  • To clear a search, click delete.

Propensity

Propensity is intended to show for a given attribute/value pair, which of the two target values has more predictive relationship. Propensity can be measured in terms of being predicted for or against a target value. If propensity is against a value, then the number is negative.

Settings (SVMC)

The Settings tab contains information related to the model summary, inputs, target values, cost matrix (if the model is tuned), partition keys (if the model is partitioned) and so on.

In the Partition field, click the partition name. The partition detail is displayed in the Partition Details window.

Click search to open the Select Partition dialog box.

The Settings tab displays information about how the model was built:

Summary (SVMC)

The Summary tab contains information related to inputs, build details, algorithm settings, and other general settings.

  • General settings list the following:

    • Type of model (Classification, Regression, and so on)

    • Owner of the model (the Schema where the model was built)

    • Model Name

    • Creation Date

    • Duration of the model build (in minutes)

    • Size of the model (in MB)

    • Comments.

  • Algorithm Settings tab lists the following:

    • The name of the algorithm used to build the model.

    • The algorithm settings that control the model build.

  • Build Details displays computed settings. Computed settings are generated by Oracle Data Mining when the model is created.

Settings (SVMC Linear)

The Settings tab comprises the following:

Inputs (SVMC)

The Inputs tab displays the list of attributes that are used to build the model.

For each attribute the following information is displayed:

  • Name: The name of the attribute.

  • Data Type: The data type of the attribute.

  • Mining Type: Categorical or Numerical.

  • Target: The check icon indicates that the attribute is a target attribute.

  • Data Preparation: YES indicates that data preparation was performed. It helps to distinguish between User and Auto Data Preparation (ADP) as ADP could be turned off, but the user could still embedded a transformation. If Data Preparation is indicated as YES, then select the column and click it. Each group can contain an Input and a Reverse Expression. If there is no Reverse Expression, then it is not displayed. If there is no Input, then nothing is displayed. Transformations are displayed in SQL notation.

  • Partition Key: YES indicates that the attribute is a partition key.

Target Values (SVMC)

Target Values for Support Vector Machine for Classification models show the values of the target attributes.

  • Click view to search for target values.

  • Click delete to clear search.

Partition Keys

The Partition Keys tab lists the columns that are partitioned.

Along with the partitioned columns, the Partition Keys tab lists the following details:
  • Partition Name

  • Source

  • Data Type

  • Value

Weights (SVMC)

In Support Vector Machines classifications, weights are a biasing mechanism for specifying the relative importance of target values (classes).

SVM models are automatically initialized to achieve the best average prediction across all classes. However, if the training data does not represent a realistic distribution, then you can bias the model to compensate for class values that are underrepresented. If you increase the weight for a class, then the percentage of correct predictions for that class should increase.

Algorithm Settings for SVMC

For Classification, the SVM algorithm has these settings:

  • Algorithm Name: Support Vector Machine

  • Kernel Function: Gaussian or Linear

  • Tolerance Value: The default is 0.001

  • Specify Complexity Factor: By default, it is not specified.

  • Active Learning: ON

  • Standard Deviation (Gaussian Kernel only)

  • Cache Size (Gaussian Kernel only)

SVM Regression Algorithm Settings

The settings that you can specify for the Support Vector Machine (SVM) algorithm depend on the Kernel function that you select.

The meaning of the individual settings is the same for both Classification and Regression.

To edit the SVM Regression Algorithm settings:

  1. You can edit the settings by using one of the following options:
    • Right-click the Classification node and select Advanced Settings.

    • Right-click the Classification node and select Edit. Then click Advanced.

  2. In the Algorithm Settings tab, the settings are available. Select Kernel Function. The options are:
    • System determined (Default). After the model is build, the kernel used is displayed in the settings in the model viewer.

    • Linear. If SVM uses the linear kernel, then the model generates coefficients.

    • Gaussian (a non-linear function).

  3. Click OK after you are done.

Algorithm Settings for Linear or System Determined Kernel (SVMR)

If you specify a linear kernel or if you let the system determine the kernel, then you can change the Tolerance Value, complexity Factor, and Active Learning for an SVM Regression model.

Tolerance Value

Tolerance Value is the maximum size of a violation of convergence criteria such that the model is considered to have converged.

The default value is 0.001. Larger values imply faster building but less accurate models.

Active Learning

Active Learning is a methodology optimizes the selection of a subset of the support vectors that maintain accuracy while enhancing the speed of the model.

Note:

Active Learning is not supported in Oracle Data Miner 18.1 connected to Oracle Database 12.2 and later.

The key features of Active Learning are:

  • Increases performance for a linear kernel. Active learning both increases performance and reduces the size of the Gaussian kernel. This is an important consideration if memory and temporary disk space are issues.

  • Forces the SVM algorithm to restrict learning to the most informative examples and not to attempt to use the entire body of data. Usually, the resulting models have predictive accuracy comparable to that of the standard (exact) SVM model.

You should not disable this setting

Active Learning is selected by default. It can be turned off by deselecting Active Learning.

Complexity Factor

The complexity factor determines the trade-off between minimizing model error on the training data and minimizing model complexity.

Its responsibility is to avoid over-fit (an over-complex model fitting noise in the training data) and under-fit (a model that is too simple). The default is to not specify a complexity factor.

You specify the complexity factor for an SVM model by selecting Specify the complexity factors.

A very large value of the complexity factor places an extreme penalty on errors, so that SVM seeks a perfect separation of target classes. A small value for the complexity factor places a low penalty on errors and high constraints on the model parameters, which can lead to under-fit.

If the histogram of the target attribute is skewed to the left or to the right, then try increasing complexity.

The default is to specify no complexity factor, in which case the system calculates a complexity factor. If you do specify a complexity factor, then specify a positive number. If you specify a complexity factor for Anomaly Detection, then the default is 1.

Algorithm Settings for Gaussian Kernel (SVMR)

If you specify the Gaussian kernel, then you can edit the Tolerance Value, Complexity Factor, Active Learning, Standard Deviation, and Cache Size for a SVM Regression model.

Tolerance Value

Tolerance Value is the maximum size of a violation of convergence criteria such that the model is considered to have converged.

The default value is 0.001. Larger values imply faster building but less accurate models.

Complexity Factor

The complexity factor determines the trade-off between minimizing model error on the training data and minimizing model complexity.

Its responsibility is to avoid over-fit (an over-complex model fitting noise in the training data) and under-fit (a model that is too simple). The default is to not specify a complexity factor.

You specify the complexity factor for an SVM model by selecting Specify the complexity factors.

A very large value of the complexity factor places an extreme penalty on errors, so that SVM seeks a perfect separation of target classes. A small value for the complexity factor places a low penalty on errors and high constraints on the model parameters, which can lead to under-fit.

If the histogram of the target attribute is skewed to the left or to the right, then try increasing complexity.

The default is to specify no complexity factor, in which case the system calculates a complexity factor. If you do specify a complexity factor, then specify a positive number. If you specify a complexity factor for Anomaly Detection, then the default is 1.

Active Learning

Active Learning is a methodology optimizes the selection of a subset of the support vectors that maintain accuracy while enhancing the speed of the model.

Note:

Active Learning is not supported in Oracle Data Miner 18.1 connected to Oracle Database 12.2 and later.

The key features of Active Learning are:

  • Increases performance for a linear kernel. Active learning both increases performance and reduces the size of the Gaussian kernel. This is an important consideration if memory and temporary disk space are issues.

  • Forces the SVM algorithm to restrict learning to the most informative examples and not to attempt to use the entire body of data. Usually, the resulting models have predictive accuracy comparable to that of the standard (exact) SVM model.

You should not disable this setting

Active Learning is selected by default. It can be turned off by deselecting Active Learning.

Standard Deviation (Gaussian Kernel)

Standard Deviation is a measure that is used to quantify the amount of variation.

If you select the Gaussian kernel, then you can specify the standard deviation of the Gaussian kernel. This value must be positive. The default is to not specify the standard deviation.

For Anomaly Detection, if you specify Standard Deviation, then the default is 1.

Cache Size (Gaussian Kernel)

If you select the Gaussian kernel, then you can specify a cache size for the size of the cache used for storing computed kernels during the build operation.

The default size is 50 megabytes.

The most expensive operation in building a Gaussian SVM model is the computation of kernels. The general approach taken to build is to converge within a chunk of data at a time, then to test for violators outside of the chunk. The build is complete when there are no more violators within tolerance. The size of the chunk is chosen such that the associated kernels can be maintained in memory in a Kernel Cache. The larger the chunk size, the better the chunk represents the population of training data and the fewer number of times new chunks must be created. Generally, larger caches imply faster builds.

Automatic Data Preparation

Most algorithms require some form of data transformation. During the model building process, Oracle Data Mining can automatically perform the transformations required by the algorithm.

You can supplement the automatic transformations with additional transformations of your own, or you can manage all the transformations yourself.

In calculating automatic transformations, Oracle Data Mining uses heuristics that address the common requirements of a given algorithm. This process results in reasonable model quality mostly.

SVM Regression Test Viewer

By default, any Classification or Regression model is automatically tested. You have the option to view the test results.

A Classification model is tested by comparing the predictions of the model with known results. Oracle Data Miner keeps the latest test result.

To view the test results for a model, right-click the Build node and select View Results.

Related Topics

SVM Regression Model Viewer

You can examine a SVM (Regression) Model in the SVM Regression Model Viewer.

The information displayed in the model viewer depends on which kernel was used to build the model.

  • If the Gaussian kernel was used, then there is only the Settings tab.

  • If the Linear Kernel was used, then there are three tabs: Coefficients,Compare, and Settings,

The tabs displayed in a SVMC model viewer depend on the kernel used to build the model:

  • SVMR model viewer for models with Linear Kernel

  • SVMR model viewer for models with Gaussian Kernel

SVMR Model Viewer for Models with Linear Kernel

Lists the tabs that are available if the Support Vector Machines for Regression model has a Linear kernel.

The tabs are:

Coefficients (SVMR)

The Coefficients tab enables you to view SVMR coefficients.

Support Vector Machine Models built with the Linear Kernel have coefficients. The coefficients are real numbers. The number of coefficients may be quite large.

The viewer supports sorting to specify the order in which coefficients are displayed and filtering to select which coefficients to display.

The coefficients are displayed in the SVMR Coefficients Grid. The relative value of the coefficients is shown graphically as a bar, with different colors for positive and negative values. For numbers close to zero, the bar may be too small to be displayed.

SVMR Coefficients Grid

Information about coefficients is organized as follows:

  • Sort by Absolute Value: The default is to sort by absolute value. For example 1 and -1 have the same absolute value. If you change this value, then you must click Query.

  • Fetch Size: It is the maximum number of rows to fetch; the default is 1,000. Smaller values result in faster fetches. If you change this value, then you must click Query.

  • Coefficients: The number of coefficients displayed; for example, 95 out of 95, indicating that there are 95 coefficients and all 95 of them are displayed.

You can perform the following tasks:

  • Search: Use view to search for items. You can search by:

    • Attribute name (Default)

    • Value

    • Coefficient

    • All (AND): If you search by this criteria, then you search for items that satisfy all criteria specified. For example, a search for ED Bac finds all attributes where both values appear.

    • All (Or): If you search by this criteria, then you search for attributes that include at least one value

  • Clear search: To clear a search, click delete.

  • To select a different search option, click the triangle beside the binoculars.

Coefficients are listed in a grid. The coefficients grid has these columns:

  • Attribute: Name of the attribute

  • Value: Value of the attribute

  • Coefficient: The value of each coefficient for the selected target value is displayed. A bar is shown in front of (and possible overlapping) each coefficient. The bar indicates the relative size of the coefficient. For positive values, the bar is light blue; for negative values, the bar is red. If a value is close to 0, then the bar may be too small to be displayed.

Settings (SVMR)

The Settings tab displays information about how the model was built.

The information is displayed in the following tabs:

  • Summary tab: Contains the Model and Algorithm settings.

  • Inputs tab: Contains the attributes used to build the model.

SVMR Model Viewer for Models with Gaussian Kernel

Lists the tabs that are available if the Support Vector Machines for Regression model has a Gaussian kernel.

The tabs are:

Summary (SVMR)

The Summary tab contains information related to inputs, build details, algorithm settings, and other general settings.

  • General settings list the following:

    • Type of model (Classification, Regression, and so on)

    • Owner of the model (the Schema where the model was built)

    • Model Name

    • Creation Date

    • Duration of the model build (in minutes)

    • Size of the model (in MB)

    • Comments.

  • Algorithm Settings section lists the following:

    • The name of the algorithm used to build the model.

    • The algorithm settings that control the model build.

  • Build Details displays computed settings. Computed settings are generated by Oracle Data Mining when the model is created.

Inputs (SVMR)

The Inputs tab displays the list of attributes that are used to build the model.

For each attribute, the following information is displayed:

  • Name: The name of the attribute.

  • Data Type: The data type of the attribute.

  • Mining Type: Categorical or Numerical.

  • Target: The check icon indicates that the attribute is a target attribute.

  • Data Preparation: YES indicates that data preparation was performed. It helps to distinguish between User and Auto Data Preparation (ADP) as ADP could be turned off, but the user could still embedded a transformation. If Data Preparation is indicated as YES, then select the column and click it. Each group can contain an Input and a Reverse Expression. If there is no Reverse Expression, then it is not displayed. If there is no Input, then nothing is displayed. Transformations are displayed in SQL notation.

  • Partition Key: YES indicates that the attribute is a partition key.

Settings Information

Certain settings related to automatic data preparation. Epsilon Value, Support and Confidence are common to most algorithms.

This section contains topics about settings that are common to most algorithms:

General Settings

The generic settings are contained in the Settings tab and General tab.

The Settings tab of a model viewer displays settings in three categories:

  • General displays generic information about the model, as described in this topic.

  • Algorithm Settings displays information that are specific to the selected algorithm.

  • Build Details displays computed settings. Computed settings are generated by Oracle Data Mining when the model is created.

The General tab contains the following information for all algorithms:

  • Type The mining function for the model: anomaly detection, association rules, attribute importance, classification, clustering, feature extraction, or regression.

  • Owner: The data mining account (schema) used to build the model.

  • Model Name: The name of the model.

  • Target Attribute: The target attribute; only Classification and Regression models have targets.

  • Creation Date: The date when the model was created in the form MM/DD/YYYY

  • Duration: Time in minutes required to build model.

  • Size: The size of the model in megabytes.

  • Comment: For models not created using Oracle Data Miner, this option displays comments embedded in the models. To see comments for models built using Oracle Data Miner, go to Properties for the node where the model is built.

    Models created using Oracle Data Miner may contain BALANCED, NATURAL, CUSTOM, or TUNED. Oracle Data Miner inserts these values to indicate if the model has been tuned and in what way it was tuned.

Automatic Data Preparation

In calculating automatic transformations, Oracle Data Mining uses heuristics that address the common requirements of a given algorithm. This process results in reasonable model quality in most cases.

Most algorithms require some form of data transformation. During the model building process, Oracle Data Mining can automatically perform the transformations required by the algorithm. You can choose to supplement the automatic transformations with additional transformations of your own, or you can choose to manage all the transformations yourself.

If Automatic Data Preparation is performed, then the same data preparation is automatically performed for data that is scored using the model. If Automatic Data Preparation is OFF, that is if you manage all the transformations yourself, then you must prepare Apply data in the same way that the Build data was prepared.

Other Settings

The other settings are related to the number of attributes in a rule, data preparation, support, and confidence.

The settings are:

  • Limit Number of Attributes in Each Rule: By default, this option is selected. The maximum number of attributes in each rule. This number must be an integer between 2 and 20. Higher numbers of rules result in slower builds. You can change the number of attributes in a rule, or you can specify no limit for the number of attributes in a rule. It is a good practice to start with the default and increase this number slowly.

    • To specify no limit, deselect this option.

    • Specifying many attributes in each rule increases the number of rules considerably.

    • The default is 3.

  • Automatic Preparation: ON or OFF. ON signifies that Automatic Data Preparation is used for normalization and outlier detection. The SVM algorithm automatically handles missing value treatment and the transformation of categorical data. Normalization and outlier detection must be handled by ADP or prepared manually. The default is ON.

  • Minimum Support: A number between 0 and 100 indicating a percentage. Smaller values for support results in slower builds and requires more system resources. The default is 5%.

  • Minimum Confidence: Confidence in the rules. A number between 0 and 100 indicating a percentage. High confidence results in a faster build. The default is 10%.

Epsilon Value

Support Vector Machine makes a distinction between small errors and large errors. The difference is defined by the epsilon value.

The algorithm calculates and optimizes an epsilon value internally, or you can supply a value.

You can specify the epsilon value for an SVM model by clicking the option Yes in answer to the question Do you want to specify an epsilon value?.

The epsilon value must be either greater than 0 or undefined

  • If the number of support vectors defined by the model is very large, then try a larger epsilon value.

  • If there are very high cardinality categorical attributes, then try decreasing epsilon.

By default, no epsilon value is specified. In such a case, the algorithm calculates an epsilon value.