This appendix covers the following topics:
Select an algorithm to use when creating a model. The algorithm selected uses the model features selected to predict a specific target measure. The output from each model varies depending on the algorithm and model features selected. The following algorithm descriptions provide a general overview of each algorithm choice for each model type.
When creating a Feature Significance model, you can choose from either of the following algorithms:
Random Forest Classifier - Random Forest is a powerful machine learning algorithm. The Random Forest algorithm builds a number of decision tree models and predicts using the ensemble of trees.
Chi Square - A chi-square test for independence compares two variables in a table (in our case, one of the selected features and the quality or yield result) to see if they are related. In a more general sense, it tests to see whether distributions of each variable differ from each other. The algorithm compares multiple feature/result variable pairs at once to determine which features have a relationship with the chosen result.
For a detailed explanation of the Random Forest algorithm, see "Random Forest" in Oracle Data Mining Concepts and visit the Apache Spark website at Random Forests.
You can set the following parameters for a random forest algorithm:
Maximum Depth - Maximum depth of each tree.
Minimum No. of Trees - The minimum number of trees to build before calculating the maximum voting or averages of predictions. A higher number of trees produces more accurate results, but requires more processing time. Choose as high a value as possible considering your processor's capabilities, because a high value results in stronger and more stable predictions.
Minimum Importance - Specify a minimum importance value to filter for features with greater than the minimum level of importance.
For Random Forest algorithm examples, visit the Apache Spark website at Random Forests.
The Chi-Square test is used to test the independence of two events. More specifically, to determine feature significance, the Chi-Square test is used to test whether the occurrence of a specific term (input feature) and the occurrence of a specific class (output variable) are independent. The algorithm generates a p-value which determines how likely or unlikely it is to have a NULL hypothesis (where the input and output are completely independent). The lower the p-value, the more unlikely the NULL hypothesis, indicating a relationship between the input and the output variables. Higher p-values indicate the likelihood of a NULL hypothesis (no relationship). You must specify the acceptable minimum importance for the model. Features with an importance above the set minimum limit are not considered significant to the quality or yield result. The system ranks all features with a p-value below this threshold, with the lowest p-value ranked the highest.
Features are ranked as follows:
Minimum Importance (p-value) | Strength of Relationship |
---|---|
0 to 0.01 | Very Strong |
0.01 to 0.05 | Strong |
0.05 to Minimum Importance (default value = 0.1) | Weak |
> Minimum Importance | No relationship |
For more details about the Chi-Square algorithm, visit the Apache Spark website at Hypothesis testing.
When creating an Insight model, you can choose either of the following algorithms:
Apriori - performs market basket analysis by identifying co-occurring items (frequent itemsets) within a set. Apriori finds rules with support greater than a specified minimum support and confidence greater than a specified minimum confidence. .
Decision Tree - extracts predictive information in the form of human-understandable rules. The rules are if-then-else expressions; they explain the decisions that lead to the prediction. Predictions models also use the Decision Tree algorithm and Feature Significance models use a variation of the Decision Tree algorithm, called Random Forest.
Association rules set the minimum level of predictability that is acceptable for this algorithm and data set. For a detailed explanation of the Apriori algorithm, see "Apriori" in Oracle Data Mining Concepts. Enter acceptable values for the following three parameters:
Maximum Rule Length - Defines the maximum number of features/predictors that can influence the model's output (quality or yield). The default value is 4 and the highest allowed value is 20.
Minimum Confidence - Specifies the minimum conditional probability of a target outcome, given that you have certain other outcomes in your data set. The default value is 0.75.
Minimum Support - Defines the required minimum percentage of the data containing the target outcome. The default value is 0.2.
For a detailed explanation of the Decision Tree algorithm, see "Decision Tree" in Oracle Data Mining Concepts.
You can set the following parameters for a decision tree algorithm:
Minimum Records to Node - the minimum number of records that must be present in a node for a split to occur. The default value is 10.
Minimum Records to Split - when the node splits the data into branches, this is the minimum number of records for each branch. The default value is 20.
Maximum Depth - the number of branch levels. The default value is 7.
Impurity Metric - Use either entropy or Gini. Gini is the default value. Purity metrics, also known as homogeneity metrics, asses the quality of alternative split conditions and select the one that results in the most homogeneous child nodes. Purity refers to the degree to which the resulting child nodes are made up of cases with the same target value. The objective is to maximize the purity in the child nodes.
When creating a Predictions model, you can choose from either of the following algorithms:
Decision tree - Decision trees extract predictive information in the form of human-understandable rules. The rules are if-then-else expressions; they explain the decisions that lead to the prediction. Insight models also use the Decision Tree algorithm and Feature Significance models use a variation of the Decision Tree algorithm, called Random Forest.
SVM (Support Vector Machine) - Distinct versions of Support Vector Machines (SVM) use different kernel functions to handle different types of data sets. Linear and Gaussian (nonlinear) kernels are supported. SVM classification attempts to separate the target classes with the widest possible margin. SVM regression tries to find a continuous function such that the maximum number of data points lie within an epsilon-wide tube around it.
For a detailed explanation of the Decision Tree algorithm, see "Decision Tree" in Oracle Data Mining Concepts.
You can set the following parameters for a decision tree algorithm:
Minimum Records to Node - the minimum number of records that must be present in a node for a split to occur. The default value is 10.
Minimum Records to Split - when the node splits the data into branches, this is the minimum number of records for each branch. The default value is 20.
Maximum Depth - the number of branch levels. The default value is 7.
Impurity Metric - Use either entropy or Gini. Gini is the default value. Purity metrics, also known as homogeneity metrics, asses the quality of alternative split conditions and select the one that results in the most homogeneous child nodes. Purity refers to the degree to which the resulting child nodes are made up of cases with the same target value. The objective is to maximize the purity in the child nodes.
For a detailed explanation of the SVM algorithm, see "Support Vector Machines" in Oracle Data Mining Concepts.
Specify the following SVM parameters:
Active Learning - Value defaults to TRUE. This restricts the SVM algorithm to use the most informative samples of the data rather than attempting to use the whole data set.
Convergence Tolerance - Value defaults to 0.0001. Setting a larger tolerance value for model convergence criteria enables faster model building, but it is less accurate. This optimization parameter is used to minimize the prediction loss of a target value while building a model.