8.9 Decision Tree

The oml.dt class uses the Decision Tree algorithm for classification.

Decision Tree models are classification models that contain axis-parallel rules. A rule is a conditional statement that can be understood by humans and may be used within a database to identify a set of records.

A decision tree predicts a target value by asking a sequence of questions. At a given stage in the sequence, the question that is asked depends upon the answers to the previous questions. The goal is to ask questions that, taken together, uniquely identify specific target values. Graphically, this process forms a tree structure.

During the training process, the Decision Tree algorithm must repeatedly find the most efficient way to split a set of cases (records) into two child nodes. The oml.dt class offers two homogeneity metrics, gini and entropy, for calculating the splits. The default metric is gini.

For information on the oml.dt class attributes and methods, invoke help(oml.dt) or see Oracle Machine Learning for Python API Reference.

Settings for a Decision Tree Model

The following table lists settings that apply to Decision Tree models.

Table 8-4 Decision Tree Model Settings

Setting Name Setting Value Description
CLAS_COST_TABLE_NAME

table_name

The name of a table that stores a cost matrix for the algorithm to use in building and applying the model. The cost matrix specifies the costs associated with misclassifications.

The cost matrix table is user-created. The following are the column requirements for the table.

  • Column Name: ACTUAL_TARGET_VALUE

    Data Type: Valid target data type

  • Column Name: PREDICTED_TARGET_VALUE

    Data Type: Valid target data type

  • Column Name: COST

    Data Type: NUMBER

CLAS_MAX_SUP_BINS

2 <= a number <= 2147483647

Specifies the maximum number of bins for each attribute.

The default value is 32.

CLAS_WEIGHTS_BALANCED

ON

OFF

Indicates whether the algorithm must create a model that balances the target distribution. This setting is most relevant in the presence of rare targets, as balancing the distribution may enable better average accuracy (average of per-class accuracy) instead of overall accuracy (which favors the dominant class). The default value is OFF.

TREE_IMPURITY_METRIC

TREE_IMPURITY_ENTROPY

TREE_IMPURITY_GINI

Tree impurity metric for a Decision Tree model.

Tree algorithms seek the best test question for splitting data at each node. The best splitter and split value are those that result in the largest increase in target value homogeneity (purity) for the entities in the node. Purity is measured in accordance with a metric. Decision trees can use either gini (TREE_IMPURITY_GINI) or entropy (TREE_IMPURITY_ENTROPY) as the purity metric. By default, the algorithm uses TREE_IMPURITY_GINI.

TREE_TERM_MAX_DEPTH

2 <= a number <= 100

Criteria for splits: maximum tree depth (the maximum number of nodes between the root and any leaf node, including the leaf node).

The default is 7.

TREE_TERM_MINPCT_NODE

0< = a number <= 10

The minimum number of training rows in a node expressed as a percentage of the rows in the training data.

The default value is 0.05, indicating 0.05%.

TREE_TERM_MINPCT_SPLIT

0 < a number <= 20

Minimum number of rows required to consider splitting a node expressed as a percentage of the training rows.

The default value is 0.1, indicating 0.1%.

TREE_TERM_MINREC_NODE

A number >= 0

Minimum number of rows in a node.

The default value is 10.

TREE_TERM_MINREC_SPLIT

A number > 1

Criteria for splits: minimum number of records in a parent node expressed as a value. No split is attempted if the number of records is below this value.

The default value is 20.

Example 8-9 Using the oml.dt Class

This example demonstrates the use of various methods of the oml.dt class. In the listing for this example, some of the output is not shown as indicated by ellipses.

import oml
import pandas as pd
from sklearn import datasets 

# Load the iris data set and create a pandas.DataFrame for it.
iris = datasets.load_iris()
x = pd.DataFrame(iris.data,
                 columns = ['Sepal_Length','Sepal_Width',
                            'Petal_Length','Petal_Width'])
y = pd.DataFrame(list(map(lambda x:
                           {0: 'setosa', 1: 'versicolor',
                            2:'virginica'}[x], iris.target)),
                 columns = ['Species'])

try:
    oml.drop('COST_MATRIX')
    oml.drop('IRIS')
except: 
    pass

# Create the IRIS database table and the proxy object for the table.
oml_iris = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')

# Create  training and test data.
dat = oml.sync(table = 'IRIS').split()
train_x = dat[0].drop('Species')
train_y = dat[0]['Species']
test_dat = dat[1]

# Create a cost matrix table in the database.
cost_matrix = [['setosa', 'setosa', 0], 
               ['setosa', 'virginica', 0.2], 
               ['setosa', 'versicolor', 0.8],
               ['virginica', 'virginica', 0],
               ['virginica', 'setosa', 0.5],
               ['virginica', 'versicolor', 0.5],
               ['versicolor', 'versicolor', 0],
               ['versicolor', 'setosa', 0.4], 
               ['versicolor', 'virginica', 0.6]]
cost_matrix = oml.create(
  pd.DataFrame(cost_matrix, 
               columns = ['ACTUAL_TARGET_VALUE', 
                          'PREDICTED_TARGET_VALUE', 'COST']),
               table = 'COST_MATRIX')

# Specify settings.
setting = {'TREE_TERM_MAX_DEPTH':'2'}

# Create a DT model object.
dt_mod = oml.dt(**setting)

# Fit the DT model according to the training data and parameter
# settings.
dt_mod.fit(train_x, train_y, cost_matrix = cost_matrix)

# Use the model to make predictions on the test data.
dt_mod.predict(test_dat.drop('Species'), 
               supplemental_cols = test_dat[:, ['Sepal_Length', 
                                                'Sepal_Width', 
                                                'Petal_Length', 
                                                'Species']])

# Return the prediction probability.
dt_mod.predict(test_dat.drop('Species'), 
               supplemental_cols = test_dat[:, ['Sepal_Length', 
                                                'Sepal_Width', 
                                                'Species']],
               proba = True)

# Make predictions and return the probability for each class
# on new data.
dt_mod.predict_proba(test_dat.drop('Species'), 
                     supplemental_cols = test_dat[:, 
                       ['Sepal_Length', 
                        'Species']]).sort_values(by = ['Sepal_Length',
                                                       'Species'])

dt_mod.score(test_dat.drop('Species'), test_dat[:, ['Species']])

Listing for This Example

>>> import oml
>>> import pandas as pd
>>> from sklearn import datasets
>>>
>>> # Load the iris data set and create a pandas.DataFrame for it.
... iris = datasets.load_iris()
>>> x = pd.DataFrame(iris.data, 
...                  columns = ['Sepal_Length','Sepal_Width',
...                             'Petal_Length','Petal_Width'])
>>> y = pd.DataFrame(list(map(lambda x: 
...                            {0: 'setosa', 1: 'versicolor', 
...                             2:'virginica'}[x], iris.target)), 
...                  columns = ['Species'])
>>>
>>> # Create the IRIS database table and the proxy object for the table.
... oml_iris = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')
>>>
>>> try:
...    oml.drop('COST_MATRIX')
...    oml.drop('IRIS')
... except: 
...    pass
>>>
>>> # Create training and test data.
... dat = oml.sync(table = 'IRIS').split()
>>> train_x = dat[0].drop('Species')
>>> train_y = dat[0]['Species']
>>> test_dat = dat[1]
>>> 
>>> # Create a cost matrix table in the database.
... cost_matrix = [['setosa', 'setosa', 0], 
...                ['setosa', 'virginica', 0.2], 
...                ['setosa', 'versicolor', 0.8],
...                ['virginica', 'virginica', 0],
...                ['virginica', 'setosa', 0.5],
...                ['virginica', 'versicolor', 0.5],
...                ['versicolor', 'versicolor', 0],
...                ['versicolor', 'setosa', 0.4], 
...                ['versicolor', 'virginica', 0.6]]
>>> cost_matrix = oml.create(
...  pd.DataFrame(cost_matrix, 
...               columns = ['ACTUAL_TARGET_VALUE', 
...                          'PREDICTED_TARGET_VALUE',
...                          'COST']),
...               table = 'COST_MATRIX')
>>>
>>> # Specify settings.
... setting = {'TREE_TERM_MAX_DEPTH':'2'}
>>> 
>>> # Create a DT model object.
... dt_mod = oml.dt(**setting)
>>> 
>>> # Fit the DT model according to the training data and parameter
... # settings.
>>> dt_mod.fit(train_x, train_y, cost_matrix = cost_matrix)

Algorithm Name: Decision Tree

Mining Function: CLASSIFICATION

Target: Species

Settings: 
                    setting name            setting value
0                      ALGO_NAME       ALGO_DECISION_TREE
1           CLAS_COST_TABLE_NAME "OML_USER"."COST_MATRIX"
2              CLAS_MAX_SUP_BINS                       32
3          CLAS_WEIGHTS_BALANCED                      OFF
4                   ODMS_DETAILS              ODMS_ENABLE
5   ODMS_MISSING_VALUE_TREATMENT  ODMS_MISSING_VALUE_AUTO
6                  ODMS_SAMPLING    ODMS_SAMPLING_DISABLE
7                      PREP_AUTO                       ON
8           TREE_IMPURITY_METRIC       TREE_IMPURITY_GINI
9            TREE_TERM_MAX_DEPTH                        2
10         TREE_TERM_MINPCT_NODE                      .05
11        TREE_TERM_MINPCT_SPLIT                       .1
12         TREE_TERM_MINREC_NODE                       10
13        TREE_TERM_MINREC_SPLIT                       20

Global Statistics:
   attribute name  attribute value
0        NUM_ROWS              104

Attributes: 
Petal_Length
Petal_Width

Partition: NO

Distributions: 

   NODE_ID TARGET_VALUE  TARGET_COUNT
0        0       setosa            36
1        0   versicolor            35
2        0    virginica            33
3        1       setosa            36
4        2   versicolor            35
5        2    virginica            33

Nodes: 

   parent  node.id  row.count  prediction  \
0     0.0        1         36      setosa   
1     0.0        2         68  versicolor   
2     NaN        0        104      setosa   

                                        split  \
0  (Petal_Length <=(2.4500000000000002E+000))
1   (Petal_Length >(2.4500000000000002E+000))
2                                        None

                                  surrogate  \
0  Petal_Width <=(8.0000000000000004E-001))   
1   Petal_Width >(8.0000000000000004E-001))   
2                                      None   

                                  full.splits  
0  (Petal_Length <=(2.4500000000000002E+000))  
1   (Petal_Length >(2.4500000000000002E+000))  
2                                           (  
>>> 
>>> # Use the model to make predictions on the test data.
... dt_mod.predict(test_dat.drop('Species'), 
...                supplemental_cols = test_dat[:, ['Sepal_Length', 
...                                                 'Sepal_Width', 
...                                                 'Petal_Length', 
...                                                 'Species']])
    Sepal_Length  Sepal_Width  Petal_Length     Species  PREDICTION
0            4.9          3.0           1.4      setosa      setosa
1            4.9          3.1           1.5      setosa      setosa
2            4.8          3.4           1.6      setosa      setosa
3            5.8          4.0           1.2      setosa      setosa
...          ...          ...           ...         ...         ...
44           6.7          3.3           5.7   virginica  versicolor
45           6.7          3.0           5.2   virginica  versicolor
46           6.5          3.0           5.2   virginica  versicolor
47           5.9          3.0           5.1   virginica  versicolor
>>> 
>>> # Return the prediction probability.
... dt_mod.predict(test_dat.drop('Species'), 
...                supplemental_cols = test_dat[:, ['Sepal_Length', 
...                                                 'Sepal_Width', 
...                                                 'Species']], 
...                proba = True)
    Sepal_Length  Sepal_Width     Species  PREDICTION  PROBABILITY
0            4.9          3.0      setosa      setosa     1.000000
1            4.9          3.1      setosa      setosa     1.000000
2            4.8          3.4      setosa      setosa     1.000000
3            5.8          4.0      setosa      setosa     1.000000
...          ...          ...         ...         ...          ...
44           6.7          3.3   virginica  versicolor     0.514706
45           6.7          3.0   virginica  versicolor     0.514706
46           6.5          3.0   virginica  versicolor     0.514706
47           5.9          3.0   virginica  versicolor     0.514706

>>> # Make predictions and return the probability for each class
>>> # on new data.
>>> dt_mod.predict_proba(test_dat.drop('Species'), 
...                      supplemental_cols = test_dat[:, 
...                        ['Sepal_Length', 
...                         'Species']]).sort_values(by = ['Sepal_Length', 
...                                                        'Species'])
    Sepal_Length     Species  PROBABILITY_OF_SETOSA  \
0            4.4      setosa                    1.0   
1            4.4      setosa                    1.0   
2            4.5      setosa                    1.0   
3            4.8      setosa                    1.0   
...          ...         ...                    ...
42           6.7   virginica                    0.0   
43           6.9  versicolor                    0.0   
44           6.9   virginica                    0.0   
45           7.0  versicolor                    0.0   

    PROBABILITY_OF_VERSICOLOR  PROBABILITY_OF_VIRGINICA  
0                    0.000000                  0.000000  
1                    0.000000                  0.000000  
2                    0.000000                  0.000000  
3                    0.000000                  0.000000  
...                       ...                       ...
42                   0.514706                  0.485294  
43                   0.514706                  0.485294  
44                   0.514706                  0.485294  
45                   0.514706                  0.485294  
>>> 
>>> dt_mod.score(test_dat.drop('Species'), test_dat[:, ['Species']])
0.645833