Naive Bayes

8.14 Naive Bayes

The oml.nb class creates a Naive Bayes (NB) model for classification.

The Naive Bayes algorithm is based on conditional probabilities. Naive Bayes looks at the historical data and calculates conditional probabilities for the target values by observing the frequency of attribute values and of combinations of attribute values.

Naive Bayes assumes that each predictor is conditionally independent of the others. (Bayes' Theorem requires that the predictors be independent.)

For information on the oml.nb class attributes and methods, invoke help(oml.nb) or see Oracle Machine Learning for Python API Reference.

Settings for a Naive Bayes Model

The following table lists the settings that apply to NB models.

Table 8-12 Naive Bayes Model Settings

Setting Name	Setting Value	Description
`CLAS_COST_TABLE_NAME`	table_name	The name of a table that stores a cost matrix for the algorithm to use in building the model. The cost matrix specifies the costs associated with misclassifications. The cost matrix table is user-created. The following are the column requirements for the table. Column Name: ACTUAL_TARGET_VALUE Data Type: Valid target data type Column Name: PREDICTED_TARGET_VALUE Data Type: Valid target data type Column Name: COST Data Type: NUMBER
`CLAS_MAX_SUP_BINS`	`2 <=` `a number` `<= 2147483647`	Specifies the maximum number of bins for each attribute. The default value is `32`.
`CLAS_PRIORS_TABLE_NAME`	table_name	The name of a table that stores prior probabilities to offset differences in distribution between the build data and the scoring data. The priors table is user-created. The following are the column requirements for the table. Column Name: TARGET_VALUE Data Type: Valid target data type Column Name: PRIOR_PROBABILITY Data Type: NUMBER
`CLAS_WEIGHTS_BALANCED`	`ON` `OFF`	Indicates whether the algorithm must create a model that balances the target distribution. This setting is most relevant in the presence of rare targets, as balancing the distribution may enable better average accuracy (average of per-class accuracy) instead of overall accuracy (which favors the dominant class). The default value is `OFF`.
`NABS_PAIRWISE_THRESHOLD`	`TO_CHAR(0 <=` `numeric_expr` `<= 1)`	Value of the pairwise threshold for the NB algorithm. The default value is `0`.
`NABS_SINGLETON_THRESHOLD`	`TO_CHAR(0 <=` `numeric_expr` `<= 1)`	Value of the singleton threshold for the NB algorithm. The default value is `0`.

See Also:

Example 8-14 Using the oml.nb Class

This example creates an NB model and uses some of the methods of the oml.nb class.

import oml
import pandas as pd
from sklearn import datasets

# Load the iris data set and create a pandas.DataFrame for it.
iris = datasets.load_iris()
x = pd.DataFrame(iris.data,
                 columns = ['Sepal_Length','Sepal_Width',
                            'Petal_Length','Petal_Width'])
y = pd.DataFrame(list(map(lambda x:
                           {0: 'setosa', 1: 'versicolor',
                            2:'virginica'}[x], iris.target)),
                 columns = ['Species'])

try:
    oml.drop(table = 'NB_PRIOR_PROBABILITY_DEMO')
    oml.drop('IRIS')
except:
    pass

# Create the IRIS database table and the proxy object for the table.
oml_iris = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')

# Create training and test data.
dat = oml.sync(table = 'IRIS').split()

train_x = dat[0].drop('Species')
train_y = dat[0]['Species']
test_dat = dat[1]

# User specified settings.
setting = {'CLAS_WEIGHTS_BALANCED': 'ON'}

# Create an oml NB model object.
nb_mod = oml.nb(**setting)

# Fit the NB model according to the training data and parameter
# settings.
nb_mod = nb_mod.fit(train_x, train_y)

# Show details of the model.
nb_mod

# Create a priors table in the database.
priors = {'setosa': 0.2, 'versicolor': 0.3, 'virginica': 0.5}
priors = oml.create(pd.DataFrame(list(priors.items()), 
                       columns = ['TARGET_VALUE', 
                                  'PRIOR_PROBABILITY']), 
                       table = 'NB_PRIOR_PROBABILITY_DEMO')

# Change the setting parameter and refit the model 
# with a user-defined prior table.
new_setting = {'CLAS_WEIGHTS_BALANCED': 'OFF'}
nb_mod = nb_mod.set_params(**new_setting).fit(train_x, 
                                              train_y, 
                                              priors = priors)
nb_mod

# Use the model to make predictions on test data.
nb_mod.predict(test_dat.drop('Species'), 
               supplemental_cols = test_dat[:, ['Sepal_Length', 
                                                'Sepal_Width', 
                                                'Petal_Length', 
                                                'Species']])
# Return the prediction probability.
nb_mod.predict(test_dat.drop('Species'), 
               supplemental_cols = test_dat[:, ['Sepal_Length', 
                                                'Sepal_Width',
                                                'Species']], 
               proba = True)


# Return the top two most influencial attributes of the highest
# probability class.
nb_mod.predict(test_dat.drop('Species'), 
               supplemental_cols = test_dat[:, ['Sepal_Length', 
                                                'Sepal_Width', 
                                                'Petal_Length',
                                                'Species']], 
               topN_attrs = 2)

# Make predictions and return the probability for each class
# on new data.
nb_mod.predict_proba(test_dat.drop('Species'), 
                     supplemental_cols = test_dat[:, 
                       ['Sepal_Length', 
                        'Species']]).sort_values(by = 
                           ['Sepal_Length',
                            'Species', 
                            'PROBABILITY_OF_setosa', 
                            'PROBABILITY_OF_versicolor'])

# Make predictions on new data and return the mean accuracy.
nb_mod.score(test_dat.drop('Species'), test_dat[:, ['Species']])

Listing for This Example

>>> import oml
>>> import pandas as pd
>>> from sklearn import datasets
>>>
>>> # Load the iris data set and create a pandas.DataFrame for it.
... iris = datasets.load_iris()
>>> x = pd.DataFrame(iris.data, 
...                  columns = ['Sepal_Length','Sepal_Width',
...                             'Petal_Length','Petal_Width'])
>>> y = pd.DataFrame(list(map(lambda x: 
...                            {0: 'setosa', 1: 'versicolor', 
...                             2:'virginica'}[x], iris.target)), 
...                  columns = ['Species'])
>>>
>>> try:
...    oml.drop(table = 'NB_PRIOR_PROBABILITY_DEMO')
...    oml.drop('IRIS')
... except:
...    pass
>>>
>>> # Create the IRIS database table and the proxy object for the table.
... oml_iris = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')
>>>
>>> # Create training and test data.
>>> dat = oml.sync(table = 'IRIS').split()
>>> train_x = dat[0].drop('Species')
>>> train_y = dat[0]['Species']
>>> test_dat = dat[1]
>>>
>>> # User specified settings.
... setting = {'CLAS_WEIGHTS_BALANCED': 'ON'}
>>>
>>> # Create an oml NB model object.
... nb_mod = oml.nb(**setting)
>>> 
>>> # Fit the NB model according to the training data and parameter
... # settings.
>>> nb_mod = nb_mod.fit(train_x, train_y)
>>>
>>> # Show details of the model.
... nb_mod

Algorithm Name: Naive Bayes

Mining Function: CLASSIFICATION

Target: Species

Settings: 
                   setting name            setting value
0                     ALGO_NAME         ALGO_NAIVE_BAYES
1         CLAS_WEIGHTS_BALANCED                       ON
2       NABS_PAIRWISE_THRESHOLD                        0
3      NABS_SINGLETON_THRESHOLD                        0
4                  ODMS_DETAILS              ODMS_ENABLE
5  ODMS_MISSING_VALUE_TREATMENT  ODMS_MISSING_VALUE_AUTO
6                 ODMS_SAMPLING    ODMS_SAMPLING_DISABLE
7                     PREP_AUTO                       ON

Global Statistics:
   attribute name  attribute value
0        NUM_ROWS              104

Attributes: 
Petal_Length
Petal_Width
Sepal_Length
Sepal_Width

Partition: NO

Priors: 

   TARGET_NAME TARGET_VALUE  PRIOR_PROBABILITY  COUNT
0     Species       setosa           0.333333     36
1     Species   versicolor           0.333333     35
2     Species    virginica           0.333333     33

Conditionals: 

    TARGET_NAME TARGET_VALUE ATTRIBUTE_NAME ATTRIBUTE_SUBNAME ATTRIBUTE_VALUE  \
0       Species       setosa   Petal_Length              None       ( ; 1.05]   
1       Species       setosa   Petal_Length              None     (1.05; 1.2]
2       Species       setosa   Petal_Length              None     (1.2; 1.35]
3       Species       setosa   Petal_Length              None    (1.35; 1.45]
...         ...          ...            ...               ...             ...   
152     Species    virginica    Sepal_Width              None    (3.25; 3.35]
153     Species    virginica    Sepal_Width              None    (3.35; 3.45]
154     Species    virginica    Sepal_Width              None    (3.55; 3.65]
155     Species    virginica    Sepal_Width              None    (3.75; 3.85]

     CONDITIONAL_PROBABILITY  COUNT  
0                   0.027778      1
1                   0.027778      1
2                   0.083333      3
3                   0.277778     10
...                      ...    ...  
152                 0.030303      1  
153                 0.060606      2  
154                 0.030303      1  
155                 0.060606      2

[156 rows x 7 columns]

>>> # Create a priors table in the database.
... priors = {'setosa': 0.2, 'versicolor': 0.3, 'virginica': 0.5}
>>> priors = oml.create(pd.DataFrame(list(priors.items()), 
...                        columns = ['TARGET_VALUE', 
...                                   'PRIOR_PROBABILITY']), 
...                        table = 'NB_PRIOR_PROBABILITY_DEMO')
>>>
>>> # Change the setting parameter and refit the model 
... # with a user-defined prior table.
... new_setting = {'CLAS_WEIGHTS_BALANCED': 'OFF'}
>>> nb_mod = nb_mod.set_params(**new_setting).fit(train_x, 
...                                               train_y,
...                                               priors = priors)
>>> nb_mod

Algorithm Name: Naive Bayes

Mining Function: CLASSIFICATION

Target: Species

Settings: 
                   setting name                          setting value
0                     ALGO_NAME                       ALGO_NAIVE_BAYES
1        CLAS_PRIORS_TABLE_NAME "OML_USER"."NB_PRIOR_PROBABILITY_DEMO"
2         CLAS_WEIGHTS_BALANCED                                    OFF
3       NABS_PAIRWISE_THRESHOLD                                      0
4      NABS_SINGLETON_THRESHOLD                                      0
5                  ODMS_DETAILS                            ODMS_ENABLE
6  ODMS_MISSING_VALUE_TREATMENT                ODMS_MISSING_VALUE_AUTO
7                 ODMS_SAMPLING                  ODMS_SAMPLING_DISABLE
8                     PREP_AUTO                                     ON

Global Statistics:
   attribute name  attribute value
0        NUM_ROWS              104

Attributes: 
Petal_Length
Petal_Width
Sepal_Length
Sepal_Width

Partition: NO

Priors: 

  TARGET_NAME TARGET_VALUE  PRIOR_PROBABILITY  COUNT
0     Species       setosa                0.2     36
1     Species   versicolor                0.3     35
2     Species    virginica                0.5     33

Conditionals: 

    TARGET_NAME TARGET_VALUE ATTRIBUTE_NAME ATTRIBUTE_SUBNAME ATTRIBUTE_VALUE  \
0       Species       setosa   Petal_Length              None       ( ; 1.05]
1       Species       setosa   Petal_Length              None     (1.05; 1.2]
2       Species       setosa   Petal_Length              None     (1.2; 1.35]
3       Species       setosa   Petal_Length              None    (1.35; 1.45]
...         ...          ...            ...               ...             ...
152     Species    virginica    Sepal_Width              None    (3.25; 3.35]
153     Species    virginica    Sepal_Width              None    (3.35; 3.45]
154     Species    virginica    Sepal_Width              None    (3.55; 3.65]
155     Species    virginica    Sepal_Width              None    (3.75; 3.85]

     CONDITIONAL_PROBABILITY  COUNT  
0                   0.027778      1
1                   0.027778      1
2                   0.083333      3
3                   0.277778     10
...                      ...    ...
152                 0.030303      1
153                 0.060606      2
154                 0.030303      1
155                 0.060606      2

[156 rows x 7 columns]

>>> # Use the model to make predictions on test data.
... nb_mod.predict(test_dat.drop('Species'),
...                supplemental_cols = test_dat[:, ['Sepal_Length', 
...                                                 'Sepal_Width', 
...                                                 'Petal_Length', 
...                                                 'Species']])
    Sepal_Length  Sepal_Width  Petal_Length     Species  PREDICTION
0            4.9          3.0           1.4      setosa      setosa
1            4.9          3.1           1.5      setosa      setosa
2            4.8          3.4           1.6      setosa      setosa
3            5.8          4.0           1.2      setosa      setosa
...          ...          ...           ...         ...         ...
42           6.7          3.3           5.7   virginica   virginica
43           6.7          3.0           5.2   virginica   virginica
44           6.5          3.0           5.2   virginica   virginica
45           5.9          3.0           5.1   virginica   virginica

>>> # Return the prediction probability.
>>> nb_mod.predict(test_dat.drop('Species'), 
...                supplemental_cols = test_dat[:, ['Sepal_Length', 
...                                                 'Sepal_Width',
...                                                 'Species']], 
...                proba = True)
    Sepal_Length  Sepal_Width     Species  PREDICTION  PROBABILITY
0            4.9          3.0      setosa      setosa     1.000000
1            4.9          3.1      setosa      setosa     1.000000
2            4.8          3.4      setosa      setosa     1.000000
3            5.8          4.0      setosa      setosa     1.000000
...           ...          ...         ...         ...          ...
42           6.7          3.3   virginica   virginica     1.000000
43           6.7          3.0   virginica   virginica     0.953848
44           6.5          3.0   virginica   virginica     1.000000
45           5.9          3.0   virginica   virginica     0.932334

>>> # Return the top two most influencial attributes of the highest
... # probability class.
>>> nb_mod.predict(test_dat.drop('Species'), 
...                supplemental_cols = test_dat[:, ['Sepal_Length', 
...                                                 'Sepal_Width', 
...                                                 'Petal_Length',
...                                                 'Species']], 
...                topN_attrs = 2)
  Sepal_Length  Sepal_Width Petal_Length    Species PREDICTION \
0          4.9          3.0          1.4     setosa     setosa
1          4.9          3.1          1.5     setosa     setosa
2          4.8          3.4          1.6     setosa     setosa
3          5.8          4.0          1.2     setosa     setosa
... ... ... ... ... ...
42         6.7          3.3          5.7  virginica  virginica
43         6.7          3.0          5.2  virginica  virginica
44         6.5          3.0          5.2  virginica  virginica
45         5.9          3.0          5.1  virginica  virginica
                                   TOP_N_ATTRIBUTES
0 <Details algorithm="Naive Bayes" class="setosa...
1 <Details algorithm="Naive Bayes" class="setosa...
2 <Details algorithm="Naive Bayes" class="setosa...
3 <Details algorithm="Naive Bayes" class="setosa...
...
42 <Details algorithm="Naive Bayes" class="virgin...
43 <Details algorithm="Naive Bayes" class="virgin...
44 <Details algorithm="Naive Bayes" class="virgin...
45 <Details algorithm="Naive Bayes" class="virgin...

>>> # Make predictions and return the probability for each class
... # on new data.
>>> nb_mod.predict_proba(test_dat.drop('Species'), 
...                      supplemental_cols = test_dat[:, 
...                        ['Sepal_Length',
...                         'Species']]).sort_values(by = 
...                            ['Sepal_Length', 
...                             'Species',
...                             'PROBABILITY_OF_setosa,
...                             'PROBABILITY_OF_versicolor'])
    Sepal_Length     Species  PROBABILITY_OF_SETOSA  \
0            4.4      setosa           1.000000e+00   
1            4.4      setosa           1.000000e+00   
2            4.5      setosa           1.000000e+00   
3            4.8      setosa           1.000000e+00  
...          ...         ...                    ...   
42           6.7   virginica           1.412132e-13
43           6.9  versicolor           5.295492e-20
44           6.9   virginica           5.295492e-20
45           7.0  versicolor           6.189014e-14

     PROBABILITY_OF_VERSICOLOR  PROBABILITY_OF_VIRGINICA  
0                9.327306e-21              7.868301e-20
1                3.497737e-20              1.032715e-19
2                2.238553e-13              2.360490e-19
3                6.995487e-22              2.950617e-21
...                       ...                       ... 
42               4.741700e-13              1.000000e+00
43               1.778141e-07              9.999998e-01
44               2.963565e-20              1.000000e+00
45               4.156340e-01              5.843660e-01

>>> # Make predictions on new data and return the mean accuracy.
... nb_mod.score(test_dat.drop('Species'), test_dat[:, ['Species']])
0.934783

Parent topic: OML4Py Classes That Provide Access to In-Database Machine Learning Algorithms