9.14 Naive Bayes

oml.nbクラスは、分類用のNaive Bayes (NB)モデルを作成します。

Naive Bayesアルゴリズムは条件付き確率に基づいています。Naive Bayesは履歴データを検索し、属性値の頻度と属性値の組合せの頻度を観測することによってターゲット値の条件付き確率を計算します。

Naive Bayesでは、各予測子は他の予測子とは条件的に独立していると想定されます。(Bayesの定理では、予測子が独立している必要があります。)

oml.nbクラスの属性およびメソッドの詳細は、help(oml.nb)を呼び出すか、Oracle Machine Learning for Python APIリファレンスを参照してください。

Naive Bayesモデルの設定

次の表は、NBモデルに適用される設定のリストです。

表9-12 Naive Bayesモデルの設定

設定名 設定値 説明
CLAS_COST_TABLE_NAME

table_name

モデルの構築に使用するアルゴリズムのコスト・マトリックスを格納する表の名前。コスト・マトリックスは、分類ミスに関連するコストを指定します。

コスト・マトリックス表は、ユーザーが作成します。表の列の要件は次のとおりです。

  • 列名: ACTUAL_TARGET_VALUE

    データ型: 有効なターゲット・データ型

  • 列名: PREDICTED_TARGET_VALUE

    データ型: 有効なターゲット・データ型

  • 列名: COST

    データ型: NUMBER

CLAS_MAX_SUP_BINS

2 <= 数値 <= 2147483647

各属性のビンの最大数を指定します。

デフォルト値は32です。

CLAS_PRIORS_TABLE_NAME

table_name

ビルド・データとスコアリング・データの間の分布の差を埋めるための事前確率を格納する表の名前。

事前確率表は、ユーザーが作成します。表の列の要件は次のとおりです。

  • 列名: TARGET_VALUE

    データ型: 有効なターゲット・データ型

  • 列名: PRIOR_PROBABILITY

    データ型: NUMBER

CLAS_WEIGHTS_BALANCED

ON

OFF

ターゲットの分布を平均化するモデルをアルゴリズムが作成する必要があるかどうかを示します。分布の平均化では、主要なクラスを重視する全体精度ではなく、平均精度(クラスごとの精度の平均)を向上できるため、この設定は稀なターゲットが存在する場合に最適です。デフォルト値はOFFです。

NABS_PAIRWISE_THRESHOLD

TO_CHAR( 0 <= numeric_expr <= 1)

NBアルゴリズムでの組しきい値。

デフォルト値は0です。

NABS_SINGLETON_THRESHOLD

TO_CHAR( 0 <= numeric_expr <= 1)

NBアルゴリズムでの単一しきい値。

デフォルト値は0です。

例9-14 oml.nbクラスの使用

この例では、NBモデルを作成し、oml.nbクラスのメソッドの一部を使用します。

import oml
import pandas as pd
from sklearn import datasets

# Load the iris data set and create a pandas.DataFrame for it.
iris = datasets.load_iris()
x = pd.DataFrame(iris.data,
                 columns = ['Sepal_Length','Sepal_Width',
                            'Petal_Length','Petal_Width'])
y = pd.DataFrame(list(map(lambda x:
                           {0: 'setosa', 1: 'versicolor',
                            2:'virginica'}[x], iris.target)),
                 columns = ['Species'])

try:
    oml.drop(table = 'NB_PRIOR_PROBABILITY_DEMO')
    oml.drop('IRIS')
except:
    pass

# Create the IRIS database table and the proxy object for the table.
oml_iris = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')

# Create training and test data.
dat = oml.sync(table = 'IRIS').split()

train_x = dat[0].drop('Species')
train_y = dat[0]['Species']
test_dat = dat[1]

# User specified settings.
setting = {'CLAS_WEIGHTS_BALANCED': 'ON'}

# Create an oml NB model object.
nb_mod = oml.nb(**setting)

# Fit the NB model according to the training data and parameter
# settings.
nb_mod = nb_mod.fit(train_x, train_y)

# Show details of the model.
nb_mod

# Create a priors table in the database.
priors = {'setosa': 0.2, 'versicolor': 0.3, 'virginica': 0.5}
priors = oml.create(pd.DataFrame(list(priors.items()), 
                       columns = ['TARGET_VALUE', 
                                  'PRIOR_PROBABILITY']), 
                       table = 'NB_PRIOR_PROBABILITY_DEMO')

# Change the setting parameter and refit the model 
# with a user-defined prior table.
new_setting = {'CLAS_WEIGHTS_BALANCED': 'OFF'}
nb_mod = nb_mod.set_params(**new_setting).fit(train_x, 
                                              train_y, 
                                              priors = priors)
nb_mod

# Use the model to make predictions on test data.
nb_mod.predict(test_dat.drop('Species'), 
               supplemental_cols = test_dat[:, ['Sepal_Length', 
                                                'Sepal_Width', 
                                                'Petal_Length', 
                                                'Species']])
# Return the prediction probability.
nb_mod.predict(test_dat.drop('Species'), 
               supplemental_cols = test_dat[:, ['Sepal_Length', 
                                                'Sepal_Width',
                                                'Species']], 
               proba = True)


# Return the top two most influencial attributes of the highest
# probability class.
nb_mod.predict(test_dat.drop('Species'), 
               supplemental_cols = test_dat[:, ['Sepal_Length', 
                                                'Sepal_Width', 
                                                'Petal_Length',
                                                'Species']], 
               topN_attrs = 2)

# Make predictions and return the probability for each class
# on new data.
nb_mod.predict_proba(test_dat.drop('Species'), 
                     supplemental_cols = test_dat[:, 
                       ['Sepal_Length', 
                        'Species']]).sort_values(by = 
                           ['Sepal_Length',
                            'Species', 
                            'PROBABILITY_OF_setosa', 
                            'PROBABILITY_OF_versicolor'])

# Make predictions on new data and return the mean accuracy.
nb_mod.score(test_dat.drop('Species'), test_dat[:, ['Species']])

この例のリスト

>>> import oml
>>> import pandas as pd
>>> from sklearn import datasets
>>>
>>> # Load the iris data set and create a pandas.DataFrame for it.
... iris = datasets.load_iris()
>>> x = pd.DataFrame(iris.data, 
...                  columns = ['Sepal_Length','Sepal_Width',
...                             'Petal_Length','Petal_Width'])
>>> y = pd.DataFrame(list(map(lambda x: 
...                            {0: 'setosa', 1: 'versicolor', 
...                             2:'virginica'}[x], iris.target)), 
...                  columns = ['Species'])
>>>
>>> try:
...    oml.drop(table = 'NB_PRIOR_PROBABILITY_DEMO')
...    oml.drop('IRIS')
... except:
...    pass
>>>
>>> # Create the IRIS database table and the proxy object for the table.
... oml_iris = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')
>>>
>>> # Create training and test data.
>>> dat = oml.sync(table = 'IRIS').split()
>>> train_x = dat[0].drop('Species')
>>> train_y = dat[0]['Species']
>>> test_dat = dat[1]
>>>
>>> # User specified settings.
... setting = {'CLAS_WEIGHTS_BALANCED': 'ON'}
>>>
>>> # Create an oml NB model object.
... nb_mod = oml.nb(**setting)
>>> 
>>> # Fit the NB model according to the training data and parameter
... # settings.
>>> nb_mod = nb_mod.fit(train_x, train_y)
>>>
>>> # Show details of the model.
... nb_mod

Algorithm Name: Naive Bayes

Mining Function: CLASSIFICATION

Target: Species

Settings: 
                   setting name            setting value
0                     ALGO_NAME         ALGO_NAIVE_BAYES
1         CLAS_WEIGHTS_BALANCED                       ON
2       NABS_PAIRWISE_THRESHOLD                        0
3      NABS_SINGLETON_THRESHOLD                        0
4                  ODMS_DETAILS              ODMS_ENABLE
5  ODMS_MISSING_VALUE_TREATMENT  ODMS_MISSING_VALUE_AUTO
6                 ODMS_SAMPLING    ODMS_SAMPLING_DISABLE
7                     PREP_AUTO                       ON

Global Statistics:
   attribute name  attribute value
0        NUM_ROWS              104

Attributes: 
Petal_Length
Petal_Width
Sepal_Length
Sepal_Width

Partition: NO

Priors: 

   TARGET_NAME TARGET_VALUE  PRIOR_PROBABILITY  COUNT
0     Species       setosa           0.333333     36
1     Species   versicolor           0.333333     35
2     Species    virginica           0.333333     33

Conditionals: 

    TARGET_NAME TARGET_VALUE ATTRIBUTE_NAME ATTRIBUTE_SUBNAME ATTRIBUTE_VALUE  \
0       Species       setosa   Petal_Length              None       ( ; 1.05]   
1       Species       setosa   Petal_Length              None     (1.05; 1.2]
2       Species       setosa   Petal_Length              None     (1.2; 1.35]
3       Species       setosa   Petal_Length              None    (1.35; 1.45]
...         ...          ...            ...               ...             ...   
152     Species    virginica    Sepal_Width              None    (3.25; 3.35]
153     Species    virginica    Sepal_Width              None    (3.35; 3.45]
154     Species    virginica    Sepal_Width              None    (3.55; 3.65]
155     Species    virginica    Sepal_Width              None    (3.75; 3.85]

     CONDITIONAL_PROBABILITY  COUNT  
0                   0.027778      1
1                   0.027778      1
2                   0.083333      3
3                   0.277778     10
...                      ...    ...  
152                 0.030303      1  
153                 0.060606      2  
154                 0.030303      1  
155                 0.060606      2

[156 rows x 7 columns]

>>> # Create a priors table in the database.
... priors = {'setosa': 0.2, 'versicolor': 0.3, 'virginica': 0.5}
>>> priors = oml.create(pd.DataFrame(list(priors.items()), 
...                        columns = ['TARGET_VALUE', 
...                                   'PRIOR_PROBABILITY']), 
...                        table = 'NB_PRIOR_PROBABILITY_DEMO')
>>>
>>> # Change the setting parameter and refit the model 
... # with a user-defined prior table.
... new_setting = {'CLAS_WEIGHTS_BALANCED': 'OFF'}
>>> nb_mod = nb_mod.set_params(**new_setting).fit(train_x, 
...                                               train_y,
...                                               priors = priors)
>>> nb_mod

Algorithm Name: Naive Bayes

Mining Function: CLASSIFICATION

Target: Species

Settings: 
                   setting name                          setting value
0                     ALGO_NAME                       ALGO_NAIVE_BAYES
1        CLAS_PRIORS_TABLE_NAME "OML_USER"."NB_PRIOR_PROBABILITY_DEMO"
2         CLAS_WEIGHTS_BALANCED                                    OFF
3       NABS_PAIRWISE_THRESHOLD                                      0
4      NABS_SINGLETON_THRESHOLD                                      0
5                  ODMS_DETAILS                            ODMS_ENABLE
6  ODMS_MISSING_VALUE_TREATMENT                ODMS_MISSING_VALUE_AUTO
7                 ODMS_SAMPLING                  ODMS_SAMPLING_DISABLE
8                     PREP_AUTO                                     ON

Global Statistics:
   attribute name  attribute value
0        NUM_ROWS              104

Attributes: 
Petal_Length
Petal_Width
Sepal_Length
Sepal_Width

Partition: NO

Priors: 

  TARGET_NAME TARGET_VALUE  PRIOR_PROBABILITY  COUNT
0     Species       setosa                0.2     36
1     Species   versicolor                0.3     35
2     Species    virginica                0.5     33

Conditionals: 

    TARGET_NAME TARGET_VALUE ATTRIBUTE_NAME ATTRIBUTE_SUBNAME ATTRIBUTE_VALUE  \
0       Species       setosa   Petal_Length              None       ( ; 1.05]
1       Species       setosa   Petal_Length              None     (1.05; 1.2]
2       Species       setosa   Petal_Length              None     (1.2; 1.35]
3       Species       setosa   Petal_Length              None    (1.35; 1.45]
...         ...          ...            ...               ...             ...
152     Species    virginica    Sepal_Width              None    (3.25; 3.35]
153     Species    virginica    Sepal_Width              None    (3.35; 3.45]
154     Species    virginica    Sepal_Width              None    (3.55; 3.65]
155     Species    virginica    Sepal_Width              None    (3.75; 3.85]

     CONDITIONAL_PROBABILITY  COUNT  
0                   0.027778      1
1                   0.027778      1
2                   0.083333      3
3                   0.277778     10
...                      ...    ...
152                 0.030303      1
153                 0.060606      2
154                 0.030303      1
155                 0.060606      2

[156 rows x 7 columns]

>>> # Use the model to make predictions on test data.
... nb_mod.predict(test_dat.drop('Species'),
...                supplemental_cols = test_dat[:, ['Sepal_Length', 
...                                                 'Sepal_Width', 
...                                                 'Petal_Length', 
...                                                 'Species']])
    Sepal_Length  Sepal_Width  Petal_Length     Species  PREDICTION
0            4.9          3.0           1.4      setosa      setosa
1            4.9          3.1           1.5      setosa      setosa
2            4.8          3.4           1.6      setosa      setosa
3            5.8          4.0           1.2      setosa      setosa
...          ...          ...           ...         ...         ...
42           6.7          3.3           5.7   virginica   virginica
43           6.7          3.0           5.2   virginica   virginica
44           6.5          3.0           5.2   virginica   virginica
45           5.9          3.0           5.1   virginica   virginica

>>> # Return the prediction probability.
>>> nb_mod.predict(test_dat.drop('Species'), 
...                supplemental_cols = test_dat[:, ['Sepal_Length', 
...                                                 'Sepal_Width',
...                                                 'Species']], 
...                proba = True)
    Sepal_Length  Sepal_Width     Species  PREDICTION  PROBABILITY
0            4.9          3.0      setosa      setosa     1.000000
1            4.9          3.1      setosa      setosa     1.000000
2            4.8          3.4      setosa      setosa     1.000000
3            5.8          4.0      setosa      setosa     1.000000
...           ...          ...         ...         ...          ...
42           6.7          3.3   virginica   virginica     1.000000
43           6.7          3.0   virginica   virginica     0.953848
44           6.5          3.0   virginica   virginica     1.000000
45           5.9          3.0   virginica   virginica     0.932334

>>> # Return the top two most influencial attributes of the highest
... # probability class.
>>> nb_mod.predict(test_dat.drop('Species'), 
...                supplemental_cols = test_dat[:, ['Sepal_Length', 
...                                                 'Sepal_Width', 
...                                                 'Petal_Length',
...                                                 'Species']], 
...                topN_attrs = 2)
  Sepal_Length  Sepal_Width Petal_Length    Species PREDICTION \
0          4.9          3.0          1.4     setosa     setosa
1          4.9          3.1          1.5     setosa     setosa
2          4.8          3.4          1.6     setosa     setosa
3          5.8          4.0          1.2     setosa     setosa
... ... ... ... ... ...
42         6.7          3.3          5.7  virginica  virginica
43         6.7          3.0          5.2  virginica  virginica
44         6.5          3.0          5.2  virginica  virginica
45         5.9          3.0          5.1  virginica  virginica
                                   TOP_N_ATTRIBUTES
0 <Details algorithm="Naive Bayes" class="setosa...
1 <Details algorithm="Naive Bayes" class="setosa...
2 <Details algorithm="Naive Bayes" class="setosa...
3 <Details algorithm="Naive Bayes" class="setosa...
...
42 <Details algorithm="Naive Bayes" class="virgin...
43 <Details algorithm="Naive Bayes" class="virgin...
44 <Details algorithm="Naive Bayes" class="virgin...
45 <Details algorithm="Naive Bayes" class="virgin...

>>> # Make predictions and return the probability for each class
... # on new data.
>>> nb_mod.predict_proba(test_dat.drop('Species'), 
...                      supplemental_cols = test_dat[:, 
...                        ['Sepal_Length',
...                         'Species']]).sort_values(by = 
...                            ['Sepal_Length', 
...                             'Species',
...                             'PROBABILITY_OF_setosa,
...                             'PROBABILITY_OF_versicolor'])
    Sepal_Length     Species  PROBABILITY_OF_SETOSA  \
0            4.4      setosa           1.000000e+00   
1            4.4      setosa           1.000000e+00   
2            4.5      setosa           1.000000e+00   
3            4.8      setosa           1.000000e+00  
...          ...         ...                    ...   
42           6.7   virginica           1.412132e-13
43           6.9  versicolor           5.295492e-20
44           6.9   virginica           5.295492e-20
45           7.0  versicolor           6.189014e-14

     PROBABILITY_OF_VERSICOLOR  PROBABILITY_OF_VIRGINICA  
0                9.327306e-21              7.868301e-20
1                3.497737e-20              1.032715e-19
2                2.238553e-13              2.360490e-19
3                6.995487e-22              2.950617e-21
...                       ...                       ... 
42               4.741700e-13              1.000000e+00
43               1.778141e-07              9.999998e-01
44               2.963565e-20              1.000000e+00
45               4.156340e-01              5.843660e-01

>>> # Make predictions on new data and return the mean accuracy.
... nb_mod.score(test_dat.drop('Species'), test_dat[:, ['Species']])
0.934783