9.14 Naive Bayes
oml.nb
クラスは、分類用のNaive Bayes (NB)モデルを作成します。
Naive Bayesアルゴリズムは条件付き確率に基づいています。Naive Bayesは履歴データを検索し、属性値の頻度と属性値の組合せの頻度を観測することによってターゲット値の条件付き確率を計算します。
Naive Bayesでは、各予測子は他の予測子とは条件的に独立していると想定されます。(Bayesの定理では、予測子が独立している必要があります。)
oml.nb
クラスの属性およびメソッドの詳細は、help(oml.nb)
を呼び出すか、Oracle Machine Learning for Python APIリファレンスを参照してください。
Naive Bayesモデルの設定
次の表は、NBモデルに適用される設定のリストです。
表9-12 Naive Bayesモデルの設定
設定名 | 設定値 | 説明 |
---|---|---|
CLAS_COST_TABLE_NAME |
table_name |
モデルの構築に使用するアルゴリズムのコスト・マトリックスを格納する表の名前。コスト・マトリックスは、分類ミスに関連するコストを指定します。 コスト・マトリックス表は、ユーザーが作成します。表の列の要件は次のとおりです。
|
CLAS_MAX_SUP_BINS |
|
各属性のビンの最大数を指定します。 デフォルト値は |
|
table_name |
ビルド・データとスコアリング・データの間の分布の差を埋めるための事前確率を格納する表の名前。 事前確率表は、ユーザーが作成します。表の列の要件は次のとおりです。
|
CLAS_WEIGHTS_BALANCED |
|
ターゲットの分布を平均化するモデルをアルゴリズムが作成する必要があるかどうかを示します。分布の平均化では、主要なクラスを重視する全体精度ではなく、平均精度(クラスごとの精度の平均)を向上できるため、この設定は稀なターゲットが存在する場合に最適です。デフォルト値は |
|
|
NBアルゴリズムでの組しきい値。 デフォルト値は |
|
|
NBアルゴリズムでの単一しきい値。 デフォルト値は |
関連項目:
例9-14 oml.nbクラスの使用
この例では、NBモデルを作成し、oml.nb
クラスのメソッドの一部を使用します。
import oml
import pandas as pd
from sklearn import datasets
# Load the iris data set and create a pandas.DataFrame for it.
iris = datasets.load_iris()
x = pd.DataFrame(iris.data,
columns = ['Sepal_Length','Sepal_Width',
'Petal_Length','Petal_Width'])
y = pd.DataFrame(list(map(lambda x:
{0: 'setosa', 1: 'versicolor',
2:'virginica'}[x], iris.target)),
columns = ['Species'])
try:
oml.drop(table = 'NB_PRIOR_PROBABILITY_DEMO')
oml.drop('IRIS')
except:
pass
# Create the IRIS database table and the proxy object for the table.
oml_iris = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')
# Create training and test data.
dat = oml.sync(table = 'IRIS').split()
train_x = dat[0].drop('Species')
train_y = dat[0]['Species']
test_dat = dat[1]
# User specified settings.
setting = {'CLAS_WEIGHTS_BALANCED': 'ON'}
# Create an oml NB model object.
nb_mod = oml.nb(**setting)
# Fit the NB model according to the training data and parameter
# settings.
nb_mod = nb_mod.fit(train_x, train_y)
# Show details of the model.
nb_mod
# Create a priors table in the database.
priors = {'setosa': 0.2, 'versicolor': 0.3, 'virginica': 0.5}
priors = oml.create(pd.DataFrame(list(priors.items()),
columns = ['TARGET_VALUE',
'PRIOR_PROBABILITY']),
table = 'NB_PRIOR_PROBABILITY_DEMO')
# Change the setting parameter and refit the model
# with a user-defined prior table.
new_setting = {'CLAS_WEIGHTS_BALANCED': 'OFF'}
nb_mod = nb_mod.set_params(**new_setting).fit(train_x,
train_y,
priors = priors)
nb_mod
# Use the model to make predictions on test data.
nb_mod.predict(test_dat.drop('Species'),
supplemental_cols = test_dat[:, ['Sepal_Length',
'Sepal_Width',
'Petal_Length',
'Species']])
# Return the prediction probability.
nb_mod.predict(test_dat.drop('Species'),
supplemental_cols = test_dat[:, ['Sepal_Length',
'Sepal_Width',
'Species']],
proba = True)
# Return the top two most influencial attributes of the highest
# probability class.
nb_mod.predict(test_dat.drop('Species'),
supplemental_cols = test_dat[:, ['Sepal_Length',
'Sepal_Width',
'Petal_Length',
'Species']],
topN_attrs = 2)
# Make predictions and return the probability for each class
# on new data.
nb_mod.predict_proba(test_dat.drop('Species'),
supplemental_cols = test_dat[:,
['Sepal_Length',
'Species']]).sort_values(by =
['Sepal_Length',
'Species',
'PROBABILITY_OF_setosa',
'PROBABILITY_OF_versicolor'])
# Make predictions on new data and return the mean accuracy.
nb_mod.score(test_dat.drop('Species'), test_dat[:, ['Species']])
この例のリスト
>>> import oml
>>> import pandas as pd
>>> from sklearn import datasets
>>>
>>> # Load the iris data set and create a pandas.DataFrame for it.
... iris = datasets.load_iris()
>>> x = pd.DataFrame(iris.data,
... columns = ['Sepal_Length','Sepal_Width',
... 'Petal_Length','Petal_Width'])
>>> y = pd.DataFrame(list(map(lambda x:
... {0: 'setosa', 1: 'versicolor',
... 2:'virginica'}[x], iris.target)),
... columns = ['Species'])
>>>
>>> try:
... oml.drop(table = 'NB_PRIOR_PROBABILITY_DEMO')
... oml.drop('IRIS')
... except:
... pass
>>>
>>> # Create the IRIS database table and the proxy object for the table.
... oml_iris = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')
>>>
>>> # Create training and test data.
>>> dat = oml.sync(table = 'IRIS').split()
>>> train_x = dat[0].drop('Species')
>>> train_y = dat[0]['Species']
>>> test_dat = dat[1]
>>>
>>> # User specified settings.
... setting = {'CLAS_WEIGHTS_BALANCED': 'ON'}
>>>
>>> # Create an oml NB model object.
... nb_mod = oml.nb(**setting)
>>>
>>> # Fit the NB model according to the training data and parameter
... # settings.
>>> nb_mod = nb_mod.fit(train_x, train_y)
>>>
>>> # Show details of the model.
... nb_mod
Algorithm Name: Naive Bayes
Mining Function: CLASSIFICATION
Target: Species
Settings:
setting name setting value
0 ALGO_NAME ALGO_NAIVE_BAYES
1 CLAS_WEIGHTS_BALANCED ON
2 NABS_PAIRWISE_THRESHOLD 0
3 NABS_SINGLETON_THRESHOLD 0
4 ODMS_DETAILS ODMS_ENABLE
5 ODMS_MISSING_VALUE_TREATMENT ODMS_MISSING_VALUE_AUTO
6 ODMS_SAMPLING ODMS_SAMPLING_DISABLE
7 PREP_AUTO ON
Global Statistics:
attribute name attribute value
0 NUM_ROWS 104
Attributes:
Petal_Length
Petal_Width
Sepal_Length
Sepal_Width
Partition: NO
Priors:
TARGET_NAME TARGET_VALUE PRIOR_PROBABILITY COUNT
0 Species setosa 0.333333 36
1 Species versicolor 0.333333 35
2 Species virginica 0.333333 33
Conditionals:
TARGET_NAME TARGET_VALUE ATTRIBUTE_NAME ATTRIBUTE_SUBNAME ATTRIBUTE_VALUE \
0 Species setosa Petal_Length None ( ; 1.05]
1 Species setosa Petal_Length None (1.05; 1.2]
2 Species setosa Petal_Length None (1.2; 1.35]
3 Species setosa Petal_Length None (1.35; 1.45]
... ... ... ... ... ...
152 Species virginica Sepal_Width None (3.25; 3.35]
153 Species virginica Sepal_Width None (3.35; 3.45]
154 Species virginica Sepal_Width None (3.55; 3.65]
155 Species virginica Sepal_Width None (3.75; 3.85]
CONDITIONAL_PROBABILITY COUNT
0 0.027778 1
1 0.027778 1
2 0.083333 3
3 0.277778 10
... ... ...
152 0.030303 1
153 0.060606 2
154 0.030303 1
155 0.060606 2
[156 rows x 7 columns]
>>> # Create a priors table in the database.
... priors = {'setosa': 0.2, 'versicolor': 0.3, 'virginica': 0.5}
>>> priors = oml.create(pd.DataFrame(list(priors.items()),
... columns = ['TARGET_VALUE',
... 'PRIOR_PROBABILITY']),
... table = 'NB_PRIOR_PROBABILITY_DEMO')
>>>
>>> # Change the setting parameter and refit the model
... # with a user-defined prior table.
... new_setting = {'CLAS_WEIGHTS_BALANCED': 'OFF'}
>>> nb_mod = nb_mod.set_params(**new_setting).fit(train_x,
... train_y,
... priors = priors)
>>> nb_mod
Algorithm Name: Naive Bayes
Mining Function: CLASSIFICATION
Target: Species
Settings:
setting name setting value
0 ALGO_NAME ALGO_NAIVE_BAYES
1 CLAS_PRIORS_TABLE_NAME "OML_USER"."NB_PRIOR_PROBABILITY_DEMO"
2 CLAS_WEIGHTS_BALANCED OFF
3 NABS_PAIRWISE_THRESHOLD 0
4 NABS_SINGLETON_THRESHOLD 0
5 ODMS_DETAILS ODMS_ENABLE
6 ODMS_MISSING_VALUE_TREATMENT ODMS_MISSING_VALUE_AUTO
7 ODMS_SAMPLING ODMS_SAMPLING_DISABLE
8 PREP_AUTO ON
Global Statistics:
attribute name attribute value
0 NUM_ROWS 104
Attributes:
Petal_Length
Petal_Width
Sepal_Length
Sepal_Width
Partition: NO
Priors:
TARGET_NAME TARGET_VALUE PRIOR_PROBABILITY COUNT
0 Species setosa 0.2 36
1 Species versicolor 0.3 35
2 Species virginica 0.5 33
Conditionals:
TARGET_NAME TARGET_VALUE ATTRIBUTE_NAME ATTRIBUTE_SUBNAME ATTRIBUTE_VALUE \
0 Species setosa Petal_Length None ( ; 1.05]
1 Species setosa Petal_Length None (1.05; 1.2]
2 Species setosa Petal_Length None (1.2; 1.35]
3 Species setosa Petal_Length None (1.35; 1.45]
... ... ... ... ... ...
152 Species virginica Sepal_Width None (3.25; 3.35]
153 Species virginica Sepal_Width None (3.35; 3.45]
154 Species virginica Sepal_Width None (3.55; 3.65]
155 Species virginica Sepal_Width None (3.75; 3.85]
CONDITIONAL_PROBABILITY COUNT
0 0.027778 1
1 0.027778 1
2 0.083333 3
3 0.277778 10
... ... ...
152 0.030303 1
153 0.060606 2
154 0.030303 1
155 0.060606 2
[156 rows x 7 columns]
>>> # Use the model to make predictions on test data.
... nb_mod.predict(test_dat.drop('Species'),
... supplemental_cols = test_dat[:, ['Sepal_Length',
... 'Sepal_Width',
... 'Petal_Length',
... 'Species']])
Sepal_Length Sepal_Width Petal_Length Species PREDICTION
0 4.9 3.0 1.4 setosa setosa
1 4.9 3.1 1.5 setosa setosa
2 4.8 3.4 1.6 setosa setosa
3 5.8 4.0 1.2 setosa setosa
... ... ... ... ... ...
42 6.7 3.3 5.7 virginica virginica
43 6.7 3.0 5.2 virginica virginica
44 6.5 3.0 5.2 virginica virginica
45 5.9 3.0 5.1 virginica virginica
>>> # Return the prediction probability.
>>> nb_mod.predict(test_dat.drop('Species'),
... supplemental_cols = test_dat[:, ['Sepal_Length',
... 'Sepal_Width',
... 'Species']],
... proba = True)
Sepal_Length Sepal_Width Species PREDICTION PROBABILITY
0 4.9 3.0 setosa setosa 1.000000
1 4.9 3.1 setosa setosa 1.000000
2 4.8 3.4 setosa setosa 1.000000
3 5.8 4.0 setosa setosa 1.000000
... ... ... ... ... ...
42 6.7 3.3 virginica virginica 1.000000
43 6.7 3.0 virginica virginica 0.953848
44 6.5 3.0 virginica virginica 1.000000
45 5.9 3.0 virginica virginica 0.932334
>>> # Return the top two most influencial attributes of the highest
... # probability class.
>>> nb_mod.predict(test_dat.drop('Species'),
... supplemental_cols = test_dat[:, ['Sepal_Length',
... 'Sepal_Width',
... 'Petal_Length',
... 'Species']],
... topN_attrs = 2)
Sepal_Length Sepal_Width Petal_Length Species PREDICTION \
0 4.9 3.0 1.4 setosa setosa
1 4.9 3.1 1.5 setosa setosa
2 4.8 3.4 1.6 setosa setosa
3 5.8 4.0 1.2 setosa setosa
... ... ... ... ... ...
42 6.7 3.3 5.7 virginica virginica
43 6.7 3.0 5.2 virginica virginica
44 6.5 3.0 5.2 virginica virginica
45 5.9 3.0 5.1 virginica virginica
TOP_N_ATTRIBUTES
0 <Details algorithm="Naive Bayes" class="setosa...
1 <Details algorithm="Naive Bayes" class="setosa...
2 <Details algorithm="Naive Bayes" class="setosa...
3 <Details algorithm="Naive Bayes" class="setosa...
...
42 <Details algorithm="Naive Bayes" class="virgin...
43 <Details algorithm="Naive Bayes" class="virgin...
44 <Details algorithm="Naive Bayes" class="virgin...
45 <Details algorithm="Naive Bayes" class="virgin...
>>> # Make predictions and return the probability for each class
... # on new data.
>>> nb_mod.predict_proba(test_dat.drop('Species'),
... supplemental_cols = test_dat[:,
... ['Sepal_Length',
... 'Species']]).sort_values(by =
... ['Sepal_Length',
... 'Species',
... 'PROBABILITY_OF_setosa,
... 'PROBABILITY_OF_versicolor'])
Sepal_Length Species PROBABILITY_OF_SETOSA \
0 4.4 setosa 1.000000e+00
1 4.4 setosa 1.000000e+00
2 4.5 setosa 1.000000e+00
3 4.8 setosa 1.000000e+00
... ... ... ...
42 6.7 virginica 1.412132e-13
43 6.9 versicolor 5.295492e-20
44 6.9 virginica 5.295492e-20
45 7.0 versicolor 6.189014e-14
PROBABILITY_OF_VERSICOLOR PROBABILITY_OF_VIRGINICA
0 9.327306e-21 7.868301e-20
1 3.497737e-20 1.032715e-19
2 2.238553e-13 2.360490e-19
3 6.995487e-22 2.950617e-21
... ... ...
42 4.741700e-13 1.000000e+00
43 1.778141e-07 9.999998e-01
44 2.963565e-20 1.000000e+00
45 4.156340e-01 5.843660e-01
>>> # Make predictions on new data and return the mean accuracy.
... nb_mod.score(test_dat.drop('Species'), test_dat[:, ['Species']])
0.934783