10.4 モデルのチューニング

oml.automl.ModelTuningクラスは、指定された分類または回帰アルゴリズムおよびトレーニング・データのハイパーパラメータをチューニングします。

モデルのチューニングは、多くの時間と労力を要する機械学習タスクであり、データ・サイエンティストの専門知識に大きく依存します。oml.automl.ModelTuningクラスは、限られたユーザー入力で、勾配法に基づく、並列度の高い非同期のハイパーパラメータ最適化アルゴリズムを使用してこのプロセスを自動化し、Oracle Machine Learningアルゴリズムのハイパーパラメータをチューニングします。

oml.automl.ModelTuningクラスは、分類および回帰アルゴリズムをサポートしています。oml.automl.ModelTuningクラスを使用するには、データセットとアルゴリズムを指定して、チューニングされたモデルおよびそれに対応するハイパーパラメータを取得します。上級ユーザーは、カスタマイズしたハイパーパラメータ検索領域とデフォルト以外のスコアリング・メトリックをこのブラックボックス・オプティマイザに渡すことができます。

パーティション化されたモデルについては、tuneメソッドのparam_space引数でパーティション化する列を渡すと、oml.automl.ModelTuningは、パーティション化されたモデルのハイパーパラメータをチューニングします。

このクラスのパラメータおよびメソッドの詳細は、help(oml.automl.ModelTuning)を呼び出すか、Oracle Machine Learning for Python APIリファレンスを参照してください。

例10-3 oml.automl.ModelTuningクラスの使用

この例では、oml.automl.ModelTuningオブジェクトを作成します。

import oml
from oml import automl
import pandas as pd
from sklearn import datasets

# Load the breast cancer data set.
bc = datasets.load_breast_cancer()
bc_data = bc.data.astype(float)
X = pd.DataFrame(bc_data, columns = bc.feature_names)
y = pd.DataFrame(bc.target, columns = ['TARGET'])

# Create the database table BreastCancer.
oml_df = oml.create(pd.concat([X, y], axis=1), 
                    table = 'BreastCancer')

# Split the data set into training and test data.
train, test = oml_df.split(ratio=(0.8, 0.2), seed = 1234)
X, y = train.drop('TARGET'), train['TARGET']
X_test, y_test = test.drop('TARGET'), test['TARGET']

# Start an automated model tuning run with a Decision Tree model.
at = automl.ModelTuning(mining_function='classification', 
                        parallel=4)
results = at.tune('dt', X, y, score_metric='accuracy')

# Show the tuned model details.
tuned_model = results['best_model']
tuned_model

# Show the best tuned model train score and the 
# corresponding hyperparameters.
score, params = results['all_evals'][0]
"{:.2}".format(score), ["{}:{}".format(k, params[k])
  for k in sorted(params)]

# Use the tuned model to get the score on the test set.
"{:.2}".format(tuned_model.score(X_test, y_test)) 

# An example invocation of model tuning with user-defined  
# search ranges for selected hyperparameters on a new tuning 
# metric (f1_macro).
search_space = {
  'RFOR_SAMPLING_RATIO': {'type': 'continuous', 
                         'range': [0.01, 0.5]}, 
  'RFOR_NUM_TREES': {'type': 'discrete', 
                     'range': [50, 100]}, 
  'TREE_IMPURITY_METRIC': {'type': 'categorical', 
                           'range': ['TREE_IMPURITY_ENTROPY', 
                           'TREE_IMPURITY_GINI']},}
results = at.tune('rf', X, y, score_metric='f1_macro', 
                  param_space=search_space)
score, params = results['all_evals'][0]
("{:.2}".format(score), ["{}:{}".format(k, params[k]) 
  for k in sorted(params)])

# Some hyperparameter search ranges need to be defined based on the 
# training data set sizes (for example, the number of samples and 
# features). You can use placeholders specific to the data set,
# such as $nr_features and $nr_samples, as the search ranges.
search_space = {'RFOR_MTRY': {'type': 'discrete',
                              'range': [1, '$nr_features/2']}}
results = at.tune('rf', X, y, 
                  score_metric='f1_macro', param_space=search_space)
score, params = results['all_evals'][0]
("{:.2}".format(score), ["{}:{}".format(k, params[k]) 
  for k in sorted(params)])

# Drop the database table.
oml.drop('BreastCancer')

この例のリスト

>>> import oml
>>> from oml import automl
>>> import pandas as pd
>>> from sklearn import datasets
>>> 
>>> # Load the breast cancer data set.
... bc = datasets.load_breast_cancer()
>>> bc_data = bc.data.astype(float)
>>> X = pd.DataFrame(bc_data, columns = bc.feature_names)
>>> y = pd.DataFrame(bc.target, columns = ['TARGET'])
>>>
>>> # Create the database table BreastCancer.
>>> oml_df = oml.create(pd.concat([X, y], axis=1), 
...                     table = 'BreastCancer')
>>>
>>> # Split the data set into training and test data.
... train, test = oml_df.split(ratio=(0.8, 0.2), seed = 1234)
>>> X, y = train.drop('TARGET'), train['TARGET']
>>> X_test, y_test = test.drop('TARGET'), test['TARGET']
>>> 
>>> # Start an automated model tuning run with a Decision Tree model.
... at = automl.ModelTuning(mining_function='classification',
...                         parallel=4)
>>> results = at.tune('dt', X, y, score_metric='accuracy')
>>>
>>> # Show the tuned model details.
... tuned_model = results['best_model']
>>> tuned_model

Algorithm Name: Decision Tree

Mining Function: CLASSIFICATION

Target: TARGET

Settings: 
                    setting name            setting value
0                      ALGO_NAME       ALGO_DECISION_TREE
1              CLAS_MAX_SUP_BINS                       32
2          CLAS_WEIGHTS_BALANCED                      OFF
3                   ODMS_DETAILS             ODMS_DISABLE
4   ODMS_MISSING_VALUE_TREATMENT  ODMS_MISSING_VALUE_AUTO
5                  ODMS_SAMPLING    ODMS_SAMPLING_DISABLE
6                      PREP_AUTO                       ON
7           TREE_IMPURITY_METRIC       TREE_IMPURITY_GINI
8            TREE_TERM_MAX_DEPTH                        8
9          TREE_TERM_MINPCT_NODE                     3.34
10        TREE_TERM_MINPCT_SPLIT                      0.1
11         TREE_TERM_MINREC_NODE                       10
12        TREE_TERM_MINREC_SPLIT                       20

Attributes: 
mean radius
mean texture
mean perimeter
mean area
mean smoothness
mean compactness
mean concavity
mean concave points
mean symmetry
mean fractal dimension
radius error
texture error
perimeter error
area error
smoothness error
compactness error
concavity error
concave points error
symmetry error
fractal dimension error
worst radius
worst texture
worst perimeter
worst area
worst smoothness
worst compactness
worst concavity
worst concave points
worst symmetry
worst fractal dimension

Partition: NO

>>>
>>> # Show the best tuned model train score and the 
... # corresponding hyperparameters.
... score, params = results['all_evals'][0]
>>> "{:.2}".format(score), ["{}:{}".format(k, params[k]) 
...   for k in sorted(params)]
('0.92', ['CLAS_MAX_SUP_BINS:32', 'TREE_IMPURITY_METRIC:TREE_IMPURITY_GINI', 'TREE_TERM_MAX_DEPTH:7', 'TREE_TERM_MINPCT_NODE:0.05', 'TREE_TERM_MINPCT_SPLIT:0.1'])
>>>
>>> # Use the tuned model to get the score on the test set.
... "{:.2}".format(tuned_model.score(X_test, y_test))
'0.92
>>>
>>> # An example invocation of model tuning with user-defined
... # search ranges for selected hyperparameters on a new tuning 
... # metric (f1_macro).
...  search_space = {
...   'RFOR_SAMPLING_RATIO': {'type': 'continuous', 
...                          'range': [0.01, 0.5]}, 
...   'RFOR_NUM_TREES': {'type': 'discrete', 
...                      'range': [50, 100]}, 
...   'TREE_IMPURITY_METRIC': {'type': 'categorical', 
...                            'range': ['TREE_IMPURITY_ENTROPY', 
...                            'TREE_IMPURITY_GINI']},}
>>> results = at.tune('rf', X, y, score_metric='f1_macro', 
>>>                   param_space=search_space)
>>> score, params = results['all_evals'][0]
>>> ("{:.2}".format(score), ["{}:{}".format(k, params[k]) 
...   for k in sorted(params)])
('0.92', ['RFOR_NUM_TREES:53', 'RFOR_SAMPLING_RATIO:0.4999951', 'TREE_IMPURITY_METRIC:TREE_IMPURITY_ENTROPY'])
>>>
>>> # Some hyperparameter search ranges need to be defined based on the 
... # training data set sizes (for example, the number of samples and 
... # features). You can use placeholders specific to the data set,
... # such as $nr_features and $nr_samples, as the search ranges.
... search_space = {'RFOR_MTRY': {'type': 'discrete',
...                               'range': [1, '$nr_features/2']}}
>>> results = at.tune('rf', X, y, 
...                   score_metric='f1_macro', param_space=search_space)
>>> score, params = results['all_evals'][0]
>>> ("{:.2}".format(score), ["{}:{}".format(k, params[k]) 
...   for k in sorted(params)])
('0.93', ['RFOR_MTRY:10'])
>>> 
>>> # Drop the database table.
... oml.drop('BreastCancer')