10.3 特徴選択

oml.automl.FeatureSelectionクラスは、トレーニング・データ・セットおよびOracle Machine Learningアルゴリズムについて最も関連性の高い特徴サブセットを識別します。

データ分析アプリケーションでは、特徴の選択は、ランタイムとモデルの両方のパフォーマンスに大きな影響を及ぼすデータ前処理の重要なステップです。oml.automl.FeatureSelectionクラスは、データセットおよびモデルについて最も関連性の高い特徴を自動的に選択します。内部的には、様々な特徴ランキング・アルゴリズムを使用して、モデルのパフォーマンスを損なうことなくモデルのトレーニング時間を短縮する最適な特徴サブセットを識別します。オラクル社の先進のメタ学習手法によって、この特徴選択最適化の検索領域がすばやくプルーニングされます。

oml.automl.FeatureSelectionクラスは、分類および回帰アルゴリズムをサポートしています。oml.automl.FeatureSelectionクラスを使用するには、特徴の削減を実行するデータセットおよびOracle Machine Learningアルゴリズムを指定します。

このクラスのパラメータおよびメソッドの詳細は、help(oml.automl.FeatureSelection)を呼び出すか、Oracle Machine Learning for Python APIリファレンスを参照してください。

例10-2 oml.automl.FeatureSelectionクラスの使用

この例では、oml.automl.FeatureSelectionクラスを使用します。この例では、完全なデータセットに基づいてモデルを構築し、予測精度を計算します。自動化された特徴選択を実行し、特定されたセットに従って列をフィルタ処理して、モデルを再構築します。その後、予測精度を再計算します。

import oml
from oml import automl
import pandas as pd
import numpy as np
from sklearn import datasets

# Load the digits data set into the database.
digits = datasets.load_digits()
X = pd.DataFrame(digits.data, 
                 columns = ['pixel{}'.format(i) for i 
                             in range(digits.data.shape[1])])
y = pd.DataFrame(digits.target, columns = ['digit'])
oml_df = oml.create(pd.concat([X, y], axis=1), table = 'DIGITS')

# Split the data set into train and test.
train, test = oml_df.split(ratio=(0.8, 0.2), 
                           seed = 1234, strata_cols='digit')
X_train, y_train = train.drop('digit'), train['digit']
X_test, y_test = test.drop('digit'), test['digit']

# Default model performance before feature selection.
mod = oml.svm(mining_function='classification').fit(X_train, 
                                                    y_train)
"{:.2}".format(mod.score(X_test, y_test))

# Create an automated feature selection object with accuracy
# as the score_metric.
fs = automl.FeatureSelection(mining_function='classification', 
                             score_metric='accuracy', parallel=4)

# Get the reduced feature subset on the train data set.
subset = fs.reduce('svm_linear', X_train, y_train)
"{} features reduced to {}".format(len(X_train.columns),
                                   len(subset))

# Use the subset to select the features and create a model on the 
# new reduced data set.
X_new  = X_train[:,subset]
X_test_new = X_test[:,subset]
mod = oml.svm(mining_function='classification').fit(X_new, y_train)
"{:.2} with {:.1f}x feature reduction".format(
  mod.score(X_test_new, y_test),
  len(X_train.columns)/len(X_new.columns))

# Drop the DIGITS table.
oml.drop('DIGITS')

# For reproducible results, add a case_id column with unique row
# identifiers.
row_id = pd.DataFrame(np.arange(digits.data.shape[0]), 
                                columns = ['CASE_ID'])
oml_df_cid = oml.create(pd.concat([row_id, X, y], axis=1), 
                        table = 'DIGITS_CID')

train, test = oml_df_cid.split(ratio=(0.8, 0.2), seed = 1234, 
                               hash_cols='CASE_ID', 
                               strata_cols='digit')
X_train, y_train = train.drop('digit'), train['digit']
X_test, y_test = test.drop('digit'), test['digit']

# Provide the case_id column name to the feature selection 
# reduce function.
subset = fs.reduce('svm_linear', X_train, 
                   y_train, case_id='CASE_ID')
"{} features reduced to {} with case_id".format(
                                           len(X_train.columns)-1, 
                                           len(subset)) 

# Drop the tables created in the example.
oml.drop('DIGITS')
oml.drop('DIGITS_CID')

この例のリスト

>>> import oml
>>> from oml import automl
>>> import pandas as pd
>>> import numpy as np
>>> from sklearn import datasets
>>> 
>>> # Load the digits data set into the database.
... digits = datasets.load_digits()
>>> X = pd.DataFrame(digits.data, 
...                  columns = ['pixel{}'.format(i) for i 
...                              in range(digits.data.shape[1])])
>>> y = pd.DataFrame(digits.target, columns = ['digit'])
>>> oml_df = oml.create(pd.concat([X, y], axis=1), table = 'DIGITS')
>>>
>>> # Split the data set into train and test.
... train, test = oml_df.split(ratio=(0.8, 0.2),
...                            seed = 1234, strata_cols='digit')
>>> X_train, y_train = train.drop('digit'), train['digit']
>>> X_test, y_test = test.drop('digit'), test['digit']
>>>
>>> # Default model performance before feature selection.
... mod = oml.svm(mining_function='classification').fit(X_train,
...                                                     y_train)
>>> "{:.2}".format(mod.score(X_test, y_test))
'0.92'
>>> 
>>> # Create an automated feature selection object with accuracy
... # as the score_metric.
... fs = automl.FeatureSelection(mining_function='classification', 
...                              score_metric='accuracy', parallel=4)
>>> # Get the reduced feature subset on the train data set.
... subset = fs.reduce('svm_linear', X_train, y_train)
>>> "{} features reduced to {}".format(len(X_train.columns), 
...                                    len(subset))
'64 features reduced to 41'
>>> 
>>> # Use the subset to select the features and create a model on the 
... # new reduced data set.
... X_new  = X_train[:,subset]
>>> X_test_new = X_test[:,subset]
>>> mod = oml.svm(mining_function='classification').fit(X_new, y_train)
>>> "{:.2} with {:.1f}x feature reduction".format(
...   mod.score(X_test_new, y_test),
...   len(X_train.columns)/len(X_new.columns))
'0.92 with 1.6x feature reduction'
>>> 
>>> # Drop the DIGITS table.
... oml.drop('DIGITS')
>>> 
>>> # For reproducible results, add a case_id column with unique row
... # identifiers.
>>> row_id = pd.DataFrame(np.arange(digits.data.shape[0]),
...                                 columns = ['CASE_ID'])
>>> oml_df_cid = oml.create(pd.concat([row_id, X, y], axis=1),
...                         table = 'DIGITS_CID')

>>> train, test = oml_df_cid.split(ratio=(0.8, 0.2), seed = 1234, 
...                                hash_cols='CASE_ID', 
...                                strata_cols='digit')
>>> X_train, y_train = train.drop('digit'), train['digit']
>>> X_test, y_test = test.drop('digit'), test['digit']
>>>
>>> # Provide the case_id column name to the feature selection
... # reduce function.
>>> subset = fs.reduce('svm_linear', X_train, 
...                    y_train, case_id='CASE_ID')
... "{} features reduced to {} with case_id".format(
...                                            len(X_train.columns)-1, 
...                                            len(subset))
'64 features reduced to 45 with case_id'
>>>
>>> # Drop the tables created in the example.
... oml.drop('DIGITS')
>>> oml.drop('DIGITS_CID')