# Training Models¶

## Oracle AutoML¶

Oracle AutoML automates the machine learning experience. It replaces the laborious and time consuming tasks of the data scientist whose workflow is as follows:

1. Select a model from a large number of viable candidate models.

2. For each model, tune the hyperparameters.

3. Select only predictive features to speed up the pipeline and reduce over fitting.

4. Ensure the model performs well on unseen data (also called generalization).

Oracle AutoML automates this workflow and provides you with an optimal model given a time budget. In addition to incorporating these typical machine learning workflow steps, Oracle AutoML is also optimized to produce a high quality model very efficiently. You can achieve this with the following:

• Scalable design: All stages in the Oracle AutoML pipeline exploit both internode and intranode parallelism, which improves scalability and reduces runtime.

• Intelligent choices reduce trials in each stage: Algorithms and parameters are chosen based on dataset characteristics. This ensures that the selected model is accurate and is efficiently selected. You can achieve this using meta learning throughout the pipeline. Meta learning is used in:

• Algorithm selection to choose an optimal model class.

• Adaptive sampling to identify the optimal set of samples.

• Feature selection to determine the ideal feature subset.

• Hyperparameter optimization.

The following topics detail the Oracle AutoML pipeline and individual stages of the pipeline:

## Keras¶

Keras is an open source neural network library. It can run on top of TensorFlow, Theano, and Microsoft Cognitive Toolkit. By default, Keras uses TensorFlow as the backend. Keras is written in Python, but it has support for R and PlaidML, see About Keras.

These examples examine a binary classification problem predicting churn. This is a common type of problem that can be solved using Keras, Tensorflow, and scikit-learn.

If the data is not cached, it is pulled from github, cached, and then loaded.

from os import path
import numpy as np
import pandas as pd
import requests

import logging
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.ERROR)

churn_data_file = '/tmp/churn.csv'
if not path.exists(churn_data_file):
# fetch sand save some data
print('fetching data from web...', end =" ")
r = requests.get('https://github.com/darenr/public_datasets/raw/master/churn_dataset.csv')
with open(churn_data_file, 'wb') as fd:
fd.write(r.content)
print("Done")



Keras needs to be imported and scikit-learn needs to be imported to generate metrics. Most of the data preprocessing and modeling can be done using the ADS library. However, the following example demonstrates how to do these tasks with external libraries:

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, roc_auc_score

from keras.models import Sequential
from keras.layers import Dense


The first step is data preparation. From the pandas.DataFrame, you extract the X and Y-values as numpy arrays. The feature selection is performed manually. The next step is feature encoding using sklearn LabelEncoder. This converts categorical variables into ordinal values (‘red’, ‘green’, ‘blue’ –> 0, 1, 2) to be compatible with Keras. The data is then split using a 80/20 ratio. The training is performed on 80% of the data. Model testing is performed on the remaining 20% of the data to evaluate how well the model generalizes.

feature_name = ['CreditScore', 'Geography', 'Gender', 'Age', 'Tenure', 'Balance',
'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary']

response_name = ['Exited']
data = df[[val for sublist in [feature_name, response_name] for val in sublist]].copy()

# Encode the category columns
for col in ['Geography', 'Gender']:
data.loc[:, col] = LabelEncoder().fit_transform(data.loc[:, col])

# Do an 80/20 split for the training and test data
train, test = train_test_split(data, test_size=0.2, random_state=42)

# Scale the features and split the features away from the response
sc = StandardScaler() # Feature Scaling
X_train = sc.fit_transform(train.drop('Exited', axis=1).to_numpy())
X_test = sc.transform(test.drop('Exited', axis=1).to_numpy())
y_train = train.loc[:, 'Exited'].to_numpy()
y_test = test.loc[:, 'Exited'].to_numpy()


The following shows the neural network architecture. It is a sequential model with an input layer with 10 nodes. It has two hidden layers with 255 densely connected nodes and the ReLu activation function. The output layer has a single node with a sigmoid activation function because the model is doing binary classification. The optimizer is Adam and the loss function is binary cross-entropy. The model is optimized on the accuracy metric. This takes several minutes to run.

keras_classifier = Sequential()

keras_classifier.fit(X_train, y_train, batch_size=10, epochs=25)


To evaluate this model, you could use sklearn or ADS.

This example uses sklearn:

y_pred = keras_classifier.predict(X_test)
y_pred = (y_pred > 0.5)

cm = confusion_matrix(y_test, y_pred)
auc = roc_auc_score(y_test, y_pred)

print("confusion_matrix:\n", cm)
print("roc_auc_score", auc)


This example uses the ADS evaluator package:

from ads.common.model import ADSModel

eval_test = MLData.build(X = pd.DataFrame(sc.transform(test.drop('Exited', axis=1)), columns=feature_name),
y = pd.Series(test.loc[:, 'Exited']),
name = 'Test Data')
eval_train = MLData.build(X = pd.DataFrame(sc.transform(train.drop('Exited', axis=1)), columns=feature_name),
y = pd.Series(train.loc[:, 'Exited']),
name = 'Training Data')


## Scikit-Learn¶

The sklearn pipeline can be used to build a model on the same churn dataset that was used in the Keras section. The pipeline allows the model to contain multiple stages and transformations. Typically, there are pipeline stages for feature encoding, scaling, and so on. In this pipeline example, a LogisticRegression estimator is used:

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

pipeline_classifier = Pipeline(steps=[
('clf', LogisticRegression())
])

pipeline_classifier.fit(X_train, y_train)


You can evaluate this model using sklearn or ADS.

## XGBoost¶

XGBoost is an optimized, distributed gradient boosting library designed to be efficient, flexible, and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides parallel tree boosting (also known as Gradient Boosting Decision Tree, Gradient Boosting Machines [GBM]) and can be used to solve a variety of data science applications. The unmodified code runs on several distributed environments (Hadoop, SGE, andMPI) and can processes billions of observations, see the XGBoost Documentation.

Import XGBoost with:

from xgboost import XGBClassifier

xgb_classifier.fit(eval_train.X, eval_train.y)


From the three estimators, we create three ADSModel objects. A Keras classifier, a sklearn pipeline with a single LogisticRegression stage, and an XGBoost model:

from ads.common.model import ADSModel

evaluator = ADSEvaluator(eval_test, models=[keras_model, lr_model, xgb_model], training_data=eval_train)
evaluator.show_in_notebook()


In addition to the other services for training models, ADS includes a hyperparameter tuning framework called ADSTuner.

ADSTuner supports using several hyperparameter search strategies that plug into common model architectures like sklearn.

ADSTuner further supports users defining their own search spaces and strategies. This makes ADSTuner functional and useful with any ML library that doesn’t include hyperparameter tuning.

First, import the packages:

import category_encoders as ce
import lightgbm
import logging
import numpy as np
import os
import pandas as pd
import pytest
import sklearn
import xgboost

from lightgbm import LGBMClassifier
from sklearn import preprocessing
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.metrics import make_scorer, f1_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
from xgboost import XGBClassifier


This is an example of running the ADSTuner on a support model SGD from sklearn:

model = SGDClassifier() ##Initialize the model
X_train, X_valid, y_train, y_valid = train_test_split(X, y)
tuner = ADSTuner(model, cv=3) ## cv is cross validation splits
tuner.search_space() ##This is the default search space
tuner.tune(X_train, y_train, exit_criterion=[NTrials(10)])


ADSTuner generates a tuning report that lists its trials, best performing hyperparameters, and performance statistics with:

You can use tuner.best_score to get the best score on the scoring metric used (accessible astuner.scoring_name) The best selected parameters are obtained with tuner.best_params and the complete record of trials with tuner.trials

If you have further compute resources and want to continue hyperparameter optimization on a model that has already been optimized, you can use:

tuner.resume(exit_criterion=[TimeBudget(5)], loglevel=logging.NOTSET)
print('So far the best {} score is {}'.format(tuner.scoring_name, tuner.best_score))
print("The best trial found was number: " + str(tuner.best_index))


ADSTuner has some robust visualization and plotting capabilities:

tuner.plot_best_scores()
tuner.plot_intermediate_scores()
tuner.search_space()
tuner.plot_contour_scores(params=['penalty', 'alpha'])
tuner.plot_parallel_coordinate_scores(params=['penalty', 'alpha'])
tuner.plot_edf_scores()


These commands produce the following plots:

ADSTuner supports custom scoring functions and custom search spaces. This example uses a different model:

model2 = LogisticRegression()
strategy = {
'C': LogUniformDistribution(low=1e-05, high=1),
'solver': CategoricalDistribution(['saga']),
'max_iter': IntUniformDistribution(500, 1000, 50)},
scoring=make_scorer(f1_score, average='weighted'),
cv=3)
tuner.tune(X_train, y_train, exit_criterion=[NTrials(5)])


ADSTuner doesn’t support every model. The supported models are:

• ‘Ridge’,

• ‘RidgeClassifier’,

• ‘Lasso’,

• ‘ElasticNet’,

• ‘LogisticRegression’,

• ‘SVC’,

• ‘SVR’,

• ‘LinearSVC’,

• ‘LinearSVR’,

• ‘DecisionTreeClassifier’,

• ‘DecisionTreeRegressor’,

• ‘RandomForestClassifier’,

• ‘RandomForestRegressor’,

• ‘XGBClassifier’,

• ‘XGBRegressor’,

• ‘ExtraTreesClassifier’,

• ‘ExtraTreesRegressor’,

• ‘LGBMClassifier’,

• ‘LGBMRegressor’,

• ‘SGDClassifier’,

• ‘SGDRegressor’

The AdaBoostRegressor model is not supported. This is an example of a custom strategy to use with this model:

model3 = AdaBoostRegressor()
X_train, X_valid, y_train, y_valid = train_test_split(X, y)
tuner = ADSTuner(model3, strategy={'n_estimators': IntUniformDistribution(50, 100)})
tuner.tune(X_train, y_train, exit_criterion=[TimeBudget(5)])


Finally, ADSTuner supports sklearn pipelines:

df, target = pd.read_csv(os.path.join('~', 'advanced-ds', 'tests', 'vor_datasets', 'vor_titanic.csv')), 'Survived'
X = df.drop(target, axis=1)
y = df[target]

numeric_features = X.select_dtypes(include=['int64', 'float64', 'int32', 'float32']).columns
categorical_features = X.select_dtypes(include=['object', 'category', 'bool']).columns

y = preprocessing.LabelEncoder().fit_transform(y)

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3, random_state=42)

num_features = len(numeric_features) + len(categorical_features)

numeric_transformer = Pipeline(steps=[
('num_imputer', SimpleImputer(strategy='median')),
('num_scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
('cat_imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('cat_encoder', ce.woe.WOEEncoder())
])

preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
]
)

pipe = Pipeline(
steps=[
('preprocessor', preprocessor),
('feature_selection', SelectKBest(f_classif, k=int(0.9 * num_features))),
('classifier', LogisticRegression())
]
)

def customerize_score(y_true, y_pred, sample_weight=None):
score = y_true == y_pred
return np.average(score, weights=sample_weight)

score = make_scorer(customerize_score)
pipe,
scoring=score,
strategy='detailed',
cv=2,
random_state=42
)


### Notebook Example: Hyperparameter Optimization with ADSTuner¶

Overview:

A hyperparameter is a parameter that is used to control a learning process. This is in contrast to other parameters that are learned in the training process. The process of hyperparameter optimization is to search for hyperparameter values by building many models and assessing their quality. This notebook provides an overview of the ADSTuner hyperparameter optimization engine. ADSTuner can optimize any estimator object that follows the scikit-learn API.

Objectives:

• Introduction

• Synchronous Tuning with Exit Criterion Based on Number of Trials

• Asynchronously Tuning with Exit Criterion Based on Time Budget

• Inspecting the Tuning Trials

• Defining a Custom Search Space and Score

• Changing the Search Space Strategy

• Optimizing a scikit-learn Pipeline()

• References

Important:

Placeholder text for required values are surrounded by angle brackets that must be removed when adding the indicated content. For example, when adding a database name to database_name = "<database_name>" would become database_name = "production".

Datasets are provided as a convenience. Datasets are considered third party content and are not considered materials under your agreement with Oracle applicable to the services. The iris dataset is distributed under the BSD license.

import category_encoders as ce
import lightgbm
import logging
import numpy as np
import os
import pandas as pd
import sklearn
import time

from sklearn import preprocessing
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.metrics import make_scorer, f1_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif


Introduction

Hyperparameter optimization requires a model, dataset, and an ADSTuner object to perform the search.

ADSTuner() Performs a hyperparameter search using cross-validation. You can specify the number of folds you want to use with the cv parameter.

The ADSTuner() needs a search space to tune the hyperparameters in so you use the strategy parameter. This parameter can be set in two ways. You can specify detailed search criteria or you can use the built-in defaults. For the supported model classes, ADSTuner provides perfunctoryand detailed search spaces that are optimized for the class of model that is being used. The perfunctory option is optimized for a small search space so that the most important hyperparameters are tuned. Generally, this option is used early in your search as it reduces the computational cost and allows you to assess the quality of the model class that you are using. The detailed search space instructs ADSTuner to cover a broad search space by tuning more hyperparameters. Typically, you would use it when you have determined what class of model is best suited for the dataset and type of problem you are working on. If you have experience with the dataset and have a good idea of what the best hyperparameter values are, you can explicitly specify the search space. You pass a dictionary that defines the search space into the strategy.

The parameter storage takes a database URL. For example, sqlite:////home/datascience/example.db. When storage is set to the default value None, a new sqlite database file is created internally in the tmp folder with a unique name. The name format is sqlite:////tmp/hpo_*.db. study_name is the name of this study for this ADSTuner object. One ADSTuner object only has one study_name. However, one database file can be shared among different ADSTuner objects. load_if_exists controls whether to load an existing study from an existing database file. If False, it raises a DuplicatedStudyError when the study_name exists.

The loglevel parameter controls the amount of logging information displayed in the notebook.

This notebook uses the scikit-learn SGDClassifer() model and the iris dataset. This model object is a regularized linear model with stochastic gradient descent (SGD) used to optimize the model parameters.

The next cell creates the SGDClassifer() model, initialize san ADSTuner object, and loads the iris data.

tuner = ADSTuner(SGDClassifier(), cv=3, loglevel=logging.WARNING)

[32m[I 2021-04-21 20:04:03,435][0m A new study created with name: hpo_22cfd4d5-c512-4e84-b7f8-d6d9c721ff05[0m


Each model class has a set of hyperparameters that you need to optimized. The strategy attribute returns what strategy is being used. This can be perfunctory, detailed, or a dictionary that defines the strategy. The method search_space() always returns a dictionary of hyperparameters that are to be searched. Any hyperparameter that is required by the model, but is not listed, uses the default value that is defined by the model class. To see what search space is being used for your model class when strategy is perfunctory or detailed use the search_space() method to see the details.

The adstuner_search_space_update.ipynb notebook has detailed examples about how to work with and update the search space.

The next cell displaces the search strategy and the search space.

print(f'Search Space for strategy "{tuner.strategy}" is: \n {tuner.search_space()}')

Search Space for strategy "perfunctory" is:
{'alpha': LogUniformDistribution(low=0.0001, high=0.1), 'penalty': CategoricalDistribution(choices=['l1', 'l2', 'none'])}


The tune() method starts a tuning process. It has a synchronous and asynchronous mode for tuning. The mode is set with the synchronous parameter. When it is set to False, the tuning process runs asynchronously so it runs in the background and allows you to continue your work in the notebook. When synchronous is set to True, the notebook is blocked until tune() finishes running. The adntuner_sync_and_async.ipynb notebook illustrates this feature in a more detailed way.

The ADSTuner object needs to know when to stop tuning. The exit_criterion parameter accepts a list of criteria that cause the tuning to finish. If any of the criteria are met, then the tuning process stops. Valid exit criteria are:

• NTrials(n): Run for n number of trials.

• TimeBudget(t): Run for t seconds.

• ScoreValue(s): Run until the score value exceeds s.

The default behavior is to run for 50 trials (NTrials(50)).

The stopping criteria are listed in the ads.hpo.stopping_criterion module.

Synchronous Tuning with Exit Criterion Based on Number of Trials

This section demonstrates how to perform a synchronous tuning process with the exit criteria based on the number of trials. In the next cell, the synchronous parameter is set to True and the exit_criterion is set to [NTrials(5)].

tuner.tune(X, y, exit_criterion=[NTrials(5)], synchronous=True)


You can access a summary of the trials by looking at the various attributes of the tuner object. The scoring_name attribute is a string that defines the name of the scoring metric. The best_score attribute gives the best score of all the completed trials. The best_params parameter defines the values of the hyperparameters that have to lead to the best score. Hyperparameters that are not in the search criteria are not reported.

print(f"So far the best {tuner.scoring_name} score is {tuner.best_score} and the best hyperparameters are {tuner.best_params}")

So far the best mean accuracy score is 0.9666666666666667 and the best hyperparameters are {'alpha': 0.002623793623610696, 'penalty': 'none'}


You can also look at the detailed table of all the trials attempted:

tuner.trials.tail()

number value datetime_start datetime_complete duration params_alpha params_penalty user_attrs_mean_fit_time user_attrs_mean_score_time user_attrs_mean_test_score user_attrs_metric user_attrs_split0_test_score user_attrs_split1_test_score user_attrs_split2_test_score user_attrs_std_fit_time user_attrs_std_score_time user_attrs_std_test_score state
0 0 0.966667 2021-04-21 20:04:03.582801 2021-04-21 20:04:05.276723 0 days 00:00:01.693922 0.002624 none 0.172770 0.027071 0.966667 mean accuracy 1.00 0.94 0.96 0.010170 0.001864 0.024944 COMPLETE
1 1 0.760000 2021-04-21 20:04:05.283922 2021-04-21 20:04:06.968774 0 days 00:00:01.684852 0.079668 l2 0.173505 0.026656 0.760000 mean accuracy 0.70 0.86 0.72 0.005557 0.001518 0.071181 COMPLETE
2 2 0.960000 2021-04-21 20:04:06.977150 2021-04-21 20:04:08.808149 0 days 00:00:01.830999 0.017068 l1 0.185423 0.028592 0.960000 mean accuracy 1.00 0.92 0.96 0.005661 0.000807 0.032660 COMPLETE
3 3 0.853333 2021-04-21 20:04:08.816561 2021-04-21 20:04:10.824708 0 days 00:00:02.008147 0.000168 l2 0.199904 0.033303 0.853333 mean accuracy 0.74 0.94 0.88 0.005677 0.001050 0.083799 COMPLETE
4 4 0.826667 2021-04-21 20:04:10.833650 2021-04-21 20:04:12.601027 0 days 00:00:01.767377 0.001671 l2 0.180534 0.028627 0.826667 mean accuracy 0.80 0.96 0.72 0.009659 0.002508 0.099778 COMPLETE

Asynchronously Tuning with Exit Criterion Based on Time Budget

ADSTuner() tuner can be run in an asynchronous mode by setting synchronous=False in the tune() method. This allows you to run other Python commands while the tuning process is executing in the background. This section demonstrates how to run an asynchronous search for the optimal hyperparameters. It uses a stopping criteria of five seconds. This is controlled by the parameter exit_criterion=[TimeBudget(5)].

The next cell starts an asynchronous tuning process. A loop is created that prints the best search results that have been detected so far by using the best_score attribute. It also displays the remaining time in the time budget by using the time_remaining attribute. The attribute status is used to exit the loop.

# This cell will return right away since it's running asynchronous.
tuner.tune(exit_criterion=[TimeBudget(5)])
while tuner.status == State.RUNNING:
print(f"So far the best score is {tuner.best_score} and the time left is {tuner.time_remaining}")
time.sleep(1)

So far the best score is 0.9666666666666667 and the time left is 4.977275848388672
So far the best score is 0.9666666666666667 and the time left is 3.9661824703216553
So far the best score is 0.9666666666666667 and the time left is 2.9267797470092773
So far the best score is 0.9666666666666667 and the time left is 1.912914752960205
So far the best score is 0.9733333333333333 and the time left is 0.9021461009979248
So far the best score is 0.9733333333333333 and the time left is 0


The attribute best_index givse you the index in the trials data frame where the best model is located.

tuner.trials.loc[tuner.best_index, :]

number                                                  10
value                                                 0.98
datetime_start                  2021-04-21 20:04:17.013347
datetime_complete               2021-04-21 20:04:18.623813
duration                            0 days 00:00:01.610466
params_alpha                                      0.014094
params_penalty                                          l1
user_attrs_mean_fit_time                           0.16474
user_attrs_mean_score_time                        0.024773
user_attrs_mean_test_score                            0.98
user_attrs_metric                            mean accuracy
user_attrs_split0_test_score                           1.0
user_attrs_split1_test_score                           1.0
user_attrs_split2_test_score                          0.94
user_attrs_std_fit_time                           0.006884
user_attrs_std_score_time                          0.00124
user_attrs_std_test_score                         0.028284
state                                             COMPLETE
Name: 10, dtype: object


The attribute n_trials reports the number of successfully complete trials that were conducted.

print(f"The total of trials was: {tuner.n_trials}.")

The total of trials was: 11.


Inspecting the Tuning Trials

You can inspect the tuning trials performance using several built in plots.

Note: If the tuning process is still running in the background, the plot runs in real time to update the new changes until the tuning process completes.

# tuner.tune(exit_criterion=[NTrials(5)], loglevel=logging.WARNING) # uncomment this line to see the real-time plot.
tuner.plot_best_scores()
`