XGBoost

9.21 XGBoost

The oml.xgb class supports the in-database scalable gradient tree boosting algorithm for both classification, regression specifications, ranking models, and survival models. It makes available the open source gradient boosting framework. It prepares the categorical encoding and missing value replacement from the OML infrastructure, calls the in-database XGBoost, builds and persists a model as a first-class database model object, and supports using the model for prediction.

You can use oml.xgb as a stand-alone predictor or incorporate it into real-world production pipelines for a wide range of problems such as ad click-through rate prediction, hazard risk prediction, web text classification, and so on.

The oml.xgb algorithm takes three types of parameters: general parameters, booster parameters, and task parameters. You set the parameters through the model settings. The algorithm supports most of the settings of the open source XGBoost project. For more information on the supported settings, see XGBoost parameters.

Through oml.xgb, OML4Py supports a number of different classification and regression specifications, ranking models, and survival models. Binary and multi-class models are supported under the classification machine learning technique while regression, ranking, count, and survival are supported under the regression machine learning technique.

oml.xgb also supports partitioned models and internalizes the data preparation.

Settings for an XGBoost model

The following table lists settings that apply to XGBoost models.

Table 9-19 XGBoost Model Settings

Setting Name Setting Value Description

Setting Name	Setting Value	Description
`booster`	A string that is one of the following: `dart` `gblinear` `gbtree`	The booster to use: `dart` `gblinear` `gbtree` The `dart` and `gbtree` boosters use tree-based models whereas `gblinear` uses linear functions. The default value is `gbtree`.
`num_round`	A non-negative integer.	The number of rounds for boosting. The default value is `10`.

booster

A string that is one of the following:

dart
gblinear
gbtree

The booster to use:

dart
gblinear
gbtree

The dart and gbtree boosters use tree-based models whereas gblinear uses linear functions.

The default value is gbtree.

num_round

A non-negative integer.

The number of rounds for boosting.

The default value is 10.

For more information on the booster settings, see XGBoost parameters

Example 9-21 Using the oml.xgb Class

This example creates an XGB model and uses some of the methods of the oml.xgb class.

#Load the iris data from sklearn and combine the target and predictors into a single DataFrame, which matches the form of a database table.  
Use the oml.create function to load this Pandas DataFrame into the databae, which creates a persistent table and returns a proxy object that you assign to z.#

import oml
from sklearn import datasets
import pandas as pd

iris = datasets.load_iris()
x = pd.DataFrame(iris.data, columns = ['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width'])
y = pd.DataFrame(list(map(lambda x: {0: 'setosa', 1: 'versicolor', 2:'virginica'}[x], iris.target)), columns = ['Species'])

#For on-premises database follow the below command to connect to the database#
oml.connect("<username>","<password>", dsn="<dsn>")
z = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')

#Create training data and test data.#

dat = oml.sync(table = "IRIS").split()
train_x = dat[0].drop('Species')
train_y = dat[0]['Species']
test_dat = dat[1]

#Classification Example:#

#Create an XGBoost model object.#

setting = {'xgboost_max_depth': '3',
...            'xgboost_eta': '1',
...            'xgboost_num_round': '10'}

xgb_mod = oml.xgb('classification', **setting)

#Fit the XGBoost model to the training data.#

xgb_mod.fit(train_x, train_y)  
#Use the model to make predictions on the test data and return the prediction probabilities for each category in Species.#
xgb_mod.predict(test_dat.drop('Species'), supplemental_cols = test_dat[:, ['Sepal_Length', 'Sepal_Width', 'Species']], proba = True).sort_values(by = ['Sepal_Length', 'Sepal_Width']) 
     Sepal_Length  Sepal_Width     Species       TOP_1  TOP_1_VAL
 0            4.4          3.0      setosa      setosa   0.993619
 1            4.4          3.2      setosa      setosa   0.993619
 2            4.5          2.3      setosa      setosa   0.942128
 3            4.8          3.4      setosa      setosa   0.993619
...           ...          ...         ...         ...        ...
 42           6.7          3.3   virginica   virginica   0.996170
 43           6.9          3.1  versicolor  versicolor   0.925217
 44           6.9          3.1   virginica   virginica   0.996170
 45           7.0          3.2  versicolor  versicolor   0.990586

#Create training data and test data.#

dat = oml.sync(table = "IRIS").split()

train_x = dat[0].drop('Sepal_Length')
train_y = dat[0]['Sepal_Length']
test_dat = dat[1]

#Create an XGBoost model object.#

setting = {'xgboost_booster': 'gblinear'}
xgb_mod = oml.xgb('regression', **setting)

#Fit the XGBoost Model according to the training data and parameter settings.#

xgb_mod.fit(train_x, train_y)  
xgb_mod.predict(test_dat.drop('Species'), supplemental_cols = test_dat[:, ['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Species']]) # doctest: +NORMALIZE_WHITESPACE, +ELLIPSIS
#Create an XGBoost model object.#

setting = {'xgboost_objective': 'rank:pairwise',
...            'xgboost_max_depth': '3',
...            'xgboost_eta': '0.1',
...            'xgboost_gamma': '1.0',
...            'xgboost_num_round': '4'}

xgb_mod = oml.xgb('regression', **setting)

#Fit the XGBoost Model according to the training data and parameter settings.#

xgb_mod.fit(train_x, train_y)
#Use the model to make predictions on the test data, returning the Sepal_Length, Sepal_Width, Petal_Length, and Species columns in the result.#

xgb_mod.predict(test_dat.drop('Species'), supplemental_cols = test_dat[:, ['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Species']])

Listing for This Example

#Load the iris data from sklearn and combine the target and predictors into a single DataFrame, which matches the form of a database table.  
Use the oml.create function to load this Pandas DataFrame into the databae, which creates a persistent table and returns a proxy object that you assign to z.#

>>> import oml
>>> from sklearn import datasets
>>> import pandas as pd

>>> iris = datasets.load_iris()
>>> x = pd.DataFrame(iris.data, columns = ['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width'])
>>> y = pd.DataFrame(list(map(lambda x: {0: 'setosa', 1: 'versicolor', 2:'virginica'}[x], iris.target)), columns = ['Species'])

>>> #For on-premises database follow the below command to connect to the database#
>>> oml.connect("<username>","<password>", dsn="<dsn>")
>>> z = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')

#Create training data and test data.#

>>> dat = oml.sync(table = "IRIS").split()
>>> train_x = dat[0].drop('Species')
>>> train_y = dat[0]['Species']
>>> test_dat = dat[1]

#Classification Example:#

#Create an XGBoost model object.#

>>> setting = {'xgboost_max_depth': '3',
...            'xgboost_eta': '1',
...            'xgboost_num_round': '10'}

>>> xgb_mod = oml.xgb('classification', **setting)

#Fit the XGBoost model to the training data.#

>>> xgb_mod.fit(train_x, train_y)  

Algorithm Name: XGBOOST
Mining Function: CLASSIFICATION
Target: Species

Settings: 
                    setting name            setting value
0                      ALGO_NAME             ALGO_XGBOOST
1          CLAS_WEIGHTS_BALANCED                      OFF
2                   ODMS_DETAILS              ODMS_ENABLE
3   ODMS_MISSING_VALUE_TREATMENT  ODMS_MISSING_VALUE_AUTO
4                  ODMS_SAMPLING    ODMS_SAMPLING_DISABLE
5                      PREP_AUTO                       ON
6                        booster                   gbtree
7                            eta                        1
8                      max_depth                        3
9                    ntree_limit                        0
10                     num_round                       10
11                     objective           multi:softprob

Global Statistics: 
  attribute name attribute value
0       NUM_ROWS             104
1       mlogloss        0.024858

Attributes: 
Petal_Length
Petal_Width
Sepal_Length
Sepal_Width

Partition: NO

ATTRIBUTE IMPORTANCE: 

  PNAME ATTRIBUTE_NAME ATTRIBUTE_SUBNAME ATTRIBUTE_VALUE      GAIN     COVER  \
0  None   Petal_Length              None            None  0.743941  0.560554   
1  None    Petal_Width              None            None  0.162191  0.245400   
2  None   Sepal_Length              None            None  0.003738  0.044741   
3  None    Sepal_Width              None            None  0.090129  0.149306   

   FREQUENCY  
0   0.447761  
1   0.268657  
2   0.119403  
3   0.164179  

#Use the model to make predictions on the test data and return the prediction probabilities for each category in Species.#

>>> xgb_mod.predict(test_dat.drop('Species'), supplemental_cols = test_dat[:, ['Sepal_Length', 'Sepal_Width', 'Species']], proba = True).sort_values(by = ['Sepal_Length', 'Sepal_Width']) 
     Sepal_Length  Sepal_Width     Species       TOP_1  TOP_1_VAL
 0            4.4          3.0      setosa      setosa   0.993619
 1            4.4          3.2      setosa      setosa   0.993619
 2            4.5          2.3      setosa      setosa   0.942128
 3            4.8          3.4      setosa      setosa   0.993619
...           ...          ...         ...         ...        ...
 42           6.7          3.3   virginica   virginica   0.996170
 43           6.9          3.1  versicolor  versicolor   0.925217
 44           6.9          3.1   virginica   virginica   0.996170
 45           7.0          3.2  versicolor  versicolor   0.990586


#Regression Example:#

#Create training data and test data.#

>>> dat = oml.sync(table = "IRIS").split()

>>> train_x = dat[0].drop('Sepal_Length')
>>> train_y = dat[0]['Sepal_Length']
>>> test_dat = dat[1]

#Create an XGBoost model object.#

>>> setting = {'xgboost_booster': 'gblinear'}
>>> xgb_mod = oml.xgb('regression', **setting)

#Fit the XGBoost Model according to the training data and parameter settings.#

>>> xgb_mod.fit(train_x, train_y)  

Algorithm Name: XGBOOST
Mining Function: REGRESSION
Target: Sepal_Length

Settings: 
                   setting name            setting value
0                     ALGO_NAME             ALGO_XGBOOST
1                  ODMS_DETAILS              ODMS_ENABLE
2  ODMS_MISSING_VALUE_TREATMENT  ODMS_MISSING_VALUE_AUTO
3                 ODMS_SAMPLING    ODMS_SAMPLING_DISABLE
4                     PREP_AUTO                       ON
5                       booster                 gblinear
6                   ntree_limit                        0
7                     num_round                       10

Computed Settings: 
              setting name setting value
0  ODMS_EXPLOSION_MIN_SUPP             1

Global Statistics: 
  attribute name attribute value
0       NUM_ROWS             104
1           rmse        0.364149

Attributes: 
Petal_Length
Petal_Width
Sepal_Width
Species

Partition: NO

ATTRIBUTE IMPORTANCE: 

  PNAME ATTRIBUTE_NAME ATTRIBUTE_SUBNAME ATTRIBUTE_VALUE    WEIGHT  CLASS
0  None   Petal_Length              None            None  0.335183      0
1  None    Petal_Width              None            None  0.368738      0
2  None    Sepal_Width              None            None  0.249208      0
3  None        Species              None      versicolor -0.197582      0
4  None        Species              None       virginica -0.170522      0

>>> xgb_mod.predict(test_dat.drop('Species'), supplemental_cols = test_dat[:, ['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Species']]) # doctest: +NORMALIZE_WHITESPACE, +ELLIPSIS
     Sepal_Length  Sepal_Width  Petal_Length     Species  PREDICTION
 0            4.9          3.0           1.4      setosa    4.797075
 1            4.9          3.1           1.5      setosa    4.818641
 2            4.8          3.4           1.6      setosa    4.963796
 3            5.8          4.0           1.2      setosa    4.979247
...           ...          ...           ...         ...         ...
 42           6.7          3.3           5.7   virginica    6.990700
 43           6.7          3.0           5.2   virginica    6.674599
 44           6.5          3.0           5.2   virginica    6.563977
 45           5.9          3.0           5.1   virginica    6.456711
 

#Ranking Example:#

#Create an XGBoost model object.#

>>> setting = {'xgboost_objective': 'rank:pairwise',
...            'xgboost_max_depth': '3',
...            'xgboost_eta': '0.1',
...            'xgboost_gamma': '1.0',
...            'xgboost_num_round': '4'}

>>> xgb_mod = oml.xgb('regression', **setting)

#Fit the XGBoost Model according to the training data and parameter settings.#

>>> xgb_mod.fit(train_x, train_y) 
Algorithm Name: XGBOOST
Mining Function: REGRESSION
Target: Sepal_Length

Settings: 
                    setting name            setting value
0                      ALGO_NAME             ALGO_XGBOOST
1                   ODMS_DETAILS              ODMS_ENABLE
2   ODMS_MISSING_VALUE_TREATMENT  ODMS_MISSING_VALUE_AUTO
3                  ODMS_SAMPLING    ODMS_SAMPLING_DISABLE
4                      PREP_AUTO                       ON
5                        booster                   gbtree
6                            eta                      0.1
7                          gamma                      1.0
8                      max_depth                        3
9                    ntree_limit                        0
10                     num_round                        4
11                     objective            rank:pairwise

Computed Settings: 
              setting name setting value
0  ODMS_EXPLOSION_MIN_SUPP             1

Global Statistics: 
  attribute name  attribute value
0       NUM_ROWS              104
1            map                1

Attributes: 
Petal_Length
Petal_Width
Sepal_Width
Species

Partition: NO

ATTRIBUTE IMPORTANCE: 

  PNAME ATTRIBUTE_NAME ATTRIBUTE_SUBNAME ATTRIBUTE_VALUE      GAIN     COVER  \
0  None   Petal_Length              None            None  0.873855  0.677624   
1  None    Petal_Width              None            None  0.083504  0.184802   
2  None    Sepal_Width              None            None  0.042641  0.137574   

   FREQUENCY  
0   0.500000  
1   0.285714  
2   0.214286  

#Use the model to make predictions on the test data, returning the Sepal_Length, Sepal_Width, Petal_Length, and Species columns in the result.#

>>> xgb_mod.predict(test_dat.drop('Species'), supplemental_cols = test_dat[:, ['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Species']]) 
     Sepal_Length  Sepal_Width  Petal_Length     Species  PREDICTION
 0            4.9          3.0           1.4      setosa    0.243485
 1            4.9          3.1           1.5      setosa    0.243485
 2            4.8          3.4           1.6      setosa    0.243485
 3            5.8          4.0           1.2      setosa    0.310980
...           ...          ...           ...         ...         ...
 42           6.7          3.3           5.7   virginica    0.771761
 43           6.7          3.0           5.2   virginica    0.728637
 44           6.5          3.0           5.2   virginica    0.728637
 45           5.9          3.0           5.1   virginica    0.674835

Parent topic: OML4Py Classes That Provide Access to In-Database Machine Learning Algorithms