## 32 XGBoost

XGBoost is highly-efficient, scalable machine learning algorithm for regression and classification that makes available the XGBoost Gradient Boosting open source package.

### 32.1 About XGBoost

Oracle's XGBoost prepares training data, builds and persists a model, and applies the model for classification and regression.

Oracle Machine Learning for SQL XGBoost is a scalable gradient tree boosting system that supports both classification and regression. It makes available the open source gradient boosting framework.

You can use XGBoost as a stand-alone predictor or incorporate it into real-world production pipelines for a wide range of problems such as ad click-through rate prediction, hazard risk prediction, web text classification, and so on.

The Oracle Machine Learning for SQL XGBoost algorithm takes three types of parameters: general parameters, booster parameters, and task parameters. You set the parameters through the model settings table. The algorithm supports most of the settings of the open source project.

Through XGBoost, Oracle Machine Learning for SQL supports a number of different classification and regression specifications, ranking models, and survival models. Binary and multiclass models are supported under the classification machine learning technique while regression, ranking, count, and survival are supported under the regression machine learning technique.

XGBoost also supports partitioned models and internalizes the data preparation. Currently, XGBoost is available only on Oracle Database Linux platform.

### 32.2 XGBoost Feature Constraints

Feature interaction constraints allow users to specify which variables can and cannot interact. By focusing on key interactions and eliminating noise, it aids in improving predicting performance. This, in turn, may lead to more generalized predictions.

The feature interaction constraints are described in terms of groupings of features that are allowed to interact. Variables that appear together in a traversal path in decision trees interact with one another because the condition of a child node is dependent on the condition of the parent node. These additional controls on model fit are beneficial to users who have a good understanding of the modeling task, including domain knowledge. Oracle Machine Learning for SQL supports more of the available XGBoost capabilities once these constraints are applied.

Monotonic constraints allow you to impose
monotonicity constraints on the features in your boosted model. There may be a
strong prior assumption that the genuine relationship is constrained in some way in
many circumstances. This could be owing to commercial factors (just specific feature
interactions are of interest) or the type of scientific subject under investigation.
A typical form of constraint is that some features have a monotonic connection to
the predicted response. In these situations, monotonic constraints may be employed
to improve the model's prediction performance. For example, let X be the
feature vector with features [x1,…, xi , …, xn] and ƒ(X) be the
prediction response. Then `ƒ(X) ≤ ƒ(X’)`

whenever ```
xi ≤
xi’
```

is an increasing constraint; `ƒ(X) ≥ ƒ(X’)`

whenever
`xi ≤ xi’`

is a decreasing constraint. These feature constraints
are listed in DBMS_DATA_MINING — Algorithm
Settings: XGBoost.

`interaction_constraints`

setting is used to specify the
interaction constraints. The example predicts customers most likely to respond
positively for an affinity card loyalty program.
```
-----------------------------------------------------------------------
-- Build a Classification Model using Interaction Contraints
-----------------------------------------------------------------------
-- The interaction constraints setting can be used to specify permitted
-- interactions in the model. The constraints must be specified
-- in the form of nested list, where each inner list is a group of
-- features (column names) that are allowed to interact with each other.
-- For example, assume x0, x1, x2, x3, x4, x5 and x6 are
-- the feature names (column names) of interest.
-- Then setting value [[x0,x1,x2],[x0,x4],[x5,x6]] specifies that:
-- * Features x0, x1 and x2 are allowed to interact with each other
-- but with no other feature.
-- * Features x0 & x4 are allowed to interact with one another
-- but with no other feature.
-- * Features x5 and x6 are allowed to interact with each other
-- but with no other feature.
-------------------------------------------------------------------------
BEGIN DBMS_DATA_MINING.DROP_MODEL('XGB_CLASS_MODEL_INTERACTIONS');
EXCEPTION WHEN OTHERS THEN NULL; END;
/
DECLARE
v_setlst DBMS_DATA_MINING.SETTING_LIST;
BEGIN
v_setlst('ALGO_NAME') := 'ALGO_XGBOOST';
v_setlst('PREP_AUTO') := 'ON';
v_setlst('max_depth') := '2';
v_setlst('eta') := '1';
v_setlst('num_round') := '100';
v_setlst('interaction_constraints') := '[[YRS_RESIDENCE, OCCUPATION],
[OCCUPATION, Y_BOX_GAMES],
[BULK_PACK_DISKETTES,
BOOKKEEPING_APPLICATION]]';
DBMS_DATA_MINING.CREATE_MODEL2(
MODEL_NAME => 'XGB_CLASS_MODEL_INTERACTIONS',
MINING_FUNCTION => 'CLASSIFICATION',
DATA_QUERY => 'SELECT * FROM TRAIN_DATA_CLAS',
SET_LIST => v_setlst,
CASE_ID_COLUMN_NAME => 'CUST_ID',
TARGET_COLUMN_NAME => 'AFFINITY_CARD');
DBMS_OUTPUT.PUT_LINE('Created model: XGB_CLASS_MODEL_INTERACTIONS');
END;
/
```

To view the complete example, see https://github.com/oracle-samples/oracle-db-examples/tree/main/machine-learning/sql/23ai.

### 32.3 XGBoost AFT Model

Survival analysis is a field of statistics that examines the time elapsed between one or more occurrences, such as death in biological organisms and failure in mechanical systems.

The goals of survival analysis include evaluating
patterns of event times, comparing distributions of survival times in different
groups of people, and determining if and how much certain factors affect the
likelihood of an event of interest. The existence of censored data is an important
feature of survival analysis. If a person does not experience an event within the
observation period, they are labeled as censored. **Censoring** is a type of missing data problem in which the time to event
is not recorded for a variety of reasons, such as the study being terminated before
all enrolled subjects have demonstrated the event of interest, or the subject
leaving the study before experiencing an event. Right censoring is defined as
knowing only the lower limit l for the genuine event
time T such that T >
l. Right censoring will take place, for example, for those subjects
whose birth date is known but who are still living when they are lost to follow-up
or when the study concludes. We frequently come upon data that has been
right-censored. The data is said to be left-censored if the event of interest
occurred before the subject was included in the study but the exact date is unknown.
Interval censoring occurs when an occurrence can only be described as occurring
between two observations or examinations.

The Cox proportional hazards model and the Accelerated Failure Time (AFT) model are two major survival analysis methods. Oracle Machine Learning for SQL supports both these models.

Cox regression works for right censored survival time data. The hazard rate is the risk of failure (that is, the risk or likelihood of suffering the event of interest) in a Cox proportional hazards regression model, assuming that the subject has lived up to a particular time. The Cox predictions are returned on a hazard ratio scale. A Cox proportional hazards model has the following form:

h (t,x) =
h_{0}(t)e^{βx}

Where h(t) is the baseline hazard, x is a covariate, and β is an estimated parameter that represents the covariate's effect on the outcome. A Cox proportional hazards model's estimated amount is understood as relative risk rather than absolute risk.

The AFT model fits models to data that can be censored to the left, right, or interval. The AFT model, which models time to an event of interest, is one of the most often used models in survival analysis. AFT is a parametric (it assumes the distribution of response data) survival model. The outcome of AFT models has a physical interpretation that is intuitive. The model has the following form:

ln Y = < W, X> + σZ

Where X is the vector in R^{d} representing the
features. W is a vector consisting of d coefficients, each corresponding to a
feature. <W, X> is the usual dot product in R^{d}.
Y is the random variable modeling the output label. Z is a random
variable of a known probability distribution. Common choices are the normal
distribution, the logistic distribution, and the extreme distribution. It represents
the “noise”. σ is a parameter that scales the size of noise.

AFT model that works with XGBoost or gradient boosting has the following form:

ln Y = T(x) + σZ

Where T(x) represents the output of a decision tree ensemble, using the supplied input x. Since Z is a random variable, you have a likelihood defined for the expression lnY=T(x)+σZ. As a result, XGBoost's purpose is to maximize (log) likelihood by fitting a suitable tree ensemble T(x).

The AFT parameters are listed in DBMS_DATA_MINING — Algorithm Settings: XGBoost.

`SURVIVAL_DATA`

table is created that contains data for survival
analysis. XGBoost AFT settings
`aft_right_bound_column_name`

,
`aft_loss_distribution`

, and
`aft_loss_distribution_scale`

are illustrated in this example.
```
-----------------------------------------------------------------------------
-- Create a data table with left and right bound columns
-----------------------------------------------------------------------------
-- The data table 'SURVIVAL_DATA' contains both exact data point and
-- right-censored data point. The left bound column is set by
-- parameter target_column_name. The right bound column is set
-- by setting aft_right_bound_column_name.
-- For right censored data point, the right bound is infinity,
-- which is represented as NULL in the right bound column.
BEGIN EXECUTE IMMEDIATE 'DROP TABLE SURVIVAL_DATA';
EXCEPTION WHEN OTHERS THEN NULL; END;
/
CREATE TABLE SURVIVAL_DATA (INST NUMBER, LBOUND NUMBER, AGE NUMBER,
SEX NUMBER, PHECOG NUMBER, PHKARNO NUMBER,
PATKARNO NUMBER, MEALCAL NUMBER, WTLOSS NUMBER,
RBOUND NUMBER);
INSERT INTO SURVIVAL_DATA VALUES(26, 235, 63, 2, 0, 100, 90, 413, 0, NULL);
INSERT INTO SURVIVAL_DATA VALUES(22, 444, 75, 2, 2, 70, 70, 438, 8, 444);
INSERT INTO SURVIVAL_DATA VALUES(16, 806, 44, 1, 1, 80, 80, 1025, 1, NULL);
INSERT INTO SURVIVAL_DATA VALUES(16, 551, 77, 2, 2, 80, 60, 750, 28, NULL);
INSERT INTO SURVIVAL_DATA VALUES(3, 202, 50, 2, 0, 100, 100, 635, 1, NULL);
INSERT INTO SURVIVAL_DATA VALUES(7, 583, 68, 1, 1, 60, 70, 1025, 7, 583);
INSERT INTO SURVIVAL_DATA VALUES(32, 135, 60, 1, 1, 90, 70, 1275, 0, 135);
INSERT INTO SURVIVAL_DATA VALUES(21, 237, 69, 1, 1, 80, 70, NULL, NULL, NULL);
INSERT INTO SURVIVAL_DATA VALUES(26, 356, 53, 2, 1, 90, 90, NULL, 2, NULL);
INSERT INTO SURVIVAL_DATA VALUES(13, 387, 56, 1, 2, 80, 60, 1075, NULL, 387);
-----------------------------------------------------------------------------
-- Build an XGBoost survival model with survival:aft
-----------------------------------------------------------------------------
BEGIN DBMS_DATA_MINING.DROP_MODEL('XGB_SURVIVAL_MODEL');
EXCEPTION WHEN OTHERS THEN NULL; END;
/
DECLARE
v_setlst DBMS_DATA_MINING.SETTING_LIST;
BEGIN
v_setlst('ALGO_NAME') := 'ALGO_XGBOOST';
v_setlst('max_depth') := '6';
v_setlst('eval_metric') := 'aft-nloglik';
v_setlst('num_round') := '100';
v_setlst('objective') := 'survival:aft';
v_setlst('aft_right_bound_column_name') := 'rbound';
v_setlst('aft_loss_distribution') := 'normal';
v_setlst('aft_loss_distribution_scale') := '1.20';
v_setlst('eta') := '0.05';
v_setlst('lambda') := '0.01';
v_setlst('alpha') := '0.02';
v_setlst('tree_method') := 'hist';
DBMS_DATA_MINING.CREATE_MODEL2(
MODEL_NAME => 'XGB_SURVIVAL_MODEL',
MINING_FUNCTION => 'REGRESSION',
DATA_QUERY => 'SELECT * FROM SURVIVAL_DATA',
TARGET_COLUMN_NAME => 'LBOUND',
CASE_ID_COLUMN_NAME => NULL,
SET_LIST => v_setlst);
END;
/
```

To view the complete example, see https://github.com/oracle-samples/oracle-db-examples/blob/main/machine-learning/sql/23ai/oml4sql-survival-analysis-xgboost.sql.

### 32.4 Ranking Methods

Oracle Machine Learningsupports pairwise and listwise ranking methods through XGBoost.

For a training data set, in a number of sets, each set consists of objects and labels representing their ranking. A ranking function is constructed by minimizing a certain loss function on the training data. Using test data, the ranking function is applied to get a ranked list of objects. Ranking is enabled for XGBoost using the regression function. OML4SQL supports pairwise and listwise ranking methods through XGBoost.

Pairwise ranking: This approach regards a pair of objects as
the learning instance. The pairs and lists are defined by
supplying the same `case_id`

value. Given a pair
of objects, this approach gives an optimal ordering for that
pair. Pairwise losses are defined by the order of the two
objects. In OML4SQL, the algorithm
uses LambdaMART to perform pairwise ranking with the goal of
minimizing the average number of inversions in ranking.

Listwise ranking: This approach takes multiple lists of
ranked objects as learning instance. The items in a list must
have the same `case_id`

. The algorithm uses
LambdaMART to perform list-wise ranking.

See Also:

- "Ranking Measures and Loss Functions in Learning to Rank" a research paper presentation on the internet.
- Oracle Database PL/SQL Packages and Types Reference for a listing and explanation of the available model settings for XGBoost.

Note:

The term hyperparameter is also interchangeably used for model setting.**Related Topics**

### 32.5 Scoring with XGBoost

Score with XGBoost using the supported SQL functions to predict values.

The SQL scoring functions supported for a
classification XGBoost model are `PREDICTION`

,
`PREDICTION_COST`

, `PREDICTION_DETAILS`

,
`PREDICTION_PROBABILITY`

, and
`PREDICTION_SET`

.

The scoring functions supported
for a regression XGBoost model are `PREDICTION`

and
`PREDICTION_DETAILS`

.

The prediction functions return the following information:

`PREDICTION`

returns the predicted value.`PREDICTION_COST`

returns a measure of cost for a given prediction as an Oracle NUMBER. (classification only)`PREDICTION_DETAILS`

returns the SHAP (SHapley Additive exPlanation) contributions.`PREDICTION_PROBABILITY`

returns the probability for a given prediction. (classification only)`PREDICTION_SET`

returns the prediction and the corresponding prediction probability for each observation. (classification only)

**Related Topics**