XGBoost AFT Model

Survival analysis is a field of statistics that examines the time elapsed between one or more occurrences, such as death in biological organisms and failure in mechanical systems.

The goals of survival analysis include evaluating patterns of event times, comparing distributions of survival times in different groups of people, and determining if and how much certain factors affect the likelihood of an event of interest. The existence of censored data is an important feature of survival analysis. If a person does not experience an event within the observation period, they are labeled as censored. Censoring is a type of missing data problem in which the time to event is not recorded for a variety of reasons, such as the study being terminated before all enrolled subjects have demonstrated the event of interest, or the subject leaving the study before experiencing an event. Right censoring is defined as knowing only the lower limit l for the genuine event time T such that T > l. Right censoring will take place, for example, for those subjects whose birth date is known but who are still living when they are lost to follow-up or when the study concludes. We frequently come upon data that has been right-censored. The data is said to be left-censored if the event of interest occurred before the subject was included in the study but the exact date is unknown. Interval censoring occurs when an occurrence can only be described as occurring between two observations or examinations.

The Cox proportional hazards model and the Accelerated Failure Time (AFT) model are two major survival analysis methods. Oracle Machine Learning for SQL supports both these models.

Cox regression works for right censored survival time data. The hazard rate is the risk of failure (that is, the risk or likelihood of suffering the event of interest) in a Cox proportional hazards regression model, assuming that the subject has lived up to a particular time. The Cox predictions are returned on a hazard ratio scale. A Cox proportional hazards model has the following form:

h (t,x) = h0(t)eβx

Where h(t) is the baseline hazard, x is a covariate, and β is an estimated parameter that represents the covariate's effect on the outcome. A Cox proportional hazards model's estimated amount is understood as relative risk rather than absolute risk.

The AFT model fits models to data that can be censored to the left, right, or interval. The AFT model, which models time to an event of interest, is one of the most often used models in survival analysis. AFT is a parametric (it assumes the distribution of response data) survival model. The outcome of AFT models has a physical interpretation that is intuitive. The model has the following form:

ln Y = < W, X> + σZ

Where X is the vector in Rd representing the features. W is a vector consisting of d coefficients, each corresponding to a feature. <W, X> is the usual dot product in Rd. Y is the random variable modeling the output label. Z is a random variable of a known probability distribution. Common choices are the normal distribution, the logistic distribution, and the extreme distribution. It represents the “noise”. σ is a parameter that scales the size of noise.

AFT model that works with XGBoost or gradient boosting has the following form:

ln Y = T(x) + σZ

Where T(x) represents the output of a decision tree ensemble, using the supplied input x. Since Z is a random variable, you have a likelihood defined for the expression lnY=T(x)+σZ. As a result, XGBoost's purpose is to maximize (log) likelihood by fitting a suitable tree ensemble T(x).

The AFT parameters are listed in DBMS_DATA_MINING — Algorithm Settings: XGBoost.

The following example displays code snippet of survival analysis using the XGBoost algorithm. In this example, a SURVIVAL_DATA table is created that contains data for survival analysis. XGBoost AFT settings aft_right_bound_column_name, aft_loss_distribution, and aft_loss_distribution_scale are illustrated in this example.
-----------------------------------------------------------------------------
--         Create a data table with left and right bound columns
-----------------------------------------------------------------------------

-- The data table 'SURVIVAL_DATA' contains both exact data point and 
-- right-censored data point. The left bound column is set by 
-- parameter target_column_name. The right bound column is set 
-- by setting aft_right_bound_column_name.

-- For right censored data point, the right bound is infinity,
-- which is represented as NULL in the right bound column.

BEGIN EXECUTE IMMEDIATE 'DROP TABLE SURVIVAL_DATA';
EXCEPTION WHEN OTHERS THEN NULL; END;
/ 
CREATE TABLE SURVIVAL_DATA (INST NUMBER, LBOUND NUMBER, AGE NUMBER, 
                            SEX NUMBER, PHECOG NUMBER, PHKARNO NUMBER, 
                            PATKARNO NUMBER, MEALCAL NUMBER, WTLOSS NUMBER, 
                            RBOUND NUMBER);                
INSERT INTO SURVIVAL_DATA VALUES(26, 235, 63, 2, 0, 100,  90,  413,  0,   NULL);
INSERT INTO SURVIVAL_DATA VALUES(22, 444, 75, 2, 2,  70,  70,  438,  8,   444);
INSERT INTO SURVIVAL_DATA VALUES(16, 806, 44, 1, 1,  80,  80, 1025,  1,   NULL);
INSERT INTO SURVIVAL_DATA VALUES(16, 551, 77, 2, 2,  80,  60,  750, 28,   NULL);
INSERT INTO SURVIVAL_DATA VALUES(3,  202, 50, 2, 0, 100, 100,  635,  1,   NULL);
INSERT INTO SURVIVAL_DATA VALUES(7,  583, 68, 1, 1,  60,  70, 1025,  7,   583);
INSERT INTO SURVIVAL_DATA VALUES(32, 135, 60, 1, 1,  90,  70, 1275,  0,   135);
INSERT INTO SURVIVAL_DATA VALUES(21, 237, 69, 1, 1,  80,  70, NULL, NULL, NULL);
INSERT INTO SURVIVAL_DATA VALUES(26, 356, 53, 2, 1,  90,  90, NULL,   2,  NULL);
INSERT INTO SURVIVAL_DATA VALUES(13, 387, 56, 1, 2,  80,  60, 1075, NULL, 387);

-----------------------------------------------------------------------------
--             Build an XGBoost survival model with survival:aft
-----------------------------------------------------------------------------

BEGIN DBMS_DATA_MINING.DROP_MODEL('XGB_SURVIVAL_MODEL');
EXCEPTION WHEN OTHERS THEN NULL; END;
/
DECLARE
    v_setlst DBMS_DATA_MINING.SETTING_LIST;
BEGIN
    v_setlst('ALGO_NAME')                    := 'ALGO_XGBOOST';
    v_setlst('max_depth')                    := '6';
    v_setlst('eval_metric')                  := 'aft-nloglik';
    v_setlst('num_round')                    := '100';
    v_setlst('objective')                    := 'survival:aft';
    v_setlst('aft_right_bound_column_name')  := 'rbound';
    v_setlst('aft_loss_distribution')        := 'normal';
    v_setlst('aft_loss_distribution_scale')  := '1.20';
    v_setlst('eta')                          := '0.05';
    v_setlst('lambda')                       := '0.01';
    v_setlst('alpha')                        := '0.02';
    v_setlst('tree_method')                  := 'hist';

    DBMS_DATA_MINING.CREATE_MODEL2(
        MODEL_NAME          => 'XGB_SURVIVAL_MODEL',
        MINING_FUNCTION     => 'REGRESSION',
        DATA_QUERY          => 'SELECT * FROM SURVIVAL_DATA',
        TARGET_COLUMN_NAME  => 'LBOUND',
        CASE_ID_COLUMN_NAME =>  NULL,
        SET_LIST            =>  v_setlst);
END;
/