8.12 Generalized Linear Model

The oml.glm class builds a Generalized Linear Model (GLM) model.

GLM models include and extend the class of linear models. They relax the restrictions on linear models, which are often violated in practice. For example, binary (yes/no or 0/1) responses do not have the same variance across classes.

GLM is a parametric modeling technique. Parametric models make assumptions about the distribution of the data. When the assumptions are met, parametric models can be more efficient than non-parametric models.

The challenge in developing models of this type involves assessing the extent to which the assumptions are met. For this reason, quality diagnostics are key to developing quality parametric models.

In addition to the classical weighted least squares estimation for linear regression and iteratively re-weighted least squares estimation for logistic regression, both solved through Cholesky decomposition and matrix inversion, Oracle Machine Learning GLM provides a conjugate gradient-based optimization algorithm that does not require matrix inversion and is very well suited to high-dimensional data. The choice of algorithm is handled internally and is transparent to the user.

GLM can be used to build classification or regression models as follows:

  • Classification: Binary logistic regression is the GLM classification algorithm. The algorithm uses the logit link function and the binomial variance function.

  • Regression: Linear regression is the GLM regression algorithm. The algorithm assumes no target transformation and constant variance over the range of target values.

The oml.glm class allows you to build two different types of models. Some arguments apply to classification models only and some to regression models only.

For information on the oml.glm class attributes and methods, invoke help(oml.glm) or see Oracle Machine Learning for Python API Reference.

Settings for a Generalized Linear Model

The following table lists the settings that apply to GLM models.

Table 8-10 Generalized Linear Model Settings

Setting Name Setting Value Description
CLAS_COST_TABLE_NAME

table_name

The name of a table that stores a cost matrix for the algorithm to use in scoring the model. The cost matrix specifies the costs associated with misclassifications.

The cost matrix table is user-created. The following are the column requirements for the table.

  • Column Name: ACTUAL_TARGET_VALUE

    Data Type: Valid target data type

  • Column Name: PREDICTED_TARGET_VALUE

    Data Type: Valid target data type

  • Column Name: COST

    Data Type: NUMBER

CLAS_WEIGHTS_BALANCED

ON

OFF

Indicates whether the algorithm must create a model that balances the target distribution. This setting is most relevant in the presence of rare targets, as balancing the distribution may enable better average accuracy (average of per-class accuracy) instead of overall accuracy (which favors the dominant class). The default value is OFF.

CLAS_WEIGHTS_TABLE_NAME

table_name

The name of a table that stores weighting information for individual target values in GLM logistic regression models. The weights are used by the algorithm to bias the model in favor of higher weighted classes.

The class weights table is user-created. The following are the column requirements for the table.

  • Column Name: TARGET_VALUE

    Data Type: Valid target data type

  • Column Name: CLASS_WEIGHT

    Data Type: NUMBER

GLMS_BATCH_ROWS

0 or a positive integer.

Number of rows in a batch used by the SGD solver. The value of this parameter sets the size of the batch for the SGD solver. An input of 0 triggers a data-driven batch size estimate.

The default value is 2000.

GLMS_CONF_LEVEL TO_CHAR(0< numeric_expr <1)

The confidence level for coefficient confidence intervals.

The default confidence level is 0.95.

GLMS_CONV_TOLERANCE

The range is (0, 1) non-inclusive.

Convergence tolerance setting of the GLM algorithm.

The default value is system-determined.

GLMS_FTR_GEN_METHOD

GLMS_FTR_GEN_CUBIC

GLMS_FTR_GEN_QUADRATIC

Whether feature generation is cubic or quadratic.

When you enable feature generation, the algorithm automatically chooses the most appropriate feature generation method based on the data.

GLMS_FTR_GENERATION

GLMS_FTR_GENERATION_E NABLE

GLMS_FTR_GENERATION_D ISABLE

Whether or not feature generation is enabled for GLM. By default, feature generation is not enabled.

Note:

Note: Feature generation can only be enabled when feature selection is also enabled.
GLMS_FTR_SEL_CRIT

GLMS_FTR_SEL_AIC

GLMS_FTR_SEL_ALPHA_INV

GLMS_FTR_SEL_RIC

GLMS_FTR_SEL_SBIC

Feature selection penalty criterion for adding a feature to the model.

When feature selection is enabled, the algorithm automatically chooses the penalty criterion based on the data.

GLMS_FTR_SELECTION

GLMS_FTR_SELECTION_DISABLE

Enable or disable feature selection for GLM.

By default, feature selection is not enabled.

GLMS_MAX_FEATURES

TO_CHAR(0 < numeric_expr <= 2000)

When feature selection is enabled, this setting specifies the maximum number of features that can be selected for the final model.

By default, the algorithm limits the number of features to ensure sufficient memory.

GLMS_NUM_ITERATIONS

A positive integer.

Maximum number of iterations for the GLM algorithm. The default value is system-determined.

GLMS_PRUNE_MODEL

GLMS_PRUNE_MODEL_ENABLE

GLMS_PRUNE_MODEL_DISABLE

When feature selection is enabled, the algorithm automatically performs pruning based on the data.

GLMS_REFERENCE_CLASS_NAME

target_value

The target value used as the reference class in a binary logistic regression model. Probabilities are produced for the other class.

By default, the algorithm chooses the value with the highest prevalence (the most cases) for the reference class.

GLMS_RIDGE_REGRESSION

GLMS_RIDGE_REG_ENABLE

GLMS_RIDGE_REG_DISABLE

Enable or disable ridge regression. Ridge applies to both regression and classification machine learning functions.

When ridge is enabled, prediction bounds are not produced by the PREDICTION_BOUNDS SQL function.

GLMS_RIDGE_VALUE

TO_CHAR(numeric_expr > 0)

The value of the ridge parameter. Use this setting only when you have configured the algorithm to use ridge regression.

If ridge regression is enabled internally by the algorithm, then the ridge parameter is determined by the algorithm.

GLMS_ROW_DIAGNOSTICS

GLMS_ROW_DIAG_ENABLE

GLMS_ROW_DIAG_DISABLE

Enable or disable row diagnostics.

By default, row diagnostics are disabled.

GLMS_SOLVER

GLMS_SOLVER_CHOL

GLMS_SOLVER_LBFGS_ADMM

GLMS_SOLVER_QR

GLMS_SOLVER_SGD

Specifies the GLM solver. You cannot select the solver if GLMS_FTR_SELECTION setting is enabled. The default value is system determined.

The GLMS_SOLVER_CHOL solver uses Cholesky decomposition.

The GLMS_SOLVER_SGD solver uses stochastic gradient descent.

GLMS_SPARSE_SOLVER

GLMS_SPARSE_SOLVER_ENABLE

GLMS_SPARSE_SOLVER_DISABLE

Enable or disable the use of a sparse solver if it is available.

The default value is GLMS_SPARSE_SOLVER_DISABLE.

ODMS_ROW_WEIGHT_COLUMN_ NAME

column_name

The name of a column in the training data that contains a weighting factor for the rows. The column datatype must be NUMBER.

You can use row weights as a compact representation of repeated rows, as in the design of experiments where a specific configuration is repeated several times. You can also use row weights to emphasize certain rows during model construction. For example, to bias the model towards rows that are more recent and away from potentially obsolete data.

Example 8-12 Using the oml.glm Class

This example demonstrates the use of various methods of the oml.glm class. In the listing for this example, some of the output is not shown as indicated by ellipses.

import oml
import pandas as pd
from sklearn import datasets

# Load the iris data set and create a pandas.DataFrame for it.
iris = datasets.load_iris()
x = pd.DataFrame(iris.data,
                 columns = ['Sepal_Length','Sepal_Width',
                            'Petal_Length','Petal_Width'])
y = pd.DataFrame(list(map(lambda x:
                           {0: 'setosa', 1: 'versicolor',
                            2:'virginica'}[x], iris.target)),
                 columns = ['Species'

try:
    oml.drop('IRIS')
except:
    pass

# Create the IRIS database table and the proxy object for the table.
oml_iris = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')

# Create training and test data.
dat = oml.sync(table = 'IRIS').split()
train_x = dat[0].drop('Petal_Width')
train_y = dat[0]['Petal_Width']
test_dat = dat[1]

# Specify settings.
setting = {'GLMS_SOLVER': 'dbms_data_mining.GLMS_SOLVER_QR'}

# Create a GLM model object.
glm_mod = oml.glm("regression", **setting)

# Fit the GLM model according to the training data and parameter
# settings.
glm_mod = glm_mod.fit(train_x, train_y)

# Show the model details.
glm_mod

# Use the model to make predictions on the test data.
glm_mod.predict(test_dat.drop('Petal_Width'), 
                supplemental_cols = test_dat[:,
                  ['Sepal_Length', 'Sepal_Width', 
                   'Petal_Length', 'Species']])

# Return the prediction probability.
glm_mod.predict(test_dat.drop('Petal_Width'),                 
                supplemental_cols = test_dat[:,
                  ['Sepal_Length', 'Sepal_Width', 
                   'Petal_Length', 'Species']],
                proba = True)

glm_mod.score(test_dat.drop('Petal_Width'), 
              test_dat[:, ['Petal_Width']])

# Change the parameter setting and refit the model.
new_setting = {'GLMS_SOLVER': 'GLMS_SOLVER_SGD'}
glm_mod.set_params(**new_setting).fit(train_x, train_y)

Listing for This Example

>>> import oml
>>> import pandas as pd
>>> from sklearn import datasets
>>>
>>> # Load the iris data set and create a pandas.DataFrame for it.
... iris = datasets.load_iris()
>>> x = pd.DataFrame(iris.data, 
...                  columns = ['Sepal_Length','Sepal_Width',
...                             'Petal_Length','Petal_Width'])
>>> y = pd.DataFrame(list(map(lambda x: 
...                            {0: 'setosa', 1: 'versicolor', 
...                             2:'virginica'}[x], iris.target)), 
...                  columns = ['Species'])
>>>
>>> try:
...    oml.drop('IRIS')
... except:
...    pass
>>>
>>> # Create the IRIS database table and the proxy object for the table.
... oml_iris = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')
>>>
>>> # Create training and test data.
... dat = oml.sync(table = 'IRIS').split()
>>> train_x = dat[0].drop('Petal_Width')
>>> train_y = dat[0]['Petal_Width']
>>> test_dat = dat[1]
>>>
>>> # Specify settings.
... setting = {'GLMS_SOLVER': 'dbms_data_mining.GLMS_SOLVER_QR'}
>>>
>>> # Create a GLM model object.
... glm_mod = oml.glm("regression", **setting)
>>>
>>> # Fit the GLM model according to the training data and parameter
... # settings.
>>> glm_mod = glm_mod.fit(train_x, train_y)
>>> 
>>> # Show the model details.
... glm_mod

Algorithm Name: Generalized Linear Model

Mining Function: REGRESSION

Target: Petal_Width

Settings: 
	            setting name                  setting value
0                     ALGO_NAME  ALGO_GENERALIZED_LINEAR_MODEL
1               GLMS_CONF_LEVEL                            .95
2           GLMS_FTR_GENERATION    GLMS_FTR_GENERATION_DISABLE
3            GLMS_FTR_SELECTION     GLMS_FTR_SELECTION_DISABLE
4                   GLMS_SOLVER                 GLMS_SOLVER_QR
5                  ODMS_DETAILS                    ODMS_ENABLE
6  ODMS_MISSING_VALUE_TREATMENT        ODMS_MISSING_VALUE_AUTO
7                 ODMS_SAMPLING          ODMS_SAMPLING_DISABLE
8                     PREP_AUTO                             ON

Computed Settings: 
	     setting name            setting value
0    GLMS_CONV_TOLERANCE  .0000050000000000000004
1    GLMS_NUM_ITERATIONS                       30
2  GLMS_RIDGE_REGRESSION    GLMS_RIDGE_REG_ENABLE

Global Statistics:
            attribute name  attribute value
0        ADJUSTED_R_SQUARE         0.949634
1                      AIC         -363.888
2                COEFF_VAR          14.6284
3                CONVERGED              YES
4       CORRECTED_TOTAL_DF              103
5         CORRECTED_TOT_SS          58.4565
6           DEPENDENT_MEAN          1.15577
7                 ERROR_DF               98
8        ERROR_MEAN_SQUARE         0.028585
9        ERROR_SUM_SQUARES          2.80131
10                 F_VALUE          389.405
11                   GMSEP         0.030347
12              HOCKING_SP         0.000295
13                     J_P         0.030234
14                MODEL_DF                5
15         MODEL_F_P_VALUE                0
16       MODEL_MEAN_SQUARE           11.131
17       MODEL_SUM_SQUARES          55.6552
18              NUM_PARAMS                6
19                NUM_ROWS              104
20         RANK_DEFICIENCY                0
21            ROOT_MEAN_SQ          0.16907
22                    R_SQ         0.952079
23                    SBIC         -348.021
24 VALID_COVARIANCE_MATRIX              YES
[1 rows x 25 columns]

Attributes: 
Petal_Length
Sepal_Length
Sepal_Width
Species

Partition: NO

Coefficients: 

          name       level  estimate
0   (Intercept)        None -0.600603
1  Petal_Length        None  0.239775
2  Sepal_Length        None -0.078338
3   Sepal_Width        None  0.253996
4       Species  versicolor  0.652420
5       Species   virginica  1.010438

Fit Details: 
                       name         value
0         ADJUSTED_R_SQUARE  9.496338e-01
1                       AIC -3.638876e+02
2                 COEFF_VAR  1.462838e+01
3        CORRECTED_TOTAL_DF  1.030000e+02
...
21             ROOT_MEAN_SQ  1.690704e-01
22                     R_SQ  9.520788e-01
23                     SBIC -3.480213e+02
24  VALID_COVARIANCE_MATRIX  1.000000e+00

Rank: 
 
6

Deviance: 

2.801309

AIC: 

-364

Null Deviance: 

58.456538

DF Residual: 

98.0

DF Null: 

103.0

Converged: 

True

>>> 
>>> # Use the model to make predictions on the test data.
... glm_mod.predict(test_dat.drop('Petal_Width'),
...                 supplemental_cols = test_dat[:,
...                   ['Sepal_Length', 'Sepal_Width', 
...                    'Petal_Length', 'Species']])
    Sepal_Length  Sepal_Width  Petal_Length     Species  PREDICTION
0            4.9          3.0           1.4      setosa    0.113215
1            4.9          3.1           1.5      setosa    0.162592
2            4.8          3.4           1.6      setosa    0.270602
3            5.8          4.0           1.2      setosa    0.248752
...           ...          ...           ...         ...         ...
42           6.7          3.3           5.7   virginica    2.89876
43           6.7          3.0           5.2   virginica    1.893790
44           6.5          3.0           5.2   virginica    1.909457
45           5.9          3.0           5.1   virginica    1.932483

>>> # Return the prediction probability.
... glm_mod.predict(test_dat.drop('Petal_Width'), 
...                 supplemental_cols = test_dat[:,
...                   ['Sepal_Length', 'Sepal_Width', 
...                    'Petal_Length', 'Species']]),
...                 proba = True)
    Sepal_Length  Sepal_Width     Species  PREDICTION
0            4.9          3.0      setosa    0.113215
1            4.9          3.1      setosa    0.162592
2            4.8          3.4      setosa    0.270602
3            5.8          4.0      setosa    0.248752
...          ...          ...         ...         ...
42           6.7          3.3   virginica    2.089876
43           6.7          3.0   virginica    1.893790
44           6.5          3.0   virginica    1.909457
45           5.9          3.0   virginica    1.932483
>>> 
>>> glm_mod.score(test_dat.drop('Petal_Width'), 
...               test_dat[:, ['Petal_Width']])
0.951252
>>>
>>> # Change the parameter setting and refit the model.
... new_setting = {'GLMS_SOLVER': 'GLMS_SOLVER_SGD'}
>>> glm_mod.set_params(**new_setting).fit(train_x, train_y)

Algorithm Name: Generalized Linear Model

Mining Function: REGRESSION

Target: Petal_Width

Settings: 
                   setting name                  setting value
0                     ALGO_NAME  ALGO_GENERALIZED_LINEAR_MODEL
1               GLMS_CONF_LEVEL                            .95
2           GLMS_FTR_GENERATION    GLMS_FTR_GENERATION_DISABLE
3            GLMS_FTR_SELECTION     GLMS_FTR_SELECTION_DISABLE
4                   GLMS_SOLVER                GLMS_SOLVER_SGD
5                  ODMS_DETAILS                    ODMS_ENABLE
6  ODMS_MISSING_VALUE_TREATMENT        ODMS_MISSING_VALUE_AUTO
7                 ODMS_SAMPLING          ODMS_SAMPLING_DISABLE
8                     PREP_AUTO                             ON

Computed Settings: 
            setting name          setting value
0        GLMS_BATCH_ROWS                   2000
1    GLMS_CONV_TOLERANCE                  .0001
2    GLMS_NUM_ITERATIONS                    500
3  GLMS_RIDGE_REGRESSION  GLMS_RIDGE_REG_ENABLE
4       GLMS_RIDGE_VALUE                    .01

Global Statistics: 
            attribute name  attribute value
0        ADJUSTED_R_SQUARE          0.94175
1                      AIC         -348.764
2                COEFF_VAR          15.7316
3                CONVERGED               NO
4       CORRECTED_TOTAL_DF              103
5         CORRECTED_TOT_SS          58.4565
6           DEPENDENT_MEAN          1.15577
7                 ERROR_DF               98
8        ERROR_MEAN_SQUARE         0.033059
9        ERROR_SUM_SQUARES          3.23979
10                 F_VALUE          324.347
11                   GMSEP         0.035097
12              HOCKING_SP         0.000341
13                     J_P         0.034966
14                MODEL_DF                5
15         MODEL_F_P_VALUE                0
16       MODEL_MEAN_SQUARE          10.7226
17       MODEL_SUM_SQUARES           53.613
18              NUM_PARAMS                6
19                NUM_ROWS              104
20         RANK_DEFICIENCY                0
21            ROOT_MEAN_SQ          0.181821
22                    R_SQ          0.944578
23                    SBIC         -332.898
24 VALID_COVARIANCE_MATRIX               NO

[1 rows x 25 columns]

Attributes: 
Petal_Length
Sepal_Length
Sepal_Width
Species

Partition: NO

Coefficients: 

           name      level  estimate
0   (Intercept)       None -0.338046
1  Petal_Length       None  0.378658
2  Sepal_Length       None -0.084440
3   Sepal_Width       None  0.137150
4       Species  versicolor 0.151916
5       Species  virginica  0.337535

Fit Details:

                       name         value
0         ADJUSTED_R_SQUARE  9.417502e-01
1                       AIC -3.487639e+02
2                 COEFF_VAR  1.573164e+01
3        CORRECTED_TOTAL_DF  1.030000e+02
...                     ...           ...
21             ROOT_MEAN_SQ  1.818215e-01
22                     R_SQ  9.445778e-01
23                     SBIC -3.328975e+02
24  VALID_COVARIANCE_MATRIX  0.000000e+00

Rank:

6

Deviance: 

3.239787

AIC: 

-349

Null Deviance: 

58.456538

Prior Weights: 

1

DF Residual: 

98.0

DF Null: 

103.0

Converged: 

False