20 Generalized Linear Model
Learn how to use Generalized Linear Model (GLM) statistical technique for linear modeling.
Oracle Machine Learning for SQL supports GLM for regression and binary classification.
Related Topics
20.1 About Generalized Linear Model
The Generalized Linear Model (GLM) includes and extends the class of linear models which address and accommodate some restrictive assumptions of the linear models.
Linear models make a set of restrictive assumptions, most importantly, that the target (dependent variable y) is normally distributed conditioned on the value of predictors with a constant variance regardless of the predicted response value. The advantage of linear models and their restrictions include computational simplicity, an interpretable model form, and the ability to compute certain diagnostic information about the quality of the fit.
GLM relaxes these restrictions, which are often violated in practice. For example, binary (yes/no or 0/1) responses do not have same variance across classes. Furthermore, the sum of terms in a linear model typically can have very large ranges encompassing very negative and very positive values. For the binary response example, we would like the response to be a probability in the range [0,1].
GLM accommodates responses that violate the linear model assumptions through two mechanisms: a link function and a variance function. The link function transforms the target range to potentially infinity to +infinity so that the simple form of linear models can be maintained. The variance function expresses the variance as a function of the predicted response, thereby accommodating responses with nonconstant variances (such as the binary responses).
Oracle Machine Learning for SQL includes two of the most popular members of the GLM family of models with their most popular link and variance functions:

Linear regression with the identity link and variance function equal to the constant 1 (constant variance over the range of response values).

Logistic regression
In other words, the methods of linear regression assume that the target value ranges from minus infinity to infinity and that the target variance is constant over the range. The logistic regression target is either 0 or 1. A logistic regression model estimate is a probability. The job of the link function in logistic regression is to transform the target value into the required range, minus infinity to infinity.
GLM Function  Default Link Function  Other Supported Link Functions 

Linear regression (gaussian)  identity  none 
Logistic regression (binomial)  logit  probit, cloglog, cauchit, and binomial variance 
Related Topics
20.2 GLM in Oracle Machine Learning
Learn how Oracle Machine Learning implements the Generalized Linear Model (GLM) algorithm.
GLM is a parametric modeling technique. Parametric models make assumptions about the distribution of the data. When the assumptions are met, parametric models can be more efficient than nonparametric models.
The challenge in developing models of this type involves assessing the extent to which the assumptions are met. For this reason, quality diagnostics are key to developing quality parametric models.
20.2.1 Interpretability and Transparency
You can interpret and understand key characteristics of Generalized Linear Model (GLM) model through model details and global details.
You can interpret Oracle Machine Learnings' GLM with ease. Each model build generates many statistics and diagnostics. Transparency is also a key feature: model details describe key characteristics of the coefficients, and global details provide highlevel statistics.
Related Topics
20.2.2 Wide Data
Generalized Linear Model(GLM) handles wide data efficiently, building quality models with numerous predictors.
GLM in Oracle Machine Learning is uniquely suited for handling wide data. The algorithm can build and score quality models that use a virtually limitless number of predictors (attributes). The only constraints are those imposed by system resources.
20.2.3 Confidence Bounds
Predict confidence bounds through the Generalized Linear Model (GLM) algorithm.
GLM have the ability to predict confidence bounds. In addition to predicting a best estimate and a probability (classification only) for each row, GLM identifies an interval wherein the prediction (regression) or probability (classification) lies. The width of the interval depends upon the precision of the model and a userspecified confidence level.
The confidence level is a measure of how sure the model is that the true value lies within a confidence interval computed by the model. A popular choice for confidence level is 95%. For example, a model might predict that an employee's income is $125K, and that you can be 95% sure that it lies between $90K and $160K. Oracle Machine Learning for SQL supports 95% confidence by default, but that value can be configured.
Note:
Confidence bounds are returned with the coefficient statistics. You can also use the PREDICTION_BOUNDS
SQL function to obtain the confidence bounds of a model prediction.
Related Topics
20.2.4 Ridge Regression
Understand the use of ridge regression for singularity (exact multicollinearity) in data.
The best regression models are those in which the predictors correlate highly with the target, but there is very little correlation between the predictors themselves. Multicollinearity is the term used to describe multivariate regression with correlated predictors.
Ridge regression is a technique that compensates for multicollinearity. Oracle Machine Learning for SQL supports ridge regression for both regression and classification machine learning techniques. The algorithm automatically uses ridge if it detects singularity (exact multicollinearity) in the data.
Information about singularity is returned in the global model details.
20.2.4.1 Configuring Ridge Regression
Configure ridge regression through build settings.
You can choose to explicitly enable ridge regression by specifying a build setting for the model. If you explicitly enable ridge, you can use the systemgenerated ridge parameter or you can supply your own. If ridge is used automatically, the ridge parameter is also calculated automatically.
The configuration choices are summarized as follows:

Whether or not to override the automatic choice made by the algorithm regarding ridge regression

The value of the ridge parameter, used only if you specifically enable ridge regression.
See Also:
Oracle Database PL/SQL Packages and Types Reference for a listing and explanation of the available model settings.Note:
The term hyperparameter is also interchangeably used for model setting.20.2.4.2 Ridge and Confidence Bounds
Models built with ridge regression do not support confidence bounds.
Related Topics
20.2.4.3 Ridge and Data Preparation
Learn about preparing data for ridge regression.
When ridge regression is enabled, different data preparation is likely to produce different results in terms of model coefficients and diagnostics. Oracle recommends that you enable Automatic Data Preparation for Generalized Linear Model models, especially when ridge regression is used.
Related Topics
20.3 Scalable Feature Selection
Oracle Machine Learning supports a highly scalable and automated version of feature selection and generation for the Generalized Linear Model algorithm.
This scalable and automated capability can enhance the performance of the algorithm and improve accuracy and interpretability. Feature selection and generation are available for both linear regression and binary logistic regression.
20.3.1 Feature Selection
Feature selection in GLM simplifies models, enhancing interpretability and accuracy by removing irrelevant predictors.
Feature selection is the process of choosing the terms to be included in the model. The fewer terms in the model, the easier it is for human beings to interpret its meaning. In addition, some columns may not be relevant to the value that the model is trying to predict. Removing such columns can enhance model accuracy.
20.3.1.1 Configuring Feature Selection
GLM configured for feature selection automatically determines the default behavior of the model.
Feature selection is a build setting for Generalized Linear Model models. It is not enabled by default. When configured for feature selection, the algorithm automatically determines appropriate default behavior, but the following configuration options are available:

The feature selection criteria can be AIC, SBIC, RIC, or αinvesting. When the feature selection criteria is αinvesting, feature acceptance can be either strict or relaxed.

The maximum number of features can be specified.

Features can be pruned in the final model. Pruning is based on tstatistics for linear regression or wald statistics for logistic regression.
20.3.1.2 Feature Selection and Ridge Regression
Choose between feature selection and ridge regression to configure GLM models.
Feature selection and ridge regression are mutually exclusive. When feature selection is enabled, the algorithm can not use ridge.
Note:
If you configure the model to use both feature selection and ridge regression, then you get an error.
20.3.2 Feature Generation
Feature generation in GLM adds transformed terms, fitting complex relationships between target and predictors.
Feature generation is the process of adding transformations of terms into the model. Feature generation enhances the power of models to fit more complex relationships between target and predictors.
20.3.2.1 Configuring Feature Generation
Learn about configuring feature generation.
Feature generation is only possible when feature selection is enabled. Feature generation is a build setting. By default, feature generation is not enabled.
The feature generation method can be either quadratic or cubic. By default, the algorithm chooses the appropriate method. You can also explicitly specify the feature generation method.
The following options for feature selection also affect feature generation:

Maximum number of features

Model pruning
Related Topics
20.4 Tuning and Diagnostics for GLM
Tuning and diagnostics in GLM help optimize model performance and quality through detailed evaluations.
The process of developing a Generalized Linear Model machine learning model typically involves a number of model builds. Each build generates many statistics that you can evaluate to determine the quality of your model. Depending on these diagnostics, you may want to try changing the model settings or making other modifications.
20.4.1 Build Settings
Specify the build settings for Generalized Linear Model (GLM).
You can use specify build settings.
Additional build settings are available to:

Control the use of ridge regression.

Specify the handling of missing values in the training data.

Specify the target value to be used as a reference in a logistic regression model.
See Also:
DBMS_DATA_MINING —Algorithm Settings: Generalized Linear Models for a listing and explanation of the available model settings.Note:
The term hyperparameter is also interchangeably used for model setting.Related Topics
20.4.2 Diagnostics
A Generalized Linear Model model generates many metrics to help you evaluate the quality of the model.
20.4.2.1 Coefficient Statistics
Learn about coeffficient statistics for linear and logistic regression.
The same set of statistics is returned for both linear and logistic regression, but statistics that do not apply to the machine learning technique are returned as NULL.
Coefficient statistics are returned by the model detail views for a Generalized Linear Model (GLM) model.
20.4.2.2 Global Model Statistics
Learn about highlevel statistics describing the model.
Separate highlevel statistics describing the model as a whole, are returned for linear and logistic regression. When ridge regression is enabled, fewer global details are returned.
Global statistics are returned by the model detail views for a Generalized Linear Model model.
20.4.2.3 Row Diagnostics
Generate rowstatistics by configuring the Generalized Linear Model (GLM) algorithm.
GLM generates perrow statistics if you specify the name of a diagnostics table in the build setting GLMS_DIAGNOSTICS_TABLE_NAME
.
GLM requires a case ID to generate row diagnostics. If you provide the name of a diagnostic table but the data does not include a case ID column, an exception is raised.
20.5 GLM Solvers
Generalized Linear Model (GLM) algorithm applies different solvers. These solvers employ different approaches for optimization.
The GLM algorithm supports four different solvers: Cholesky, QR, Stochastic Gradient Descent (SGD),and Alternating Direction Method of Multipliers (ADMM) (on top of LBFGS). The Cholesky and QR solvers employ classical decomposition approaches. The Cholesky solver is faster compared to the QR solver but less stable numerically. The QR solver handles better rank deficient problems without the help of regularization.
The SGD and ADMM (on top of LBFGS) solvers are best suited for large scale data. The SGD solver employs the stochastic gradient descent optimization algorithm while ADMM (on top of LBFGS) uses the BroydenFletcherGoldfarbShanno optimization algorithm within an Alternating Direction Method of Multipliers framework. The SGD solver is fast but is sensitive to parameters and requires suitable scaled data to achieve good convergence. The LBFGS algorithm solves unconstrained optimization problems and is more stable and robust than SGD. Also, LBFGS uses ADMM in conjunction, which, results in an efficient distributed optimization approach with low communication cost.
20.6 Data Preparation for GLM
Learn about preparing data for the Generalized Linear Model (GLM) algorithm.
Automatic Data Preparation (ADP) implements suitable data transformations for both linear and logistic regression.
See Also:
DBMS_DATA_MINING —Algorithm Settings: Generalized Linear Models for a listing and explanation of the available model settings.Note:
The term hyperparameter is also interchangeably used for model setting.Oracle recommends that you use ADP with GLM.
Related Topics
20.6.1 Data Preparation for Linear Regression
ADP ensures optimal data transformations for linear regression, enhancing model accuracy.
When ADP is enabled, the algorithm chooses a transformation based on input data properties and other settings. The transformation can include one or more of the following for numerical data: subtracting the mean, scaling by the standard deviation, or performing a correlation transformation (Neter, et. al, 1990). If the correlation transformation is applied to numeric data, it is also applied to categorical attributes.
Prior to standardization, categorical attributes are exploded into N1 columns where N is the attribute cardinality. The most frequent value (mode) is omitted during the explosion transformation. In the case of highest frequency ties, the attribute values are sorted alphanumerically in ascending order, and the first value on the list is omitted during the explosion. This explosion transformation occurs whether or not ADP is enabled.
In the case of high cardinality categorical attributes, the described transformations (explosion followed by standardization) can increase the build data size because the resulting data representation is dense. To reduce memory, disk space, and processing requirements, use an alternative approach. Under these circumstances, the VIF statistic must be used with caution.
Related Topics
See Also:

Neter, J., Wasserman, W., and Kutner, M.H., "Applied Statistical Models", Richard D. Irwin, Inc., Burr Ridge, IL, 1990.
20.6.2 Data Preparation for Logistic Regression
ADP optimizes data for logistic regression, standardizing numerical attributes and exploding categorical attributes.
Categorical attributes are exploded into N1 columns where N is the attribute cardinality. The most frequent value (mode) is omitted during the explosion transformation. In the case of highest frequency ties, the attribute values are sorted alphanumerically in ascending order and the first value on the list is omitted during the explosion. This explosion transformation occurs whether or not Automatic Data Preparation (ADP) is enabled.
When ADP is enabled, numerical attributes are scaled by the standard deviation. This measure of variability is computed as the standard deviation per attribute with respect to the origin (not the mean) (Marquardt, 1980).
See Also:
Marquardt, D.W., "A Critique of Some Ridge Regression Methods: Comment", Journal of the American Statistical Association, Vol. 75, No. 369 , 1980, pp. 8791.
20.6.3 Missing Values
GLM automatically replaces missing values.
When building or applying a model, Oracle Machine Learning automatically replaces missing values of numerical attributes with the mean and missing values of categorical attributes with the mode.
You can configure the Generalized Linear Model algorithm to override the
default treatment of missing values. With the
ODMS_MISSING_VALUE_TREATMENT
setting, you can cause the
algorithm to delete rows in the training data that have missing values instead of
replacing them with the mean or the mode. However, when the model is applied, OML4SQL performs the usual mean/mode missing value
replacement. As a result, it is possible that the statistics generated from scoring
does not match the statistics generated from building the model.
If you want to delete rows with missing values in the scoring the model, you must perform the transformation explicitly. To make build and apply statistics match, you must remove the rows with NULLs from the scoring data before performing the apply operation. You can do this by creating a view.
CREATE VIEW viewname
AS SELECT * from tablename
WHERE column_name1
is NOT NULL
AND column_name2
is NOT NULL
AND column_name3
is NOT NULL .....
Note:
In OML4SQL, missing values in nested data indicate sparsity, not values missing at random.
The value
ODMS_MISSING_VALUE_DELETE_ROW
is only valid for tables
without nested columns. If this value is used with nested data, an exception is
raised.
20.7 Linear Regression
GLM supports linear regression, assuming no target transformation and constant variance over target values.
Oracle Machine Learning supports linear regression as the Generalized Linear Model regression algorithm. The algorithm assumes no target transformation and constant variance over the range of target values. The algorithm uses the identity link function.
20.7.1 Poisson and Variance Link Function
The Poisson distribution is the number of occurrences of the event in a given time interval. It is a count distribution when the variable of interest is a discrete count variable.
For example, how many times per month will a grocery product be purchased? How many phone calls will be made per hour on the network? The predictors are the conditions that affect the average number events. The link function is in the following form:
g(μ) = lnμ = β_{0}+β_{1}x_{1}+β_{2}x_{2}+β_{3}x_{3}+...β_{n}x_{n}
Where average event count is μ.
The variance function is in the following form:
Var(μ)=μ
20.7.2 Negative Binomial Link Function and Variance
In Poisson distribution the variance is equal to the mean, however, sometimes, the variance of the predicted mean is larger than the mean. This occurrence in count data analysis is called overdispersion. Because the consequences are potentially so severe, models such as negative binomial regression can be applied.
The link function is in the following form:
g(μ) = lnμ = β_{0}+β_{1}x_{1}+β_{2}x_{2}+β_{3}x_{3}+...β_{n}x_{n}
Where average event count is μ.
20.7.3 Coefficient Statistics for Linear Regression
Lists coefficient statistics for linear regression.
Generalized Linear Model regression models generate the following coefficient statistics:

Linear coefficient estimate

Standard error of the coefficient estimate

tvalue of the coefficient estimate

Probability of the tvalue

Variance Inflation Factor (VIF)

Standardized estimate of the coefficient

Lower and upper confidence bounds of the coefficient
20.7.4 Global Model Statistics for Linear Regression
Generalized Linear Model regression models generate the following statistics.
Generalized Linear Model regression models generate the following statistics that describe the model as a whole:

Model degrees of freedom

Model sum of squares

Model mean square

Model F statistic

Model F value probability

Error degrees of freedom

Error sum of squares

Error mean square

Corrected total degrees of freedom

Corrected total sum of squares

Root mean square error

Dependent mean

Coefficient of variation

RSquare

Adjusted RSquare

Akaike's information criterion

Schwarz's Baysian information criterion

Estimated mean square error of the prediction

Hocking Sp statistic

JP statistic (the final prediction error)

Number of parameters (the number of coefficients, including the intercept)

Number of rows

Whether or not the model converged

Whether or not a covariance matrix was computed
20.7.5 Row Diagnostics for Linear Regression
The diagnostics table for GLM regression models provides detailed rowlevel insights.
For linear regression, the diagnostics table has the columns described in the
following table. All the columns are NUMBER
, except the
CASE_ID
column, which preserves the type from the training
data.
Table 201 Diagnostics Table for GLM Regression Models
Column  Description 


Value of the case ID column 

Value of the target column 

Value predicted by the model for the target 

Value of the diagonal element of the hat matrix 

Measure of error 

Standard error of the residual 

Studentized residual 

Predicted residual 

Cook's D influence statistic 
20.8 Logistic Regression
GLM implements binary logistic regression, transforming target values into a probability scale for classification.
Oracle Machine Learning supports binary logistic regression as a Generalized Linear Model classification algorithm. Link and variance functions are the mechanism that allows GLM to handle targets of a regression that departs in known ways from normality. In logistic regression, a link function is used to relate the explanatory variables (covariates) and the expectation of the response variable. Binomial regression predicts the probability of a success by applying the inverse of a specified link function to a linear combination of covariates. The specified inverse link function can be any monotonically increasing function that maps values from the range (∞, ∞) to [0,1]. The inverse link function is created from cumulative distribution functions (CDFs) of wellknown random distributions. The variance has a known functional relationship with the probability, and a binary target probability varies between zero and one. For logistic regression, the variance function is fixed to its known functional relationship with probability. However, there are other options for the link function. The link function not only transforms the target range into a linearmethodsfriendly format, but it also represents a target concept. The analyst can use the target concept to interpret a forecast on two scales: the link scale and the transformed scale. The transformed scale in logistic regression is probability.
20.8.1 Logit Link Function
The logit link transforms a probability into the log of the odds ratio. The odds ratio is the ratio of the predicted probability of the positive to the predicted probability of the negative class. The log of the odds ratio has the appropriate range.
The odds ratio is a measure of the evidence for or against the positive target class. Odds ratios can be associated with particular predictor value. Odds ratios are naturally multiplicative, which makes the log of odds ratios additive. The logodds ratio interprets the influence of a predictor as additive evidence for or against the positive class.
An advantage of the logit link is that the training data can be sampled independently from the two classes. This can be very significant in cases in which one class is rare or costly, such as the instances of a disease. Analysis of disease factors can be done directly from a sample of healthy people and a sample of people with the disease. This type of sampling is known as retrospective sampling.
For logistic regression, the logit link is the default. For technical reasons, this link is called the canonical link.
20.8.2 Probit Link Function
Probit link uses standard normal distribution to transform target values, ideal for normally distributed targets.
One approach to transforming the range of a probability to the range minus infinity to infinity is to choose a probability distribution that is defined on that range and assign the distribution value that corresponds to the probability as the target value. For example, the probabilities, 0, 0.5 and 1.0 corresponds to the value infinity, 0 and infinity in a standard normal distribution. An inverse cumulative distribution function is a function that determines the value that corresponds to a probability. In this approach, a user matches the particular probability distribution to assumptions regarding the distribution of the target. Users often find transformation of a target using the target's known associated distribution as natural. The probit link takes this approach, using the standard normal distribution. An example use case is an analysis of high blood pressure. Blood pressure is assumed to have a normal distribution.
20.8.3 Cloglog Link Function
Cloglog link models extreme events effectively, transforming target values using Gumbel distribution.
The Complimentary LogLog (cloglog) link is another example of using an inverse cumulative distribution function to transform the target. It differs from logit and probit function because it is asymmetric. It works best when the chance of an event is extremely low or extremely high. Gumbel described these extreme value distributions. The cloglog model is closely related to continuoustime models for event occurrence. The cloglog link function corresponds to Gumbel CDF. The precipitation from the worst rainstorm in 100 years is an example of data that follows an extreme value distribution (the hundred year rain).
20.8.4 Cauchit Link Function
Cauchit link uses the Cauchy distribution to transform target values, suitable for data with infinite variance.
The Cauchit link is another application of an inverse cumulative distribution function to transform the target. In this case, the distribution is the Cauchy distribution. The Cauchy distribution is symmetric, however, it has infinite variance. An infinite variance means the probability decays slowly as the values become more extreme. Such distributions are called fattailed. The Cauchit link is often used where fewer assumptions are justified with respect to the distribution of the target. The Cauchit link is used to measure data in binomial form when the variance is not considered to be finite.
20.8.5 Reference Class
Specify the reference class for binary logistic regression in GLM to improve prediction accuracy.
You can use the build setting
GLMS_REFERENCE_CLASS_NAME
to specify the target value to be used as
a reference in a binary logistic regression model. Probabilities are produced for the
other (nonreference) class. By default, the algorithm chooses the value with the
highest prevalence. If there are ties, the attributes are sorted alphanumerically in an
ascending order.
20.8.6 Class Weights
Use class weights to influence target class weighting during model building in GLM.
You can use the build setting CLAS_WEIGHTS_TABLE_NAME
to
specify the name of a class weights table. Class weights influence the weighting of
target classes during the model build.
20.8.7 Coefficient Statistics for Logistic Regression
GLM provides detailed coefficient statistics for logistic regression, aiding in model evaluation.
Generalized Linear Model classification models generate the following coefficient statistics:

Name of the predictor

Coefficient estimate

Standard error of the coefficient estimate

Wald chisquare value of the coefficient estimate

Probability of the Wald chisquare value

Standardized estimate of the coefficient

Lower and upper confidence bounds of the coefficient

Exponentiated coefficient

Exponentiated coefficient for the upper and lower confidence bounds of the coefficient
20.8.8 Global Model Statistics for Logistic Regression
GLM generates global statistics for logistic regression, supporting model assessment.
Generalized Linear Model classification models generate the following statistics that describe the model as a whole:

Akaike's criterion for the fit of the intercept only model

Akaike's criterion for the fit of the intercept and the covariates (predictors) model

Schwarz's criterion for the fit of the intercept only model

Schwarz's criterion for the fit of the intercept and the covariates (predictors) model

2 log likelihood of the intercept only model

2 log likelihood of the model

Likelihood ratio degrees of freedom

Likelihood ratio chisquare probability value

Pseudo Rsquare Cox an Snell

Pseudo Rsquare Nagelkerke

Dependent mean

Percent of correct predictions

Percent of incorrect predictions

Percent of ties (probability for two cases is the same)

Number of parameters (the number of coefficients, including the intercept)

Number of rows

Whether or not the model converged

Whether or not a covariance matrix was computed.
20.8.9 Row Diagnostics for Logistic Regression
GLM provides detailed row diagnostics for logistic regression, offering insights into individual predictions.
For
logistic regression, the diagnostics table has the columns described in the
following table. All the columns are NUMBER
, except the
CASE_ID
and TARGET_VALUE
columns, which
preserve the type from the training data.
Table 202 Row Diagnostics Table for Logistic Regression
Column  Description 


Value of the case ID column 

Value of the target value 

Probability associated with the target value 

Value of the diagonal element of the hat matrix 

Residual with respect to the adjusted dependent variable 

The raw residual scaled by the estimated standard deviation of the target 

Contribution to the overall goodness of fit of the model 

Confidence interval displacement diagnostic 

Confidence interval displacement diagnostic 

Change in the deviance due to deleting an individual observation 

Change in the Pearson chisquare 