4.1.4 Build a Generalized Linear Model
The ore.glm
functions fits generalized linear models on data in an ore.frame
object..
The function uses a Fisher scoring iteratively reweighted least squares (IRLS) algorithm. Instead of the traditional step of halving to prevent the selection of less optimal coefficient estimates, ore.glm
uses a line search to select new coefficient estimates at each iteration, starting from the current coefficient estimates and moving through the Fisher scoring suggested estimates using the formula (1 - alpha) * old + alpha * suggested
where alpha in [0, 2]
. When the interp
control argument is TRUE
, the deviance is approximated by a cubic spline interpolation. When it is FALSE
, the deviance is calculated using a follow-up data scan.
Each iteration consists of two or three embedded R execution map/reduce operations: an IRLS operation, an initial line search operation, and, if interp = FALSE
, an optional follow-up line search operation. As with ore.lm
, the IRLS map operation creates QR decompositions when update = "qr"
or cross-products when update = "crossprod"
of the model.matrix
, or sparse.model.matrix
if argument sparse = TRUE
, and the IRLS reduce operation block updates those QR decompositions or cross-product matrices. After the algorithm has either converged or reached the maximum number of iterations, a final embedded R map/reduce operation is used to generate the complete set of model-level statistics.
The ore.glm
function returns an ore.glm
object.
For information on the ore.glm
function arguments, call help(ore.glm)
.
Example 4-4 Using the ore.glm Function
This example loads the rpart
package and then pushes the kyphosis
data set to a temporary database table that has the proxy ore.frame
object KYPHOSIS
. The example builds a Generalized Linear Model using the ore.glm
function and one using the glm
function and calls the summary
function on the models.
# Load the rpart library to get the kyphosis and solder data sets. library(rpart) # Logistic regression KYPHOSIS <- ore.push(kyphosis) kyphFit1 <- ore.glm(Kyphosis ~ ., data = KYPHOSIS, family = binomial()) kyphFit2 <- glm(Kyphosis ~ ., data = kyphosis, family = binomial()) summary(kyphFit1) summary(kyphFit2)Listing for Example 4-4
R> # Load the rpart library to get the kyphosis and solder data sets.
R> library(rpart)
R> # Logistic regression
R> KYPHOSIS <- ore.push(kyphosis)
R> kyphFit1 <- ore.glm(Kyphosis ~ ., data = KYPHOSIS, family = binomial())
R> kyphFit2 <- glm(Kyphosis ~ ., data = kyphosis, family = binomial())
R> summary(kyphFit1)
Call:
ore.glm(formula = Kyphosis ~ ., data = KYPHOSIS, family = binomial())
Deviance Residuals:
Min 1Q Median 3Q Max
-2.3124 -0.5484 -0.3632 -0.1659 2.1613
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.036934 1.449622 -1.405 0.15998
Age 0.010930 0.006447 1.696 0.08997 .
Number 0.410601 0.224870 1.826 0.06786 .
Start -0.206510 0.067700 -3.050 0.00229 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 83.234 on 80 degrees of freedom
Residual deviance: 61.380 on 77 degrees of freedom
AIC: 69.38
Number of Fisher Scoring iterations: 4
R> summary(kyphFit2)
Call:
glm(formula = Kyphosis ~ ., family = binomial(), data = kyphosis)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.3124 -0.5484 -0.3632 -0.1659 2.1613
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.036934 1.449575 -1.405 0.15996
Age 0.010930 0.006446 1.696 0.08996 .
Number 0.410601 0.224861 1.826 0.06785 .
Start -0.206510 0.067699 -3.050 0.00229 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 83.234 on 80 degrees of freedom
Residual deviance: 61.380 on 77 degrees of freedom
AIC: 69.38
Number of Fisher Scoring iterations: 5
# Poisson regression
R> SOLDER <- ore.push(solder)
R> solFit1 <- ore.glm(skips ~ ., data = SOLDER, family = poisson())
R> solFit2 <- glm(skips ~ ., data = solder, family = poisson())
R> summary(solFit1)
Call:
ore.glm(formula = skips ~ ., data = SOLDER, family = poisson())
Deviance Residuals:
Min 1Q Median 3Q Max
-3.4105 -1.0897 -0.4408 0.6406 3.7927
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.25506 0.10069 -12.465 < 2e-16 ***
OpeningM 0.25851 0.06656 3.884 0.000103 ***
OpeningS 1.89349 0.05363 35.305 < 2e-16 ***
SolderThin 1.09973 0.03864 28.465 < 2e-16 ***
MaskA3 0.42819 0.07547 5.674 1.40e-08 ***
MaskB3 1.20225 0.06697 17.953 < 2e-16 ***
MaskB6 1.86648 0.06310 29.580 < 2e-16 ***
PadTypeD6 -0.36865 0.07138 -5.164 2.41e-07 ***
PadTypeD7 -0.09844 0.06620 -1.487 0.137001
PadTypeL4 0.26236 0.06071 4.321 1.55e-05 ***
PadTypeL6 -0.66845 0.07841 -8.525 < 2e-16 ***
PadTypeL7 -0.49021 0.07406 -6.619 3.61e-11 ***
PadTypeL8 -0.27115 0.06939 -3.907 9.33e-05 ***
PadTypeL9 -0.63645 0.07759 -8.203 2.35e-16 ***
PadTypeW4 -0.11000 0.06640 -1.657 0.097591 .
PadTypeW9 -1.43759 0.10419 -13.798 < 2e-16 ***
Panel 0.11818 0.02056 5.749 8.97e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 6855.7 on 719 degrees of freedom
Residual deviance: 1165.4 on 703 degrees of freedom
AIC: 2781.6
Number of Fisher Scoring iterations: 4
Parent topic: Build Oracle Machine Learning for R Models