Brief Demo of activeReg

Author: Dr Chibisi Chima-Okereke Created: March 23, 2014 07:08:00 GMT Published: March 23, 2014 07:08:00 GMT

Following a previous announcement concerning the creation of activeReg a sibling package to activeH5 for carrying out regression on big data sets we at Active Analytics present a small demonstration of the package using the popular 1987 airline dataset.

The linear regression and general linear regression function h5lm() and h5glm() outputs will be compared with R’s lm() and glm() functions.

Linear regression Analysis

Here we load the data into a data frame and then also save the data as an H5DF object.

# Load the activeReg package
require(activeReg)
# Loading the 1987 airline dataset
df <- readRDS("data/air.RDS")
# converting days of the week to factor
df$DayOfWeek <- factor(df$DayOfWeek)
h5Path <- "data/air.h5"
# Creating H5 Data
h5Data <- newH5DF(df, filePath = h5Path)
Initializing ...
Creating meta data on file ... 
Converting character to factors 
Registering any factor columns 
Creating the matrix for writing ... 
Writing data to file ... 

We run linear regression using lm():

# Linear regression in R's lm
lm1 <- lm(formula = ArrDelay ~ DayOfWeek, data = df)
summary(lm1)
Call:
lm(formula = ArrDelay ~ DayOfWeek, data = df)
Residuals:
    Min      1Q  Median      3Q     Max 
-324.69  -19.07  -10.51    4.15 2588.04 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 10.51350    0.03763 279.420  < 2e-16 ***
DayOfWeek2  -2.24982    0.05368 -41.909  < 2e-16 ***
DayOfWeek3  -0.55056    0.05356 -10.279  < 2e-16 ***
DayOfWeek4   2.17248    0.05343  40.657  < 2e-16 ***
DayOfWeek5   2.55417    0.05337  47.855  < 2e-16 ***
DayOfWeek6  -4.66690    0.05566 -83.852  < 2e-16 ***
DayOfWeek7  -0.18393    0.05413  -3.398 0.000679 ***
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 39.24 on 7275281 degrees of freedom
  (177927 observations deleted due to missingness)
Multiple R-squared:  0.003285, Adjusted R-squared:  0.003284 
F-statistic:  3996 on 6 and 7275281 DF,  p-value: < 2.2e-16

and then using h5lm():

# Linear regression in activeReg
alm <- h5lm(formula = ArrDelay ~ DayOfWeek, data = h5Data)
Processing chunk 1 
Processing chunk 2 
Processing chunk 3 
....
summary(alm)
Call:
 h5lm(formula = ArrDelay ~ DayOfWeek, data = h5Data) 
             Estimate Std.Error   t.value P(>|t|)
(Intercept)  10.51350   0.03763 279.41966   0.000
DayOfWeek2   -2.24982   0.05368 -41.90887   0.000
DayOfWeek3   -0.55056   0.05356 -10.27908   0.000
DayOfWeek4    2.17248   0.05343  40.65667   0.000
DayOfWeek5    2.55417   0.05337  47.85496   0.000
DayOfWeek6   -4.66690   0.05566 -83.85159   0.000
DayOfWeek7   -0.18393   0.05413  -3.39799   0.001
logLik: -37021866 , df: 8
AIC: 74043746 , BIC: 74043953
R-Squared: 0.003284723
F-Statistic: 3996.006 on 6 and 7275281 DF, pvalue: 0

Logistic Regression Analysis

First we present the a binomial model using R’s glm() funciton.

# Binomial Regression using R's glm
glm1 <- glm(formula = ArrDelay > 10 ~ DayOfWeek, family = binomial(link = "logit"), data = df)
summary(glm1)
Call:
glm(formula = ArrDelay > 10 ~ DayOfWeek, family = binomial(link = "logit"), 
    data = df)
Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.8974  -0.8395  -0.8204   1.4861   1.6797  
Coefficients:
             Estimate Std. Error  z value Pr(>|z|)    
(Intercept) -0.861624   0.002098 -410.627   <2e-16 ***
DayOfWeek2  -0.165704   0.003051  -54.314   <2e-16 ***
DayOfWeek3  -0.054420   0.003004  -18.114   <2e-16 ***
DayOfWeek4   0.102810   0.002951   34.843   <2e-16 ***
DayOfWeek5   0.160037   0.002933   54.566   <2e-16 ***
DayOfWeek6  -0.269402   0.003213  -83.845   <2e-16 ***
DayOfWeek7  -0.005563   0.003020   -1.842   0.0655 .  
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
    Null deviance: 8787789  on 7275287  degrees of freedom
Residual deviance: 8761041  on 7275281  degrees of freedom
  (177927 observations deleted due to missingness)
AIC: 8761055
Number of Fisher Scoring iterations: 4

Then we present the same case using the h5glm() activeReg function.

# Binomial regression using activeReg
aglm <- h5glm(formula = ArrDelay > 10 ~ DayOfWeek, family = binomial_(link = "logit"), data = h5Data)

This is the output for the iterative procedure …

Iteration 0 the coefficients are:
             [,1]
[1,] -0.987369109
[2,] -0.162442345
[3,] -0.054648105
[4,]  0.106537434
[5,]  0.167611700
[6,] -0.257923804
[7,] -0.005643235
Iteration 1 iterations, and the deviance is 8783085 the coefficients are:
             [,1]
[1,] -0.858078279
[2,] -0.165391761
[3,] -0.054227733
[4,]  0.102267311
[5,]  0.159123315
[6,] -0.269351868
[7,] -0.005540061
Iteration 2 iterations, and the deviance is 8761059 the coefficients are:
             [,1]
[1,] -0.861621407
[2,] -0.165702560
[3,] -0.054419338
[4,]  0.102809484
[5,]  0.160036097
[6,] -0.269400869
[7,] -0.005563139
Iteration 3 iterations, and the deviance is 8761041 the coefficients are:
             [,1]
[1,] -0.861623952
[2,] -0.165703523
[3,] -0.054419777
[4,]  0.102810400
[5,]  0.160037478
[6,] -0.269401627
[7,] -0.005563187
Iteration 4 iterations, and the deviance is 8761041 the coefficients are:
             [,1]
[1,] -0.861623952
[2,] -0.165703523
[3,] -0.054419777
[4,]  0.102810400
[5,]  0.160037478
[6,] -0.269401627
[7,] -0.005563187
summary(aglm)
Call:
 h5glm(formula = ArrDelay > 10 ~ DayOfWeek, family = binomial_(link = "logit"), 
    data = h5Data) 
              Estimate  Std.Error    z.value P(>|z|)
(Intercept) -8.616e-01  2.098e-03 -4.106e+02   0.000
DayOfWeek2  -1.657e-01  3.051e-03 -5.431e+01   0.000
DayOfWeek3  -5.442e-02  3.004e-03 -1.811e+01   0.000
DayOfWeek4   1.028e-01  2.951e-03  3.484e+01   0.000
DayOfWeek5   1.600e-01  2.933e-03  5.457e+01   0.000
DayOfWeek6  -2.694e-01  3.213e-03 -8.384e+01   0.000
DayOfWeek7  -5.563e-03  3.020e-03 -1.842e+00   0.065
logLik: -11001788 , df: 8
AIC: 8761055 , BIC: 22003798

Summary

The purpose of this brief look at the activeReg package was to show some basic functionality and some equivalence between its outputs and that of R’s lm() and glm() functions. Next time we will look at running regressions on big data.

Thank you.