Following a previous announcement concerning the creation of activeReg a sibling package to activeH5 for carrying out regression on big data sets we at Active Analytics present a small demonstration of the package using the popular 1987 airline dataset.
The linear regression and general linear regression function h5lm()
and h5glm()
outputs will be compared with R’s lm()
and glm()
functions.
Here we load the data into a data frame and then also save the data as an H5DF object.
# Load the activeReg package
require(activeReg)
# Loading the 1987 airline dataset
df <- readRDS("data/air.RDS")
# converting days of the week to factor
df$DayOfWeek <- factor(df$DayOfWeek)
h5Path <- "data/air.h5"
# Creating H5 Data
h5Data <- newH5DF(df, filePath = h5Path)
Initializing ...
Creating meta data on file ...
Converting character to factors
Registering any factor columns
Creating the matrix for writing ...
Writing data to file ...
We run linear regression using lm()
:
# Linear regression in R's lm
lm1 <- lm(formula = ArrDelay ~ DayOfWeek, data = df)
summary(lm1)
Call:
lm(formula = ArrDelay ~ DayOfWeek, data = df)
Residuals:
Min 1Q Median 3Q Max
-324.69 -19.07 -10.51 4.15 2588.04
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.51350 0.03763 279.420 < 2e-16 ***
DayOfWeek2 -2.24982 0.05368 -41.909 < 2e-16 ***
DayOfWeek3 -0.55056 0.05356 -10.279 < 2e-16 ***
DayOfWeek4 2.17248 0.05343 40.657 < 2e-16 ***
DayOfWeek5 2.55417 0.05337 47.855 < 2e-16 ***
DayOfWeek6 -4.66690 0.05566 -83.852 < 2e-16 ***
DayOfWeek7 -0.18393 0.05413 -3.398 0.000679 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 39.24 on 7275281 degrees of freedom
(177927 observations deleted due to missingness)
Multiple R-squared: 0.003285, Adjusted R-squared: 0.003284
F-statistic: 3996 on 6 and 7275281 DF, p-value: < 2.2e-16
and then using h5lm()
:
# Linear regression in activeReg
alm <- h5lm(formula = ArrDelay ~ DayOfWeek, data = h5Data)
Processing chunk 1
Processing chunk 2
Processing chunk 3
....
summary(alm)
Call:
h5lm(formula = ArrDelay ~ DayOfWeek, data = h5Data)
Estimate Std.Error t.value P(>|t|)
(Intercept) 10.51350 0.03763 279.41966 0.000
DayOfWeek2 -2.24982 0.05368 -41.90887 0.000
DayOfWeek3 -0.55056 0.05356 -10.27908 0.000
DayOfWeek4 2.17248 0.05343 40.65667 0.000
DayOfWeek5 2.55417 0.05337 47.85496 0.000
DayOfWeek6 -4.66690 0.05566 -83.85159 0.000
DayOfWeek7 -0.18393 0.05413 -3.39799 0.001
logLik: -37021866 , df: 8
AIC: 74043746 , BIC: 74043953
R-Squared: 0.003284723
F-Statistic: 3996.006 on 6 and 7275281 DF, pvalue: 0
First we present the a binomial model using R’s glm()
funciton.
# Binomial Regression using R's glm
glm1 <- glm(formula = ArrDelay > 10 ~ DayOfWeek, family = binomial(link = "logit"), data = df)
summary(glm1)
Call:
glm(formula = ArrDelay > 10 ~ DayOfWeek, family = binomial(link = "logit"),
data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.8974 -0.8395 -0.8204 1.4861 1.6797
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.861624 0.002098 -410.627 <2e-16 ***
DayOfWeek2 -0.165704 0.003051 -54.314 <2e-16 ***
DayOfWeek3 -0.054420 0.003004 -18.114 <2e-16 ***
DayOfWeek4 0.102810 0.002951 34.843 <2e-16 ***
DayOfWeek5 0.160037 0.002933 54.566 <2e-16 ***
DayOfWeek6 -0.269402 0.003213 -83.845 <2e-16 ***
DayOfWeek7 -0.005563 0.003020 -1.842 0.0655 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 8787789 on 7275287 degrees of freedom
Residual deviance: 8761041 on 7275281 degrees of freedom
(177927 observations deleted due to missingness)
AIC: 8761055
Number of Fisher Scoring iterations: 4
Then we present the same case using the h5glm()
activeReg function.
# Binomial regression using activeReg
aglm <- h5glm(formula = ArrDelay > 10 ~ DayOfWeek, family = binomial_(link = "logit"), data = h5Data)
This is the output for the iterative procedure …
Iteration 0 the coefficients are:
[,1]
[1,] -0.987369109
[2,] -0.162442345
[3,] -0.054648105
[4,] 0.106537434
[5,] 0.167611700
[6,] -0.257923804
[7,] -0.005643235
Iteration 1 iterations, and the deviance is 8783085 the coefficients are:
[,1]
[1,] -0.858078279
[2,] -0.165391761
[3,] -0.054227733
[4,] 0.102267311
[5,] 0.159123315
[6,] -0.269351868
[7,] -0.005540061
Iteration 2 iterations, and the deviance is 8761059 the coefficients are:
[,1]
[1,] -0.861621407
[2,] -0.165702560
[3,] -0.054419338
[4,] 0.102809484
[5,] 0.160036097
[6,] -0.269400869
[7,] -0.005563139
Iteration 3 iterations, and the deviance is 8761041 the coefficients are:
[,1]
[1,] -0.861623952
[2,] -0.165703523
[3,] -0.054419777
[4,] 0.102810400
[5,] 0.160037478
[6,] -0.269401627
[7,] -0.005563187
Iteration 4 iterations, and the deviance is 8761041 the coefficients are:
[,1]
[1,] -0.861623952
[2,] -0.165703523
[3,] -0.054419777
[4,] 0.102810400
[5,] 0.160037478
[6,] -0.269401627
[7,] -0.005563187
summary(aglm)
Call:
h5glm(formula = ArrDelay > 10 ~ DayOfWeek, family = binomial_(link = "logit"),
data = h5Data)
Estimate Std.Error z.value P(>|z|)
(Intercept) -8.616e-01 2.098e-03 -4.106e+02 0.000
DayOfWeek2 -1.657e-01 3.051e-03 -5.431e+01 0.000
DayOfWeek3 -5.442e-02 3.004e-03 -1.811e+01 0.000
DayOfWeek4 1.028e-01 2.951e-03 3.484e+01 0.000
DayOfWeek5 1.600e-01 2.933e-03 5.457e+01 0.000
DayOfWeek6 -2.694e-01 3.213e-03 -8.384e+01 0.000
DayOfWeek7 -5.563e-03 3.020e-03 -1.842e+00 0.065
logLik: -11001788 , df: 8
AIC: 8761055 , BIC: 22003798
The purpose of this brief look at the activeReg package was to show some basic functionality and some equivalence between its outputs and that of R’s lm()
and glm()
functions. Next time we will look at running regressions on big data.
Thank you.