Go to Article ...

Brief Demo of activeReg

Active Analytics Ltd: posted 23 Mar 2014 07:08 by Chibisi Chima-Okereke

Introduction

Following a previous announcement concerning the creation of activeReg a sibling package to activeH5 for carrying out regression on big data sets we at Active Analytics present a small demonstration of the package using the popular 1987 airline dataset.

The linear regression and general linear regression function h5lm() and h5glm() outputs will be compared with R's lm() and glm() functions.

Linear regression Analysis

Here we load the data into a data frame and then also save the data as an H5DF object.

# Load the activeReg package
require(activeReg)

# Loading the 1987 airline dataset
df <- readRDS("data/air.RDS")
# converting days of the week to factor
df$DayOfWeek <- factor(df$DayOfWeek)


h5Path <- "data/air.h5"
# Creating H5 Data
h5Data <- newH5DF(df, filePath = h5Path)

Output ...

Initializing ...
Creating meta data on file ... 
Converting character to factors 
Registering any factor columns 
Creating the matrix for writing ... 
Writing data to file ... 

We run linear regression using lm():

# Linear regression in R's lm
lm1 <- lm(formula = ArrDelay ~ DayOfWeek, data = df)
summary(lm1)

Output

Call:
lm(formula = ArrDelay ~ DayOfWeek, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-324.69  -19.07  -10.51    4.15 2588.04 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 10.51350    0.03763 279.420  < 2e-16 ***
DayOfWeek2  -2.24982    0.05368 -41.909  < 2e-16 ***
DayOfWeek3  -0.55056    0.05356 -10.279  < 2e-16 ***
DayOfWeek4   2.17248    0.05343  40.657  < 2e-16 ***
DayOfWeek5   2.55417    0.05337  47.855  < 2e-16 ***
DayOfWeek6  -4.66690    0.05566 -83.852  < 2e-16 ***
DayOfWeek7  -0.18393    0.05413  -3.398 0.000679 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 39.24 on 7275281 degrees of freedom
  (177927 observations deleted due to missingness)
Multiple R-squared:  0.003285, Adjusted R-squared:  0.003284 
F-statistic:  3996 on 6 and 7275281 DF,  p-value: < 2.2e-16

and then using h5lm():

# Linear regression in activeReg
alm <- h5lm(formula = ArrDelay ~ DayOfWeek, data = h5Data)

Output

Processing chunk 1 
Processing chunk 2 
Processing chunk 3 
....
summary(alm)

Output

Call:
 h5lm(formula = ArrDelay ~ DayOfWeek, data = h5Data) 
             Estimate Std.Error   t.value P(>|t|)
(Intercept)  10.51350   0.03763 279.41966   0.000
DayOfWeek2   -2.24982   0.05368 -41.90887   0.000
DayOfWeek3   -0.55056   0.05356 -10.27908   0.000
DayOfWeek4    2.17248   0.05343  40.65667   0.000
DayOfWeek5    2.55417   0.05337  47.85496   0.000
DayOfWeek6   -4.66690   0.05566 -83.85159   0.000
DayOfWeek7   -0.18393   0.05413  -3.39799   0.001

logLik: -37021866 , df: 8
AIC: 74043746 , BIC: 74043953
R-Squared: 0.003284723
F-Statistic: 3996.006 on 6 and 7275281 DF, pvalue: 0

Logistic Regression Analysis

First we present the a binomial model using R's glm() funciton

# Binomial Regression using R's glm
glm1 <- glm(formula = ArrDelay > 10 ~ DayOfWeek, family = binomial(link = "logit"), data = df)
summary(glm1)

Output ...

Call:
glm(formula = ArrDelay > 10 ~ DayOfWeek, family = binomial(link = "logit"), 
    data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.8974  -0.8395  -0.8204   1.4861   1.6797  

Coefficients:
             Estimate Std. Error  z value Pr(>|z|)    
(Intercept) -0.861624   0.002098 -410.627   <2e-16 ***
DayOfWeek2  -0.165704   0.003051  -54.314   <2e-16 ***
DayOfWeek3  -0.054420   0.003004  -18.114   <2e-16 ***
DayOfWeek4   0.102810   0.002951   34.843   <2e-16 ***
DayOfWeek5   0.160037   0.002933   54.566   <2e-16 ***
DayOfWeek6  -0.269402   0.003213  -83.845   <2e-16 ***
DayOfWeek7  -0.005563   0.003020   -1.842   0.0655 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 8787789  on 7275287  degrees of freedom
Residual deviance: 8761041  on 7275281  degrees of freedom
  (177927 observations deleted due to missingness)
AIC: 8761055

Number of Fisher Scoring iterations: 4

Then we present the same case using the h5glm() activeReg function

# Binomial regression using activeReg
aglm <- h5glm(formula = ArrDelay > 10 ~ DayOfWeek, family = binomial_(link = "logit"), data = h5Data)

This is the output for the iterative procedure ...

Iteration 0 the coefficients are:
             [,1]
[1,] -0.987369109
[2,] -0.162442345
[3,] -0.054648105
[4,]  0.106537434
[5,]  0.167611700
[6,] -0.257923804
[7,] -0.005643235
Iteration 1 iterations, and the deviance is 8783085 the coefficients are:
             [,1]
[1,] -0.858078279
[2,] -0.165391761
[3,] -0.054227733
[4,]  0.102267311
[5,]  0.159123315
[6,] -0.269351868
[7,] -0.005540061
Iteration 2 iterations, and the deviance is 8761059 the coefficients are:
             [,1]
[1,] -0.861621407
[2,] -0.165702560
[3,] -0.054419338
[4,]  0.102809484
[5,]  0.160036097
[6,] -0.269400869
[7,] -0.005563139
Iteration 3 iterations, and the deviance is 8761041 the coefficients are:
             [,1]
[1,] -0.861623952
[2,] -0.165703523
[3,] -0.054419777
[4,]  0.102810400
[5,]  0.160037478
[6,] -0.269401627
[7,] -0.005563187
Iteration 4 iterations, and the deviance is 8761041 the coefficients are:
             [,1]
[1,] -0.861623952
[2,] -0.165703523
[3,] -0.054419777
[4,]  0.102810400
[5,]  0.160037478
[6,] -0.269401627
[7,] -0.005563187

Then the summary ...

summary(aglm)

Call:
 h5glm(formula = ArrDelay > 10 ~ DayOfWeek, family = binomial_(link = "logit"), 
    data = h5Data) 
              Estimate  Std.Error    z.value P(>|z|)
(Intercept) -8.616e-01  2.098e-03 -4.106e+02   0.000
DayOfWeek2  -1.657e-01  3.051e-03 -5.431e+01   0.000
DayOfWeek3  -5.442e-02  3.004e-03 -1.811e+01   0.000
DayOfWeek4   1.028e-01  2.951e-03  3.484e+01   0.000
DayOfWeek5   1.600e-01  2.933e-03  5.457e+01   0.000
DayOfWeek6  -2.694e-01  3.213e-03 -8.384e+01   0.000
DayOfWeek7  -5.563e-03  3.020e-03 -1.842e+00   0.065

logLik: -11001788 , df: 8
AIC: 8761055 , BIC: 22003798

Summary

The purpose of this brief look at the activeReg package was to show some basic functionality and some equivalence between its outputs and that of R's lm() and glm() functions. Next time we will look at running regressions on big data.

Thank you.

Data Science Consulting & Software Training

Active Analytics Ltd. is a data science consultancy, and Open Source Statistical Software Training company. Please contact us for more details or to comment on the blog.

Dr. Chibisi Chima-Okereke, R Training, Statistics and Data Analysis.

Go to Article ...

Product Launch activeH5 on-disk and in-memory data storage

Active Analytics Ltd: posted 26 Jan 2014 10:43 by Chibisi Chima-Okereke [ updated 22 Mar 2014 08:33 ]

Introduction

Active Analytics Ltd have now created their second software product called activeReg a sibling package to activeH5. It allows users to analyse big data using linear regression and generalized linear regression models.

The distributions covered are:

  1. Binomial
  2. Poisson
  3. Gaussian
  4. Quasipoisson
  5. Quasibinomial
  6. Gamma
  7. Inverse Gaussian
  8. Quasi

Data Science Consulting & Software Training

Active Analytics Ltd. is a data science consultancy, and Open Source Statistical Software Training company. Please contact us for more details or to comment on the blog.

Dr. Chibisi Chima-Okereke, R Training, Statistics and Data Analysis.

Go to Article ...

R vs Python: Why R is still the king of statistical computing

Active Analytics Ltd: posted 5 Feb 2014 17:46 by Chibisi Chima-Okereke [ updated 22 Mar 2014 08:18 ]

Introduction

One of our previous blog entries praised the virtues of Python impressed with its speed on our quickly hacked Chain Ladder code, much faster than the first go in R – though this is mostly a result of Python's HDF5 storage format h5py package – which is a great package. A later blog entry shows an implementation in R based on Active Analytics activeH5 package showing parity between performance in R and Python.

Recently there have been blog posts on R and Python, whether Python is now displacing R as a programming language for data science and trying to ascertain whether Python is really faster than R. This blog entry was written for people that carry out statistical analysis and are trying to decide whether R or Python is the best route to take.

This blog entry is not intended to "rip" into Python. Python is a great language and we at Active Analytics love it - when the tools are available, you can build applications very quickly, and docstrings bring sanity to the documentation process. If you use Spyder your docstrings are immediately searchable which is a cool feature.

I have written some Python scripts to emulate R features located in Active Analytics Github repository, that's part of the "littleR" theme.

Is Python really faster than R?

Based on my experience if you are doing pure floating point matrix operations, both are about the same, if you are doing real statistical analysis, R is miles ahead. The latter statement will be dealt with later but for the former, here is a quicky (but by no means comprehensive) benchmark for creating and multiplying matrices:

For R

This R code was written in Rstudio IDE which is totally great IDE:

system.time(x <- matrix(runif(1e8), nc = 1e4))
#  user  system elapsed 
# 3.996   0.177   4.193 
system.time(y <- matrix(runif(1e8), nc = 1e4))
#  user  system elapsed 
# 3.967   0.187   4.155 
system.time(z <- t(x) %*% y)
#    user  system elapsed 
# 191.087   8.655  27.203 

For Python

Here is the script

# BasicBench.py

import numpy as np
import time

# The benchmarks

# -*- coding: utf-8 -*-
xA = time.time()
x = np.ndarray(shape = (1e4, 1e4), buffer = np.random.uniform(size = 1e4**2))
xB = time.time()
y = np.ndarray(shape = (1e4, 1e4), buffer = np.random.uniform(size = 1e4**2))
xC = time.time()
z = x.transpose().dot(y)
xD = time.time()

print "Time to create matrix x is " + str(xB - xA) + " s"
print "Time to create matrix y is " + str(xC - xB) + " s"
print "Time to for matrix operation is " + str(xD - xC) + " s"

# To run
# $ python basicBench.py 

Results

# Time to create matrix x is 1.261 s
# Time to create matrix y is 1.238 s
# Time to for matrix operation is 28.895 s

This is running Ubuntu 13.10 on an core i7 processor with 32GB of RAM with openblas installed (8 threads).

It is interesting to observe that R's matrix operation was a little faster but that creating the matrix containing uniformly distributed random numbers in Python is much faster. Even with matrices at this size, the matrix operation benchmark gives pretty similar results. If I was doing this matrix operation on matrices of this size or larger, I would use alternative faster and less memory intensive methods (e.g. using my activeH5 package in memory and on disk see here for more details).

It is worth testing these things out for yourself thinking carefully about the kind of operations you would like to do and then benchmarking those. You can see from this simple example that the performance of different aspects of the calculation can be quite variable. At the end of the day it is pretty easy to port in a C++ stand-in for R functions using Rcpp which will perk up any flagging parts of the analysis.

Rcpp vs Numbas & Cython

If you are interested in pure speed of numerical calculations you will find that both R & Python are probably about the same, optimization can be done on both for instance on the R side abstracting to C/C++ in R using Rcpp and RcppArmadillo - which are totally awesome, I have ginned-up many an R routine using these tools. On the Python side there are lots to choose from, Cython, Numbas, and weave. If its speed you want the most important choice is not whether to use R or Python, it is whether there is a well written library available for your application in the given programming language.

I have used, R's Rcpp, and Python's Numbas and Cython. I would say preference is at least partly down to the individual. My preference by a long way is Rcpp, which is changing the landscape of R and will be pivotal in R's future, it is quite possible that it is the most important R package out there at the moment. Why do I love Rcpp so much?

  1. C++ integration with R is seamless. No really it is.
  2. It minimises having to directly write .Call() .Internal() and other R back-end stuff.
  3. Its C++ for crying out loud with R, don't you get it? Think of all the C/C++ libraries out there. It's now reasonably straight forward to include them in your R package. No really it is.
  4. You don't have to learn a hybrid language, just R and C++. When I'm eating a steak, I want to know it's a steak. It is difficult to create hybrid languages. You can get away with it if they are both scripted languages by using appropriate decorators etc. essentially allowing the programmer to write both languages in the same place. But when one is a compiled language things get a little dicey, and if in-lining doesn't cut it you are in tricky territory of defining you own language to be translated and then compiled. This means that experts of either or both languages really have no idea how your language works - they have to learn a third hybrid language. Whereas even if you are new to C++, you can always learn C++ (from one or more of the many thousands of books and internet articles blogs resources etc.) and then just read the Rcpp API - like any other C++ API.

Perhaps the way forward for Python is that its weave tool should be further developed and documented to essentially make it as easy to #include C++ as it is in Rcpp. There is no shame in copying when the model works, after that you can make it better.

R vs Python in statistical analysis

For those that studied statistics 101 at secondary/high school and university lets take a quick walk down memory lane. One of the first things you learn about are types of data: discrete, continuous, and categorical. How do these data types translate to R and Python and how flexible are the types? As I said in a previous blog R's raison d'etre is statistical analysis and those basic data types are part of R's DNA and so are handling missing values. Python's scientific computing package numpy the defacto package for computation is quite happy for you to have continuous data which can have NAN values but lacks missing values for all types of data. The powers that be know that it is an issue (see here), I believe that Python cannot make significant inroads into statistical analysis without missing types. Currently, best that Python has come up with for statistical data types is the pandas package which does have categorical data types but are currently inflexible - due at least in part to the missing value problem. You could choose to hack something up using masked arrays, but it would still be sluggish (and you will loose the will to live).

Here is a demonstration of trying to do R-like operations on basic data types in Python. There are two python files in this coding example to emulate various R-like qualities, they are located here.

Lets see how factors are treated in the pandas package:

Import a little R


>>> execfile("RUtilities.py")# littleR utilities

This may look like R

 >>> x = sample(letters[0:3], 20, True)

But it really isn't

>>> x
array(['c', 'b', 'b', 'a', 'b', 'b', 'a', 'b', 'c', 'c', 'c', 'a', 'a',
       'a', 'b', 'a', 'a', 'c', 'c', 'c'], 
      dtype='|S1')

Create a categorical variable


>>> import pandas as pd 
>>> y = pd.Categorical(x)
>>> y
Categorical: 
[c, b, b, a, b, b, a, b, c, c, c, a, a, a, b, a, a, c, c, c]
Levels (3): Index(['a', 'b', 'c'], dtype=object)

We run into limitations pretty quickly

 >>> y[0] = "f"
  File "", line 1, in 
TypeError: 'Categorical' object does not support item assignment

Let's be cheeky and try to use numpy's nan as a kind of NA:

>>> np.nan
nan
>>> z = x

Hmm maybe it would be better if this failed altogether than doing this ...

>>> z[0] = np.nan
>>> x
array(['n', 'b', 'b', 'a', 'b', 'b', 'a', 'b', 'c', 'c', 'c', 'a', 'a',
       'a', 'b', 'a', 'a', 'c', 'c', 'c'], 
      dtype='|S1')

Okay so lists are great right? lets get our NA into a list and then try to convert it to a factor >>> z = x.tolist() >>> z[0] = np.nan

Yay?

 >>> z
[nan, 'b', 'b', 'a', 'b', 'b', 'a', 'b', 'c', 'c', 'c', 'a', 'a', 'a', 'b', 'a', 'a', 'c', 'c', 'c']

Try making this a factor

 >>> z = pd.Categorical(z)
>>> z
Categorical: 
[nan, b, b, a, b, b, a, b, c, c, c, a, a, a, b, a, a, c, c, c]
Levels (4): Index(['a', 'b', 'c', 'nan'], dtype=object)

But do we really want character "nan" as missing values as levels in our categorical variable?

What about integers?

 >>> x = np.array([1,2,3,np.nan, 2,4,6], dtype = int)
Traceback (most recent call last):
  File "", line 1, in 
ValueError: cannot convert float NaN to integer

The above examples are slightly tongue-in-cheek but if missing values are important to you, you need R. But are factors that behave as in R impossible to get in Python? No but they are awful slow - unless you build one yourself in C. Here we have a python factor object that behaves like R's factor here is a demo:

 >>> execfile('RUtilities.py')
>>> execfile('factor.py')
>>> import time

>>> x = np.array(['c', 'b', 'b', 'a', 'b', 'b', 'a', 'b', 'c', 'c', 'c', 'a', 'a',
       'a', 'b', 'a', 'a', 'c', 'c', 'c'])
>>> y = factor(x)
>>> y.__class__

>>> y
y
c b b a b b a b c c c a a a b a a c c c
Levels: a  b  c

Yes each element is a factor

 >>> y[3]
a
Levels: a  b  c

You can assign to it
>>> y[3] = "b"
>>> y
c b b b b b a b c c c a a a b a a c c c
Levels: a  b  c

Assigning out of factor scope results in "missing" value similar to R

 >>> y[3] = "f"
>>> y
c   b   b   nan   b   b   a   b   c   c   c   a   a   a   b   a   a   c   c   c
Levels: a  b  c

You can even have multiple insertions

 >>> y[0:3] = "c"
>>> y
c   c   c   nan   b   b   a   b   c   c   c   a   a   a   b   a   a   c   c   c
Levels: a  b  c
>>> y[3:6] = ["c", "a", "c"]
>>> y
c c c c a c a b c c c a a a b a a c c c
Levels: a  b  c

But it's ridiculously slow

 >>> x = sample(letters, 50000)
>>> xA = time.time()
>>> y = factor(x)
>>> xB = time.time()
>>> print 'Time taken: ' + (xB - xA).__str__() + ' s'
Time taken: 2.28630208969 s

In R

x = sample(letters, 50000, TRUE)
system.time(y <- factor(x))
   user  system elapsed 
  0.003   0.000   0.002 

Lets build a basic model matrix in python using pandas and the patsy package, a package for building model matrices.

import pandas as pd
import patsy as f
import time as time
data = pd.DataFrame({"y" :  np.random.uniform(size = 50000), "x" : x})
xA = time.time()
mm = f.dmatrices("y ~ x", data)
xB = time.time()
print 'Time taken: ' + (xB - xA).__str__() + ' s'
Time taken: 0.288909912109 s

In R

data <- data.frame("y" = runif(50000), "x" = x)
system.time(mm <- model.matrix(y ~ x, data = data))
   user  system elapsed 
  0.053   0.000   0.052

Note that in the above model matrix example, the character array x is used to build the model matrix in Python so what you are seeing is the speed for doing things in a "standard way" using the patsy package.

The R package deluge

At the time of this blog entry there are 5173 R packages in CRAN. What does this mean? Perhaps it means that many thousands of statisticians, data scientists, mathematicians and others are so comfortable with the R programming environment they decided to choose it to publicly express their various algorithmic ambitions. Consider also that these packages represent the tip of a very large iceberg in terms of statisticians, data scientists/analysts, and programmers that interface with R. They have extended the functionality of R to many different areas of statistical analysis and applied mathematics. Perhaps this is what they call this a ringing endorsement. Could they be wrong? Maybe, but not likely.

Summary

This blog is focused on data types basics which describe why it is so difficult to build statistical objects in Python. R users will know that it has barely begun to point out the super-cool features but the basics are important - they are the foundation that advanced structures are built on. One of the basics is R's formula object, which is a call object underneath with some attributes. I haven't touched on it in this blog but mentioned it as one of R's killer features in a previous blog entry. I think it is also important and convenient that formula is a call object.

For Python to become a true contender in statistical analysis, it will have to become more R-like or build on R-like features - unless they are about to pull some other enormous super-cool rabbit out of a hat (which I don't rule out). When should you use Python? Use Python if your data doesn't contain missing values, is numerical and you don't need categorical data. Python (numpy) is geared for physics type applications with matrix and high dimensional array computations. Python is also a very good general purpose scripting language and is good at doing most general things - but not statistics. What can R learn from Python? Documentation, doc-strings style and presentation, and the object oriented programming style. Python makes OOP pretty straight forward with one good standard. R has three, S3, S4, and reference classes (sometimes called R5). I like reference classes, it works in a similar to Python's OOP, though not fully developed I think it is R's best and most promising OOP technique.

For quite some time now I have looked on with envy at Python's OOP and its many useful and well crafted general programming tools. Rcpp and reference classes changes this somewhat you can now pull in C/C++ libraries in to R in a simple and straight forward manner and write Python-like OO code.

Data Science Consulting & Software Training

Active Analytics Ltd. is a data science consultancy, and Open Source Statistical Software Training company. Please contact us for more details or to comment on the blog.

Dr. Chibisi Chima-Okereke, R Training, Statistics and Data Analysis.

Go to Top