R vs Python: Why R is still the king of statistical computing

Active Analytics Ltd: posted 5 Feb 2014 17:46 by Chibisi Chima-Okereke [ updated 22 Mar 2014 08:18 ]

Introduction

One of our previous blog entries praised the virtues of Python impressed with its speed on our quickly hacked Chain Ladder code, much faster than the first go in R – though this is mostly a result of Python's HDF5 storage format h5py package – which is a great package. A later blog entry shows an implementation in R based on Active Analytics activeH5 package showing parity between performance in R and Python.

Recently there have been blog posts on R and Python, whether Python is now displacing R as a programming language for data science and trying to ascertain whether Python is really faster than R. This blog entry was written for people that carry out statistical analysis and are trying to decide whether R or Python is the best route to take.

This blog entry is not intended to "rip" into Python. Python is a great language and we at Active Analytics love it - when the tools are available, you can build applications very quickly, and docstrings bring sanity to the documentation process. If you use Spyder your docstrings are immediately searchable which is a cool feature.

I have written some Python scripts to emulate R features located in Active Analytics Github repository, that's part of the "littleR" theme.

Is Python really faster than R?

Based on my experience if you are doing pure floating point matrix operations, both are about the same, if you are doing real statistical analysis, R is miles ahead. The latter statement will be dealt with later but for the former, here is a quicky (but by no means comprehensive) benchmark for creating and multiplying matrices:

For R

This R code was written in Rstudio IDE which is totally great IDE:

system.time(x <- matrix(runif(1e8), nc = 1e4))
#  user  system elapsed 
# 3.996   0.177   4.193 
system.time(y <- matrix(runif(1e8), nc = 1e4))
#  user  system elapsed 
# 3.967   0.187   4.155 
system.time(z <- t(x) %*% y)
#    user  system elapsed 
# 191.087   8.655  27.203 

For Python

Here is the script

# BasicBench.py

import numpy as np
import time

# The benchmarks

# -*- coding: utf-8 -*-
xA = time.time()
x = np.ndarray(shape = (1e4, 1e4), buffer = np.random.uniform(size = 1e4**2))
xB = time.time()
y = np.ndarray(shape = (1e4, 1e4), buffer = np.random.uniform(size = 1e4**2))
xC = time.time()
z = x.transpose().dot(y)
xD = time.time()

print "Time to create matrix x is " + str(xB - xA) + " s"
print "Time to create matrix y is " + str(xC - xB) + " s"
print "Time to for matrix operation is " + str(xD - xC) + " s"

# To run
# $ python basicBench.py 

Results

# Time to create matrix x is 1.261 s
# Time to create matrix y is 1.238 s
# Time to for matrix operation is 28.895 s

This is running Ubuntu 13.10 on an core i7 processor with 32GB of RAM with openblas installed (8 threads).

It is interesting to observe that R's matrix operation was a little faster but that creating the matrix containing uniformly distributed random numbers in Python is much faster. Even with matrices at this size, the matrix operation benchmark gives pretty similar results. If I was doing this matrix operation on matrices of this size or larger, I would use alternative faster and less memory intensive methods (e.g. using my activeH5 package in memory and on disk see here for more details).

It is worth testing these things out for yourself thinking carefully about the kind of operations you would like to do and then benchmarking those. You can see from this simple example that the performance of different aspects of the calculation can be quite variable. At the end of the day it is pretty easy to port in a C++ stand-in for R functions using Rcpp which will perk up any flagging parts of the analysis.

Rcpp vs Numbas & Cython

If you are interested in pure speed of numerical calculations you will find that both R & Python are probably about the same, optimization can be done on both for instance on the R side abstracting to C/C++ in R using Rcpp and RcppArmadillo - which are totally awesome, I have ginned-up many an R routine using these tools. On the Python side there are lots to choose from, Cython, Numbas, and weave. If its speed you want the most important choice is not whether to use R or Python, it is whether there is a well written library available for your application in the given programming language.

I have used, R's Rcpp, and Python's Numbas and Cython. I would say preference is at least partly down to the individual. My preference by a long way is Rcpp, which is changing the landscape of R and will be pivotal in R's future, it is quite possible that it is the most important R package out there at the moment. Why do I love Rcpp so much?

  1. C++ integration with R is seamless. No really it is.
  2. It minimises having to directly write .Call() .Internal() and other R back-end stuff.
  3. Its C++ for crying out loud with R, don't you get it? Think of all the C/C++ libraries out there. It's now reasonably straight forward to include them in your R package. No really it is.
  4. You don't have to learn a hybrid language, just R and C++. When I'm eating a steak, I want to know it's a steak. It is difficult to create hybrid languages. You can get away with it if they are both scripted languages by using appropriate decorators etc. essentially allowing the programmer to write both languages in the same place. But when one is a compiled language things get a little dicey, and if in-lining doesn't cut it you are in tricky territory of defining you own language to be translated and then compiled. This means that experts of either or both languages really have no idea how your language works - they have to learn a third hybrid language. Whereas even if you are new to C++, you can always learn C++ (from one or more of the many thousands of books and internet articles blogs resources etc.) and then just read the Rcpp API - like any other C++ API.

Perhaps the way forward for Python is that its weave tool should be further developed and documented to essentially make it as easy to #include C++ as it is in Rcpp. There is no shame in copying when the model works, after that you can make it better.

R vs Python in statistical analysis

For those that studied statistics 101 at secondary/high school and university lets take a quick walk down memory lane. One of the first things you learn about are types of data: discrete, continuous, and categorical. How do these data types translate to R and Python and how flexible are the types? As I said in a previous blog R's raison d'etre is statistical analysis and those basic data types are part of R's DNA and so are handling missing values. Python's scientific computing package numpy the defacto package for computation is quite happy for you to have continuous data which can have NAN values but lacks missing values for all types of data. The powers that be know that it is an issue (see here), I believe that Python cannot make significant inroads into statistical analysis without missing types. Currently, best that Python has come up with for statistical data types is the pandas package which does have categorical data types but are currently inflexible - due at least in part to the missing value problem. You could choose to hack something up using masked arrays, but it would still be sluggish (and you will loose the will to live).

Here is a demonstration of trying to do R-like operations on basic data types in Python. There are two python files in this coding example to emulate various R-like qualities, they are located here.

Lets see how factors are treated in the pandas package:

Import a little R


>>> execfile("RUtilities.py")# littleR utilities

This may look like R

 >>> x = sample(letters[0:3], 20, True)

But it really isn't

>>> x
array(['c', 'b', 'b', 'a', 'b', 'b', 'a', 'b', 'c', 'c', 'c', 'a', 'a',
       'a', 'b', 'a', 'a', 'c', 'c', 'c'], 
      dtype='|S1')

Create a categorical variable


>>> import pandas as pd 
>>> y = pd.Categorical(x)
>>> y
Categorical: 
[c, b, b, a, b, b, a, b, c, c, c, a, a, a, b, a, a, c, c, c]
Levels (3): Index(['a', 'b', 'c'], dtype=object)

We run into limitations pretty quickly

 >>> y[0] = "f"
  File "", line 1, in 
TypeError: 'Categorical' object does not support item assignment

Let's be cheeky and try to use numpy's nan as a kind of NA:

>>> np.nan
nan
>>> z = x

Hmm maybe it would be better if this failed altogether than doing this ...

>>> z[0] = np.nan
>>> x
array(['n', 'b', 'b', 'a', 'b', 'b', 'a', 'b', 'c', 'c', 'c', 'a', 'a',
       'a', 'b', 'a', 'a', 'c', 'c', 'c'], 
      dtype='|S1')

Okay so lists are great right? lets get our NA into a list and then try to convert it to a factor >>> z = x.tolist() >>> z[0] = np.nan

Yay?

 >>> z
[nan, 'b', 'b', 'a', 'b', 'b', 'a', 'b', 'c', 'c', 'c', 'a', 'a', 'a', 'b', 'a', 'a', 'c', 'c', 'c']

Try making this a factor

 >>> z = pd.Categorical(z)
>>> z
Categorical: 
[nan, b, b, a, b, b, a, b, c, c, c, a, a, a, b, a, a, c, c, c]
Levels (4): Index(['a', 'b', 'c', 'nan'], dtype=object)

But do we really want character "nan" as missing values as levels in our categorical variable?

What about integers?

 >>> x = np.array([1,2,3,np.nan, 2,4,6], dtype = int)
Traceback (most recent call last):
  File "", line 1, in 
ValueError: cannot convert float NaN to integer

The above examples are slightly tongue-in-cheek but if missing values are important to you, you need R. But are factors that behave as in R impossible to get in Python? No but they are awful slow - unless you build one yourself in C. Here we have a python factor object that behaves like R's factor here is a demo:

 >>> execfile('RUtilities.py')
>>> execfile('factor.py')
>>> import time

>>> x = np.array(['c', 'b', 'b', 'a', 'b', 'b', 'a', 'b', 'c', 'c', 'c', 'a', 'a',
       'a', 'b', 'a', 'a', 'c', 'c', 'c'])
>>> y = factor(x)
>>> y.__class__

>>> y
y
c b b a b b a b c c c a a a b a a c c c
Levels: a  b  c

Yes each element is a factor

 >>> y[3]
a
Levels: a  b  c

You can assign to it
>>> y[3] = "b"
>>> y
c b b b b b a b c c c a a a b a a c c c
Levels: a  b  c

Assigning out of factor scope results in "missing" value similar to R

 >>> y[3] = "f"
>>> y
c   b   b   nan   b   b   a   b   c   c   c   a   a   a   b   a   a   c   c   c
Levels: a  b  c

You can even have multiple insertions

 >>> y[0:3] = "c"
>>> y
c   c   c   nan   b   b   a   b   c   c   c   a   a   a   b   a   a   c   c   c
Levels: a  b  c
>>> y[3:6] = ["c", "a", "c"]
>>> y
c c c c a c a b c c c a a a b a a c c c
Levels: a  b  c

But it's ridiculously slow

 >>> x = sample(letters, 50000)
>>> xA = time.time()
>>> y = factor(x)
>>> xB = time.time()
>>> print 'Time taken: ' + (xB - xA).__str__() + ' s'
Time taken: 2.28630208969 s

In R

x = sample(letters, 50000, TRUE)
system.time(y <- factor(x))
   user  system elapsed 
  0.003   0.000   0.002 

Lets build a basic model matrix in python using pandas and the patsy package, a package for building model matrices.

import pandas as pd
import patsy as f
import time as time
data = pd.DataFrame({"y" :  np.random.uniform(size = 50000), "x" : x})
xA = time.time()
mm = f.dmatrices("y ~ x", data)
xB = time.time()
print 'Time taken: ' + (xB - xA).__str__() + ' s'
Time taken: 0.288909912109 s

In R

data <- data.frame("y" = runif(50000), "x" = x)
system.time(mm <- model.matrix(y ~ x, data = data))
   user  system elapsed 
  0.053   0.000   0.052

Note that in the above model matrix example, the character array x is used to build the model matrix in Python so what you are seeing is the speed for doing things in a "standard way" using the patsy package.

The R package deluge

At the time of this blog entry there are 5173 R packages in CRAN. What does this mean? Perhaps it means that many thousands of statisticians, data scientists, mathematicians and others are so comfortable with the R programming environment they decided to choose it to publicly express their various algorithmic ambitions. Consider also that these packages represent the tip of a very large iceberg in terms of statisticians, data scientists/analysts, and programmers that interface with R. They have extended the functionality of R to many different areas of statistical analysis and applied mathematics. Perhaps this is what they call this a ringing endorsement. Could they be wrong? Maybe, but not likely.

Summary

This blog is focused on data types basics which describe why it is so difficult to build statistical objects in Python. R users will know that it has barely begun to point out the super-cool features but the basics are important - they are the foundation that advanced structures are built on. One of the basics is R's formula object, which is a call object underneath with some attributes. I haven't touched on it in this blog but mentioned it as one of R's killer features in a previous blog entry. I think it is also important and convenient that formula is a call object.

For Python to become a true contender in statistical analysis, it will have to become more R-like or build on R-like features - unless they are about to pull some other enormous super-cool rabbit out of a hat (which I don't rule out). When should you use Python? Use Python if your data doesn't contain missing values, is numerical and you don't need categorical data. Python (numpy) is geared for physics type applications with matrix and high dimensional array computations. Python is also a very good general purpose scripting language and is good at doing most general things - but not statistics. What can R learn from Python? Documentation, doc-strings style and presentation, and the object oriented programming style. Python makes OOP pretty straight forward with one good standard. R has three, S3, S4, and reference classes (sometimes called R5). I like reference classes, it works in a similar to Python's OOP, though not fully developed I think it is R's best and most promising OOP technique.

For quite some time now I have looked on with envy at Python's OOP and its many useful and well crafted general programming tools. Rcpp and reference classes changes this somewhat you can now pull in C/C++ libraries in to R in a simple and straight forward manner and write Python-like OO code.

Data Science Consulting & Software Training

Active Analytics Ltd. is a data science consultancy, and Open Source Statistical Software Training company. Please contact us for more details or to comment on the blog.

Dr. Chibisi Chima-Okereke, R Training, Statistics and Data Analysis.