From now there will be fundamental changes to this blog. Active Analytics Ltd is now not just an R-based company, we embrace any open source statistical/software development tools that we feel suits our aims. This blog will now reflect that, there will be much less emphasis on R and more on big data, and more focus on other programming languages and tools. I hope that this will be more stimulating and useful for those that read this blog, and that it will help readers generate alternative interesting ideas.
This blog entry follows on from the previous one on Big data Chain Ladder analysis in R where I used the rhdf5 and ChainLadder packages, and some quickly implemented code to carry out chain ladder analysis. This blog entry does the same thing in Python.
I first start by comparing R and Python for statistical analysis in the R vs Python section, feel free to skip over this section - it is quite long.
For many months now I have been itching for an excuse to do a blog posting on data analysis in Python. The brilliantly written Python h5py module for HDF5 file I/O is a good excuse for me to start doing this.
Before I begin I would like to say that nothing I say here is a ‘dis’ on R. I love R, it is the program I started coding in, but there are many tools and programming languages out there, and a good coder and developer will select the right tool for the right job - if given the choice.
The philosophy of R is to ‘turn ideas into software quickly and faithfully’, and R does this in data analysis. For trying out an idea and getting something to work quickly in statistical data analysis the S language has had a big advantage over Python but this has changed!
These features have been baked into the R language from the very beginning:
tapply()
, by()
, and plyr
and reshape2
packages and many many more. The pandas package has categorical data coding and this is seamlessly interpreted by the statsmodel package.lapply()
, sapply()
, vapply()
, apply()
, tapply()
, by()
, plyr
, reshape2
. While some of these functions are relatively easy to implement, they are extremely useful. Some of this functionality is already implemented in one way or another in the pandas package.A lot of very useful stuff has been built on these ‘killer’ features and this is why R programmers put up with so many less desirable challenges of coding in R - because the language is so darn useful for data analysis. However Python is already catching up and is already claiming ground from R as the scripting language of choice in statistical analysis and data mining applications.
I think of Python as being R’s big brother, and when young siblings fight big brothers often win. Regardless of what happens to R, I think that its philosophy will carry on. In terms of philosophy and language, there is some overlap in R and Python. They are both scripting languages and Python is also designed to allow programmers to implement things quickly, but while R comes from statistics Python comes from physics and engineering. In Python for example the numpy module, you will find much more emphasis on multidimensional arrays (maybe for those 4th rank stiffness tensors physicists play with). But because Python is built by physicists and engineers, the quality is better, it is faster, it is more reliable, the procedures are more consistent (few nasty surprises), the documentation is impressive in its thoroughness (it make you feel safe, warm and cuddly inside), the modules are also more mature, there are many more features for cross platform (in terms of OS, and connections to other programming languages) than R. It is a serious full featured, grown-up, kitchen sink included scripting programming language and it is easy to use, and when I say easy, I mean EASY! If you want to prototype or for serious development, you can do it quickly in Python. In fact, I don’t believe that there is a better language that embraces the philosophy of the S language than Python. More and more features in R are being implemented in Python e.g. pandas, statsmodels, scikits, scipy, and numpy.
I will be using three Python modules today, h5py for HDF5 file I/O, numpy for numerical data crunching, and the time package for timing processes. I am using Python 2.7.5+ on Ubuntu 13.10. I am using the same computer as the equivalent analysis in R with no multi-threading/parallel processing tricks.
We now load the packages
import h5py as h5py
import numpy as np
import time as time
In the spirit of porting R-features into Python (or other languages), I hope to slowing develop a theme in my blogs called littleR. These will be exploring how to implement ‘killer’ R features in other languages (probably just Python for now). But really if you are an R programmer seriously considering using Python you should really check out the pandas and statsmodels packages they are awesome.
In R, there is a function called jitter()
, it adds a little noise to each point that you submit to the function. Below is my version of jitter, which I call Jitter()
because I like beginning function names with capitals (I also camel-case). As you can see, it is a very simple function, three lines long, each number is taken to be normally distributed around itself and the standard deviation is assumed to be sigmalScale
times smaller (which defaults to 100). Could it be simpler?
def Jitter(array, sigmaScale = 100):
array = np.random.normal(loc = array, scale = array/sigmaScale)
return array
Next is the Lapply()
function which stands in for the lapply()/sapply()
function in R. It will return a list objects if given a list or a numpy array if given
a numpy object.
# Function to find if item is numpy-type
def isnumpy(item):
return 'numpy' in str(type(item))
def Lapply(_list, fun, **kwargs):
iSize = _list.__len__()
out = list()
for index in range(0, iSize):
out.append(fun(_list[index], **kwargs))
if isnumpy(_list):
out = np.array(out)
return out
Next is the LowerTri()
function with acts a bit like lower.tri()
in R but returns indices rather than logical.
# Okay I don't like typing, so I will declare LowerTri
def LowerTri(mMatrix):
return np.tril_indices_from(mMatrix, -1)
Again this is pretty straight forward. These functions will be used extensively in the Chain Ladder demonstration below.
Okay lets get down to business. In this section we will implement in Python the analysis done in the previous posting in R. Consider you have a lot of claims triangles that will not fit into memory. I claim that HDF5 (I like to use H5 to mean the same thing) is a very good option for fast I/O and data analysis. The procedure is to simulate the data and write it to H5 format, and then read back that data, analyse it using the Chain Ladder method, and then write the results to another H5 file.
Again I will say that the h5py package is quite simply awesome, if you work with big data, you should take a look at it, the documentation is amazing. In this blog I make no attempt to optimize performance at all, there is no multi-threading or parallel processing, though these are straightforward to include since the API is so darn good. I could also have written the data set into a three-dimensional array instead of separate data items but this is simply a quick hack to get an idea of the comparison with R.
First I create the Age-to-Age factors:
# Age to Age factors
ageFact = ageFact = np.linspace(1.9, 1, 10)
ageFact = ageFact.cumprod()
As before we next create a function to generate a triangle row (equivalent to a development year).
# This is a function to create a row of the triangle
def GenerateTriangleRow(iDev, dFactors = ageFact, dInflationRate = 1.02, initClaim = 154):
dFactors = np.concatenate(([1], dFactors))
dClaims = Jitter(initClaim)*Jitter(dFactors)*(dInflationRate**iDev)
return dClaims
Now we create a function to simulate the full triangle
def GenerateTriangle(iSize, **kwargs):
indices = np.arange(1, iSize + 1)
mClaimsMatrix = Lapply(indices, GenerateTriangleRow, **kwargs)
# Reverse the columns
mClaimsMatrix = mClaimsMatrix[:, ::-1]
# Assign nan to lower triangle
mClaimsMatrix[LowerTri(mClaimsMatrix)] = nan
# Reverse columns to get the claims triangle
mClaimsMatrix = mClaimsMatrix[:, ::-1]
return mClaimsMatrix
Next we create a function to write the triangle to H5 format. See how easy it is to use the h5py API?
def WriteToH5File(sName, h5File):
h5File[sName] = GenerateTriangle(11)
return None
Next we create the H5 file and the matrix names
# H5 file
sClaimsH5 = "data/ClaimsTri.h5"
claimsH5File = h5py.File(sClaimsH5)
# Matrix names
sMatrixNames = Lapply(range(2000), str)
Finally we execute the process to simulate the data and write to file 2000 times as in the R case
# Executing the process to simulate data & write to file
np.random.seed(0)
xA = time.time()
null = Lapply(sMatrixNames, WriteToH5File, h5File = claimsH5File)
claimsH5File.flush()
xB = time.time()
print 'The elapsed time is ' + str(xB - xA)
# This is the time taken to run
# The elapsed time is 1.108 s
Say what? Yes more than an order of magnitude faster than the R case with no tweaking at all. Be careful however, all I may have proved is that h5py is faster than rhdf5.
To my knowledge there is no python Chain Ladder package sitting out there, so lets quickly hack our own function.
First we create function to extract the age-to-age factor for a particular column pair.
def GetFactor(index, mTri):
colSum = mTri[:np.negative(index + 1), index:(index + 2)].sum(0)
return colSum[1]/colSum[0]
We then use this function to apply Chain Ladder to the triangle and return the completed claims development ‘square’ - because it is a matrix.
def GetChainSquare(mClaimTri):
iSize = mClaimTri.shape[1]
dFactors = Lapply(np.arange(iSize - 1), GetFactor, mTri = mClaimTri)
dAntiDiag = mClaimTri[:,::-1].diagonal()[1:]
for index in range(dAntiDiag.size):
mClaimTri[index + 1,np.negative(index + 1):] = dAntiDiag[index]*dFactors[np.negative(index + 1):].cumprod()
return mClaimTri
Next we create a function to write this completed square to file:
def WriteSquareToFile(sName, destH5, sourceH5):
destH5[sName] = GetChainSquare(sourceH5[sName][...])
return None
Then we create the H5 file to store the processed files (there is no reason why we cannot simply store this in the same file as the source data)
# Creating the H5 files for the processed triangles
sProcessedH5 = "data/ClaimsSquare.h5"
processedH5 = h5py.File(sProcessedH5)
# Executing the process to do chainladder and write to file
xA = time.time()
null = Lapply(sMatrixNames, WriteSquareToFile, sourceH5 = claimsH5File, destH5 = processedH5)
processedH5.flush()
xB = time.time()
print 'The elapsed time is ' + str(xB - xA)
# This is the time taken to run
# The elapsed time is 1.224 s
No you haven’t gone blind, its just another scintillating performance from the h5py package
I enjoyed writing the code for this blog - everything I attempted worked mostly on the first try. Python is robust no doubt. All programming languages have their disadvantages and frustrations, but R programmers put up with a lot for the immediacy and convenience of their language. Python is a richer more stable and better performing language, and it embraces what is best about S, to create ideas quickly and faithfully, it is an easy language to code with many similarities to R. I think the performance numbers speak for themselves but as I have said, they may be benchmarking h5py against rhdf5 rather than R and Python.
Stay tuned to this blog and expect to see a lot more Python related activities here.