Big data Chain Ladder analysis in Python

Active Analytics Ltd: posted 15 Dec 2013 06:51 by Chibisi Chima-Okereke [ updated 26 Jan 2014 10:59 ]

Introduction

From now there will be fundamental changes to this blog. Active Analytics Ltd is now not just an R-based company, we embrace any open source statistical/software development tools that we feel suits our aims. This blog will now reflect that, there will be much less emphasis on R and more on big data, and more focus on other programming languages and tools. I hope that this will be more stimulating and useful for those that read this blog, and that it will help readers generate alternative interesting ideas.

This blog entry follows on from the previous one on Big data Chain Ladder analysis in R where I used the rhdf5 and ChainLadder packages, and some quickly implemented code to carry out chain ladder analysis. This blog entry does the same thing in Python.

I first start by comparing R and Python for statistical analysis in the R vs Python section, feel free to skip over this section - it is quite long.

R vs Python

-The R Side

For many months now I have been itching for an excuse to do a blog posting on data analysis in Python. The brilliantly written Python h5py module for HDF5 file I/O is a good excuse for me to start doing this.

Before I begin I would like to say that nothing I say here is a 'dis' on R. I love R, it is the program I started coding in, but there are many tools and programming languages out there, and a good coder and developer will select the right tool for the right job - if given the choice.

The important feature of R is its philosophy which can be read in my introduction to R course notes that are freely available here. This philosophy is to 'turn ideas into software quickly and faithfully', and R does this in data analysis. For trying out an idea and getting something to work quickly in statistical data analysis the S language has had a big advantage over Python but this has changed!

These features have been baked into the R language from the very beginning:

  1. The data frame (either data.frame/data.table). The pandas package is available in Python and it is brilliant, a lot of the features available in R's data.frame/data.table objects have been ported into this package, it is very feature rich and well worth checking out. I would say that the main difference in the way pandas works is that the functions applied to the DataFrame object are explicitly method based, so if you are coming from an R background it may take a little while to get used to this but it is well worth the time investment!
  2. The formula class and model.matrix. Formula objects are extremely important in creating models. Lets face it, data comes in tables, and has complex types. The formula class allows data to be reshaped and prepared for models using functions such as model.matrix(). Formula also plays important roles in plotting and many other features and quite frankly deserves its own blog entry. The Python analogue is in the patsy package.
  3. The factor class. This is very interesting both as a form of compression, and as a data class. Categorical data is very important in many modelling applications and the integration between the factor class, formula and model matrices is seamless in R both for regression and hypothesis testing. It is also used in summary statistics and split-apply type jobs using tapply(), by(), and plyr and reshape2 packages and many many more. The pandas package has categorical data coding and this is seamlessly interpreted by the statsmodel package.
  4. The apply functions, lapply(), sapply(), vapply(), apply(), tapply(), by(), plyr, reshape2. While some of these functions are relatively easy to implement, they are extremely useful. Some of this functionality is already implemented in one way or another in the pandas package.

A lot of very useful stuff has been built on these 'killer' features and this is why R programmers put up with so many less desirable challenges of coding in R - because the language is so darn useful for data analysis. However Python is already catching up and is already claiming ground from R as the scripting language of choice in statistical analysis and data mining applications.

-The Python Side

I think of Python as being R's big brother, and when young siblings fight big brothers often win. Regardless of what happens to R, I think that its philosophy will carry on. In terms of philosophy and language, there is some overlap in R and Python. They are both scripting languages and Python is also designed to allow programmers to implement things quickly, but while R comes from statistics Python comes from physics and engineering. In Python for example the numpy module, you will find much more emphasis on multidimensional arrays (maybe for those 4th rank stiffness tensors physicists play with). But because Python is built by physicists and engineers, the quality is better, it is faster, it is more reliable, the procedures are more consistent (few nasty surprises), the documentation is impressive in its thoroughness (it make you feel safe, warm and cuddly inside), the modules are also more mature, there are many more features for cross platform (in terms of OS, and connections to other programming languages) than R. It is a serious full featured, grown-up, kitchen sink included scripting programming language and it is easy to use, and when I say easy, I mean EASY! If you want to prototype or for serious development, you can do it quickly in Python. In fact, I don't believe that there is a better language that embraces the philosophy of the S language than Python. More and more features in R are being implemented in Python e.g. pandas, statsmodels, scikits, scipy, and numpy.

Preliminaries

I will be using three Python modules today, h5py for HDF5 file I/O, numpy for numerical data crunching, and the time package for timing processes. I am using Python 2.7.5+ on Ubuntu 13.10. I am using the same computer as the equivalent analysis in R with no multi-threading/parallel processing tricks.

We now load the packages

import h5py as h5py
import numpy as np
import time as time

littleR

In the spirit of porting R-features into Python (or other languages), I hope to slowing develop a theme in my blogs called littleR. These will be exploring how to implement 'killer' R features in other languages (probably just Python for now). But really if you are an R programmer seriously considering using Python you should really check out the pandas and statsmodels packages they are awesome.

Jitter function

In R, there is a function called jitter(), it adds a little noise to each point that you submit to the function. Below is my version of jitter, which I call Jitter() because I like beginning function names with capitals (I also camel-case). As you can see, it is a very simple function, three lines long, each number is taken to be normally distributed around itself and the standard deviation is assumed to be sigmalScale times smaller (which defaults to 100). Could it be simpler?

def Jitter(array, sigmaScale = 100):
    array = np.random.normal(loc = array, scale = array/sigmaScale)
    return array

Lapply function

Next is the Lapply() function which stands in for the lapply()/sapply() function in R. It will return a list objects if given a list or a numpy array if given a numpy object.

# Function to find if item is numpy-type
def isnumpy(item):
    return 'numpy' in str(type(item))

def Lapply(_list, fun, **kwargs):
    iSize = _list.__len__()
    out = list()
    for index in range(0, iSize):
        out.append(fun(_list[index], **kwargs))
    if isnumpy(_list):
        out = np.array(out)
    return out

Next is the LowerTri() function with acts a bit like lower.tri() in R but returns indices rather than logical.

# Okay I don't like typing, so I will declare LowerTri
def LowerTri(mMatrix):
    return np.tril_indices_from(mMatrix, -1)

Again this is pretty straight forward. These functions will be used extensively in the Chain Ladder demonstration below.

Chain Ladder analysis in Python

Okay lets get down to business. In this section we will implement in Python the analysis done in the previous posting in R. Consider you have a lot of claims triangles that will not fit into memory. I claim that HDF5 (I like to use H5 to mean the same thing) is a very good option for fast I/O and data analysis. The procedure is to simulate the data and write it to H5 format, and then read back that data, analyse it using the Chain Ladder method, and then write the results to another H5 file.

Again I will say that the h5py package is quite simply awesome, if you work with big data, you should take a look at it, the documentation is amazing. In this blog I make no attempt to optimize performance at all, there is no multi-threading or parallel processing, though these are straightforward to include since the API is so darn good. I could also have written the data set into a three-dimensional array instead of separate data items but this is simply a quick hack to get an idea of the comparison with R.

Simulating the data

First I create the Age-to-Age factors:

# Age to Age factors
ageFact = ageFact = np.linspace(1.9, 1, 10)
ageFact = ageFact.cumprod()

As before we next create a function to generate a triangle row (equivalent to a development year).

# This is a function to create a row of the triangle
def GenerateTriangleRow(iDev, dFactors = ageFact, dInflationRate = 1.02, initClaim = 154): 
    dFactors = np.concatenate(([1], dFactors)) 
    dClaims = Jitter(initClaim)*Jitter(dFactors)*(dInflationRate**iDev) 
    return dClaims 

Now we create a function to simulate the full triangle

def GenerateTriangle(iSize, **kwargs):
    indices = np.arange(1, iSize + 1)
    mClaimsMatrix = Lapply(indices, GenerateTriangleRow, **kwargs)
    # Reverse the columns
    mClaimsMatrix = mClaimsMatrix[:, ::-1]
    # Assign nan to lower triangle
    mClaimsMatrix[LowerTri(mClaimsMatrix)] = nan
    # Reverse columns to get the claims triangle
    mClaimsMatrix = mClaimsMatrix[:, ::-1]
    return mClaimsMatrix

Next we create a function to write the triangle to H5 format. See how easy it is to use the h5py API?

def WriteToH5File(sName, h5File):
    h5File[sName] = GenerateTriangle(11)
    return None

Next we create the H5 file and the matrix names

# H5 file
sClaimsH5 = "data/ClaimsTri.h5"
claimsH5File = h5py.File(sClaimsH5)

# Matrix names
sMatrixNames = Lapply(range(2000), str)

Finally we execute the process to simulate the data and write to file 2000 times as in the R case

# Executing the process to simulate data & write to file
np.random.seed(0)
xA = time.time()
null = Lapply(sMatrixNames, WriteToH5File, h5File = claimsH5File)
claimsH5File.flush()
xB = time.time()
print 'The elapsed time is ' + str(xB - xA)
This is the time taken to run
# The elapsed time is 1.108 s

Say what? Yes more than an order of magnitude faster than the R case with no tweaking at all. Be careful however, all I may have proved is that h5py is faster than rhdf5.

Chain Ladder analysis

To my knowledge there is no python Chain Ladder package sitting out there, so lets quickly hack our own function.

First we create function to extract the age-to-age factor for a particular column pair

def GetFactor(index, mTri):
    colSum = mTri[:np.negative(index + 1), index:(index + 2)].sum(0)
    return colSum[1]/colSum[0]

We then use this function to apply Chain Ladder to the triangle and return the completed claims development 'square' - because it is a matrix.

def GetChainSquare(mClaimTri):
    iSize = mClaimTri.shape[1]
    dFactors = Lapply(np.arange(iSize - 1), GetFactor, mTri = mClaimTri)
    dAntiDiag = mClaimTri[:,::-1].diagonal()[1:]
    for index in range(dAntiDiag.size):
        mClaimTri[index + 1,np.negative(index + 1):] = dAntiDiag[index]*dFactors[np.negative(index + 1):].cumprod()
    return mClaimTri

Next we create a function to write this completed square to file

def WriteSquareToFile(sName, destH5, sourceH5):
    destH5[sName] = GetChainSquare(sourceH5[sName][...])
    return None

Then we create the H5 file to store the processed files (there is no reason why we cannot simply store this in the same file as the source data)

# Creating the H5 files for the processed triangles
sProcessedH5 = "data/ClaimsSquare.h5"
processedH5 = h5py.File(sProcessedH5)

# Executing the process to do chainladder and write to file
xA = time.time()
null = Lapply(sMatrixNames, WriteSquareToFile, sourceH5 = claimsH5File, destH5 = processedH5)
processedH5.flush()
xB = time.time()
print 'The elapsed time is ' + str(xB - xA)


This is the time taken to run
# The elapsed time is 1.224 s

No you haven't gone blind, its just another scintillating performance from the h5py package

Summary

I enjoyed writing the code for this blog - everything I attempted worked mostly on the first try. Python is robust no doubt. All programming languages have their disadvantages and frustrations, but R programmers put up with a lot for the immediacy and convenience of their language. Python is a richer more stable and better performing language, and it embraces what is best about S, to create ideas quickly and faithfully, it is an easy language to code with many similarities to R. I think the performance numbers speak for themselves but as I have said, they may be benchmarking h5py against rhdf5 rather than R and Python.

Stay tuned to this blog and expect to see a lot more Python related activities here.

Data Science Consulting & Software Training

Active Analytics Ltd. is a Data/Statistical Analysis, and R Training company. Please contact us for more details or to comment on the blog.

Dr. Chibisi Chima-Okereke, R Training, Statistics and Data Analysis.