Big data Chain Ladder analysis using activeH5

Active Analytics Ltd: posted 26 Jan 2014 11:25 by Chibisi Chima-Okereke [ updated 26 Jan 2014 11:53 ]

Introduction

Active Analytics Ltd has recently released an R package called activeH5 for big data storage and access allowing very fast access to data frames and matrices on disk and in memory, more details can be found on the product launch blog. Today we compare the performance of the activeH5 package with two previous posts on Chain Ladder analysis in R and Python using the rhdf5 and h5py packages. If you have read the previous blogs the triangle simulators have been updated and so their code and times have changed.

Creating the claims triangles

This is the code to simulate the claims traingles, feel free to skip to the times below the code block.

# Load the package
require(activeH5)

# Age-To-Age Factors
ageFact <- seq(1.9, 1, by = -.1)
# Inflation Rate
infRate <- 1.02
# Reversing columns
revCols <- function(x){
  x[,ncol(x):1]
}

# This shake function is faster than jitter and equivalent to my Jitter() function in Python
shake <- function(vec, sigmaScale = 100)
{
  rnorm(n = length(vec), mean = vec, sd = vec/sigmaScale)
}

# Alternative Row generation funtion
GenerateRow <- function(iDev, dFactors = cumprod(ageFact), dInflationRate = 1.02, initClaim = 154)
{
  shake(initClaim)*shake(c(1, dFactors))*(dInflationRate^iDev)
}

# Function to generate a claims matrix
GenerateTriangle <- function(iSize, ...)
{
  indices = 1:iSize
  mClaimTri = t(sapply(indices, GenerateRow, ...))
  # Reverse columns to get the claims triangle
  mClaimTri = revCols(mClaimTri)
  # Assign nan to lower triangle
  mClaimTri[lower.tri(mClaimTri)] = NA
  mClaimTri = revCols(mClaimTri)
  return(mClaimTri)
}

# Function to write claims matrix to file
WriteToH5File <- function(dSName, filePath)
{
  mClaims = GenerateTriangle(11)
  h5WriteDoubleMat(dSName, mClaims, dim(mClaims), filePath)
  return(invisible())
}

# The H5 file
claimsH5 = "data/ClaimsTri.h5"
# Create the H5 file
h5CreateFile(claimsH5, 1)
# The matrix names
matNames <- paste("ch", 1:2000, sep = "")
# Simulation time and time taken to write to file
system.time(sapply(matNames, WriteToH5File, filePath = claimsH5))
This is the time taken to run
#  user  system elapsed 
# 0.850   0.039   0.888 

Compare the above to that objects using rhdf5 (7.795 s) and to that obtained using Python's h5py package (1.108 s). As you can see the performance between activeH5 and Python's h5py is comparable in this case the time taken when using the activeH5 package was a little less (0.888 s).

Analysis the claims triangles

Here we analyse the claims triangles stored on disk and write them to another file using the activeH5 package. Again feel free to skip to the time benchmark below the code block.

# Function for calculating age-to-age factors
GetFactor <- function(index, mTri)
{
  fact = matrix(mTri[-c((nrow(mTri) - index + 1):nrow(mTri)), index:(index + 1)], ncol = 2)
  fact = c(sum(fact[,1]), sum(fact[,2]))  
  return(fact[2]/fact[1])
}

GetChainSquare <- function(mClaimTri)
{
  nCols <- ncol(mClaimTri)
  dFactors = sapply(1:(nCols - 1), GetFactor, mTri = mClaimTri)
  dAntiDiag = diag(revCols(mClaimTri))[2:nCols]
  for(index in 1:length(dAntiDiag))
    mClaimTri[index + 1, (nCols - index + 1):nCols] = dAntiDiag[index]*cumprod(dFactors[(nCols - index):(nCols - 1)])
  mClaimTri
}

WriteClaimsSquare <- function(dSName)
{
  cSquare <- GetChainSquare(h5ReadDoubleMat(dSName, claimsH5))
  h5WriteDoubleMat(dSName, cSquare, dim(cSquare), squareH5)  
}

# Writing the claims square
squareH5 = "data/ClaimsSquare.h5"
# Create the H5 file
h5CreateFile(squareH5, 1)
# Simulation time and time taken to write to file
system.time(sapply(matNames, WriteClaimsSquare))
This is the time taken to run
#  user  system elapsed 
# 0.850   0.039   0.888

Compare the above to that objects using rhdf5 (7.795 s) and to that obtained using Python's h5py package (1.108 s). As you can see the performance between activeH5 and Python's h5py is comparable in this case the time taken when using the activeH5 package was a little less (0.888 s).

Analysis the claims triangles

Here we analyse the claims triangles stored on disk and write them to another file using the activeH5 package. Again feel free to skip to the time benchmark below the code block.

# Function for calculating age-to-age factors
GetFactor <- function(index, mTri)
{
  fact = matrix(mTri[-c((nrow(mTri) - index + 1):nrow(mTri)), index:(index + 1)], ncol = 2)
  fact = c(sum(fact[,1]), sum(fact[,2]))  
  return(fact[2]/fact[1])
}

GetChainSquare <- function(mClaimTri)
{
  nCols <- ncol(mClaimTri)
  dFactors = sapply(1:(nCols - 1), GetFactor, mTri = mClaimTri)
  dAntiDiag = diag(revCols(mClaimTri))[2:nCols]
  for(index in 1:length(dAntiDiag))
    mClaimTri[index + 1, (nCols - index + 1):nCols] = dAntiDiag[index]*cumprod(dFactors[(nCols - index):(nCols - 1)])
  mClaimTri
}

WriteClaimsSquare <- function(dSName)
{
  cSquare <- GetChainSquare(h5ReadDoubleMat(dSName, claimsH5))
  h5WriteDoubleMat(dSName, cSquare, dim(cSquare), squareH5)  
}

# Writing the claims square
squareH5 = "data/ClaimsSquare.h5"
# Create the H5 file
h5CreateFile(squareH5, 1)
# Simulation time and time taken to write to file
system.time(sapply(matNames, WriteClaimsSquare))
This is the time taken to run.
#  user  system elapsed 
# 1.229   0.068   1.299

Here we see that the time taken is 1.299 s in comparison to the rhdf5 at 12.986 s (an order of magnitude slower) and h5py at 1.224 s (similar performance if a little faster).

Summary

This blog post barely scratches the features available in the activeH5 package, its purpose is to give an idea of the relative performance of the package to other big data equivalents out there. More blogs and demonstrations of the capabilities of this package will follow. Active Analytics has a roadmap for building and releasing a series of package for big data analysis, activeH5 will be the bedrock of these packages and its purpose will be to provide the various data structures and storage and access performance needed for these package.

Thank you

Data Science Consulting & Software Training

Active Analytics Ltd. is a data science consultancy, and Open Source Statistical Software Training company. Please contact us for more details or to comment on the blog.

Dr. Chibisi Chima-Okereke, R Training, Statistics and Data Analysis.