In this blog entry, I am referring to “big data” as data that is too big to be processed in memory. The method used here is quite simple. The rhdf5 package from bioconductor will be used to store our dataset, and the fantastic ChainLadder package will be used to process each chunk of the data. The data set generated here is by no means large, this is a quick and simple demonstration to show what is possible in principle. In practice (with more time), there is a lot of optimization that can speed up the process by some orders of magnitude.
Below I start by describing the data and how we generate it, and then present the code for processing the claims triangles using the ChainLadder package. I will then create a bare-bones Chain Ladder function to demonstrate a quick win with speed optimization. The code for this blog entry is located here.
HDF5 is a hierarchical data format designed specifically for storing large amounts of data. It has fast I/O and I will be reading and writing claims triangles in this format using the rhdf5 package. In this blog entry, I will use HDF5 and H5 interchangeably to mean the same thing.
This demonstration will simulate and carry out a Chain Ladder calculation on 2000 triangles, also I assume that we are only interested in the completed triangle and nothing else. The hard disk used here runs at 7200 rpm (a single hard drive).
The data will be simulated and written to the H5 file by appending the triangles one at a time (the triangle is then a chunk/block).
# Age-To-Age Factors
ageFact <- seq(1.9, 1, by = -.1)
# Inflation Rate
infRate <- 1.02
# Function to reverse matrix columns
revCols <- function(x){
x[,ncol(x):1]
}
# We use shake rather than jitter, shake is much faster and equivalent to my python Jitter() function
shake <- function(vec, sigmaScale = 100)
{
rnorm(n = length(vec), mean = vec, sd = vec/sigmaScale)
}
# Alternative Row generation funtion
GenerateRow <- function(iDev, dFactors = cumprod(ageFact), dInflationRate = 1.02, initClaim = 154)
{
shake(initClaim)*shake(c(1, dFactors))*(dInflationRate^iDev)
}
# Function to generate a claims matrix
GenerateTriangle <- function(iSize, ...)
{
indices = 1:iSize
mClaimTri = t(sapply(indices, GenerateRow, ...))
# Reverse columns to get the claims triangle
mClaimTri = revCols(mClaimTri)
# Assign nan to lower triangle
mClaimTri[lower.tri(mClaimTri)] = NA
mClaimTri = revCols(mClaimTri)
return(mClaimTri)
}
# Creating the claims H5 file
claimsH5 <- "data/ClaimsTri.h5"
claimsH5File <- H5Fcreate(claimsH5)
# Function to write a triangle to H5 file
WriteToH5File <- function(sName, h5File = claimsH5File){
h5writeDataset.matrix(GenerateTriangle(11), h5File, name = sName, level = 0)
}
sMatrixNames <- as.character(1:2000)
system.time(null <- lapply(sMatrixNames, WriteToH5File))
# This is the time taken to run
# user system elapsed
# 7.773 0.025 7.795
We now process the triangles. Again we are effectively chunking using the triangle as a unit chunk.
# H5 File to store the claims triangle
sProcessedH5 <- "data/ClaimsSquare.h5"
processedH5 <- H5Fcreate(sProcessedH5)
# Function to process an item in the ChainLadder file
ChainLadder <- function(sName = "1", file = claimsH5File){
MackChainLadder(h5read(file = file, name = sName), est.sigma="Mack")$FullTriangle
}
# Function to write a processed chainladder to H5 file
WriteSquare <- function(sName = "1", file = processedH5){
h5writeDataset.matrix(ChainLadder(sName = sName), processedH5, name = sName, level = 0)
}
# Process Execution
system.time(null <- lapply(sMatrixNames, WriteSquare))
This is the time taken to run
# user system elapsed
# 203.940 813.666 131.732
Above I said that we are only interested the completed claims triangle and the MackChainLadder function carries out lots of other very useful calculations and exception handling. Here I create a bare-bones Chain Ladder function to demonstrate quick speed win.
First I create a function to get the age-to-age factor from a particular column
# Get claims factor at a particular column index
GetFactor <- function(index, mTri)
{
fact = matrix(mTri[-c((nrow(mTri) - index + 1):nrow(mTri)), index:(index + 1)], ncol = 2)
fact = c(sum(fact[,1]), sum(fact[,2]))
return(fact[2]/fact[1])
}
This function carries out the Chain Ladder calculation to create the complete claims matrix (quick and dirty - no exception handling).
# Function to carry out Chain Ladder on a claims triangle
GetChainSquare <- function(mClaimTri)
{
nCols <- ncol(mClaimTri)
dFactors = sapply(1:(nCols - 1), GetFactor, mTri = mClaimTri)
dAntiDiag = diag(revCols(mClaimTri))[2:nCols]
for(index in 1:length(dAntiDiag))
mClaimTri[index + 1, (nCols - index + 1):nCols] = dAntiDiag[index]*cumprod(dFactors[(nCols - index):(nCols - 1)])
mClaimTri
}
Now we combine the new Chain Ladder function with H5 chunk request
ChainLadder2 <- function(sName = "1", file = claimsH5File){
GetChainSquare(h5read(file = file, name = sName))
}
This is the time taken to run
# user system elapsed
# 12.957 0.047 12.986
Using this technique or something similar, we can process very large data files. Here, I have shown that simple techniques can be used to optimize speed but this barely scratches the surface of what is possible. We can carry out further optimization on the H5 side and we can also use multi-threaded/parallel techniques. The important feature is that we can process very large datasets that will not fit into memory and can do this in linear time. Once the files are processed we can read and write to our H5 file at our leisure. Isn’t that nice?