Active Analytics Ltd has recently released an R package called activeH5 for big data storage and access allowing very fast access to data frames and matrices on disk and in memory, more details can be found on the product launch blog. Today we compare the performance of the activeH5 package with two previous posts on Chain Ladder analysis in R and Python using the rhdf5 and h5py packages. If you have read the previous blogs the triangle simulators have been updated and so their code and times have changed.
This is the code to simulate the claims traingles, feel free to skip to the times below the code block.
# Load the package
require(activeH5)
# Age-To-Age Factors
ageFact <- seq(1.9, 1, by = -.1)
# Inflation Rate
infRate <- 1.02
# Reversing columns
revCols <- function(x){
x[,ncol(x):1]
}
# This shake function is faster than jitter and equivalent to my Jitter() function in Python
shake <- function(vec, sigmaScale = 100)
{
rnorm(n = length(vec), mean = vec, sd = vec/sigmaScale)
}
# Alternative Row generation funtion
GenerateRow <- function(iDev, dFactors = cumprod(ageFact), dInflationRate = 1.02, initClaim = 154)
{
shake(initClaim)*shake(c(1, dFactors))*(dInflationRate^iDev)
}
# Function to generate a claims matrix
GenerateTriangle <- function(iSize, ...)
{
indices = 1:iSize
mClaimTri = t(sapply(indices, GenerateRow, ...))
# Reverse columns to get the claims triangle
mClaimTri = revCols(mClaimTri)
# Assign nan to lower triangle
mClaimTri[lower.tri(mClaimTri)] = NA
mClaimTri = revCols(mClaimTri)
return(mClaimTri)
}
# Function to write claims matrix to file
WriteToH5File <- function(dSName, filePath)
{
mClaims = GenerateTriangle(11)
h5WriteDoubleMat(dSName, mClaims, dim(mClaims), filePath)
return(invisible())
}
# The H5 file
claimsH5 = "data/ClaimsTri.h5"
# Create the H5 file
h5CreateFile(claimsH5, 1)
# The matrix names
matNames <- paste("ch", 1:2000, sep = "")
# Simulation time and time taken to write to file
system.time(sapply(matNames, WriteToH5File, filePath = claimsH5))
# This is the time taken to run
# user system elapsed
# 0.850 0.039 0.888
Compare the above to that objects using rhdf5 (7.795 s) and to that obtained using Python’s h5py package (1.108 s). As you can see the performance between activeH5 and Python’s h5py is comparable in this case the time taken when using the activeH5 package was a little less (0.888 s).
Here we analyse the claims triangles stored on disk and write them to another file using the activeH5 package. Again feel free to skip to the time benchmark below the code block.
# Function for calculating age-to-age factors
GetFactor <- function(index, mTri)
{
fact = matrix(mTri[-c((nrow(mTri) - index + 1):nrow(mTri)), index:(index + 1)], ncol = 2)
fact = c(sum(fact[,1]), sum(fact[,2]))
return(fact[2]/fact[1])
}
GetChainSquare <- function(mClaimTri)
{
nCols <- ncol(mClaimTri)
dFactors = sapply(1:(nCols - 1), GetFactor, mTri = mClaimTri)
dAntiDiag = diag(revCols(mClaimTri))[2:nCols]
for(index in 1:length(dAntiDiag))
mClaimTri[index + 1, (nCols - index + 1):nCols] = dAntiDiag[index]*cumprod(dFactors[(nCols - index):(nCols - 1)])
mClaimTri
}
WriteClaimsSquare <- function(dSName)
{
cSquare <- GetChainSquare(h5ReadDoubleMat(dSName, claimsH5))
h5WriteDoubleMat(dSName, cSquare, dim(cSquare), squareH5)
}
# Writing the claims square
squareH5 = "data/ClaimsSquare.h5"
# Create the H5 file
h5CreateFile(squareH5, 1)
# Simulation time and time taken to write to file
system.time(sapply(matNames, WriteClaimsSquare))
# This is the time taken to run
# user system elapsed
# 0.850 0.039 0.888
Compare the above to that objects using rhdf5 (7.795 s) and to that obtained using Python’s h5py package (1.108 s). As you can see the performance between activeH5 and Python’s h5py is comparable in this case the time taken when using the activeH5 package was a little less (0.888 s).
Here we analyse the claims triangles stored on disk and write them to another file using the activeH5 package. Again feel free to skip to the time benchmark below the code block.
# Function for calculating age-to-age factors
GetFactor <- function(index, mTri)
{
fact = matrix(mTri[-c((nrow(mTri) - index + 1):nrow(mTri)), index:(index + 1)], ncol = 2)
fact = c(sum(fact[,1]), sum(fact[,2]))
return(fact[2]/fact[1])
}
GetChainSquare <- function(mClaimTri)
{
nCols <- ncol(mClaimTri)
dFactors = sapply(1:(nCols - 1), GetFactor, mTri = mClaimTri)
dAntiDiag = diag(revCols(mClaimTri))[2:nCols]
for(index in 1:length(dAntiDiag))
mClaimTri[index + 1, (nCols - index + 1):nCols] = dAntiDiag[index]*cumprod(dFactors[(nCols - index):(nCols - 1)])
mClaimTri
}
WriteClaimsSquare <- function(dSName)
{
cSquare <- GetChainSquare(h5ReadDoubleMat(dSName, claimsH5))
h5WriteDoubleMat(dSName, cSquare, dim(cSquare), squareH5)
}
# Writing the claims square
squareH5 = "data/ClaimsSquare.h5"
# Create the H5 file
h5CreateFile(squareH5, 1)
# Simulation time and time taken to write to file
system.time(sapply(matNames, WriteClaimsSquare))
# This is the time taken to run.
# user system elapsed
# 1.229 0.068 1.299
Here we see that the time taken is 1.299 s in comparison to the rhdf5 at 12.986 s (an order of magnitude slower) and h5py at 1.224 s (similar performance if a little faster).
This blog post barely scratches the features available in the activeH5 package, its purpose is to give an idea of the relative performance of the package to other big data equivalents out there. More blogs and demonstrations of the capabilities of this package will follow. Active Analytics has a roadmap for building and releasing a series of package for big data analysis, activeH5 will be the bedrock of these packages and its purpose will be to provide the various data structures and storage and access performance needed for these package.
Thank you