ChainLadder benchmarks with Julia, R, Rcpp

Active Analytics Ltd: posted 06 August 2014 by Chibisi Chima-Okereke

Introduction

Today we decided to talk about a programming language called Julia, its page describes it as a high-level, high-performance dynamic programming for technical computing. If like us you use R, Python, etc. and often need step down to C/C++ for speed, check out the Julia programming language. It promises better performance, with advantages inherent in dynamic scripting programming languages and without needing to step into C/C++ territory to speed up code. Julia's programming style is more multiple dispatch than object oriented programming as in C# and Java. Like Gordon Geko and the creatures in the movie Colony, Julia is a language for greedy programmers that simply want MORE!

Here, we put the Chain Ladder analysis through Julia and compare its speed with the performance in R and Rcpp using code from a previous blog entry. The main reason that we like to use the Chain Ladder calculation to test speed is that though it is not familiar to people outside insurance, it is a very simple calculation on a matrix/array that gives you an idea of how well a language performs basic numerical calculations.

The code in this blog entry is run/compiled on Ubuntu 14.04, core i7 processor having 16 GB RAM, using julia version 0.2.1, g++ version 4.8.2 compiler, R version 3.1.1 and Rcpp version 0.11.2. The scripts are available on Github.

Our next blog is "ChainLadder in < 2 microseconds with Julia's ccall()", if you like this one, you'll want to check out the next.

A little play around

Before jumping into Chain Ladder, let's take Julia out for a quick spin - kick the tires a bit with a casual unscientific timed test and a multiple dispatch example.

More loops please

Unlike R, one of the matras of Julia is "more loops please" and for veteran R programmers that have spent years avoiding for loops this can either be a little unusual or a relief, maybe both. In Julia, loops can actually increase the speed of certain tasks. Consider the following task in R:

 system.time(x <- sin(1:1E6))
   user  system elapsed 
  0.060   0.001   0.060 

In Julia, we can do this:

@elapsed x = sin(1:1E6)
0.042019575

Julia's elapsed time is a little faster but see what happens when we use a loop in Julia:

@elapsed x = [sin(i) for i = 1:1E6] # Note: [sin(i) for i in 1:1E6] is also valid
0.033976399

Pretty sweet eh? This kind of thinking runs counter to an R programmer's training and experience which is to avoid loops as much as possible. If you are wandering, the "@" denotes a macro so @elapsed is a macro that returns the elapsed time of the code in front of it (check out the Julia documentation, it's a great read). Note also the Python-like contraction of the loop into an array. We can try something similar in R:

system.time(for(i in 1:1E6)sin(i))
   user  system elapsed 
  0.195   0.000   0.195

But we did not assign this to x, to do this using a for loop we need to do something like:

system.time({x <- vector(length = 1E6); for(i in 1:1E6){x[i] <- sin(i)}})
   user  system elapsed 
  1.209   0.003   1.212 

Notice that we need to state the length of the vector x, if we try to do something like this

x = 0
system.time(for(i in 1:1E6){x[i] <- sin(i)})

we may be waiting till hell freezes over.

Multiple dispatch

In practical terms, multiple dispatch turns object orientated programming on its head, instead of encapsulating functions or methods inside data structures or objects, functions are effectively overloaded on different data structures. The object oriented paradigm is not a natural way for data-scientists or algorithmists to work, one of the great things about R is to allow programs to be created quickly and simply by providing a nice and powerful programming interface. Julia takes this one step futher and makes it explicit and clear, that we are writing functions for specific data structures - but with an option to write "catch-all" functions to deal with any data type (see the methods section of the Julia manual).

In Julia, we can concatenate strings with "*":

"Hello " * "world!"
"Hello world!"

Consider the more popular symbol for this "+". This is not available by default

"Hello " + "world!"
ERROR: no method +(ASCIIString,ASCIIString)

This is good for us, because previously we tried to overload this operator in R for character but it was locked. In Julia however, we can do this easily:

julia> +(x::ASCIIString, y::ASCIIString) = string(x, y) #string is equivalent to paste() in R
+ (generic function with 93 methods)

Then we can use it:

"Hello " + "world!" + " Goodbye " + "world!"
"Hello world! Goodbye world!"

Chain Ladder timed tests

The tires have been kicked, the engine warmed up, and we're ready to go. In this section we look at Chain Ladder analysis times in Julia, R, and Rcpp. We take the R/Rcpp code from a previous blog entry and run data from R's ChainLadder package auto paid claims on this code and compare it with equivalent Julia code.

The R and Rcpp benchmarks

# ChainLadder data
x = ChainLadder::auto[[1]]
source("chain-ladder.r")
require(microbenchmark)
microbenchmark(GetChainSquareR(x), GetChainSquareCpp(x), times = 1000)

# Benchmark outputs
Unit: microseconds
                 expr    min      lq  median       uq      max neval
   GetChainSquareR(x) 163.09 171.600 175.398 183.2290 1321.261  1000
 GetChainSquareCpp(x)   4.74   5.546   8.076   8.5965   59.558  1000

Yes that is a median time of about 8 microseconds per triangle run using Rcpp, and yes we too think that it is insane.

Chain Ladder in Julia

First we load packages:

using DataArrays, DataFrames

Then the data

# The auto paid data from the ChainLadder package
x = [[101125. 209921 266618 305107 327850 340669 348430 351193 353353 353584];
[102541 203213 260677 303182 328932 340948 347333 349813 350523     NaN];
[114932 227704 298120 345542 367760 377999 383611 385224     NaN     NaN];
[114452 227761 301072 340669 359979 369248 373325     NaN     NaN     NaN];
[115597 243611 315215 354490 372376 382738     NaN     NaN     NaN     NaN];
[127760 259416 326975 365780 386725     NaN     NaN     NaN     NaN     NaN];
[135616 262294 327086 367357     NaN     NaN     NaN     NaN     NaN     NaN];
[127177 244249 317972     NaN     NaN     NaN     NaN     NaN     NaN     NaN];
[128631 246803     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN];
[126288     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN]]

Then the functions for carrying out the analysis: exact analogues of the previous R and C++ functions:

function GetFactor(index, mTri)
	nRow::Int = size(mTri)[1] - index
	mTri = mTri[1:nRow, index:(index + 1)]
	mTri = sum(mTri, 1)
	return mTri[2]/mTri[1]
end
	

function GetChainSquare(mTri)
	nCol = size(mTri)[2]
	dFactors = [GetFactor(i, mTri) for i = 1:(nCol - 1)]
	dAntiDiag = diag(mTri[:, reverse(1:nCol)])[2:nCol]
	for index = 1:length(dAntiDiag)
		mTri[index + 1, (nCol - index + 1):nCol] = dAntiDiag[index]*cumprod(dFactors[(nCol - index):(nCol - 1)])
	end
	mTri
end

A convenience function for benchmarking:

# Benchmark
function bench(n::Int)
	output = [@elapsed GetChainSquare(x) for i = 1:n]
	output = output*1E6 # microseconds
	return DataFrames.DataFrame(min = quantile(output, 0), lq = quantile(output, 0.25), median =quantile(output, 0.5), 
			uq = quantile(output, 0.75), max = quantile(output, 1), neval = n)
end

Now for the benchmark (times are also in microseconds):

bench(1000)
1x6 DataFrame:
           min      lq  median      uq      max neval
[1,]    49.271 50.5192 51.2425 52.0497 173178.0  1000

In this case Julia is over 3 times faster than R and over 6 times slower than RcppArmadillo C++ algorithm. Notice that the data we used contained NaN rather than NA. If we use a DataArray rather than a simple Array which will allow us NA values:

# Benchmark output
bench(1000)
1x6 DataFrame:
           min      lq  median      uq     max neval
[1,]    98.171 101.129 102.262 105.461 13693.8  1000

Still faster than R but now slower than before. With every new feature added to data structures we expect penalties in performance and programmers should be aware of this.

Further optimizations

After tweets from Viral B. Shah, one of the co-creators of Julia hinting that devectorising the insertion, and manually accumulating rather than reusing the using cumprod function should be faster we tried a modification of the code:

function function GetChainSquare2(mTri)
	p = size(mTri)[2] # This is the size of the triangle
	dFactors = [GetFactor(i, mTri) for i = 1:(p - 1)] # the chain ladder factors
	for i = 2:p # iterate over the rows
		for j = (p - i + 2):p # iterative over the columns from the "antidiagonal"
			mTri[i, j] = mTri[i, j-1]*dFactors[j - 1]
		end
	end
	return mTri
end

and with appropriate adjustment of the bench function get about 25 microsecond improvement on the median time:

bench(1000)
1x6 DataFrame:
           min    lq  median      uq     max neval
[1,]    22.385 23.69 25.1305 26.9605 107.778  1000

Ofcourse this means that we can apply some of these changes to the C/C++ algorithm which should make that faster too, but we will continue this at a later date.

Conclusion

The Julia language is still young but since its introduction in 2012 it has rapidly become popular as an alternative or companion to R, Python, etc. for technical and high performance computing and data science applications. Our little drive with this programming language shows that it has plenty to offer those of us that want "it all" from a programming language; but though it leaves R behind in our tests it has a little way to go before we will see C/C++ level performance.

We like to think of Julia as R and Python with the kid's gloves off, some of C's speed and power then throw in high performance tools, clean programming syntax and more all wrapped up as a scripted programming language. It is early days but the number of available packages is rapidly growing. Julia also has an interface to C & Fortran but C++ support is still limited which is a significant drawback for those if us that have an appetite for C++ goodness.

In this blog, we are only starting to scratch the surface of Julia's capabilities but make no mistake, Julia is the next step in the evolution of data science programming languages and we will certainly revisit it in the future. If you are new to Julia, check out the documentation, its one of the best written we have read.

Sometimes people ask whether R or Python will be the most used "data science" programming language, we now have a dark horse, a late entrant to the race, a new kid on the block that could one day grab this particular crown.

Data Science Consulting & Software Training

Active Analytics Ltd. is a data science consultancy, and Open Source Statistical Software Training company. Please contact us for more details or to comment on the blog.

Dr. Chibisi Chima-Okereke, R Training, Statistics and Data Analysis.