Today we decided to talk about a programming language called Julia, its page describes it as a high-level, high-performance dynamic programming for technical computing. If like us you use R, Python, etc. and often need step down to C/C++ for speed, check out the Julia programming language. It promises better performance, with advantages inherent in dynamic scripting programming languages and without needing to step into C/C++ territory to speed up code. Julia’s programming style is more multiple dispatch than object oriented programming as in C# and Java. Like Gordon Geko and the creatures in the movie Colony, Julia is a language for greedy programmers that simply want MORE!.
Here, we put the Chain Ladder analysis through Julia and compare its speed with the performance in R and Rcpp using code from a previous blog entry. The main reason that we like to use the Chain Ladder calculation to test speed is that though it is not familiar to people outside insurance, it is a very simple calculation on a matrix/array that gives you an idea of how well a language performs basic numerical calculations.
The code in this blog entry is run/compiled on Ubuntu 14.04, core i7 processor having 16 GB RAM, using julia version 0.2.1, g++ version 4.8.2 compiler, R version 3.1.1 and Rcpp version 0.11.2. The scripts are available on Github.
Our next blog is “ChainLadder in < 2 microseconds with Julia’s ccall()”, if you like this one, you’ll want to check out the next.
Before jumping into Chain Ladder, let’s take Julia out for a quick spin - kick the tires a bit with a casual unscientific timed test and a multiple dispatch example.
Unlike R, one of the matras of Julia is “more loops please” and for veteran R programmers that have spent years avoiding for
loops this can either be a little unusual or a relief, maybe both. In Julia, loops can actually increase the speed of certain tasks. Consider the following task in R:
system.time(x = sin(1:1E6))
# user system elapsed
# 0.060 0.001 0.060
In Julia, we can do this:
@elapsed x = sin(1:1E6)
# 0.042019575
Julia’s elapsed time is a little faster but see what happens when we use a loop in Julia:
@elapsed x = [sin(i) for i = 1:1E6] # Note: [sin(i) for i in 1:1E6] is also valid
# 0.033976399
Pretty sweet eh? This kind of thinking runs counter to an R programmer’s training and experience which is to avoid loops as much as possible. If you are wandering, @
denotes a macro
so @elapsed
is a macro that returns the elapsed time of the code in front of it (check out the Julia documentation, it’s a great read). Note also the Python-like contraction of the loop into an array. We can try something similar in R:
system.time(for(i in 1:1E6)sin(i))
user system elapsed
0.195 0.000 0.195
But we did not assign this to x
, to do this using a for loop we need to do something like:
system.time({x = vector(length = 1E6); for(i in 1:1E6){x[i] <- sin(i)}})
user system elapsed
1.209 0.003 1.212
Notice that we need to state the length of the vector x
, if we try to do something like this.
x = 0
system.time(for(i in 1:1E6){x[i] <- sin(i)})
we may be waiting till hell freezes over.
In practical terms, multiple dispatch turns object orientated programming on its head, instead of encapsulating functions or methods inside data structures or objects, functions are effectively overloaded on different data structures. The object oriented paradigm is not a natural way for data-scientists or algorithmists to work, one of the great things about R is to allow programs to be created quickly and simply by providing a nice and powerful programming interface. Julia takes this one step futher and makes it explicit and clear, that we are writing functions for specific data structures - but with an option to write “catch-all” functions to deal with any data type (see the methods section of the Julia manual).
In Julia, we can concatenate strings with *
:
"Hello " * "world!"
# "Hello world!"
Consider the more popular symbol for this +
. This is not available by default
"Hello " + "world!"
# ERROR: no method +(ASCIIString,ASCIIString)
This is good for us, because previously we tried to overload this operator in R for character but it was locked. In Julia however, we can do this easily:
+(x::ASCIIString, y::ASCIIString) = string(x, y) # string is equivalent to paste() in R
+ (generic function with 93 methods)
Then we can use it:
"Hello " + "world!" + " Goodbye " + "world!"
# "Hello world! Goodbye world!"
Chain Ladder timed tests
The tires have been kicked, the engine warmed up, and we’re ready to go. In this section we look at Chain Ladder analysis times in Julia, R, and Rcpp. We take the R/Rcpp code from a previous blog entry and run data from R’s ChainLadder package auto paid claims on this code and compare it with equivalent Julia code.
# ChainLadder data
x = ChainLadder::auto[[1]]
source("chain-ladder.r")
require(microbenchmark)
microbenchmark(GetChainSquareR(x), GetChainSquareCpp(x), times = 1000)
# Benchmark outputs
Unit: microseconds
expr min lq median uq max neval
GetChainSquareR(x) 163.09 171.600 175.398 183.2290 1321.261 1000
GetChainSquareCpp(x) 4.74 5.546 8.076 8.5965 59.558 1000
Yes that is a median time of about 8 microseconds
per triangle run using Rcpp, and yes we too think that it is insane.
First we load packages:
using DataArrays, DataFrames
Then the data
# The auto paid data from the ChainLadder package
x = [[101125. 209921 266618 305107 327850 340669 348430 351193 353353 353584];
[102541 203213 260677 303182 328932 340948 347333 349813 350523 NaN];
[114932 227704 298120 345542 367760 377999 383611 385224 NaN NaN];
[114452 227761 301072 340669 359979 369248 373325 NaN NaN NaN];
[115597 243611 315215 354490 372376 382738 NaN NaN NaN NaN];
[127760 259416 326975 365780 386725 NaN NaN NaN NaN NaN];
[135616 262294 327086 367357 NaN NaN NaN NaN NaN NaN];
[127177 244249 317972 NaN NaN NaN NaN NaN NaN NaN];
[128631 246803 NaN NaN NaN NaN NaN NaN NaN NaN];
[126288 NaN NaN NaN NaN NaN NaN NaN NaN NaN]]
Then the functions for carrying out the analysis: exact analogues of the previous R and C++ functions:
function GetFactor(index, mTri)
nRow::Int = size(mTri)[1] - index
mTri = mTri[1:nRow, index:(index + 1)]
mTri = sum(mTri, 1)
return mTri[2]/mTri[1]
end
function GetChainSquare(mTri)
nCol = size(mTri)[2]
dFactors = [GetFactor(i, mTri) for i = 1:(nCol - 1)]
dAntiDiag = diag(mTri[:, reverse(1:nCol)])[2:nCol]
for index = 1:length(dAntiDiag)
mTri[index + 1, (nCol - index + 1):nCol] = dAntiDiag[index]*cumprod(dFactors[(nCol - index):(nCol - 1)])
end
mTri
end
A convenience function for benchmarking:
# Benchmark
function bench(n::Int)
output = [@elapsed GetChainSquare(x) for i = 1:n]
output = output*1E6 # microseconds
return DataFrames.DataFrame(min = quantile(output, 0),
lq = quantile(output, 0.25), median =quantile(output, 0.5),
uq = quantile(output, 0.75), max = quantile(output, 1), neval = n)
end
Now for the benchmark (times are also in microseconds):
bench(1000)
1x6 DataFrame:
min lq median uq max neval
[1,] 49.271 50.5192 51.2425 52.0497 173178.0 1000
In this case Julia is over 3 times faster than R and over 6 times slower than RcppArmadillo C++ algorithm. Notice that the data we used contained NaN
rather than NA
. If we use a DataArray
rather than a simple Array
which will allow us NA
values:
# Benchmark output
bench(1000)
1x6 DataFrame:
min lq median uq max neval
[1,] 98.171 101.129 102.262 105.461 13693.8 1000
Still faster than R but now slower than before. With every new feature added to data structures we expect penalties in performance and programmers should be aware of this.
After tweets from Viral B. Shah, one of the co-creators of Julia hinting that devectorising the insertion, and manually accumulating rather than reusing the using cumprod
function should be faster we tried a modification of the code:
function function GetChainSquare2(mTri)
p = size(mTri)[2] # This is the size of the triangle
dFactors = [GetFactor(i, mTri) for i = 1:(p - 1)] # the chain ladder factors
for i = 2:p # iterate over the rows
for j = (p - i + 2):p # iterative over the columns from the "antidiagonal"
mTri[i, j] = mTri[i, j-1]*dFactors[j - 1]
end
end
return mTri
end
and with appropriate adjustment of the bench
function get about 25 microsecond improvement on the median time:
bench(1000)
1x6 DataFrame:
min lq median uq max neval
[1,] 22.385 23.69 25.1305 26.9605 107.778 1000
Ofcourse this means that we can apply some of these changes to the C/C++ algorithm which should make that faster too, but we will continue this at a later date.
The Julia language is still young but since its introduction in 2012 it has rapidly become popular as an alternative or companion to R, Python, etc. for technical and high performance computing and data science applications. Our little drive with this programming language shows that it has plenty to offer those of us that want “it all” from a programming language; but though it leaves R behind in our tests it has a little way to go before we will see C/C++ level performance.
We like to think of Julia as R and Python with the kid’s gloves off, some of C’s speed and power then throw in high performance tools, clean programming syntax and more all wrapped up as a scripted programming language. It is early days but the number of available packages is rapidly growing. Julia also has an interface to C & Fortran but C++ support is still limited which is a significant drawback for those if us that have an appetite for C++ goodness.
In this blog, we are only starting to scratch the surface of Julia’s capabilities but make no mistake, Julia is the next step in the evolution of data science programming languages and we will certainly revisit it in the future. If you are new to Julia, check out the documentation, its one of the best written we have read.
Sometimes people ask whether R or Python will be the most used “data science” programming language, we now have a dark horse, a late entrant to the race, a new kid on the block that could one day grab this particular crown.