Exploratory analysis of actuarial pricing data using R & ggplot2

Active Analytics Ltd: posted 9 Jan 2013 18:13 by Chibisi Chima-Okereke [ updated 22 May 2013 05:39 ]

Introduction

Those in the statistics and data mining industry are well aware of the inroads that R is making in their environment. R has been in heavy use in some sectors and is now being noticed in Insurance. Evidence of this is that this year, the first conference on R in Insurance will be taking place. As someone who has worked in this sector using R, I have to say that its pretty exciting.

The ggplot2 package is a plotting and graphics package written for R by Hadley Wickham. Its great looking plots and impressive flexibility have made it a popular amongst the R coding community. The syntax is rather different from other R graphics package allowing users to produce very creative plots with relatively small amounts of code.

In insurance, pricing models can become very complex and sometimes it is useful to have a tool like R to build graphs that are informative of the data structure. These can be useful not only in discussions within pricing teams but also when communicating ideas to non-technical people. Very often presenting the correct graph can save time.

The purpose of this post is to outline some exploratory plots that a pricing analyst might use when looking at data. These are plots allow claims frequency and severity to be viewed in one and two-way type plots, providing some insight to potentially effective rating factors before statistical tests and modelling process. This blog entry is targeted at personal lines insurance pricing analysts that want to use R to look at their data.

Before I begin there are two items to cover:

  1. The charts on this blog entry were produced using svg() function on the grDevices package on R, after generating them in ggplot2.
  2. Disclaimer. This blog entry is not a tutorial for actuarial pricing, but a note on hints for carrying out plots in R. The author and Active Analytics ltd. are not liable for any consequences that emerge from following any part of or all the instruction on this blog entry.

The Data

The data used has appeared in a very interesting paper by Spedicato Giorgio Alfredo as a CAS (Casualty Actuarial Society) working paper. In this paper Spedicato presents a very thorough and informative introduction to R for actuarial pricing using GLM (Generalized Linear Models).
The data resides in the faraway R package and is called motorins. It contains claims data (Payment and perd), exposure data (Insured), number of claims (Claims) and some rating factors (Kilometres, Zone, Bonus, Make).
We first load the packges we need.

require(faraway) # the data source
require(ggplot2) # for plotting
require(gridExtra) # for arranging plots
require(scales) # for the plot scales

We can look at the data table:

head(motorins)
  Kilometres Zone Bonus Make Insured Claims Payment     perd
1          1    1     1    1  455.13    108  392491 3634.176
2          1    1     1    2   69.17     19   46221 2432.684
3          1    1     1    3   72.88     13   15694 1207.231
4          1    1     1    4 1292.39    124  422201 3404.847
5          1    1     1    5  191.01     40  119373 2984.325
6          1    1     1    6  477.66     57  170913 2998.474

Note that perd is the payment per claim (Payment/Claims).

We calculate the exposure weighted claims frequency (AveCount) and rename pred to AvePaid.

claims <- motorins
names(claims)[8] <- "AvePaid"
claims$AveCount <- with(claims, Claims/Insured)
claims$Bonus <- factor(claims$Bonus, ordered = TRUE)

More information on the data can be obtained by using ?motorins in the R interpreter.

Box plots for frequency and severity

We can start by looking at one-way box plots for frequency and severity. First we look at the Kilometres, categorical variable. It is an ordered factor for distance driven each year. The first part of the code uses the qplot() function to create a frequency boxplot(geom = "boxplot") for the frequency. The second part repeats the task for the severity, and the last part of the code simply arranges the plots that have been produced into a 2-column plot. We also use the theme() function to position the legend.

# Frequency box-plot
fp <- qplot(data = claims, x = Kilometres, y = AveCount, fill = Kilometres, 
	geom = "boxplot", ylab = "Average Claims Count\n") + theme(legend.position="bottom")

# Severity box-plot
sp <- qplot(data = claims, x = Kilometres, y = AvePaid, fill = Kilometres, 
	geom = "boxplot", ylab = "Average Severity\n") + theme(legend.position="bottom")

# Arranging the plots
grid.arrange(fp, sp, ncol = 2)

Clearly some transformation is necessary and so we plot on a log y-scale

# Scale transformation
fp <- fp + scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), labels = trans_format("log10", math_format(10^.x)))

sp <- sp + scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), labels = trans_format("log10", math_format(10^.x)))
grid.arrange(fp, sp, ncol = 2)

This is better. The trend in the data on the frequency side becomes clear.

We would want to see the mean as well. Here we use the geom_point() function to add the mean point to each box plot. Please note the use of the I() function. While using ggplot2 it is sometimes necessary to use the I() function to denote that you really do mean what you enter in the brackets.

# Adding the mean point to the box plot
fp <- fp + geom_point(stat = "summary", fun.y = "mean", size = I(3), color = I("black")) + 
  geom_point(stat = "summary", fun.y = "mean", size = I(2.2), color = I("orange"))

sp <- sp + geom_point(stat = "summary", fun.y = "mean", size = I(3), color = I("black")) + 
  geom_point(stat = "summary", fun.y = "mean", size = I(2.2), color = I("orange"))
grid.arrange(fp, sp, ncol = 2)

For convenience we can wrap this up as the R function below. In the factor below, we use the substitute(), deparse(), and eval() functions to mediate passing the rating factors that we are interested in into the qplot() function.

frSevBoxPlot <- function(rFactor = Kilometres, Data = claims){
  
  rFactor <- substitute(rFactor)
  
  fp <- qplot(data = Data, x = eval(rFactor, list(x = rFactor)), 
              y = AveCount, fill = eval(rFactor, list(fill = rFactor)), 
              geom = "boxplot", ylab = "Average Claims Count\n", xlab = deparse(rFactor)) + 
              theme(legend.position="bottom") + scale_fill_discrete(name = paste(deparse(rFactor), "  "))
  sp <- qplot(data = Data, x = eval(rFactor, list(x = rFactor)), 
              y = AvePaid, fill = eval(rFactor, list(fill = rFactor)),
              geom = "boxplot", ylab = "Average Severity\n", xlab = deparse(rFactor)) + 
    theme(legend.position="bottom") + scale_fill_discrete(name = paste(deparse(rFactor), "  "))
  fp <- fp + scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), labels = trans_format("log10", math_format(10^.x)))
  sp <- sp + scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), labels = trans_format("log10", math_format(10^.x)))
  fp <- fp + geom_point(stat = "summary", fun.y = "mean", size = I(3), color = I("black")) + 
    geom_point(stat = "summary", fun.y = "mean", size = I(2.2), color = I("orange"))
  sp <- sp + geom_point(stat = "summary", fun.y = "mean", size = I(3), color = I("black")) + 
    geom_point(stat = "summary", fun.y = "mean", size = I(2.2), color = I("orange"))
  
  grid.arrange(fp, sp, ncol = 2)
  
}

Now we can plot with impunity. The plots for Zone, Bonus, and Make factors are all below

frSevBoxPlot(rFactor = Zone)
frSevBoxPlot(rFactor = Bonus)
frSevBoxPlot(rFactor = Make)



From the above plots, claims frequency is far more interesting so for the rest of this blog, we will focus on that.

Two-way plots

There are lots of ways to look at the influence of two factors on a variable in ggplot2. One of these is to use the facet_grid() function. In the plot below, we look at the influence of Zone and Bonus on the claims frequency. Both trends are immediately clear. We have shifted to using the ggplot() function, which is a more formal way of defining the plots. In this case the geometry that we want to plot are defined as standalone functions, e.g. geom_boxplot().

svg(filename = paste(path, "7_Zone_Bonus_Two_Way.svg", sep = ""), width = 11, height = 7)
ggplot(claims, aes(x = Zone, y = AveCount, fill = Zone)) +  geom_boxplot() + 
  facet_grid(. ~ Bonus, labeller = label_both) + 
  scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), 
                labels = trans_format("log10", math_format(10^.x))) + 
  theme(legend.position="bottom") + labs(x = "\nZone", y = "Exposure weighted average Counts\n") + 
  geom_point(stat = "summary", fun.y = "mean", size = I(3), color = I("black")) + 
  geom_point(stat = "summary", fun.y = "mean", size = I(2.2), color = I("orange"))
dev.off()

An alternative way of presenting this information is using the violin plot. The advantage here is that the shape of the distribution is immediately evident.

ggplot(claims, aes(x = Zone, y = AveCount, fill = Zone)) +  geom_violin(trim = FALSE) + 
  facet_grid(. ~ Bonus, labeller = label_both) + 
  scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), 
                labels = trans_format("log10", math_format(10^.x))) + 
  theme(legend.position="bottom") + labs(x = "\nZone", y = "Exposure weighted average Counts\n") + 
  geom_point(stat = "summary", fun.y = "mean", size = I(3), color = I("black")) + 
  geom_point(stat = "summary", fun.y = "mean", size = I(2.2), color = I("orange"))

Histogram

Now we move to histograms which in this case are particularly interesting. This is because we will be plotting histograms of claim frequencies. On the face of it this can be slightly confusing (the idea of frequencies of frequencies) but the plots are also of value. Firstly, we plot a basic histogram.

qplot(AveCount, data = claims, geom = "histogram", y = ..density.., 
      binwidth = .04, colour = I("white"), fill = I("orange"), xlab = "\nExposure weighted average Counts", 
      ylab = "Density", main = "Histogram of average claim counts\n")

For each rating factor, we can produce plots where each frequency bar is apportioned to rating categories. Not only this we can also choose to represent each bar as proportions. Eyeballing the charts side by side, we can see where the claims are and which categories are represented time and again. We do this by defining position = "stack", to stack the bars together, and position = "fill" to represent each bar's proportion categories.

We rap our plotting code into a function

oneWayPlot <- function(rFactor = Kilometres, position = stack){
  
  rFactor <- substitute(rFactor)
  position <- deparse(substitute(position))
  
  switch(position, fill = {ylab = "Proportions"; main = ylab}, {ylab = "Denstiy"; main = "Histogram"})
  ylab <- paste(ylab, "\n")
  main <- paste(main, "of claim counts by", rFactor)
  
  tplot <- qplot(AveCount, data = claims, geom = "histogram", y = ..density.., binwidth = .04, 
                 fill = eval(rFactor, list(fill = rFactor)), ylab = ylab, xlab = "\nExposure weighted average Counts",
                 main = paste(main, "\n"), position = position) + scale_fill_discrete(name = paste(deparse(rFactor), "  ")) + 
    theme(legend.position="bottom")
  
  return(tplot)
}

and then output the plots

# Kilometers
grid.arrange(oneWayPlot(Kilometres, stack), oneWayPlot(Kilometres, fill), ncol = 2)
# Zone
grid.arrange(oneWayPlot(Zone, stack), oneWayPlot(Zone, fill), ncol = 2)
# Bonus
grid.arrange(oneWayPlot(Bonus, stack), oneWayPlot(Bonus, fill), ncol = 2)
# Make
grid.arrange(oneWayPlot(Make, stack), oneWayPlot(Make, fill), ncol = 2)




Summary

Using ggplot2 can clearly produce very interesting and informative graphics. There are few statistical programs like R that have a great potential to change the way that analysis is presented and carried out in the insurance industry. I hope that this post will encourage more actuarial analysts to have a go at R.

Data Science Consulting & Software Training

Active Analytics Ltd. is a data science consultancy, and Open Source Statistical Software Training company. Please contact us for more details or to comment on the blog.

Dr. Chibisi Chima-Okereke, R Training, Statistics and Data Analysis.