Crime data exploration in R using ggplot2

Author: Dr Chibisi Chima-Okereke Created: January 16, 2013 13:24:00 GMT Published: May 22, 2013 05:37:00 GMT
figure9

The purpose of this blog post is to outline some exploratory plots using crime data, available from data.gov.uk website and the ggplot2 package in R. The ggplot2 package is a plotting and graphics package written for R by Hadley Wickham. Its great looking plots and impressive flexibility have made it a popular amongst R coders. The syntax is rather different from other R graphics packages allowing users to produce very creative plots with relatively small amounts of code. It uses the characteristic “+” operator that allows existing plot elements to be manipulated even after the plot has been created and displayed. This is a rather different approach from other graphics packages and allows a much more dynamic approach to plotting.

Though this blog post has been created for crime data, the principles can be extended to analysis of many different data sets.

Before I begin there are two items to cover:

  1. The graphics files on this blog entry were produced using svg() function on the grDevices package on R.
  2. Disclaimer. This blog entry is not a tutorial for crime analysis, but a note on hints for carrying out plots in R. The author and Active Analytics are not liable for any consequences that emerge from following any part of or all the instruction on this article.

The Data

The data used in this plotting tutorial was from the data.gov.uk website. It contains monthly crime data for the preceding 24 months, in this case from January 2011 to December 2012 (“31/12/12 BTP-Dec-2012”) for crimes on trains collected by the British Transport Police. It contains counts of crime by crime type, information about the station where the crime occurred, the operator, the area, passenger numbers, the address, and whether the crime occurred on the station or the train. This data set has 106,084 rows – large enough to be useful but not immediately unwieldy; it also has problems with missing values in some columns. We are interested in how the number of crimes committed in the Crime.Count column values varies with other variables in the table.

#We load some packages
# Our plotting tool
require(ggplot2)
# For arranging the plots
require(gridExtra)
# For manipulating the plot scales
require(scales)
# For generting our svg files
require(grDevices)
options("stringsAsFactors" = TRUE)
# Path to the folder holding the data csv
path <- "C:\\...\\"

Read the latest data in

btpData <- read.csv(file = paste(path, "BTP-Dec-2012.csv", sep = ""), header = TRUE)
names(btpData)[1] <- "Crime_Type"

The dimensions of the table …

dim(btpData)
# [1] 106084     12

We use the head() function to view the data first 6 rows of the data

head(btpData)
                   Crime_Type Station_Name Month Year Crime.Count Location_Type
1   Criminal damage and arson   Abbey Road     9 2012           1       STATION
2       Anti Social Behaviour   Abbey Road    12 2011           1       STATION
3 Public Disorder and Weapons   Abbey Road     2 2012           1       STATION
4 Public Disorder and Weapons   Abbey Road     5 2012           1      ON TRAIN
5                     Robbery   Abbey Road     5 2012           1      ON TRAIN
6               Vehicle crime   Abbey Road     1 2012           1       STATION
              Operator            OperatorURL               Area
1 Transport for London http://www.tfl.gov.uk/ London Underground
2 Transport for London http://www.tfl.gov.uk/ London Underground
3 Transport for London http://www.tfl.gov.uk/ London Underground
4 Transport for London http://www.tfl.gov.uk/ London Underground
5 Transport for London http://www.tfl.gov.uk/ London Underground
6 Transport for London http://www.tfl.gov.uk/ London Underground
  Passenger.numbers Address    Type
1              NULL    NULL STATION
2              NULL    NULL STATION
3              NULL    NULL STATION
4              NULL    NULL STATION
5              NULL    NULL STATION
6              NULL    NULL STATION

The summary() function allows us to look a data summary. Almost half of the data has missing Address fields and 4919 of the rows have missing Passenger.numbers values. There is some justification that these data points should be removed and we drop them shortly.

summary(btpData)
                       Crime_Type              Station_Name        Month       
 All crime and ASB          :40106   Victoria        :   886   Min.   : 1.000  
 Other Theft                :20066   Waterloo        :   616   1st Qu.: 4.000  
 Anti Social Behaviour      :11152   Liverpool Street:   570   Median : 7.000  
 Violent crime              : 9211   Leeds           :   518   Mean   : 6.576  
 Public Disorder and Weapons: 8814   Euston          :   517   3rd Qu.:10.000  
 Criminal damage and arson  : 5417   London Bridge   :   516   Max.   :12.000  
 (Other)                    :11318   (Other)         :102461                   
      Year       Crime.Count       Location_Type                   Operator    
 Min.   :2011   Min.   :  1.000   ON TRAIN:41204   Transport for London:26944  
 1st Qu.:2011   1st Qu.:  1.000   STATION :64880   South Eastern Trains: 9130  
 Median :2012   Median :  1.000                    South West Trains   : 8745  
 Mean   :2012   Mean   :  2.141                    Southern Railway    : 6818  
 3rd Qu.:2012   3rd Qu.:  2.000                    Northern Rail       : 6611  
 Max.   :2012   Max.   :106.000                    First Great Western : 6342  
                                                   (Other)             :41494  
                                OperatorURL                    Area      
 http://www.tfl.gov.uk/               :26944   London North      :19740  
 http://www.southeasternrailway.co.uk/: 9130   London South      :27648  
 http://www.southwesttrains.co.uk     : 8745   London Underground:23660  
 http://www.southernrailway.com       : 6818   North East        : 8405  
 http://www.northernrail.org/         : 6611   North West        : 9227  
 http://www.firstgreatwestern.co.uk/  : 6342   Scotland          : 5015  
 (Other)                              :41494   Wales and Western :12389  
 Passenger.numbers                                  Address     
 NULL    : 4919    NULL                                 :49843  
 9039907 :  518    FAIRFIELD STREET,MANCHESTER,M1 2PB   :  353  
 32853942:  439    SMALLBROOK QUEESWAY,BIRMINGHAM,B2 4QA:  337  
 10273542:  353    MELTON STREET,EUSTON,NW1 2HS         :  329  
 38357578:  344    YORK ROAD,WATERLOO,SE1 8SW           :  312  
 8540695 :  337    STATION HILL,READING,RG1 1LZ         :  306  
 (Other) :99174    (Other)                              :54604  
         Type      
 LU STATION:23798  
 METROSTN  :  391  
 STATION   :80146  
 SUBWAY    :  169  
 TRAMLINK  : 1580 

The data has now been reloaded with the option("stringsAsFactors" = FALSE), we want to retain the character variables to prevent unexpected behaviour from the variables. First we want to append the date for the end of the month onto each data item.

Coping with dates

Regardless of the software you use for data analysis, date formats can be a little challenging to work with. In our data set, we have Month and Year information, but we want to create column for date which will be the last day of the month since this is when the data was valid for. This can be rather tricky; my original code was verbose but ran okay in speed terms, after a few iterations, it has been reduced to very few lines but runs like a tortoise! The main problem is that I use the seq() function which is not vectorized. I include the tortoise version because it is easier to see what it happening with fewer code lines. I first create a dates vector by using the month and year information but setting the day to 1, the first day of the month. I then use the seq() function to step forward 1 month and subtract 1 day which takes me back to the last day of the original month.

# Creating the dates column
dates <- as.Date(paste(btpData$Year, btpData$Month, "1", sep = "-"))
dates <- sapply(dates, function(x){
  seq(x, by = "1 month", length.out = 2)[2] - 1
})
btpData$Date <- as.Date(dates)

Now we do some basic formatting on the data, making sure that some names are correctly formatted, and drop any rows that have no passenger number information.

# Rename some items
btpData$Crime_Type[btpData$Crime_Type == "Other crime"] <- "Other Crime"
btpData$Crime_Type[btpData$Crime_Type == "Vehicle crime"] <- "Vehicle Crime"
btpData$Crime_Type[btpData$Crime_Type == "Violent crime"] <- "Violent Crime"
btpData$Crime_Type[btpData$Crime_Type == "All crime and ASB"] <- "All Crime and ASB"
btpData$Crime_Type[btpData$Crime_Type == "Criminal damage and arson"] <- "Criminal Damage and Arson"
btpData$Passenger.numbers <- as.numeric(btpData$Passenger.numbers)
# Perhaps looking at the number crime per passenger is sensible
btpData$Count.Prop <- with(btpData, Crime.Count/Passenger.numbers)
btpData$Operator[btpData$Operator == "First Transpennine Express"] <- "First TransPennine Express"
# We drop the data with no Passenger.numbers data
btpData <- subset(btpData, !is.na(Passenger.numbers))

Some preliminary plots

We can start by looking at crime counts by crime type, date, and location. Here we use the aggregate() function to pivot the table to give the output we want. In addition we specify formulas to carry out the aggregation, so that Crime.Count ~ Crime_Type + Date, means that we want the aggregation of the Crime.Count by Crime_Type and Date.

# Aggregating by crime type and location type, and then by date
aggCrime1 <- aggregate(Crime.Count ~ Crime_Type + Location_Type, data = btpData, FUN = sum, na.rm = TRUE)
aggCrime2 <- aggregate(Crime.Count ~ Crime_Type + Date, data = btpData, FUN = sum, na.rm = TRUE)
head(aggCrime1)
                 Crime_Type Location_Type Crime.Count
1         All Crime and ASB      ON TRAIN       38927
2     Anti Social Behaviour      ON TRAIN        4296
3                  Burglary      ON TRAIN           3
4 Criminal Damage and Arson      ON TRAIN        1898
5                     Drugs      ON TRAIN         345
6               Other Crime      ON TRAIN         982
head(aggCrime2)
                 Crime_Type       Date Crime.Count
1         All Crime and ASB 2011-01-31        4544
2     Anti Social Behaviour 2011-01-31         801
3                  Burglary 2011-01-31          29
4 Criminal Damage and Arson 2011-01-31         302
5                     Drugs 2011-01-31         299
6               Other Crime 2011-01-31         154

Line plots

We can look at the crime type counts with date. I will now unpack a little of what we use in the plotting function below. The I() function is the “as is” function, so in this case it “tells” qplot() that you really want the size to be .7. This function is very useful and comes up time and again in specifying various items in R, but mostly in formulas. The scale_y_log10() function allows you to specify a log-scale for the y axis. The scale_x_date() function allows the manipulation of the format of the date axis, in this case in 2 month intervals.

qplot(data = aggCrime2, x = Date, y = Crime.Count, color = Crime_Type, geom = "line",
      ylab = "Crime Count\n", xlab = "\nDate", size = I(.7)) + 
      geom_point(size = I(2.5), shape = 17) + scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), 
      labels = trans_format("log10", math_format(10^.x))) + scale_x_date(labels = date_format("%m/%y"), 
      breaks = date_breaks(width = "2 month"))
figure1

Next we create a line plot of how Crime.Count changes with Month for the Years that we posses data.

# Creating the plotting data set
monthYear = aggregate(Crime.Count ~ Month + Year, data = subset(btpData, 
                Crime_Type == "All Crime and ASB"), FUN = sum, na.rm = TRUE)
monthYear$Month = as.factor(monthYear$Month)
monthYear$Year = as.factor(monthYear$Year)

This is a rather simple plot in comparison to the plots that came before. The main difference here is that we use the main ggplot() function to specify the main context items of the plot and add the geometries laterusing the geom_line() and geom_point() functions.

ggplot(monthYear, aes(x=Month, y=Crime.Count, colour=Year, group=Year)) +
    geom_line() + geom_point() + theme(legend.position = "top") + xlab("\nMonth") + 
    ylab("\nCrime Count")
figure3

Now even though this blog is really about plotting I couldn’t help but notice the item below.

aggregate(Crime.Count ~ Year, data = monthYear, FUN = function(x)sign(diff(x)))
  Year Crime.Count.1 Crime.Count.2 Crime.Count.3 Crime.Count.4 Crime.Count.5
1 2011             1             1            -1             1             1
2 2012             1             1            -1             1            -1
  Crime.Count.6 Crime.Count.7 Crime.Count.8 Crime.Count.9 Crime.Count.10
1             1            -1            -1             1             -1
2             1             1            -1             1             -1
  Crime.Count.11
1             -1
2             -1

So what is it and what does it mean? What I have done is used the aggregate() function, but instead of doing the usual sum over the variable, in this case Year, I take the monthYear data that I have already pre-aggregated in terms of Year and Month and used the diff() function to subtract the Crime.Count of successive months and take the sign of the resulting number. I then print the table below which shows remarkable consistency in the movements of the crime data in successive months. Even though we only have two years of data, we can say something about this pattern since it repeats across many levels. We know that from Jan to Feb, the crime count is likely to increase and March to April the crime count will probably decrease. Before even doing any statistical analysis, we can start getting a feel for the various items in the data set. How many software packages allow such versatility in programming syntax when drilling into your data?

Next we look at the counts data. The chart on the left below shows the crime count for each area, however, the crime count per passenger tells a different story.

AreaYear = aggregate(Crime.Count ~ Area + Year, data = subset(btpData, 
        Crime_Type == "All Crime and ASB"), FUN = sum, na.rm = TRUE)
AreaYearProp = aggregate(Count.Prop ~ Area + Year, data = subset(btpData, 
      Crime_Type == "All Crime and ASB"), FUN = sum, na.rm = TRUE)
AreaYear$Year = ordered(AreaYear$Year)
AreaYearProp$Year = ordered(AreaYear$Year)
ay1 = ggplot(AreaYear, aes(x=Area, y=Crime.Count, colour=Year, group=Year)) +
  geom_line() + geom_point() + theme(axis.text.x = element_text(angle = 25, hjust = 1),
  legend.position = "top") + xlab("\nArea") + ylab("\nCrime Count")
theme(axis.text.x = element_text(angle = 25, hjust = 1), 
      legend.position = "top")
ay2 = ggplot(AreaYearProp, aes(x=Area, y=Count.Prop, colour=Year, group=Year)) +
  geom_line() + geom_point() + theme(axis.text.x = element_text(angle = 25, hjust = 1),
  legend.position = "top") + 
  xlab("\nArea") + ylab("\nCount Proportion")
grid.arrange(ay1, ay2, ncol = 2)
figure4

Bar plots

Next we look at some bar plots of Crime.Count with Crime_Type. In this plot we use the grid.arrange() function from the gridExtra package to arrange the plots in a single row. We specify the bar plot using geom = "bar", and you will notice that the bars are filled according to Location_Type, by using the fill="Location_Type". The axis.text.x item shows off some of the flexibility of the package allowing rotation of the axis text. We can also change the default text in the legend by using the scale_fill_discrete() function. The bar chart on the left shows crime count on a log scale, notice that position="dodge", indicating that the bars should be arranged side by side. The chart on the right indicates the proportions by using the position="fill" option.

ch1 = qplot(x = Crime_Type, y = Crime.Count, data = aggCrime1, geom = "bar", 
        fill = Location_Type, position = "dodge", xlab = "\nCrime Type", ylab = "Crime Count (log scale)\n") + 
        theme(axis.text.x = element_text(angle = 25, hjust = 1), 
        legend.position = "top") + scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), 
        labels = trans_format("log10", math_format(10^.x))) + 
        scale_fill_discrete(name="Location Type\t", labels = c("ON TRAIN\t", "ON STATION"))
ch2 = qplot(x = Crime_Type, y = Crime.Count, data = aggCrime1, geom = "bar", 
      fill = Location_Type, position = "fill", xlab = "\nCrime Type", ylab = "Crime Count Proportions\n") + 
      theme(axis.text.x = element_text(angle = 25, hjust = 1), 
      legend.position = "top") + 
      scale_fill_discrete(name="Location Type\t", labels = c("ON TRAIN\t", "ON STATION"))
grid.arrange(ch1, ch2, ncol = 2)
figure2

We can plot a similar plots for the Area variable.

locData = aggregate(Crime.Count ~ Crime_Type + Area, data = btpData, FUN = sum, na.rm = TRUE)
lc1 = ggplot(locData, aes(x=Crime_Type, y=Crime.Count, fill=Area)) +
  geom_bar(position = "dodge") + 
  theme(axis.text.x = element_text(angle = 25, hjust = 1), 
  legend.position = "none") + 
  scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), 
  labels = trans_format("log10", math_format(10^.x))) + xlab("Crime Type") + ylab("Crime Type")
 
lc2 = ggplot(locData, aes(x=Crime_Type, y=Crime.Count, fill=Area)) +
  geom_bar(position = "fill") + theme(axis.text.x = 
   element_text(angle = 25, hjust = 1), legend.position = "left") + xlab("Crime Type") + ylab("Crime Type")
 
grid.arrange(lc1, lc2, ncol = 2)
figure10

Some multi-way plots

Line plots

Next we add more variables into the mix and create plot where we can inspect three variables. From the two plots below, it looks like the underground is a comparatively safe place to travel despite the high counts. Here note the use of the facet_grid() function to plot the charts, one for each area.

# First the plot for counts
mayData = aggregate(Crime.Count ~ Month + Area + Year, 
            data = subset(btpData, Crime_Type == "All Crime and ASB"), 
            FUN = sum, na.rm = TRUE)
mayData$Year = ordered(mayData$Year)
mayData$Month = ordered(mayData$Month)
# This is a fantastic little plot showing how crime varies with month over different areas
ggplot(mayData, aes(x=Month, y=Crime.Count, colour=Year, group=Year)) +
  geom_line() + geom_point() + theme(legend.position = "top") + 
  facet_grid(. ~ Area) + scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), 
  labels = trans_format("log10", math_format(10^.x)))
figure5
# Now the plot for proportions
mayData = aggregate(Count.Prop ~ Month + Area + Year, 
                     data = subset(btpData, Crime_Type == "All Crime and ASB"), 
                     FUN = sum, na.rm = TRUE)
mayData$Year = ordered(mayData$Year)
mayData$Month = ordered(mayData$Month)
# The facet plot again
ggplot(mayData, aes(x=Month, y=Count.Prop, colour=Year, group=Year)) +
  geom_line() + geom_point() + theme(legend.position = "top") + 
  facet_grid(. ~ Area) + scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), 
  labels = trans_format("log10", math_format(10^.x)))
figure6

Here we repeat the plots for Operator instead of Area. To present the plots in a tabular manner, we use facet_wrap() rather than facet_grid(). We are wrapping here since use of grid; wrapping allows us to make use of the rectangular grid when extending plots by one variable.

# Creating the data set
moyData <- aggregate(Crime.Count ~ Month + Operator + Year, 
            data = subset(btpData, Crime_Type == "All Crime and ASB"),
            FUN = sum, na.rm = TRUE)
moyData$Year <- ordered(moyData$Year)
moyData$Month <- ordered(moyData$Month)
moyData$Crime.Count <- as.numeric(moyData$Crime.Count)
# Remove operators with insufficient data
moyData <- subset(moyData, !Operator %in% c("First Scotrail", "Privately Owned"))
 
# We wrap rather than grid
# Chart for the counts
ggplot(moyData, aes(x=Month, y=Crime.Count, colour=Year, group=Year)) +
  geom_line() + geom_point() + theme(legend.position = "top") + 
  facet_wrap( ~ Operator, nrow = 4) + scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), 
  labels = trans_format("log10", math_format(10^.x))) + xlab("\nMonth") + ylab("\nCrime Count")
figure7
# Creating the data set
moyData <- aggregate(Count.Prop ~ Month + Operator + Year, 
            data = subset(btpData, Crime_Type == "All Crime and ASB"),
            FUN = sum, na.rm = TRUE)
moyData$Year <- ordered(moyData$Year)
moyData$Month <- ordered(moyData$Month)
# Remove operators with insufficient data
moyData <- subset(moyData, !Operator %in% c("First Scotrail", "Privately Owned"))
# Plot for the proportions
ggplot(moyData, aes(x=Month, y=Count.Prop, colour=Year, group=Year)) +
  geom_line() + geom_point() + theme(legend.position = "top") + 
  facet_wrap( ~ Operator, nrow = 4) + scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), 
  labels = trans_format("log10", math_format(10^.x))) + xlab("\nMonth") + ylab("\nCrime Count Proportions")
figure8

We can re-plot Crime Count and highlight Crime Type in the plot. In terms of count.

opData = aggregate(Crime.Count ~ Crime_Type + Operator, data = btpData2, FUN = sum, na.rm = TRUE)
ggplot(opData, aes(x=Crime_Type, y=Crime.Count, color=Crime_Type, group = Operator)) + 
      geom_line() + geom_point() + facet_wrap( ~ Operator) +
      theme(axis.text.x = element_blank(), legend.title = element_blank(), 
      legend.position = "top") + xlab("\nCrime Type") + ylab("Crime Count\n")+ 
      scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), 
      labels = trans_format("log10", math_format(10^.x)))
figure13

And then for proportions

opData = aggregate(Count.Prop ~ Crime_Type + Operator, data = btpData2, FUN = sum, na.rm = TRUE)
ggplot(opData, aes(x=Crime_Type, y=Count.Prop, color=Crime_Type, group = Operator)) + 
      geom_line() + geom_point() + facet_wrap( ~ Operator) +
      theme(axis.text.x = element_blank(), legend.title = element_blank(), 
      legend.position = "top") + xlab("\nCrime Type") + ylab("Crime Count\n")+ 
      scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), 
      labels = trans_format("log10", math_format(10^.x)))
figure14

Here, we have a different representation using line plots for counts.

locData = aggregate(Crime.Count ~ Month + Crime_Type + Area, data = btpData, FUN = sum, na.rm = TRUE)
locData$Month = ordered(locData$Month)
ggplot(locData, aes(x=Month, y=Crime.Count, color=Area, group=Area)) +
  geom_line() + geom_point() + facet_wrap( ~ Crime_Type) + scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), 
  labels = trans_format("log10", math_format(10^.x))) + xlab("\nMonth") + ylab("Crime Count \n")
figure11

For count proportions

locData = aggregate(Count.Prop ~ Month + Crime_Type + Area, data = btpData, FUN = sum, na.rm = TRUE)
locData$Month = ordered(locData$Month)
ggplot(locData, aes(x=Month, y=Count.Prop, color=Area, group=Area)) +
  geom_line() + geom_point() + facet_wrap( ~ Crime_Type) + 
  scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), 
  labels = trans_format("log10", math_format(10^.x))) + 
  theme(legend.position = "top", legend.title = element_blank()) + 
  ylab("Count Proportions \n") + xlab("\nMonth")
figure12

Box Plots

Now we do a box plot; this time instead of looking at counts with time, we look at count proportions with Crime_Type and Operator.

# Plot crime type facetted by operator
btpData2 <- btpData
btpData2$Year <- ordered(btpData2$Year)
btpData2$Month <- ordered(btpData2$Month)
btpData2 <- subset(btpData2, !Operator %in% c("Eurostar", "First Capital Connect", 
            "First Scotrail", "Heathrow Express", "Island Line", "Midland Metro", 
            "Privately Owned", "Stobart Rail", "First ScotRail", 
            "Strathclyde Partnership for Transport", "Tyne And Wear Metro (nexus)"))
 
ggplot(btpData2, aes(x=Crime_Type, y=Count.Prop, fill=Crime_Type, group=Crime_Type)) +
  geom_boxplot() + theme(legend.position = "top", 
  axis.text.x = element_blank(), legend.title = element_blank()) + 
  xlab("\nCrime Type") + ylab("Count Proportions\n") + 
  facet_wrap( ~ Operator, nrow = 4) + scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), 
  labels = trans_format("log10", math_format(10^.x)))
figure9

Summary

Using ggplot2 can clearly produce very interesting and informative graphics. There are few statistical programs like R that have a great potential to change the way that analysis is presented and carried out. I hope that you have gained some useful information from this blog entry.