The purpose of this blog post is to outline some exploratory plots using crime data, available from data.gov.uk website and the ggplot2 package in R. The ggplot2 package is a plotting and graphics package written for R by Hadley Wickham. Its great looking plots and impressive flexibility have made it a popular amongst R coders. The syntax is rather different from other R graphics packages allowing users to produce very creative plots with relatively small amounts of code. It uses the characteristic “+” operator that allows existing plot elements to be manipulated even after the plot has been created and displayed. This is a rather different approach from other graphics packages and allows a much more dynamic approach to plotting.
Though this blog post has been created for crime data, the principles can be extended to analysis of many different data sets.
Before I begin there are two items to cover:
svg()
function on the grDevices package on R.The data used in this plotting tutorial was from the data.gov.uk website. It contains monthly crime data for the preceding 24 months, in this case from January 2011 to December 2012 (“31/12/12 BTP-Dec-2012”) for crimes on trains collected by the British Transport Police. It contains counts of crime by crime type, information about the station where the crime occurred, the operator, the area, passenger numbers, the address, and whether the crime occurred on the station or the train. This data set has 106,084 rows – large enough to be useful but not immediately unwieldy; it also has problems with missing values in some columns. We are interested in how the number of crimes committed in the Crime.Count column values varies with other variables in the table.
#We load some packages
# Our plotting tool
require(ggplot2)
# For arranging the plots
require(gridExtra)
# For manipulating the plot scales
require(scales)
# For generting our svg files
require(grDevices)
options("stringsAsFactors" = TRUE)
# Path to the folder holding the data csv
path <- "C:\\...\\"
Read the latest data in
btpData <- read.csv(file = paste(path, "BTP-Dec-2012.csv", sep = ""), header = TRUE)
names(btpData)[1] <- "Crime_Type"
The dimensions of the table …
dim(btpData)
# [1] 106084 12
We use the head()
function to view the data first 6 rows of the data
head(btpData)
Crime_Type Station_Name Month Year Crime.Count Location_Type
1 Criminal damage and arson Abbey Road 9 2012 1 STATION
2 Anti Social Behaviour Abbey Road 12 2011 1 STATION
3 Public Disorder and Weapons Abbey Road 2 2012 1 STATION
4 Public Disorder and Weapons Abbey Road 5 2012 1 ON TRAIN
5 Robbery Abbey Road 5 2012 1 ON TRAIN
6 Vehicle crime Abbey Road 1 2012 1 STATION
Operator OperatorURL Area
1 Transport for London http://www.tfl.gov.uk/ London Underground
2 Transport for London http://www.tfl.gov.uk/ London Underground
3 Transport for London http://www.tfl.gov.uk/ London Underground
4 Transport for London http://www.tfl.gov.uk/ London Underground
5 Transport for London http://www.tfl.gov.uk/ London Underground
6 Transport for London http://www.tfl.gov.uk/ London Underground
Passenger.numbers Address Type
1 NULL NULL STATION
2 NULL NULL STATION
3 NULL NULL STATION
4 NULL NULL STATION
5 NULL NULL STATION
6 NULL NULL STATION
The summary()
function allows us to look a data summary. Almost half of the data has missing Address
fields and 4919 of the rows have missing Passenger.numbers
values. There is some justification that these data points should be removed and we drop them shortly.
summary(btpData)
Crime_Type Station_Name Month
All crime and ASB :40106 Victoria : 886 Min. : 1.000
Other Theft :20066 Waterloo : 616 1st Qu.: 4.000
Anti Social Behaviour :11152 Liverpool Street: 570 Median : 7.000
Violent crime : 9211 Leeds : 518 Mean : 6.576
Public Disorder and Weapons: 8814 Euston : 517 3rd Qu.:10.000
Criminal damage and arson : 5417 London Bridge : 516 Max. :12.000
(Other) :11318 (Other) :102461
Year Crime.Count Location_Type Operator
Min. :2011 Min. : 1.000 ON TRAIN:41204 Transport for London:26944
1st Qu.:2011 1st Qu.: 1.000 STATION :64880 South Eastern Trains: 9130
Median :2012 Median : 1.000 South West Trains : 8745
Mean :2012 Mean : 2.141 Southern Railway : 6818
3rd Qu.:2012 3rd Qu.: 2.000 Northern Rail : 6611
Max. :2012 Max. :106.000 First Great Western : 6342
(Other) :41494
OperatorURL Area
http://www.tfl.gov.uk/ :26944 London North :19740
http://www.southeasternrailway.co.uk/: 9130 London South :27648
http://www.southwesttrains.co.uk : 8745 London Underground:23660
http://www.southernrailway.com : 6818 North East : 8405
http://www.northernrail.org/ : 6611 North West : 9227
http://www.firstgreatwestern.co.uk/ : 6342 Scotland : 5015
(Other) :41494 Wales and Western :12389
Passenger.numbers Address
NULL : 4919 NULL :49843
9039907 : 518 FAIRFIELD STREET,MANCHESTER,M1 2PB : 353
32853942: 439 SMALLBROOK QUEESWAY,BIRMINGHAM,B2 4QA: 337
10273542: 353 MELTON STREET,EUSTON,NW1 2HS : 329
38357578: 344 YORK ROAD,WATERLOO,SE1 8SW : 312
8540695 : 337 STATION HILL,READING,RG1 1LZ : 306
(Other) :99174 (Other) :54604
Type
LU STATION:23798
METROSTN : 391
STATION :80146
SUBWAY : 169
TRAMLINK : 1580
The data has now been reloaded with the option("stringsAsFactors" = FALSE)
, we want to retain the character variables to prevent unexpected behaviour from the variables. First we want to append the date for the end of the month onto each data item.
Regardless of the software you use for data analysis, date formats can be a little challenging to work with. In our data set, we have Month and Year information, but we want to create column for date which will be the last day of the month since this is when the data was valid for. This can be rather tricky; my original code was verbose but ran okay in speed terms, after a few iterations, it has been reduced to very few lines but runs like a tortoise! The main problem is that I use the seq()
function which is not vectorized. I include the tortoise version because it is easier to see what it happening with fewer code lines. I first create a dates
vector by using the month and year information but setting the day to 1, the first day of the month. I then use the seq()
function to step forward 1 month and subtract 1 day which takes me back to the last day of the original month.
# Creating the dates column
dates <- as.Date(paste(btpData$Year, btpData$Month, "1", sep = "-"))
dates <- sapply(dates, function(x){
seq(x, by = "1 month", length.out = 2)[2] - 1
})
btpData$Date <- as.Date(dates)
Now we do some basic formatting on the data, making sure that some names are correctly formatted, and drop any rows that have no passenger number information.
# Rename some items
btpData$Crime_Type[btpData$Crime_Type == "Other crime"] <- "Other Crime"
btpData$Crime_Type[btpData$Crime_Type == "Vehicle crime"] <- "Vehicle Crime"
btpData$Crime_Type[btpData$Crime_Type == "Violent crime"] <- "Violent Crime"
btpData$Crime_Type[btpData$Crime_Type == "All crime and ASB"] <- "All Crime and ASB"
btpData$Crime_Type[btpData$Crime_Type == "Criminal damage and arson"] <- "Criminal Damage and Arson"
btpData$Passenger.numbers <- as.numeric(btpData$Passenger.numbers)
# Perhaps looking at the number crime per passenger is sensible
btpData$Count.Prop <- with(btpData, Crime.Count/Passenger.numbers)
btpData$Operator[btpData$Operator == "First Transpennine Express"] <- "First TransPennine Express"
# We drop the data with no Passenger.numbers data
btpData <- subset(btpData, !is.na(Passenger.numbers))
We can start by looking at crime counts by crime type, date, and location. Here we use the aggregate()
function to pivot the table to give the output we want. In addition we specify formulas to carry out the aggregation, so that Crime.Count ~ Crime_Type + Date
, means that we want the aggregation of the Crime.Count
by Crime_Type
and Date
.
# Aggregating by crime type and location type, and then by date
aggCrime1 <- aggregate(Crime.Count ~ Crime_Type + Location_Type, data = btpData, FUN = sum, na.rm = TRUE)
aggCrime2 <- aggregate(Crime.Count ~ Crime_Type + Date, data = btpData, FUN = sum, na.rm = TRUE)
head(aggCrime1)
Crime_Type Location_Type Crime.Count
1 All Crime and ASB ON TRAIN 38927
2 Anti Social Behaviour ON TRAIN 4296
3 Burglary ON TRAIN 3
4 Criminal Damage and Arson ON TRAIN 1898
5 Drugs ON TRAIN 345
6 Other Crime ON TRAIN 982
head(aggCrime2)
Crime_Type Date Crime.Count
1 All Crime and ASB 2011-01-31 4544
2 Anti Social Behaviour 2011-01-31 801
3 Burglary 2011-01-31 29
4 Criminal Damage and Arson 2011-01-31 302
5 Drugs 2011-01-31 299
6 Other Crime 2011-01-31 154
We can look at the crime type counts with date. I will now unpack a little of what we use in the plotting function below. The I()
function is the “as is” function, so in this case it “tells” qplot()
that you really want the size to be .7
. This function is very useful and comes up time and again in specifying various items in R, but mostly in formulas. The scale_y_log10()
function allows you to specify a log-scale for the y axis. The scale_x_date()
function allows the manipulation of the format of the date axis, in this case in 2 month intervals.
qplot(data = aggCrime2, x = Date, y = Crime.Count, color = Crime_Type, geom = "line",
ylab = "Crime Count\n", xlab = "\nDate", size = I(.7)) +
geom_point(size = I(2.5), shape = 17) + scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x))) + scale_x_date(labels = date_format("%m/%y"),
breaks = date_breaks(width = "2 month"))
Next we create a line plot of how Crime.Count
changes with Month for the Years that we posses data.
# Creating the plotting data set
monthYear = aggregate(Crime.Count ~ Month + Year, data = subset(btpData,
Crime_Type == "All Crime and ASB"), FUN = sum, na.rm = TRUE)
monthYear$Month = as.factor(monthYear$Month)
monthYear$Year = as.factor(monthYear$Year)
This is a rather simple plot in comparison to the plots that came before. The main difference here is that we use the main ggplot()
function to specify the main context items of the plot and add the geometries laterusing the geom_line()
and geom_point()
functions.
ggplot(monthYear, aes(x=Month, y=Crime.Count, colour=Year, group=Year)) +
geom_line() + geom_point() + theme(legend.position = "top") + xlab("\nMonth") +
ylab("\nCrime Count")
Now even though this blog is really about plotting I couldn’t help but notice the item below.
aggregate(Crime.Count ~ Year, data = monthYear, FUN = function(x)sign(diff(x)))
Year Crime.Count.1 Crime.Count.2 Crime.Count.3 Crime.Count.4 Crime.Count.5
1 2011 1 1 -1 1 1
2 2012 1 1 -1 1 -1
Crime.Count.6 Crime.Count.7 Crime.Count.8 Crime.Count.9 Crime.Count.10
1 1 -1 -1 1 -1
2 1 1 -1 1 -1
Crime.Count.11
1 -1
2 -1
So what is it and what does it mean? What I have done is used the aggregate()
function, but instead of doing the usual sum over the variable, in this case Year
, I take the monthYear
data that I have already pre-aggregated in terms of Year
and Month
and used the diff()
function to subtract the Crime.Count
of successive months and take the sign of the resulting number. I then print the table below which shows remarkable consistency in the movements of the crime data in successive months. Even though we only have two years of data, we can say something about this pattern since it repeats across many levels. We know that from Jan to Feb, the crime count is likely to increase and March to April the crime count will probably decrease. Before even doing any statistical analysis, we can start getting a feel for the various items in the data set. How many software packages allow such versatility in programming syntax when drilling into your data?
Next we look at the counts data. The chart on the left below shows the crime count for each area, however, the crime count per passenger tells a different story.
AreaYear = aggregate(Crime.Count ~ Area + Year, data = subset(btpData,
Crime_Type == "All Crime and ASB"), FUN = sum, na.rm = TRUE)
AreaYearProp = aggregate(Count.Prop ~ Area + Year, data = subset(btpData,
Crime_Type == "All Crime and ASB"), FUN = sum, na.rm = TRUE)
AreaYear$Year = ordered(AreaYear$Year)
AreaYearProp$Year = ordered(AreaYear$Year)
ay1 = ggplot(AreaYear, aes(x=Area, y=Crime.Count, colour=Year, group=Year)) +
geom_line() + geom_point() + theme(axis.text.x = element_text(angle = 25, hjust = 1),
legend.position = "top") + xlab("\nArea") + ylab("\nCrime Count")
theme(axis.text.x = element_text(angle = 25, hjust = 1),
legend.position = "top")
ay2 = ggplot(AreaYearProp, aes(x=Area, y=Count.Prop, colour=Year, group=Year)) +
geom_line() + geom_point() + theme(axis.text.x = element_text(angle = 25, hjust = 1),
legend.position = "top") +
xlab("\nArea") + ylab("\nCount Proportion")
grid.arrange(ay1, ay2, ncol = 2)
Next we look at some bar plots of Crime.Count
with Crime_Type
. In this plot we use the grid.arrange()
function from the gridExtra
package to arrange the plots in a single row. We specify the bar plot using geom = "bar"
, and you will notice that the bars are filled according to Location_Type
, by using the fill="Location_Type"
. The axis.text.x
item shows off some of the flexibility of the package allowing rotation of the axis text. We can also change the default text in the legend by using the scale_fill_discrete()
function. The bar chart on the left shows crime count on a log scale, notice that position="dodge"
, indicating that the bars should be arranged side by side. The chart on the right indicates the proportions by using the position="fill"
option.
ch1 = qplot(x = Crime_Type, y = Crime.Count, data = aggCrime1, geom = "bar",
fill = Location_Type, position = "dodge", xlab = "\nCrime Type", ylab = "Crime Count (log scale)\n") +
theme(axis.text.x = element_text(angle = 25, hjust = 1),
legend.position = "top") + scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x))) +
scale_fill_discrete(name="Location Type\t", labels = c("ON TRAIN\t", "ON STATION"))
ch2 = qplot(x = Crime_Type, y = Crime.Count, data = aggCrime1, geom = "bar",
fill = Location_Type, position = "fill", xlab = "\nCrime Type", ylab = "Crime Count Proportions\n") +
theme(axis.text.x = element_text(angle = 25, hjust = 1),
legend.position = "top") +
scale_fill_discrete(name="Location Type\t", labels = c("ON TRAIN\t", "ON STATION"))
grid.arrange(ch1, ch2, ncol = 2)
We can plot a similar plots for the Area variable.
locData = aggregate(Crime.Count ~ Crime_Type + Area, data = btpData, FUN = sum, na.rm = TRUE)
lc1 = ggplot(locData, aes(x=Crime_Type, y=Crime.Count, fill=Area)) +
geom_bar(position = "dodge") +
theme(axis.text.x = element_text(angle = 25, hjust = 1),
legend.position = "none") +
scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x))) + xlab("Crime Type") + ylab("Crime Type")
lc2 = ggplot(locData, aes(x=Crime_Type, y=Crime.Count, fill=Area)) +
geom_bar(position = "fill") + theme(axis.text.x =
element_text(angle = 25, hjust = 1), legend.position = "left") + xlab("Crime Type") + ylab("Crime Type")
grid.arrange(lc1, lc2, ncol = 2)
Next we add more variables into the mix and create plot where we can inspect three variables. From the two plots below, it looks like the underground is a comparatively safe place to travel despite the high counts. Here note the use of the facet_grid()
function to plot the charts, one for each area.
# First the plot for counts
mayData = aggregate(Crime.Count ~ Month + Area + Year,
data = subset(btpData, Crime_Type == "All Crime and ASB"),
FUN = sum, na.rm = TRUE)
mayData$Year = ordered(mayData$Year)
mayData$Month = ordered(mayData$Month)
# This is a fantastic little plot showing how crime varies with month over different areas
ggplot(mayData, aes(x=Month, y=Crime.Count, colour=Year, group=Year)) +
geom_line() + geom_point() + theme(legend.position = "top") +
facet_grid(. ~ Area) + scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x)))
# Now the plot for proportions
mayData = aggregate(Count.Prop ~ Month + Area + Year,
data = subset(btpData, Crime_Type == "All Crime and ASB"),
FUN = sum, na.rm = TRUE)
mayData$Year = ordered(mayData$Year)
mayData$Month = ordered(mayData$Month)
# The facet plot again
ggplot(mayData, aes(x=Month, y=Count.Prop, colour=Year, group=Year)) +
geom_line() + geom_point() + theme(legend.position = "top") +
facet_grid(. ~ Area) + scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x)))
Here we repeat the plots for Operator instead of Area. To present the plots in a tabular manner, we use facet_wrap()
rather than facet_grid()
. We are wrapping here since use of grid; wrapping allows us to make use of the rectangular grid when extending plots by one variable.
# Creating the data set
moyData <- aggregate(Crime.Count ~ Month + Operator + Year,
data = subset(btpData, Crime_Type == "All Crime and ASB"),
FUN = sum, na.rm = TRUE)
moyData$Year <- ordered(moyData$Year)
moyData$Month <- ordered(moyData$Month)
moyData$Crime.Count <- as.numeric(moyData$Crime.Count)
# Remove operators with insufficient data
moyData <- subset(moyData, !Operator %in% c("First Scotrail", "Privately Owned"))
# We wrap rather than grid
# Chart for the counts
ggplot(moyData, aes(x=Month, y=Crime.Count, colour=Year, group=Year)) +
geom_line() + geom_point() + theme(legend.position = "top") +
facet_wrap( ~ Operator, nrow = 4) + scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x))) + xlab("\nMonth") + ylab("\nCrime Count")
# Creating the data set
moyData <- aggregate(Count.Prop ~ Month + Operator + Year,
data = subset(btpData, Crime_Type == "All Crime and ASB"),
FUN = sum, na.rm = TRUE)
moyData$Year <- ordered(moyData$Year)
moyData$Month <- ordered(moyData$Month)
# Remove operators with insufficient data
moyData <- subset(moyData, !Operator %in% c("First Scotrail", "Privately Owned"))
# Plot for the proportions
ggplot(moyData, aes(x=Month, y=Count.Prop, colour=Year, group=Year)) +
geom_line() + geom_point() + theme(legend.position = "top") +
facet_wrap( ~ Operator, nrow = 4) + scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x))) + xlab("\nMonth") + ylab("\nCrime Count Proportions")
We can re-plot Crime Count and highlight Crime Type in the plot. In terms of count.
opData = aggregate(Crime.Count ~ Crime_Type + Operator, data = btpData2, FUN = sum, na.rm = TRUE)
ggplot(opData, aes(x=Crime_Type, y=Crime.Count, color=Crime_Type, group = Operator)) +
geom_line() + geom_point() + facet_wrap( ~ Operator) +
theme(axis.text.x = element_blank(), legend.title = element_blank(),
legend.position = "top") + xlab("\nCrime Type") + ylab("Crime Count\n")+
scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x)))
And then for proportions
opData = aggregate(Count.Prop ~ Crime_Type + Operator, data = btpData2, FUN = sum, na.rm = TRUE)
ggplot(opData, aes(x=Crime_Type, y=Count.Prop, color=Crime_Type, group = Operator)) +
geom_line() + geom_point() + facet_wrap( ~ Operator) +
theme(axis.text.x = element_blank(), legend.title = element_blank(),
legend.position = "top") + xlab("\nCrime Type") + ylab("Crime Count\n")+
scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x)))
Here, we have a different representation using line plots for counts.
locData = aggregate(Crime.Count ~ Month + Crime_Type + Area, data = btpData, FUN = sum, na.rm = TRUE)
locData$Month = ordered(locData$Month)
ggplot(locData, aes(x=Month, y=Crime.Count, color=Area, group=Area)) +
geom_line() + geom_point() + facet_wrap( ~ Crime_Type) + scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x))) + xlab("\nMonth") + ylab("Crime Count \n")
For count proportions
locData = aggregate(Count.Prop ~ Month + Crime_Type + Area, data = btpData, FUN = sum, na.rm = TRUE)
locData$Month = ordered(locData$Month)
ggplot(locData, aes(x=Month, y=Count.Prop, color=Area, group=Area)) +
geom_line() + geom_point() + facet_wrap( ~ Crime_Type) +
scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x))) +
theme(legend.position = "top", legend.title = element_blank()) +
ylab("Count Proportions \n") + xlab("\nMonth")
Now we do a box plot; this time instead of looking at counts with time, we look at count proportions with Crime_Type
and Operator
.
# Plot crime type facetted by operator
btpData2 <- btpData
btpData2$Year <- ordered(btpData2$Year)
btpData2$Month <- ordered(btpData2$Month)
btpData2 <- subset(btpData2, !Operator %in% c("Eurostar", "First Capital Connect",
"First Scotrail", "Heathrow Express", "Island Line", "Midland Metro",
"Privately Owned", "Stobart Rail", "First ScotRail",
"Strathclyde Partnership for Transport", "Tyne And Wear Metro (nexus)"))
ggplot(btpData2, aes(x=Crime_Type, y=Count.Prop, fill=Crime_Type, group=Crime_Type)) +
geom_boxplot() + theme(legend.position = "top",
axis.text.x = element_blank(), legend.title = element_blank()) +
xlab("\nCrime Type") + ylab("Count Proportions\n") +
facet_wrap( ~ Operator, nrow = 4) + scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x)))
Using ggplot2 can clearly produce very interesting and informative graphics. There are few statistical programs like R that have a great potential to change the way that analysis is presented and carried out. I hope that you have gained some useful information from this blog entry.