Recently, I wanted to create a map similar to James Cheshire’s crime map of London,
which shows the most common crimes commited in a rectangular grid of points laid
over London. Instead of using a rectangular grid, I wanted to use hexbins, but
it turns out that ggplot
needs a bit of prodding to do anything other than
simply count the number of observations in each bin.
At the time I couldn’t find a good tutorial on writing custom hexbin functions, so this post is a reasonably thorough explanation of what I’ve made work.
Getting Spatial Data
To get these examples up and running, you’ll need some spatial data. I’m going
to use something similar to the data used in the original crime map of London:
crimes reported to the Metropolitan Police Department (of London) from May 2012
to April 2013. You can get the exact same extract I have used from the excellent
data.police.uk
website by following this link.
The files come in one-month increments, each in their own folder. There are a
number of interesting columns, but for our purposes we only need the coordinates
and the type of crime. I’ve appended a quick bit of code to concatenate all of
them together and extract the relevant columns at the end of this post.
You will also need a fairly recent (post version 2.0) of ggplot2
. To use
geom_hex
and related functions, you will also require the hexbin
package,
which I’ve loaded explicitly here. I’m personally a fan of dplyr
and the %>%
operator, but its use here is purely to make working with the data easier.
library(dplyr, warn.conflicts = FALSE)
library(hexbin)
library(ggplot2)
The data looks as follows:
# load("path/to/crimes.rda")
head(crimes, 10)
## # A tibble: 10 x 5
## Year Month Longitude Latitude Crime
## <int> <int> <dbl> <dbl> <fctr>
## 1 2012 5 0.135866 51.58734 Anti-social behaviour
## 2 2012 5 0.134947 51.58806 Anti-social behaviour
## 3 2012 5 0.135866 51.58734 Anti-social behaviour
## 4 2012 5 0.134947 51.58806 Anti-social behaviour
## 5 2012 5 0.137065 51.58367 Anti-social behaviour
## 6 2012 5 0.134947 51.58806 Anti-social behaviour
## 7 2012 5 0.140192 51.58231 Anti-social behaviour
## 8 2012 5 0.134947 51.58806 Anti-social behaviour
## 9 2012 5 0.134947 51.58806 Anti-social behaviour
## 10 2012 5 0.134947 51.58806 Anti-social behaviour
Creating a Basic Spatial Hexbin Map
To start with, we can produce a traditional hexbin overlaid on a basemap (that
is, using something like Open Street Map or Google Maps as a “background”). The
ggmap
package is a handy way to get a basemap into your ggplot
maps. You can
get a basemap for London (as a ggplot
object) as follows:
library(ggmap)
# These are approximate coordinates for London.
london.ish <- c(`left` = -0.25, `bottom` = 51.4,
`right` = 0, `top` = 51.6)
london.map <- ggmap::get_map(location = london.ish,
maptype = "toner", source = "stamen")
basemap <- ggmap::ggmap(london.map) +
coord_equal() +
labs(x = NULL, y = NULL) +
theme(axis.text = element_blank())
# The data set actually contains a lot of crime reports from outside
# of London itself. We're not interested in these observations.
crimes <- crimes %>%
filter(Longitude > london.ish[1],
Longitude < london.ish[3],
Latitude > london.ish[2],
Latitude < london.ish[4])
Which allows us to create a nice, traditional hexbin map:
basic.hexmap <- basemap +
geom_hex(aes(x = Longitude, y = Latitude,
fill = cut(..value.., c(0, 100, 250, 500, 1000,
1500, 2000, 2500, Inf))),
colour = NA, bins = 30, alpha = 0.75,
data = crimes) +
scale_fill_brewer(palette = "OrRd",
labels = c("<100", "100-250", "250-500",
"500-1000", "1000-1500",
"1500-2000", "2000-2500",
">2500")) +
labs(fill = NULL,
title = "Reported Crimes in London, by Location (2012-2013)")
print(basic.hexmap)
I’ve binned the count data so that the map is a little more informative. (You
can read more about using the cut
function for this in my earlier post on hexbins.)
If you look carefully you might notice that there are very few crimes in the
City of London — I’m assuming that this is because they have their own police
force.
The geom_hex
geometry works by using stat_bin_hex
to transform the source
data frame into another data frame with binned data. (This is where the magic
..value..
aesthetic that I used above comes from: it refers to a column in the
data frame that stat_bin_hex
outputs.) In fact you can use the statistic
directly to do some interesting things. Say, by using dots instead of hexagons,
but still maintaining the hexagonal tiling:
dot.hexmap <- basemap +
stat_bin_hex(aes(x = Longitude, y = Latitude, fill = NULL,
colour = cut(..value.., c(0, 100, 250, 500, 1000,
1500, 2000, 2500, Inf))),
geom = "point", size = 3, bins = 30, alpha = 0.75,
data = crimes) +
scale_colour_brewer(palette = "OrRd",
labels = c("<100", "100-250", "250-500",
"500-1000", "1000-1500",
"1500-2000", "2000-2500",
">2500")) +
labs(colour = NULL,
title = "Reported Crimes in London, by Location (2012-2013)")
print(dot.hexmap)
In fact this is precisely what the original crime map of London does — only
with rectangular binning instead of hexagonal binning. Since Cheshire is a
self-admitted ggplot2 user himself,
I’m assuming the original map used stat_bin_2d
.
Note that stat_bin_hex
gets a little confused when you use a different
geometry, which is why I’ve explicitly set fill = NULL
here. You get strange
legends otherwise.
Using Custom Binning Functions
But what if we want to do something other than just count up the number of
crimes in each hexagon? Suppose that in our data set, each row corresponded to
an arbitrary number of crimes of a given type at a given location. In other
words, each observation has a weight accompanying it. We can’t really use
stat_bin_hex
for this, but ggplot
provides another, more powerful interface
to computing statistics on a hexagonal grid: stat_summary_hex
.
The stat_summary_hex
statistic expects two new things. First, it requires that
we explicitly identify the third dimension our data is organized by, the z
aesthetic. And second, it requires us to pass a function that will be used to
compute the value at each hexagon in the grid from this z
aesthetic. This also
means that the stat_bin_hex
statistic we used above is equivalent to running
stat_summary_hex
with fun = sum
and a z = 1L
aesthetic.
For example, suppose each row in our crime dataset referred to between one and three individual crimes at a given location. We could aggregate into a traditional hexbin map as follows:
crimes$Count <- sample(1:3, nrow(crimes), replace = TRUE)
summary.hexmap <- basemap +
stat_summary_hex(aes(x = Longitude, y = Latitude, z = Count,
fill = cut(..value..,
c(0, 100, 250, 500, 1000,
1500, 2000, 2500, Inf))),
fun = sum,
colour = NA, bins = 30, alpha = 0.75,
data = crimes) +
scale_fill_brewer(palette = "OrRd",
labels = c("<100", "100-250", "250-500",
"500-1000", "1000-1500",
"1500-2000", "2000-2500",
">2500")) +
labs(fill = NULL,
title = "Reported Crimes in London, by Location (2012-2013)")
print(summary.hexmap)
When stat_summary_hex
Fails, Use hexbin
Directly
There some limitations to stat_summary_hex
. For instance, suppose we were
just interested in non-drug crimes. The intuitive approach will fail:
broken.hexmap <- basemap +
stat_summary_hex(aes(x = Longitude, y = Latitude, z = Crime,
fill = cut(..value..,
c(0, 100, 250, 500, 1000,
1500, 2000, 2500, Inf))),
fun = function(z) sum(z != "Drugs"),
colour = NA, bins = 30, alpha = 0.75,
data = crimes) +
scale_fill_brewer(palette = "OrRd",
labels = c("<100", "100-250", "250-500",
"500-1000", "1000-1500",
"1500-2000", "2000-2500",
">2500")) +
labs(fill = NULL,
title = "Reported Crimes in London, by Location (2012-2013)")
print(broken.hexmap)
What’s going on here? It turns out that ggplot
’s implementation (which you
can see here)
uses tapply
, so it actually creates one geometry for each level of our
Crime
factor. You can see this in that the alpha setting seems to be ignored.
In fact if you look closely you can see the layer below in some places.
Thankfully, ggplot
is flexible enough for us to completely circumvent this
problem. Internally, geom_hex
calls out to the hexbin
function through stat = "hex"
and then draws hexagons according to the resulting data frame. We can
work around this by computing the intermediate statistic ourselves, and then
pass stat = "identity"
to geom_hex
instead (this is similar to how you would
disable geom_bar
’s binning if you were starting out with aggregate data).
So what we really need is to join our data set with a column indicating which
hexagon a given row would fall in. That would allow us to compute arbitrary
contents for the hexagons from the original data set. Thankfully, the hexbin
function provides a straightforward way of doing this by passing IDs = TRUE
(otherwise the cID
slot is optional, since it can be very large).
The hexbin
function itself returns an S4 object with a number of other
interesting fields, chiefly those that can be used to create a data frame
containing the hexagon coordinates.
crimes.hex <- hexbin::hexbin(crimes$Longitude, crimes$Latitude,
xbins = 30, IDs = TRUE)
# Deconstruct the relevant hexbin information into a dataset:
hexagons <- data.frame(hexbin::hcell2xy(crimes.hex),
cell = crimes.hex@cell,
count = crimes.hex@count)
crimes$cell = crimes.hex@cID
nondrug.crime.hexagons <- crimes %>%
filter(Crime != "Drugs") %>%
group_by(cell) %>%
summarise(crimes = n()) %>%
ungroup() %>%
# Cut here, instead of in the ggplot call (as above).
mutate(crime.level = cut(crimes, c(0, 100, 250, 500,
1000, 1500, 2000,
2500, Inf),
labels = c("<100", "100-250",
"250-500", "500-1000",
"1000-1500", "1500-2000",
"2000-2500", ">2500"))) %>%
right_join(hexagons, by = "cell") %>%
select(cell, x, y, count, crimes, crime.level)
head(nondrug.crime.hexagons, 10)
## # A tibble: 10 x 6
## cell x y count crimes crime.level
## <int> <dbl> <dbl> <int> <int> <fctr>
## 1 1 -0.2499860 51.40001 77 77 <100
## 2 2 -0.2416534 51.40001 117 116 100-250
## 3 3 -0.2333209 51.40001 54 50 <100
## 4 4 -0.2249883 51.40001 26 26 <100
## 5 5 -0.2166557 51.40001 29 29 <100
## 6 6 -0.2083232 51.40001 87 86 <100
## 7 7 -0.1999906 51.40001 136 132 100-250
## 8 8 -0.1916580 51.40001 336 318 250-500
## 9 9 -0.1833255 51.40001 23 23 <100
## 10 10 -0.1749929 51.40001 102 100 <100
We can now plot the nondrug.crime.hexagons
data frame directly, using
geom_hex
.
manual.hexmap <- basemap +
geom_hex(aes(x = x, y = y, fill = crime.level),
stat = "identity", colour = NA, alpha = 0.75,
data = nondrug.crime.hexagons) +
scale_fill_brewer(palette = "OrRd") +
labs(fill = NULL,
title = "Non-drug Crimes in London, by Location (2012-2013)")
print(manual.hexmap)
Using this approach it is possible to compute arbitrary data for each hex bin.
Postscript: Cleaning up the Crime Data
This snippet of code should extract all of the relevant data from the raw, unzipped data extract:
library(dplyr, warn.conflicts = FALSE)
# Point R at the unzipped files:
setwd("~/Downloads/b9490c7923e00e0e56ac3d2aa9f2b2a968935b15.zip Folder/")
csv.files <- data_frame(
month = c(5:12, 1:4),
year = c(rep(2012L, times = 8), rep(2013L, times = 4))
) %>% mutate(
fmt = paste0(year, ifelse(month < 10, "-0", "-"), month),
file = paste0(fmt, "/", fmt, "-metropolitan-street.csv")
) %>% apply(1, function(row) {
cat("Reading file:", row['file'], "\n")
raw.data <- read.csv(
row['file'], stringsAsFactors = FALSE, na.strings = "",
colClasses = c("NULL", "NULL", "NULL", "NULL", "numeric", "numeric",
"NULL", "NULL", "NULL", "factor", "NULL", "NULL")
) %>% mutate(Year = as.integer(row['year']),
Month = as.integer(row['month'])) %>%
select(Year, Month, Longitude, Latitude, Crime = Crime.type)
raw.data
})
crimes <- do.call(rbind, csv.files) %>% tbl_df()
save(crimes, file = "~/Code/R/data/london-crimes/crimes.rda",
compress = "xz")
For reference, the following is the output of my sessionInfo()
at the time of
writing:
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin14.3.0 (64-bit)
## Running under: OS X 10.10.3 (Yosemite)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] methods stats graphics grDevices utils datasets base
##
## other attached packages:
## [1] ggmap_2.6.1 ggplot2_2.1.0 hexbin_1.27.1 dplyr_0.5.0 stringr_1.0.0
## [6] digest_0.6.10 knitr_1.13
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.6 magrittr_1.5 maps_3.1.1
## [4] munsell_0.4.3 colorspace_1.2-6 geosphere_1.5-5
## [7] lattice_0.20-33 rjson_0.2.15 R6_2.1.2
## [10] jpeg_0.1-8 plyr_1.8.4 tools_3.3.1
## [13] grid_3.3.1 gtable_0.2.0 png_0.1-7
## [16] DBI_0.5 assertthat_0.1 tibble_1.1
## [19] RJSONIO_1.3-0 reshape2_1.4.1 mapproj_1.2-4
## [22] formatR_1.4 codetools_0.2-14 evaluate_0.9
## [25] sp_1.2-3 stringi_1.1.1 RgoogleMaps_1.2.0.7
## [28] scales_0.4.0 proto_0.3-10
Since this post relies on certain implementation details of ggplot2
and
hexbin
, it may break with future versions of either library.