Unconstant Conjunction A personal blog

Custom Hexbin Functions with ggplot

Recently, I wanted to create a map similar to James Cheshire’s crime map of London, which shows the most common crimes commited in a rectangular grid of points laid over London. Instead of using a rectangular grid, I wanted to use hexbins, but it turns out that ggplot needs a bit of prodding to do anything other than simply count the number of observations in each bin.

At the time I couldn’t find a good tutorial on writing custom hexbin functions, so this post is a reasonably thorough explanation of what I’ve made work.

Getting Spatial Data

To get these examples up and running, you’ll need some spatial data. I’m going to use something similar to the data used in the original crime map of London: crimes reported to the Metropolitan Police Department (of London) from May 2012 to April 2013. You can get the exact same extract I have used from the excellent data.police.uk website by following this link. The files come in one-month increments, each in their own folder. There are a number of interesting columns, but for our purposes we only need the coordinates and the type of crime. I’ve appended a quick bit of code to concatenate all of them together and extract the relevant columns at the end of this post.

You will also need a fairly recent (post version 2.0) of ggplot2. To use geom_hex and related functions, you will also require the hexbin package, which I’ve loaded explicitly here. I’m personally a fan of dplyr and the %>% operator, but its use here is purely to make working with the data easier.

library(dplyr, warn.conflicts = FALSE)
library(hexbin)
library(ggplot2)

The data looks as follows:

# load("path/to/crimes.rda")
head(crimes, 10)
## # A tibble: 10 x 5
##     Year Month Longitude Latitude                 Crime
##    <int> <int>     <dbl>    <dbl>                <fctr>
## 1   2012     5  0.135866 51.58734 Anti-social behaviour
## 2   2012     5  0.134947 51.58806 Anti-social behaviour
## 3   2012     5  0.135866 51.58734 Anti-social behaviour
## 4   2012     5  0.134947 51.58806 Anti-social behaviour
## 5   2012     5  0.137065 51.58367 Anti-social behaviour
## 6   2012     5  0.134947 51.58806 Anti-social behaviour
## 7   2012     5  0.140192 51.58231 Anti-social behaviour
## 8   2012     5  0.134947 51.58806 Anti-social behaviour
## 9   2012     5  0.134947 51.58806 Anti-social behaviour
## 10  2012     5  0.134947 51.58806 Anti-social behaviour

Creating a Basic Spatial Hexbin Map

To start with, we can produce a traditional hexbin overlaid on a basemap (that is, using something like Open Street Map or Google Maps as a “background”). The ggmap package is a handy way to get a basemap into your ggplot maps. You can get a basemap for London (as a ggplot object) as follows:

library(ggmap)

# These are approximate coordinates for London.
london.ish <- c(`left` = -0.25, `bottom` = 51.4,
                `right` = 0, `top` = 51.6)
london.map <- ggmap::get_map(location = london.ish,
                             maptype = "toner", source = "stamen")

basemap <- ggmap::ggmap(london.map) +
    coord_equal() +
    labs(x = NULL, y = NULL) +
    theme(axis.text = element_blank())

# The data set actually contains a lot of crime reports from outside
# of London itself. We're not interested in these observations.
crimes <- crimes %>%
    filter(Longitude > london.ish[1],
           Longitude < london.ish[3],
           Latitude > london.ish[2],
           Latitude < london.ish[4])

Which allows us to create a nice, traditional hexbin map:

basic.hexmap <- basemap +
    geom_hex(aes(x = Longitude, y = Latitude,
                 fill = cut(..value.., c(0, 100, 250, 500, 1000,
                                         1500, 2000, 2500, Inf))),
             colour = NA, bins = 30, alpha = 0.75,
             data = crimes) +
    scale_fill_brewer(palette = "OrRd",
                      labels = c("<100", "100-250", "250-500",
                                 "500-1000", "1000-1500",
                                 "1500-2000", "2000-2500",
                                 ">2500")) +
    labs(fill = NULL,
         title = "Reported Crimes in London, by Location (2012-2013)")

print(basic.hexmap)
![plot of chunk unnamed-chunk-4](./images/r-b6720d/unnamed-chunk-4-1.png)

I’ve binned the count data so that the map is a little more informative. (You can read more about using the cut function for this in my earlier post on hexbins.) If you look carefully you might notice that there are very few crimes in the City of London — I’m assuming that this is because they have their own police force.

The geom_hex geometry works by using stat_bin_hex to transform the source data frame into another data frame with binned data. (This is where the magic ..value.. aesthetic that I used above comes from: it refers to a column in the data frame that stat_bin_hex outputs.) In fact you can use the statistic directly to do some interesting things. Say, by using dots instead of hexagons, but still maintaining the hexagonal tiling:

dot.hexmap <- basemap +
    stat_bin_hex(aes(x = Longitude, y = Latitude, fill = NULL,
                     colour = cut(..value.., c(0, 100, 250, 500, 1000,
                                               1500, 2000, 2500, Inf))),
                 geom = "point", size = 3, bins = 30, alpha = 0.75,
                 data = crimes) +
    scale_colour_brewer(palette = "OrRd",
                        labels = c("<100", "100-250", "250-500",
                                   "500-1000", "1000-1500",
                                   "1500-2000", "2000-2500",
                                   ">2500")) +
    labs(colour = NULL,
         title = "Reported Crimes in London, by Location (2012-2013)")

print(dot.hexmap)
![plot of chunk unnamed-chunk-5](./images/r-b6720d/unnamed-chunk-5-1.png)

In fact this is precisely what the original crime map of London does — only with rectangular binning instead of hexagonal binning. Since Cheshire is a self-admitted ggplot2 user himself, I’m assuming the original map used stat_bin_2d.

Note that stat_bin_hex gets a little confused when you use a different geometry, which is why I’ve explicitly set fill = NULL here. You get strange legends otherwise.

Using Custom Binning Functions

But what if we want to do something other than just count up the number of crimes in each hexagon? Suppose that in our data set, each row corresponded to an arbitrary number of crimes of a given type at a given location. In other words, each observation has a weight accompanying it. We can’t really use stat_bin_hex for this, but ggplot provides another, more powerful interface to computing statistics on a hexagonal grid: stat_summary_hex.

The stat_summary_hex statistic expects two new things. First, it requires that we explicitly identify the third dimension our data is organized by, the z aesthetic. And second, it requires us to pass a function that will be used to compute the value at each hexagon in the grid from this z aesthetic. This also means that the stat_bin_hex statistic we used above is equivalent to running stat_summary_hex with fun = sum and a z = 1L aesthetic.

For example, suppose each row in our crime dataset referred to between one and three individual crimes at a given location. We could aggregate into a traditional hexbin map as follows:

crimes$Count <- sample(1:3, nrow(crimes), replace = TRUE)
summary.hexmap <- basemap +
    stat_summary_hex(aes(x = Longitude, y = Latitude, z = Count,
                         fill = cut(..value..,
                                    c(0, 100, 250, 500, 1000,
                                      1500, 2000, 2500, Inf))),
                     fun = sum,
                     colour = NA, bins = 30, alpha = 0.75,
                     data = crimes) +
    scale_fill_brewer(palette = "OrRd",
                      labels = c("<100", "100-250", "250-500",
                                 "500-1000", "1000-1500",
                                 "1500-2000", "2000-2500",
                                 ">2500")) +
    labs(fill = NULL,
         title = "Reported Crimes in London, by Location (2012-2013)")

print(summary.hexmap)
![plot of chunk unnamed-chunk-6](./images/r-b6720d/unnamed-chunk-6-1.png)

When stat_summary_hex Fails, Use hexbin Directly

There some limitations to stat_summary_hex. For instance, suppose we were just interested in non-drug crimes. The intuitive approach will fail:

broken.hexmap <- basemap +
    stat_summary_hex(aes(x = Longitude, y = Latitude, z = Crime,
                         fill = cut(..value..,
                                    c(0, 100, 250, 500, 1000,
                                      1500, 2000, 2500, Inf))),
                     fun = function(z) sum(z != "Drugs"),
                     colour = NA, bins = 30, alpha = 0.75,
                     data = crimes) +
    scale_fill_brewer(palette = "OrRd",
                      labels = c("<100", "100-250", "250-500",
                                 "500-1000", "1000-1500",
                                 "1500-2000", "2000-2500",
                                 ">2500")) +
    labs(fill = NULL,
         title = "Reported Crimes in London, by Location (2012-2013)")

print(broken.hexmap)
![plot of chunk unnamed-chunk-7](./images/r-b6720d/unnamed-chunk-7-1.png)

What’s going on here? It turns out that ggplot’s implementation (which you can see here) uses tapply, so it actually creates one geometry for each level of our Crime factor. You can see this in that the alpha setting seems to be ignored. In fact if you look closely you can see the layer below in some places.

Thankfully, ggplot is flexible enough for us to completely circumvent this problem. Internally, geom_hex calls out to the hexbin function through stat = "hex" and then draws hexagons according to the resulting data frame. We can work around this by computing the intermediate statistic ourselves, and then pass stat = "identity" to geom_hex instead (this is similar to how you would disable geom_bar’s binning if you were starting out with aggregate data).

So what we really need is to join our data set with a column indicating which hexagon a given row would fall in. That would allow us to compute arbitrary contents for the hexagons from the original data set. Thankfully, the hexbin function provides a straightforward way of doing this by passing IDs = TRUE (otherwise the cID slot is optional, since it can be very large).

The hexbin function itself returns an S4 object with a number of other interesting fields, chiefly those that can be used to create a data frame containing the hexagon coordinates.

crimes.hex <- hexbin::hexbin(crimes$Longitude, crimes$Latitude,
                           xbins = 30, IDs = TRUE)

# Deconstruct the relevant hexbin information into a dataset:
hexagons <- data.frame(hexbin::hcell2xy(crimes.hex),
                       cell = crimes.hex@cell,
                       count = crimes.hex@count)

crimes$cell = crimes.hex@cID
nondrug.crime.hexagons <- crimes %>%
    filter(Crime != "Drugs") %>%
    group_by(cell) %>%
    summarise(crimes = n()) %>%
    ungroup() %>%
    # Cut here, instead of in the ggplot call (as above).
    mutate(crime.level = cut(crimes, c(0, 100, 250, 500,
                                       1000, 1500, 2000,
                                       2500, Inf),
                             labels = c("<100", "100-250",
                                        "250-500", "500-1000",
                                        "1000-1500", "1500-2000",
                                        "2000-2500", ">2500"))) %>%
    right_join(hexagons, by = "cell") %>%
    select(cell, x, y, count, crimes, crime.level)

head(nondrug.crime.hexagons, 10)
## # A tibble: 10 x 6
##     cell          x        y count crimes crime.level
##    <int>      <dbl>    <dbl> <int>  <int>      <fctr>
## 1      1 -0.2499860 51.40001    77     77        <100
## 2      2 -0.2416534 51.40001   117    116     100-250
## 3      3 -0.2333209 51.40001    54     50        <100
## 4      4 -0.2249883 51.40001    26     26        <100
## 5      5 -0.2166557 51.40001    29     29        <100
## 6      6 -0.2083232 51.40001    87     86        <100
## 7      7 -0.1999906 51.40001   136    132     100-250
## 8      8 -0.1916580 51.40001   336    318     250-500
## 9      9 -0.1833255 51.40001    23     23        <100
## 10    10 -0.1749929 51.40001   102    100        <100

We can now plot the nondrug.crime.hexagons data frame directly, using geom_hex.

manual.hexmap <- basemap +
    geom_hex(aes(x = x, y = y, fill = crime.level),
             stat = "identity", colour = NA, alpha = 0.75,
             data = nondrug.crime.hexagons) +
    scale_fill_brewer(palette = "OrRd") +
    labs(fill = NULL,
         title = "Non-drug Crimes in London, by Location (2012-2013)")

print(manual.hexmap)
![plot of chunk unnamed-chunk-9](./images/r-b6720d/unnamed-chunk-9-1.png)

Using this approach it is possible to compute arbitrary data for each hex bin.

Postscript: Cleaning up the Crime Data

This snippet of code should extract all of the relevant data from the raw, unzipped data extract:

library(dplyr, warn.conflicts = FALSE)

# Point R at the unzipped files:
setwd("~/Downloads/b9490c7923e00e0e56ac3d2aa9f2b2a968935b15.zip Folder/")

csv.files <- data_frame(
    month = c(5:12, 1:4),
    year = c(rep(2012L, times = 8), rep(2013L, times = 4))
) %>% mutate(
    fmt = paste0(year, ifelse(month < 10, "-0", "-"), month),
    file = paste0(fmt, "/", fmt, "-metropolitan-street.csv")
) %>% apply(1, function(row) {
    cat("Reading file:", row['file'], "\n")
    raw.data <- read.csv(
        row['file'], stringsAsFactors = FALSE, na.strings = "",
        colClasses = c("NULL", "NULL", "NULL", "NULL", "numeric", "numeric",
                       "NULL", "NULL", "NULL", "factor", "NULL", "NULL")
    ) %>% mutate(Year = as.integer(row['year']),
                 Month = as.integer(row['month'])) %>%
        select(Year, Month, Longitude, Latitude, Crime = Crime.type)
    raw.data
})

crimes <- do.call(rbind, csv.files) %>% tbl_df()

save(crimes, file = "~/Code/R/data/london-crimes/crimes.rda",
     compress = "xz")

For reference, the following is the output of my sessionInfo() at the time of writing:

## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin14.3.0 (64-bit)
## Running under: OS X 10.10.3 (Yosemite)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] methods   stats     graphics  grDevices utils     datasets  base     
## 
## other attached packages:
## [1] ggmap_2.6.1   ggplot2_2.1.0 hexbin_1.27.1 dplyr_0.5.0   stringr_1.0.0
## [6] digest_0.6.10 knitr_1.13   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.6         magrittr_1.5        maps_3.1.1         
##  [4] munsell_0.4.3       colorspace_1.2-6    geosphere_1.5-5    
##  [7] lattice_0.20-33     rjson_0.2.15        R6_2.1.2           
## [10] jpeg_0.1-8          plyr_1.8.4          tools_3.3.1        
## [13] grid_3.3.1          gtable_0.2.0        png_0.1-7          
## [16] DBI_0.5             assertthat_0.1      tibble_1.1         
## [19] RJSONIO_1.3-0       reshape2_1.4.1      mapproj_1.2-4      
## [22] formatR_1.4         codetools_0.2-14    evaluate_0.9       
## [25] sp_1.2-3            stringi_1.1.1       RgoogleMaps_1.2.0.7
## [28] scales_0.4.0        proto_0.3-10

Since this post relies on certain implementation details of ggplot2 and hexbin, it may break with future versions of either library.

comments powered by Disqus