This post is meant to be a short intro on how to create visualizations like the following using R and ggplot2:
Update (February 6, 2017): I’ve updated the content of this post to
be much more modern, taking advantage of developments in the spatial package
ecosystem and in the capabilities of ggplot2
.
To get through this tutorial, you’ll need the following packages:
suppressMessages({
library(rgeos)
library(maptools)
library(ggplot2)
})
The first thing you need to get your hands on is some representation of the
polygons on a map. There are many formats for these polygons, but one of the
more common ones is the .shp
format used by ArcGIS. The maptools
package
enables opening and manipulating objects created from this format, so you’ll
need to get your hands on both that and its prerequisite, rgeos
.
Update (February 6, 2017): In the past I recommended gpclib
, but
the rgeos
library is now the preferred approach. It also has a more permissive
license.
A nice scale to work at in Canada is the level of the forward sortation areas (FSAs) defined by the first three digits of a postal code. Until very recently the exact borders of these areas was data you had to pay for, but now Statistics Canada makes it available on the Census website, here.
Fair warning: it’s quite a large file, and will take some time to downlaod. Of course, all of the FSAs in Canada is a bit too much to deal with, so I’ll focus just on those that encompass the city of Toronto, which all begin with the letter “M”.
# fsas <- maptools::readShapeSpatial("path/to/gfsa000b11a_e.shp")
toronto.fsas <- fsas[substr(fsas$CFSAUID, 1, 1) == 'M',]
# The shapefile takes up quite a lot of memory, and we no longer need it all,
# so you can delete it if you like:
rm(fsas)
We have to turn the object that maptools
produces into something that can be
plotted in the ggplot2
paradigm, which reguires a data.frame
object. To do
this, we can use the ggplot2::fortify()
function:
data <- ggplot2::fortify(toronto.fsas, region = "CFSAUID")
# Turn the generated "id" chr column into a correctly-labelled factor variable
data$fsa <- factor(data$id)
data$id <- NULL
Toronto is actually pretty great about making lots of interesting data available online. They provide the number of licensed dogs and cats in the city by FSA here, although it needs to be cleaned up a little before you can use it directly. It needs to be merged into the map data:
# pets <- read.csv("path/to/pets.csv")
# Merge the map data and the pet data into something we'll plot
plot.data <- merge(data, pets, by = "fsa")
Now we can use ggplot2
to plot the polygons, and fill them with a gradient
based on the number of dogs. Note that you must group
the polygons, otherwise
they might not be drawn out in the correct order (try omitting it and see).
Another key parameter is the coord_equal()
coordinate modification: since
we’re dealing with map data, if the scaling of the x and y axis are not the
same the map will be arbitrarily distorted.
I’m glossing over the details of map projections here — in this case, the city is small enough that a latitude = longitude presentation will look just fine.
ggplot(plot.data, aes(x = long, y = lat, group = group, fill = dog)) +
geom_polygon() +
coord_equal()
It’s that easy! Of course, the default ggplot2
theme is really meant for
presenting data, and might not look that great with a map. The built-in
theme_void()
will strip away most of this extraneous framing, although I tend
to tweak it a little as well.
Update (February 6, 2017): Bin your spatial data, and use viridis
.
I strongly believe that you usually want to bin data for choropleth maps, since
it can be very difficult to judge fine colour differences. The
ggplot2::cut_number()
function will find bins roughly equal in size, which is
a good place to start.
Unfortunately, the built-in colours for discrete scales are horrible not only
for a final presentation, but also for your own exploratory work. You’re better
off using one of the ColorBrewer palettes (available through the built-in
scale_fill_brewer()
function), and you should also check out the carefully
tuned palettes from the viridis
package.
Now let’s try this again, with these changes:
library(viridis)
ggplot(plot.data, aes(x = long, y = lat, group = group,
fill = cut_number(dog, 5))) +
geom_polygon(color = "grey10", size = 0.2) +
coord_equal() +
viridis::scale_fill_viridis(discrete = TRUE) +
labs(title = "Registered Dogs (Toronto, 2013)",
fill = NULL) +
theme_void() +
theme(legend.position = "bottom",
panel.background = element_rect(fill = NA, colour = "#cccccc"))
Which looks much better. Normally we’d be done here, but the sage advice of Randall Munroe is worth keeping in mind:
So we can improve the meaningfulness of this map by adjusting for population. Luckily, Statistics Canada actually publishes population data at the FSA-level, and even more luckily they provide data on the number of private dwellings that are occupied by their usual owners. This last number is particularly interesting because it is the level at which you would expect pet ownership to occur. You can find the data here.
As with most StatsCan-provided CSV files, you will need to clean it up a bit before using it.
# pop <- read.csv("path/to/fsa_pop.csv")
# We don't really need the rest of this...
pop$population <- NULL
pop$indian <- NULL
pop$dwellings <- NULL
# Merge the data into the dataset we're already using
plot.data <- merge(plot.data, pop, by = "fsa")
# Calculate the density of dogs per private dwelling
plot.data$dog.density <- plot.data$dog / plot.data$dwellings.usual
And now we can plot our population-adjusted map using the theme from before:
dog.map <- ggplot(plot.data, aes(x = long, y = lat, group = group,
fill = cut_number(dog.density, 5))) +
geom_polygon(color = "gray10", size = 0.2) +
coord_equal() +
viridis::scale_fill_viridis(discrete = TRUE) +
labs(title = "Registered Dogs per Private Dwelling (Toronto, 2013)",
fill = NULL) +
theme_void() +
theme(legend.position = "bottom",
panel.background = element_rect(fill = NA, colour = "#cccccc"))
dog.map
Notice that the numbers are quite small. We might want to convert to per-1000 units later.
Style Points: Better Longitude & Latitude Labels
In case you need to explicitly show the coordinates of your spatial data, we can
ensure that they are formatted nicely by passing a custom formatting function to
the labels
option on the x and y scale. These functions must take a vector of
inputs and return a vector of labels, and we can write one to handle each of
longitude and latitude quite easily:
# Converts a numeric array to an array of latitude labels
latitude <- function(x)
paste0(format(x, digits = 3), "° ", ifelse(x > 0, "N", "S"))
# Converts a numeric array to an array of longitude labels
longitude <- function(x)
paste0(format(x, digits = 3), "° ", ifelse(x > 0, "E", "W"))
dog.map +
scale_y_continuous(labels = latitude) +
scale_x_continuous(labels = longitude) +
theme(axis.text = element_text(size = 8, colour = "#333333"))
Style Points: Adding a Compass Rose
Update (February 6, 2017): Before ggsn
existed, I used to do this
manually by creating a grob from a vector image using the grImport
package.
This is no longer necessary.
The ggsn
package makes it possible to
add scale bars and compass rose symbols to ggplot
objects.
library(ggsn)
dog.map +
ggsn::north(plot.data, location = "bottomright", symbol = 10)
Style Points: Adding a Drop Shadow
There’s a little trick for adding a drop shadow: we just draw the polygons filled with grey and shifted slightly southeast, and then draw the actual map polygons with the correct fill on top (which I learned here).
ggplot(plot.data, aes(x = long, y = lat, group = group,
fill = cut_number(dog.density, 5))) +
geom_polygon(aes(x = long + 0.005, y = lat - 0.002),
color = "grey50", size = 0.2, fill = "grey50") +
geom_polygon(color = "gray10", size = 0.2) +
coord_equal() +
viridis::scale_fill_viridis(discrete = TRUE) +
labs(title = "Registered Dogs per Private Dwelling (Toronto, 2013)",
fill = NULL) +
theme_void() +
theme(legend.position = "bottom",
panel.background = element_rect(fill = NA, colour = "#cccccc"))
Finishing Touches: Better Fonts and Nicer Labels
Update (February 6, 2017): As of ggplot2
version 2.2, subtitles and
captions are officially supported.
For standalone graphics, it is conventional to point to your data sources
somewhere, often below. You can do this fairly easily with the caption
label.
I also like to add authorship and licensing information.
In order to get “nicer” labels, I’m also going to bin the data manually.
Adopting Arial Narrow
as the font-of-the-day, the image at the top of this
post can be generated as follows:
plot.data$dogs.per.1000 <- factor(
cut(plot.data$dog.density * 1000, c(0, 30, 45, 60, 90, 120)),
labels = c("Under 30", "30 to 44", "45 to 59", "60 to 89", "90 to 120")
)
ggplot(plot.data, aes(x = long, y = lat, group = group, fill = dogs.per.1000)) +
geom_polygon(aes(x = long + 0.005, y = lat - 0.002),
color = "grey50", size = 0.2, fill = "grey50") +
geom_polygon(color = "grey10", size = 0.2) +
coord_equal() +
# Add the compass rose.
ggsn::north(plot.data, location = "bottomleft", symbol = 10) +
viridis::scale_fill_viridis(option = "viridis", discrete = TRUE) +
# Set the labels to NULL so that no space is allocated for them
labs(x = NULL, y = NULL, fill = NULL,
title = "Registered Dogs per Toronto Home",
subtitle = "Density per 1000 Private Dwellings, 2013",
caption = paste("A. Jacobs :: unconj.ca",
"Data: Toronto Animal Services (2014) & Statistics Canada (2013)",
"You may redistribute this graphic under the terms of the CC-by-SA license.",
sep = "\n")) +
theme_void() +
theme(text = element_text(family = "Arial Narrow", size = 8),
plot.title = element_text(size = 12, face = "bold"),
plot.margin = unit(c(0, 0.25, 0.0, 0.25), "in"),
panel.border = element_rect(fill = NA, colour = "#cccccc"),
legend.text = element_text(size = 8),
legend.position = c(0.9, 0.25))
Which looks pretty good to me!