For those coming to R from other languages, it seems very odd that R users
import code from packages the way that they do. For instance, if you are used to
Python, this is the general pattern of “using code from elsewhere”:
import math
import numpy as np
from random import randint
# Usage:
math.floor(3.2)
np.array(...)
randint()
Meanwhile, in R, you’re likely to see
library(dplyr)
data_frame(x = 1, y = "A") %>%
mutate(z = TRUE)
… with the subtext being that users must simply know that the
data_frame()
, mutate()
, and %>%
functions are actually all from the
dplyr package. Why is R so unusual here?
Continue Reading →
Recently I wanted to send some Shiny usage data from R to a certain metrics
server. Since R includes write.socket()
and
friends for opening arbitrary network sockets, it seemed at the outset that this
would be quite simple. However, I ran into an interesting roadblock along the
way.
It turns out that R’s socket API only supports TCP connections, which you can
confirm by looking at the source code
– and in this case I needed to send UDP
packets instead. This was a little surprising to me, since most other languages
would include UDP support out of the box; it’s a core internet protocol, after
all. For whatever reason, this seems not to be the case with R, and even after
searching CRAN and GitHub I wasn’t able to find an existing package that
provides UDP socket support.
To remedy this, I put together a simple way to write messages to UDP sockets
from R.
Continue Reading →
In the past few years the APIs of the wildly popular tidyverse packages have
coalesced around the pipe operator (%>%
). R users familiar with this ecosystem
of packages are now accustomed to writing “pipelines” where code is expressed as
a series of steps composed with the pipe operator.
For example, plenty of data science code basically boils down to:
data() %>% transform() %>% summarise()
(This is even more evident if you subscribe to Wickham & Grolemund’s view that
models are a “low-dimensional summary of your data”.)
There are plenty of resources on how to use pipes to re-write you R code. There
are fewer on the implications of pipes on how R functions and package APIs are
designed. What should package authors should keep in mind, knowing that their
users are likely to use their functions in pipelines?
My own experience has led me to compile four principles for writing pipe-
friendly APIs for R packages:
-
Only one argument is going to be “piped” into your functions, so you should
design them to accommodate this. The first argument should be what you’d
expect to be piped in; all other arguments should be parameters with
meaningful default values. If you think you’ll always need more than one
argument, consider wrapping them up in a lightweight S3 class.
-
Think carefully about the output of your functions, since this will be what
is passed down the pipeline. Strive to always return a single type of output
(a data frame, a numeric vector, etc). Never return NULL
, because very few
functions can take NULL
as an input. (To signify empty values, use zero-row
data frames or vectors with NA
.)
-
Prefer “pure” functions – e.g. those that have no “side effects” that mutate
global state – whenever possible. Users think about pipelines as a linear
progression of steps. The transparency of pure functions make them easy to
reason about in this fashion.
-
When your functions must have side effects (e.g. when printing or plotting),
return the original object (instead of the conventional NULL
) so that
pipelines can continue.
I’ve found that being explicit about these design goals has improved the quality
of my own package interfaces; perhaps they can be of use to others.
Continue Reading →
Recently I was working on an R package that had historically used both rjson
and jsonlite to serialize R objects to JSON (before sending it off to an
API). In this case we wanted to remove the rjson depdency, which was only
used in a few places.
The most noticable hiccup I encountered while porting the code was during
encoding of parameter lists generated in R, which looked something like
params <- list(key1 = "param", key2 = NULL,
key3 = c("paired", "params"))
In this case, rjson produced exactly what our API was looking for:
cat(rjson::toJSON(params))
#> {"key1":"param","key2":null,"key3":["paired","params"]}
But by default the jsonlite package will behave very differently:
jsonlite::toJSON(params)
#> {"key1":["param"],"key2":{},"key3":["paired","params"]}
Continue Reading →
In the process of working on a recent choropeth piece for work,
I discovered that it’s easy to stumble when moving spatial data out of R and
onto the web. It’s a poorly-documented reality that many web-based mapping
libraries (including both D3 and Leaflet) expect GeoJSON data to be in
EPSG:4326
, and it is by no means a given that that your spatial data will
start off in this projection.
If you’re like me and do your spatial data pre-processing in R before exporting
to GeoJSON, you may have to re-project your data before these libraries will
handle them properly. Thankfully, this is fairly easy to do with the modern
spatial packages.
Continue Reading →