Unconstant Conjunction A personal blog

For and Against data.table

I was reminded by Megan Stodel’s recent post – “Three reasons why I use data.table” – of the enduring data.table vs. dplyr argument.

To Stodel’s points in favour of data.table, which I would summarise as (1) performance, (2) good design, and (3) that it makes you feel a bit iconoclastic in these tidyverse times, I would add the following:

Both of these are highly valued in production environments, which happens to be the focus of my current work.

However, I would usually advocate against using data.table unless you really, really do need its performance benefits.

Much of the tibble/data.table debate seems to be over peoples' preferred syntax,1 or the degree to which base R, data.table, or the tidyverse produces more understandable code – and, relatedly, whether it is easier to teach to newcomers. All of these are valuable discussions.

However, I’d like to articulate another reason why I’m always cautious about adopting data.table: modify-by-reference can lead to surprising and non-idiomatic R code.

Every data.table user probably knows that the reason the package is fast is that it discards R’s usual copy-on-modify behaviour, instead modifying objects “by reference”.2 It is worth explaining in some detail what this means.

Copy-on-Modify vs. Modify-by-Reference

R users are accustomed to the fact that objects will not change unless you reassign to them with the <- operator. And so we can look at code like

df <- data.frame(x = rnorm(100), y = runif(100) * 5)
df <- create_features(df)

and expect that the call to custom_summary() will not modify df, while the call to create_features() clearly does. We would know this without needing to look at the source code for these functions, even if they were given unhelpful names like fn1() and fn2(). Now suppose that custom_summary() looked like the following:

custom_summary <- function(df) {
  df$summary_measure <- df$x * floor(df$y)
  df$x <- ifelse(df$x < -1e-4, -1e-4, df$x)
  df$x <- ifelse(df$x > 1e-4, 1e-4, df$x)

That is, the function cleans up the data a little before presenting a summary. Even though the function is modifying df internally, we know that these changes will not “leak” out of the scope of the function. This is what is meant by copy-on-modify: R will copy objects as necessary to prevent modifications from manifesting in the original.

Now suppose that for performance reasons we decide to translate this code to use data.table objects. The result might look like the following:

df <- data.table(x = rnorm(100), y = runif(100) * 5)
df[, y := floor(y)]
df[, feature1 := x * 2 + y]
df[, feature2 := x * x]

Even though there are no further uses of <-, these last lines all modify df in-place. This is what is meant by “modify-by-reference”. We could also move these into a new create_features() function, and this function would not need to use <- to modify the original data.table object.

A naive refactor could be carried out for the summary function, as well:

custom_summary <- function(df) {
  df[, summary_measure := x * floor(y)]
  df[x < -1e-4, x := -1e-4]
  df[x > 1e-4, x := 1e-4]

This new version introduces an subtle bug: if the call to custom_summary() is later removed or commented out, df$x may have larger or smaller entries than expected (since they are no longer capped). This is a direct consequence of modify-by-reference, and could never happen with native data.frame or tibble objects.

It’s true that this is a contrived example, but I think it’s a useful illustration nonetheless. Experienced data.table users might argue that functions should “clean up” after themselves if they make use of modify-by-reference features, or perhaps that sometimes-surprising side effects are merely the cost of using the package to begin with. The authors of data.table present an analogous scenario in one of the vignettes; they suggest that the function call copy() before modifying the input. Yet this is likely to erase the performance benefits of using data.table objects in the first place.

Reference Semantics are Not Idiomatic in User Code

As Bob Rudis remarks, what constitutes “idiomatic” R can vary, but in practice there are very few aspects of base R that have “reference semantics” – i.e. that break copy-on-modify rules in favour of modify-by-reference.

An important set of exceptions to this general rule are the functions that load packages, print to the console, or plot. All of these modify in place (e.g. without using <-) important objects: the search path, the console, and the graphics device. What these exceptions have in common is precisely that they manipulate well-known global objects that are baked into R itself.

In contrast, user-created objects – the usual vectors, data frames, and functions – almost never have reference semantics. This is not to say that you are categorically unable to use them. There are even explicit language features with reference semantics, namely environments, reference classes, and external pointers – or even when writing functions that use the <<- operator. Yet it is exceedingly rare to see these in the usual data analysis code; they are mostly used by packages to implement design patterns that would otherwise be very difficult in a copy-on-modify world.

You can see echos of this in Hadley Wickham’s warning to users about R6 classes (which have reference semantics) in the new edition of Advanced R:

[I]f you use R6 it’s very easy to create a non-idiomatic API that will feel very odd to native R users, and will have surprising pain points because of the reference semantics.

It’s my guess that concern about these pain points is the primary reason why dplyr never adopted the modify-by-reference approach, even though it has obviously been shown by data.table to be more performant.

Does This Mean You Should Use dplyr?

To be clear, data.table has its place. All of the arguments in favour of using it I mentioned at the outset remain in good standing. And reference semantics are not the only reason for the performance of data.table operations.

Nor does this mean that I would wholeheartedly recommend using dplyr and friends at every opportunity. Most data wrangling can be done in base R,3 and I often argue that this is the right choice, for instance, inside packages.

If anything, what I would suggest is that if you are going to use data.table, try to ensure that you write an API that exposes as little as possible of its reference semantics. Avoid creating situations that will surprise users expecting <- as the signal for modification.

  1. Atrebas has produced an excellent side-by-side comparison of some common analysis patterns with data.table and dplyr. ↩︎

  2. I’m a bit critical of this language because all R objects are, at the C level, pointers/references. Copy-on-modify is actually a contract adhered to by the parts of R written in C; it is not enforced by the language and can be broken. This is how it was possible to actually implement data.table in the first place.

    Better terms are “reference semantics” or “modify-by-reference”, because they separate behaviour from implementation. ↩︎

  3. For instance, Jozef Hajnala has a neat collection of articles demonstrating how common dplyr and data.table patterns can be implemented in base R. ↩︎

comments powered by Disqus