I was reminded by Megan Stodel’s recent post – “Three reasons why I use data.table” – of the enduring data.table vs. dplyr argument.
To Stodel’s points in favour of data.table, which I would summarise as (1) performance, (2) good design, and (3) that it makes you feel a bit iconoclastic in these tidyverse times, I would add the following:
-
A very stable, actively-maintained, and battle-tested codebase. This is also true of both dplyr and base R, of course, but it is valuable nonetheless.
-
Zero third-party dependencies. In contrast, dplyr has, at the time of writing, 22 direct or indirect package dependencies. In combination with the choice of C over Rcpp, this leads to very fast compilation and installation times.
Both of these are highly valued in production environments, which happens to be the focus of my current work.
However, I would usually advocate against using data.table unless you really, really do need its performance benefits.
Much of the tibble
/data.table
debate seems to be over peoples' preferred
syntax,1 or the degree to which base R, data.table, or the tidyverse
produces more understandable code – and, relatedly, whether it is easier to
teach to newcomers. All of
these are valuable discussions.
However, I’d like to articulate another reason why I’m always cautious about adopting data.table: modify-by-reference can lead to surprising and non-idiomatic R code.
Every data.table user probably knows that the reason the package is fast is that it discards R’s usual copy-on-modify behaviour, instead modifying objects “by reference”.2 It is worth explaining in some detail what this means.
Copy-on-Modify vs. Modify-by-Reference
R users are accustomed to the fact that objects will not change unless you
reassign to them with the <-
operator. And so we can look at code like
df <- data.frame(x = rnorm(100), y = runif(100) * 5)
custom_summary(df)
df <- create_features(df)
and expect that the call to custom_summary()
will not modify df
, while the
call to create_features()
clearly does. We would know this without needing to
look at the source code for these functions, even if they were given unhelpful
names like fn1()
and fn2()
. Now suppose that custom_summary()
looked like
the following:
custom_summary <- function(df) {
df$summary_measure <- df$x * floor(df$y)
df$x <- ifelse(df$x < -1e-4, -1e-4, df$x)
df$x <- ifelse(df$x > 1e-4, 1e-4, df$x)
summary(df)
}
That is, the function cleans up the data a little before presenting a summary.
Even though the function is modifying df
internally, we know that these
changes will not “leak” out of the scope of the function. This is what is meant
by copy-on-modify: R will copy objects as necessary to prevent modifications
from manifesting in the original.
Now suppose that for performance reasons we decide to translate this code to use
data.table
objects. The result might look like the following:
df <- data.table(x = rnorm(100), y = runif(100) * 5)
custom_summary(df)
df[, y := floor(y)]
df[, feature1 := x * 2 + y]
df[, feature2 := x * x]
Even though there are no further uses of <-
, these last lines all modify df
in-place. This is what is meant by “modify-by-reference”. We could also move
these into a new create_features()
function, and this function would not need
to use <-
to modify the original data.table
object.
A naive refactor could be carried out for the summary function, as well:
custom_summary <- function(df) {
df[, summary_measure := x * floor(y)]
df[x < -1e-4, x := -1e-4]
df[x > 1e-4, x := 1e-4]
summary(df)
}
This new version introduces an subtle bug: if the call to custom_summary()
is
later removed or commented out, df$x
may have larger or smaller entries than
expected (since they are no longer capped). This is a direct consequence of
modify-by-reference, and could never happen with native data.frame
or tibble
objects.
It’s true that this is a contrived example, but I think it’s a useful
illustration nonetheless. Experienced data.table
users might argue that
functions should “clean up” after themselves if they make use of
modify-by-reference features, or perhaps that sometimes-surprising side effects
are merely the cost of using the package to begin with. The authors of
data.table present an analogous scenario in one of the vignettes;
they suggest that the function call copy()
before modifying the input.
Yet this is likely to erase the performance benefits of using data.table
objects in the first place.
Reference Semantics are Not Idiomatic in User Code
As Bob Rudis remarks, what constitutes “idiomatic” R can vary, but in practice there are very few aspects of base R that have “reference semantics” – i.e. that break copy-on-modify rules in favour of modify-by-reference.
An important set of exceptions to this general rule are the functions that load
packages, print to the console, or plot. All of these modify in place (e.g.
without using <-
) important objects: the search path, the console, and the
graphics device. What these exceptions have in common is precisely that they
manipulate well-known global objects that are baked into R itself.
In contrast, user-created objects – the usual vectors, data frames, and
functions – almost never have reference semantics. This is not to say that you
are categorically unable to use them. There are even explicit language features
with reference semantics, namely environments, reference classes, and external
pointers – or even when writing functions that use the <<-
operator. Yet it
is exceedingly rare to see these in the usual data analysis code; they are
mostly used by packages to implement design patterns that would otherwise be
very difficult in a copy-on-modify world.
You can see echos of this in Hadley Wickham’s warning to users about R6 classes (which have reference semantics) in the new edition of Advanced R:
[I]f you use R6 it’s very easy to create a non-idiomatic API that will feel very odd to native R users, and will have surprising pain points because of the reference semantics.
It’s my guess that concern about these pain points is the primary reason why dplyr never adopted the modify-by-reference approach, even though it has obviously been shown by data.table to be more performant.
Does This Mean You Should Use dplyr?
To be clear, data.table has its place. All of the arguments in favour of
using it I mentioned at the outset remain in good standing. And reference
semantics are not the only reason
for the performance of data.table
operations.
Nor does this mean that I would wholeheartedly recommend using dplyr and friends at every opportunity. Most data wrangling can be done in base R,3 and I often argue that this is the right choice, for instance, inside packages.
If anything, what I would suggest is that if you are going to use
data.table, try to ensure that you write an API that exposes as little as
possible of its reference semantics. Avoid creating situations that will
surprise users expecting <-
as the signal for modification.
-
Atrebas has produced an excellent side-by-side comparison of some common analysis patterns with data.table and dplyr. ↩︎
-
I’m a bit critical of this language because all R objects are, at the C level, pointers/references. Copy-on-modify is actually a contract adhered to by the parts of R written in C; it is not enforced by the language and can be broken. This is how it was possible to actually implement data.table in the first place.
Better terms are “reference semantics” or “modify-by-reference”, because they separate behaviour from implementation. ↩︎
-
For instance, Jozef Hajnala has a neat collection of articles demonstrating how common dplyr and data.table patterns can be implemented in base R. ↩︎