Author’s note: this is a lightly modified version of the talk I gave at the GTA
R User’s Group in May of this year. You can find the original slides
here.
Unfortunately, the talk was not recorded.
As I have noted before,
most resources for R package authors are pitched at those writing open-source
packages — usually hosted on GitHub, and with the goal
of ending up on CRAN.
These are valuable resources, and reflect the healthy free and open-source
(FOSS) R package ecosystem. But it is not the whole story. Many R users,
especially those working as data scientists in industry, can and should be
writing packages for internal use within their company or organisation.
Yet there is comparatively little out there about how to actually put together
high-quality packages in these environments.
This post is my attempt to address that gap.
At work we have more than 50 internal R packages,
and I have been heavily involved in building up the culture and tooling we use
to make managing those packages possible over the last two years.
I’ll focus on three major themes: code, tooling, and culture.
Continue Reading →
Have you ever noticed something like the following when you’re installing an R
package?
checking for gcc... gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking how to run the C preprocessor... gcc -E
checking for grep that handles long lines and -e... /usr/bin/grep
checking for egrep... /usr/bin/grep -E
checking for ANSI C header files... yes
...
configure: creating ./config.status
config.status: creating src/Makevars
on Windows, users install pre-built packages, but on all other platforms
(including macOS and Linux), they are built from source, including any native C,
C++, or Fortran code. Often it’s enough to use the default R settings to build
these packages, but sometimes you might need know a bit more about a user’s
system in order to get things working correctly.
For this reason, R permits packages
to have a ./configure
shell script to make these checks before a package is
installed.
Continue Reading →
I was reminded by Megan Stodel’s recent post – “Three reasons why I use
data.table” – of the enduring
data.table vs. dplyr argument.
To Stodel’s points in favour of data.table, which I would summarise as (1)
performance, (2) good design, and (3) that it makes you feel a bit iconoclastic
in these tidyverse times, I would add the following:
-
A very stable, actively-maintained, and battle-tested codebase. This is also
true of both dplyr and base R, of course, but it is valuable nonetheless.
-
Zero third-party dependencies. In contrast,
dplyr has, at the time of writing, 22 direct or indirect package
dependencies. In combination with the choice of C over Rcpp, this leads to
very fast compilation and installation times.
Both of these are highly valued in production environments, which happens to be
the focus of my current work.
However, I would usually advocate against using data.table unless you
really, really do need its performance benefits.
Continue Reading →
There are a wealth of excellent resources for R users on creating and
maintaining R packages, which remain the best way to share your code within your
organisation or the larger community. However, almost all of these resources
tend to focus on open-source packages, tools, and workflows. As a result of
this, they tend to skim over more corporate issues like copyright assignment.
Yet many R users are working on their code in proprietary environments, creating
packages for internal use by their company. That code is closed source, with the
copyright belonging to their organisation. If you fall into this category, how
should you communicate that in your R package? This post is intended to provide
a very clear answer to that question.
Continue Reading →
I’m fan of Bryan Cantrill’s argument that we ought to think about “platform
values” when assessing technologies, highlighted in a recent talk,
but explained more thoroughly in an earlier one on his experience with the node.js community.
In my reading, he argues that of the many values (such as performance, security,
or expressiveness) that programming languages or platforms may hold, many are in
conflict, and the platform inevitably emphasises some of these values over
others. These decisions are a reflection of explicit or implicit “platform
values”.
Cantrill illustrates this with a series of examples, but, no surprise, the R
platform does not make his shortlist. I couldn’t resist trying to cook up my own
taxonomy of platform values for the R language and its community.
More broadly, though, Cantrill believes that the values of a platform affect the
projects that adopt them, and conflicts can arise between the values of a
project and its respective choice of platform. This strongly echoes my own
experience in the R community, and I’ve gotten a measure of clarity for future
and existing projects by learning to articulate these conflicts.
Continue Reading →