If you’re doing any kind of empirical work in Economics, you probably have a huge, messy folder containing a mix of
- Data files (
.csv
,.dta
,.xlsx
, etc.) in various states of merge-ness and cleanliness. - Scripts for creating graphs & figures, producing summary statistics, and computing models. Probably written for Stata, R, or the Pandas data stack1.
- Files containing written work. These are usually
.doc(x)
files, but I’ve seen lots of LaTeX lately as well, and being a plain-text format, this is a huge boon to reproducible research.
A really simple research workflow (start with data, make some figures, make some summary statistics, and run some models) might look like the following:
But of course that’s not clear when looking at the .zip
file you send your
coauthor.
(In case anyone is wondering about the file extensions, the .do
extensions are
for Stata, which is a popular statistics package among economists2, and the
.R
extensions are for R, which is a very popular programming
language/statistics package for everyone else. R probably makes the nicer graphs
of the two3.)
I have a good friend who deals with all of this by creating a single .do
file
to run all of the other Stata, R, and pdflatex scripts4, and then simply
commenting out the sections she doesn’t want to run. I used to do essentially
the same thing with a Bash script, and back when I used R exclusively I used
knitr (which I’ll discuss a little bit at the end).
The problem with these approaches (despite being somewhat inelegant, except
possibly in the case of knitr) is that they don’t really incorporate the notion
of dependencies. If you look back at the diagram above, you’ll see that if,
say, you modify the script to produce the tables, then everything that relies
on the tables must be compiled/re-run. So in this example, table1-data.tex
,
paper.tex
, and paper.pdf
— and, importantly, only those things — would
need to be updated. That kind of interdependency is very hard to capture in a
simple master “build” script.
An alternative that I’ve been toying with recently is to use the venerable Unix
tool “make”5. Make requires that you write a small file (almost always called
Makefile
, with no extension) outlining the “targets” your project produces,
and the “dependencies” each of them require. The syntax looks like the
following:
VARIABLE = 0
# Comment
[target]: [dependency1 dependency2 ...]
[command to run] $(VARIABLE) [etc]
So for our project above, you might have a simple Makefile
containing
# First, clean the data using the cleanup script:
data-cleaned.dta: data.csv cleanup.do
stata-se -b do "cleanup.do"
# Next, produce the two tables from the cleaned data:
table1-data.tex: data-cleaned.dta sumstats.do
stata-se -b do "sumstats.do"
table2-data.tex: data-cleaned.dta estimates.do
stata-se -b do "estimates.do"
# Next, produce the three figures. The '%' character can be used as a wildcard:
f%.pdf: data-cleaned.dta make_figs.R
Rscript "make_figs.R"
# Finally, produce the paper (which relies on complete figures and tables):
paper.pdf: paper.tex f1.pdf f2.pdf f3.pdf table1-data.tex table2-data.tex
pdflatex "paper.tex"
As you can see, the structure of “targets” and “dependencies” is very explicit. If we want to produce one (or more) of the targets we can call Make with, for example:
$ make f1.pdf
This will check the rule for producing f1.pdf
, which relies on
data-cleaned.dta
and make_figs.R
. Since there is a rule for the former, it
will try to produce data-cleaned.dta
first, then produce f1.pdf
.
What happens if we call make f1.pdf
again? Make checks if the file exists (it
does, we just created it), and then exits without doing any redundant work.
This is all quite convenient, but hardly worth switching to Make for. But wait!
It gets better. Suppose we edit the cleanup.do
file. And ask Make to produce
f1.pdf
again. Make will check if f1.pdf
exists. It does (we just made it).
Next, it will check if the files that were used to produce f1.pdf
have been
updated since f1.pdf
was created. They were not. Next, Make will check if any
of the files used to make those files were updated since f1.pdf
. And therein
lies the magic of Make’s dependency-based structure: Make will realize that
cleanup.do
is newer than f1.pdf
, and it will re-generate data-cleaned.dta
as a result before re-generating f1.pdf
, because these are the files that
depend on cleanup.do
.
By convention, most Makefiles also include an all
target to create the
“default” output (for us, the final paper), and a clean
target that removes
all generated files. So in this case we should probably add the following to the
top of the file:
all: paper.pdf
clean:
rm *.pdf table*-data.tex data-cleaned.dta
.PHONY: clean all
There are a couple of neat features of Make on display here6. Notice that not
all targets require dependencies (in this case, clean
), and that not all
targets have to specify a command (in this case, all
does not). In addition,
I’ve added the special .PHONY
target, which tells Make that the clean
and
all
targets don’t actually produce files called all
and clean
.
Now, to check if everything works from scratch.
$ make clean
$ make
When you don’t specify a target for Make, it looks for the first target
specified (that doesn’t start with .
). Since we added all
at the top, it
process this target.
More Examples: Reproducible Data Sources
In the first example, I sort of assumed you just started with your data.csv
.
More often than not, however, you have to make it from a variety of sources, and
part of doing reproducible research is documenting your sources in the code. So
you might have something that can be visualized with
One way you might encode this is with
# Get first data source from the internet
data-1.dta: csv_to_dta.py
wget http://www.somedatasource.org/datsets/awesome.csv
python csv_to_dta.py --output=data-1.dta awesome.csv
# Scrape the second data source using some Python script
data-2.dta: scrape_site.py
python scrape_site.py --output=data-2.dta "http://www.somedatasource.org/info/results.html"
# Merge the datasets using a Stata script
data-merged.dta: data-1.dta data-2.dta merge.do
stata-se -b do "merge.do"
Wget is an useful tool for downloading files from the command line, by the way. Also, while you don’t have to write web scrapers (or file format converts) in Python, it’s probably the easiest way to do so.
More Examples: Making Your Project (More) Portable
One problem you might encounter with Makefiles is hard-coding in the programs to
compile the scrips. For example, while Stata is called stata-se
on my machine,
it may be simply stata
on your machine. A good way of simplifying the
find-and-replace problem this entails is to define a variable at the top of the
file, say STATA
, that anyone can use to adapt the file to their environment
with minimal effort, e.g.
# Change this to the location of your Stata executable:
STATA = stata-se
...
# First, clean the data using the cleanup script:
data-cleaned.dta: data.csv cleanup.do
$(STATA) -b do "cleanup.do"
# Next, produce the two tables from the cleaned data:
table1-data.tex: data-cleaned.dta sumstats.do
$(STATA) -b do "sumstats.do"
table2-data.tex: data-cleaned.dta estimates.do
$(STATA) -b do "estimates.do"
...
And so on with Python, etc. If your research works with the same script over and
over again (say, a web scraper), but another researcher might replace just this
file (say, replacing scrape.py
with fastscrape.py
), then you could have
something like
PYTHON = python
SCRAPER = fastscrape.py
data-1.dta: $(SCRAPER)
$(PYTHON) $(SCRAPER) --output=data-1.dta --source="http://www.somedatasoure.org/datasets/awesome1.dta"
data-2.dta: $(SCRAPER)
$(PYTHON) $(SCRAPER) --output=data-2.dta --source="http://www.somedatasoure.org/datasets/awesome2.dta"
And so on.
A Word About Make vs. Knitr
Knitr is a tool that allows you to embed runnable R code in LaTeX or Markdown
files so that the code is run and the figures/tables/etc are produced when the
document is compiled (LaTeX to .pdf
, Markdown to .html
). It’s a widely-used
tool for literate programming and research papers. I think this is really neat,
but eventually I switched away from it and towards using things like Make
because literate programming/knitr is sort of limited to linear approaches, and
might not reflect more modular projects. It also tends to hang if part of your
project takes a while to run (Matlab simulations, anyone?), which can be a pain
in the ass when you need to wait 10 minutes to see what the new paragraph you
wrote actually looks like in the final PDF. This is exactly the same problem
iPython notebooks have — great for a step-by-step explanation, but not so
great for a paper where it might actually take a while for things to run.
-
It is possible that you are using excel for all of this, but if that’s the case you’re probably not interested in this article to begin with. ↩︎
-
On a related Stata note, if you’re trying to output tables and regression results from Stata, the user-written esttab package can be quite useful. Running
esttab
orestout
with thebooktabs fragment
options work very nicely in my opinion. ↩︎ -
At least in conjunction with ggplot2, anyway.
install.packages("ggplot2")
, please. ↩︎ -
Pro Stata tip: you can use
shell
to exectue unix-style commands from within Stata. So for example to compile a downstream LaTeX document you can useshell pdflatex paper.tex
. ↩︎ -
Make is available on all Unix-like systems (Mac OS X, Linux, BSDs), and there is also a version available for Windows. ↩︎
-
Also, note that
rm
is the Unix command for removing files, and the*
character serves as a wildcard when using that program. ↩︎