If you’re doing any kind of empirical work in Economics, you probably have a huge, messy folder containing a mix of
- Data files (
.xlsx, etc.) in various states of merge-ness and cleanliness.
- Scripts for creating graphs & figures, producing summary statistics, and computing models. Probably written for Stata, R, or the Pandas data stack1.
- Files containing written work. These are usually
.doc(x)files, but I’ve seen lots of LaTeX lately as well, and being a plain-text format, this is a huge boon to reproducible research.
A really simple research workflow (start with data, make some figures, make some summary statistics, and run some models) might look like the following:
But of course that’s not clear when looking at the
.zip file you send your
(In case anyone is wondering about the file extensions, the
.do extensions are
for Stata, which is a popular statistics package among economists2, and the
.R extensions are for R, which is a very popular programming
language/statistics package for everyone else. R probably makes the nicer graphs
of the two3.)
I have a good friend who deals with all of this by creating a single
to run all of the other Stata, R, and pdflatex scripts4, and then simply
commenting out the sections she doesn’t want to run. I used to do essentially
the same thing with a Bash script, and back when I used R exclusively I used
knitr (which I’ll discuss a little bit at the end).
The problem with these approaches (despite being somewhat inelegant, except
possibly in the case of knitr) is that they don’t really incorporate the notion
of dependencies. If you look back at the diagram above, you’ll see that if,
say, you modify the script to produce the tables, then everything that relies
on the tables must be compiled/re-run. So in this example,
paper.pdf — and, importantly, only those things — would
need to be updated. That kind of interdependency is very hard to capture in a
simple master “build” script.
An alternative that I’ve been toying with recently is to use the venerable Unix
tool “make”5. Make requires that you write a small file (almost always called
Makefile, with no extension) outlining the “targets” your project produces,
and the “dependencies” each of them require. The syntax looks like the
VARIABLE = 0 # Comment [target]: [dependency1 dependency2 ...] [command to run] $(VARIABLE) [etc]
So for our project above, you might have a simple
# First, clean the data using the cleanup script: data-cleaned.dta: data.csv cleanup.do stata-se -b do "cleanup.do" # Next, produce the two tables from the cleaned data: table1-data.tex: data-cleaned.dta sumstats.do stata-se -b do "sumstats.do" table2-data.tex: data-cleaned.dta estimates.do stata-se -b do "estimates.do" # Next, produce the three figures. The '%' character can be used as a wildcard: f%.pdf: data-cleaned.dta make_figs.R Rscript "make_figs.R" # Finally, produce the paper (which relies on complete figures and tables): paper.pdf: paper.tex f1.pdf f2.pdf f3.pdf table1-data.tex table2-data.tex pdflatex "paper.tex"
As you can see, the structure of “targets” and “dependencies” is very explicit. If we want to produce one (or more) of the targets we can call Make with, for example:
$ make f1.pdf
This will check the rule for producing
f1.pdf, which relies on
make_figs.R. Since there is a rule for the former, it
will try to produce
data-cleaned.dta first, then produce
What happens if we call
make f1.pdf again? Make checks if the file exists (it
does, we just created it), and then exits without doing any redundant work.
This is all quite convenient, but hardly worth switching to Make for. But wait!
It gets better. Suppose we edit the
cleanup.do file. And ask Make to produce
f1.pdf again. Make will check if
f1.pdf exists. It does (we just made it).
Next, it will check if the files that were used to produce
f1.pdf have been
f1.pdf was created. They were not. Next, Make will check if any
of the files used to make those files were updated since
f1.pdf. And therein
lies the magic of Make’s dependency-based structure: Make will realize that
cleanup.do is newer than
f1.pdf, and it will re-generate
as a result before re-generating
f1.pdf, because these are the files that
By convention, most Makefiles also include an
all target to create the
“default” output (for us, the final paper), and a
clean target that removes
all generated files. So in this case we should probably add the following to the
top of the file:
all: paper.pdf clean: rm *.pdf table*-data.tex data-cleaned.dta .PHONY: clean all
There are a couple of neat features of Make on display here6. Notice that not
all targets require dependencies (in this case,
clean), and that not all
targets have to specify a command (in this case,
all does not). In addition,
I’ve added the special
.PHONY target, which tells Make that the
all targets don’t actually produce files called
Now, to check if everything works from scratch.
$ make clean $ make
When you don’t specify a target for Make, it looks for the first target
specified (that doesn’t start with
.). Since we added
all at the top, it
process this target.
More Examples: Reproducible Data Sources
In the first example, I sort of assumed you just started with your
More often than not, however, you have to make it from a variety of sources, and
part of doing reproducible research is documenting your sources in the code. So
you might have something that can be visualized with
One way you might encode this is with
# Get first data source from the internet data-1.dta: csv_to_dta.py wget http://www.somedatasource.org/datsets/awesome.csv python csv_to_dta.py --output=data-1.dta awesome.csv # Scrape the second data source using some Python script data-2.dta: scrape_site.py python scrape_site.py --output=data-2.dta "http://www.somedatasource.org/info/results.html" # Merge the datasets using a Stata script data-merged.dta: data-1.dta data-2.dta merge.do stata-se -b do "merge.do"
Wget is an useful tool for downloading files from the command line, by the way. Also, while you don’t have to write web scrapers (or file format converts) in Python, it’s probably the easiest way to do so.
More Examples: Making Your Project (More) Portable
One problem you might encounter with Makefiles is hard-coding in the programs to
compile the scrips. For example, while Stata is called
stata-se on my machine,
it may be simply
stata on your machine. A good way of simplifying the
find-and-replace problem this entails is to define a variable at the top of the
STATA, that anyone can use to adapt the file to their environment
with minimal effort, e.g.
# Change this to the location of your Stata executable: STATA = stata-se ... # First, clean the data using the cleanup script: data-cleaned.dta: data.csv cleanup.do $(STATA) -b do "cleanup.do" # Next, produce the two tables from the cleaned data: table1-data.tex: data-cleaned.dta sumstats.do $(STATA) -b do "sumstats.do" table2-data.tex: data-cleaned.dta estimates.do $(STATA) -b do "estimates.do" ...
And so on with Python, etc. If your research works with the same script over and
over again (say, a web scraper), but another researcher might replace just this
file (say, replacing
fastscrape.py), then you could have
PYTHON = python SCRAPER = fastscrape.py data-1.dta: $(SCRAPER) $(PYTHON) $(SCRAPER) --output=data-1.dta --source="http://www.somedatasoure.org/datasets/awesome1.dta" data-2.dta: $(SCRAPER) $(PYTHON) $(SCRAPER) --output=data-2.dta --source="http://www.somedatasoure.org/datasets/awesome2.dta"
And so on.
A Word About Make vs. Knitr
Knitr is a tool that allows you to embed runnable R code in LaTeX or Markdown
files so that the code is run and the figures/tables/etc are produced when the
document is compiled (LaTeX to
.html). It’s a widely-used
tool for literate programming and research papers. I think this is really neat,
but eventually I switched away from it and towards using things like Make
because literate programming/knitr is sort of limited to linear approaches, and
might not reflect more modular projects. It also tends to hang if part of your
project takes a while to run (Matlab simulations, anyone?), which can be a pain
in the ass when you need to wait 10 minutes to see what the new paragraph you
wrote actually looks like in the final PDF. This is exactly the same problem
iPython notebooks have — great for a step-by-step explanation, but not so
great for a paper where it might actually take a while for things to run.
It is possible that you are using excel for all of this, but if that’s the case you’re probably not interested in this article to begin with. ↩︎
On a related Stata note, if you’re trying to output tables and regression results from Stata, the user-written esttab package can be quite useful. Running
booktabs fragmentoptions work very nicely in my opinion. ↩︎
At least in conjunction with ggplot2, anyway.
install.packages("ggplot2"), please. ↩︎
Pro Stata tip: you can use
shellto exectue unix-style commands from within Stata. So for example to compile a downstream LaTeX document you can use
shell pdflatex paper.tex. ↩︎
Make is available on all Unix-like systems (Mac OS X, Linux, BSDs), and there is also a version available for Windows. ↩︎
Also, note that
rmis the Unix command for removing files, and the
*character serves as a wildcard when using that program. ↩︎