Unconstant Conjunction A personal blog

Managing Complex Research Workflows with Make

If you’re doing any kind of empirical work in Economics, you probably have a huge, messy folder containing a mix of

  • Data files (.csv, .dta, .xlsx, etc.) in various states of merge-ness and cleanliness.
  • Scripts for creating graphs & figures, producing summary statistics, and computing models. Probably written for Stata, R, or the Pandas data stack1.
  • Files containing written work. These are usually .doc(x) files, but I’ve seen lots of LaTeX lately as well, and being a plain-text format, this is a huge boon to reproducible research.

A really simple research workflow (start with data, make some figures, make some summary statistics, and run some models) might look like the following:

An Econ Workflow

But of course that’s not clear when looking at the .zip file you send your coauthor.

(In case anyone is wondering about the file extensions, the .do extensions are for Stata, which is a popular statistics package among economists2, and the .R extensions are for R, which is a very popular programming language/statistics package for everyone else. R probably makes the nicer graphs of the two3.)

I have a good friend who deals with all of this by creating a single .do file to run all of the other Stata, R, and pdflatex scripts4, and then simply commenting out the sections she doesn’t want to run. I used to do essentially the same thing with a Bash script, and back when I used R exclusively I used knitr (which I’ll discuss a little bit at the end).

The problem with these approaches (despite being somewhat inelegant, except possibly in the case of knitr) is that they don’t really incorporate the notion of dependencies. If you look back at the diagram above, you’ll see that if, say, you modify the script to produce the tables, then everything that relies on the tables must be compiled/re-run. So in this example, table1-data.tex, paper.tex, and paper.pdf — and, importantly, only those things — would need to be updated. That kind of interdependency is very hard to capture in a simple master “build” script.

An alternative that I’ve been toying with recently is to use the venerable Unix tool “make”5. Make requires that you write a small file (almost always called Makefile, with no extension) outlining the “targets” your project produces, and the “dependencies” each of them require. The syntax looks like the following:

VARIABLE = 0
# Comment
[target]: [dependency1 dependency2 ...]
	[command to run] $(VARIABLE) [etc]

So for our project above, you might have a simple Makefile containing

# First, clean the data using the cleanup script:
data-cleaned.dta: data.csv cleanup.do
	stata-se -b do "cleanup.do"

# Next, produce the two tables from the cleaned data:
table1-data.tex: data-cleaned.dta sumstats.do
	stata-se -b do "sumstats.do"

table2-data.tex: data-cleaned.dta estimates.do
	stata-se -b do "estimates.do"

# Next, produce the three figures. The '%' character can be used as a wildcard:
f%.pdf: data-cleaned.dta make_figs.R
	Rscript "make_figs.R"

# Finally, produce the paper (which relies on complete figures and tables):
paper.pdf: paper.tex f1.pdf f2.pdf f3.pdf table1-data.tex table2-data.tex
	pdflatex "paper.tex"

As you can see, the structure of “targets” and “dependencies” is very explicit. If we want to produce one (or more) of the targets we can call Make with, for example:

$ make f1.pdf

This will check the rule for producing f1.pdf, which relies on data-cleaned.dta and make_figs.R. Since there is a rule for the former, it will try to produce data-cleaned.dta first, then produce f1.pdf.

What happens if we call make f1.pdf again? Make checks if the file exists (it does, we just created it), and then exits without doing any redundant work.

This is all quite convenient, but hardly worth switching to Make for. But wait! It gets better. Suppose we edit the cleanup.do file. And ask Make to produce f1.pdf again. Make will check if f1.pdf exists. It does (we just made it). Next, it will check if the files that were used to produce f1.pdf have been updated since f1.pdf was created. They were not. Next, Make will check if any of the files used to make those files were updated since f1.pdf. And therein lies the magic of Make’s dependency-based structure: Make will realize that cleanup.do is newer than f1.pdf, and it will re-generate data-cleaned.dta as a result before re-generating f1.pdf, because these are the files that depend on cleanup.do.

By convention, most Makefiles also include an all target to create the “default” output (for us, the final paper), and a clean target that removes all generated files. So in this case we should probably add the following to the top of the file:

all: paper.pdf

clean:
	rm *.pdf table*-data.tex data-cleaned.dta

.PHONY: clean all

There are a couple of neat features of Make on display here6. Notice that not all targets require dependencies (in this case, clean), and that not all targets have to specify a command (in this case, all does not). In addition, I’ve added the special .PHONY target, which tells Make that the clean and all targets don’t actually produce files called all and clean.

Now, to check if everything works from scratch.

$ make clean
$ make

When you don’t specify a target for Make, it looks for the first target specified (that doesn’t start with .). Since we added all at the top, it process this target.

More Examples: Reproducible Data Sources

In the first example, I sort of assumed you just started with your data.csv. More often than not, however, you have to make it from a variety of sources, and part of doing reproducible research is documenting your sources in the code. So you might have something that can be visualized with

A Date Workflow

One way you might encode this is with

# Get first data source from the internet
data-1.dta: csv_to_dta.py
	wget http://www.somedatasource.org/datsets/awesome.csv
	python csv_to_dta.py --output=data-1.dta awesome.csv

# Scrape the second data source using some Python script
data-2.dta: scrape_site.py
	python scrape_site.py --output=data-2.dta "http://www.somedatasource.org/info/results.html"

# Merge the datasets using a Stata script
data-merged.dta: data-1.dta data-2.dta merge.do
	stata-se -b do "merge.do"

Wget is an useful tool for downloading files from the command line, by the way. Also, while you don’t have to write web scrapers (or file format converts) in Python, it’s probably the easiest way to do so.

More Examples: Making Your Project (More) Portable

One problem you might encounter with Makefiles is hard-coding in the programs to compile the scrips. For example, while Stata is called stata-se on my machine, it may be simply stata on your machine. A good way of simplifying the find-and-replace problem this entails is to define a variable at the top of the file, say STATA, that anyone can use to adapt the file to their environment with minimal effort, e.g.

# Change this to the location of your Stata executable:
STATA = stata-se

...

# First, clean the data using the cleanup script:
data-cleaned.dta: data.csv cleanup.do
	$(STATA) -b do "cleanup.do"

# Next, produce the two tables from the cleaned data:
table1-data.tex: data-cleaned.dta sumstats.do
	$(STATA) -b do "sumstats.do"

table2-data.tex: data-cleaned.dta estimates.do
	$(STATA) -b do "estimates.do"

...

And so on with Python, etc. If your research works with the same script over and over again (say, a web scraper), but another researcher might replace just this file (say, replacing scrape.py with fastscrape.py), then you could have something like

PYTHON = python
SCRAPER = fastscrape.py

data-1.dta: $(SCRAPER)
	$(PYTHON) $(SCRAPER) --output=data-1.dta --source="http://www.somedatasoure.org/datasets/awesome1.dta"

data-2.dta: $(SCRAPER)
	$(PYTHON) $(SCRAPER) --output=data-2.dta --source="http://www.somedatasoure.org/datasets/awesome2.dta"

And so on.

A Word About Make vs. Knitr

Knitr is a tool that allows you to embed runnable R code in LaTeX or Markdown files so that the code is run and the figures/tables/etc are produced when the document is compiled (LaTeX to .pdf, Markdown to .html). It’s a widely-used tool for literate programming and research papers. I think this is really neat, but eventually I switched away from it and towards using things like Make because literate programming/knitr is sort of limited to linear approaches, and might not reflect more modular projects. It also tends to hang if part of your project takes a while to run (Matlab simulations, anyone?), which can be a pain in the ass when you need to wait 10 minutes to see what the new paragraph you wrote actually looks like in the final PDF. This is exactly the same problem iPython notebooks have — great for a step-by-step explanation, but not so great for a paper where it might actually take a while for things to run.


  1. It is possible that you are using excel for all of this, but if that’s the case you’re probably not interested in this article to begin with.

    [return]
  2. On a related Stata note, if you’re trying to output tables and regression results from Stata, the user-written esttab package can be quite useful. Running esttab or estout with the booktabs fragment options work very nicely in my opinion.

    [return]
  3. At least in conjunction with ggplot2, anyway. install.packages("ggplot2"), please.

    [return]
  4. Pro Stata tip: you can use shell to exectue unix-style commands from within Stata. So for example to compile a downstream LaTeX document you can use shell pdflatex paper.tex.

    [return]
  5. Make is available on all Unix-like systems (Mac OS X, Linux, BSDs), and there is also a version available for Windows.

    [return]
  6. Also, note that rm is the Unix command for removing files, and the * character serves as a wildcard when using that program.

    [return]
comments powered by Disqus