Recently I had to use an older dataset that hadn’t been nicely sanitized into something that Stata or Pandas could understand. In particular, I was working with the NHANES I Epidemiologic Follow-up Study1, which has data available only in the original tape format from the 1980s and early 1990s. An individual record from one of the smaller files looks like this (scroll right for the full effect):
9220809 511 12 0112 996102410232091442411 1 2 9 486 21400520 230031142750215185031486 0 03 42750486 051850
Where every number might be a single variable or part of a multi-column variable. For example, the first two digits are the year the individual died, followed by a five-digit identification number. In other words, it’s impossible to decipher without digging into the original documentation. This kind of data is very hard to read into a statistics package (or Excel, for the less ambitious) in this format — but never to fear! It turns out to be quite simple to parse these files in Python and output something that more high-level tools can understand.
For reference, I’m going to be using the public data files and documentation provided by the Centers for Disease Control on their website. In particular, I’ll use the data from the 1992 NHEFS Mortality survey and its accompanying documentation.
Extracting a simple variable
What we really want to do is convert the column(s) that correspond to a
particular variable into something that we can move around in Python (or output
so we can use it with another program). The documentation for the
field, for example, looks like this::
Tape Field Pos. Size Variable Description and Codes 52-53 2 Education 00 ... No formal education 01-08 ... Years of elementary school 09 ... 1 year of high school 10 ... 2 years of high school 11 ... 3 years of high school 12 ... 4 years of high school 13 ... 1 year of college 14 ... 2 years of college 15 ... 3 years of college 16 ... 4 years of college 17 ... 5 or more years of college 99 ... Not stated (Note: This information is only available for deaths occurring in 1989 and later. For earlier years, this field will be blank.)
Parsing a line with this kind of documentation in Python is really quite easy if
you make use of slicing. It’s likely that you don’t need to differentiate
between ‘Not stated’ and ‘missing due to being before 1989’, so the
99 entries are the same from your perspective. Since a CSV file usually
represents missing data with an empty string, we can extract the
variable from a line in the tape with:
educ = line[51:53] if educ == ' ' or educ == '99': educ = '' elif educ == '00': educ = '0' else: educ = educ.lstrip('0')
Although there are of other ways to accomplish this:
# If you need to remove the leading zeros and don't mind converting str -> int -> str educ = line[51:53] educ = str(int(educ)) if educ != ' ' and educ != '99' else '' # Using `replace`; leaves in the leading zeros educ = line[51:53].replace(' ', '').replace('99', '')
Another simple example might be the
Tape Field Pos. Size Variable Description and Codes 59 1 Sex 1 ... Male 2 ... Female
which can be extracted from a line and converted to a dummy variable for female participants with
# Represent with 0 = male, 1 = female female = int(line) - 1
Converting categorical variables to dummy variables
Often it may be of interest to convert a categorical variable like
Tape Field Pos. Size Variable Description and Codes 62 1 Race Recode #1 (*) 1 ... White 2 ... Races other than white or black 3 ... Black
Into a series of dummy variables (i.e.
other_race). This is
also fairly easy to do with some ternary operators:
white = 1 if line == '1' else 0 black = 1 if line == '3' else 0 other_race = 1 if line == '2' else 0
As a slightly more complex example, consider the entries for marital status:
Tape Field Pos. Size Variable Description and Codes 77 1 Marital Status 1 ... Never married, single 2 ... Married 3 ... Divorced 4 ... Widowed 8 ... Marital status not on certificate 9 ... Marital status not stated
We might want the dummies for
be empty (as opposed to zero) when we have either of the last two conditions.
if line == '8' or line == '9': married = single = divorced = widowed = '' else: single = '1' if line == '1' else '0' married = '1' if line == '2' else '0' divorced = '1' if line == '3' else '0' widowed = '1' if line == '4' else '0'
The full example
If we stuff all of this into a single function that will extract variables from a line, it might look something like the following:
def convert_record_to_csv(line): # Identifier used by the survey seqnum = line[2:7].lstrip('0') # Year and month of death (day is not always provided) y_death = '19%s' % line[:2] m_death = line[54:56] # Sex female = int(line) - 1 # Marriage if line == '8' or line == '9': married = single = divorced = widowed = '' else: single = '1' if line == '1' else '0' married = '1' if line == '2' else '0' divorced = '1' if line == '3' else '0' widowed = '1' if line == '4' else '0' # Education educ = line[51:53] educ = educ if educ != ' ' and educ != '99' else '' # Autopsy autopsy = '1' if line == '1' else '0' autopsy = autopsy if line != '8' and line != '9' else '' # Race white = 1 if line == '1' else 0 black = 1 if line == '3' else 0 other_race = 1 if line == '2' else 0 return ', '.join([str(x) for x in (seqnum, y_death, m_death, female, educ, autopsy, white, black, other_race, single, married, divorced, widowed)]) + '\n'
We can then write a pretty simple script to convert the raw tape data (provided
by the CDC as
N92mort.txt) into a
with open('N92mort.txt', 'r') as source, open('out.csv', 'w') as dest: # Write column names, as per CSV convention dest.write('seqnum, y_death, m_death, female, educ, autopsy, white, black, other_race, single, married, divorced, widowed\n') # Loop over each line in the source file for i, line in enumerate(source): try: dest.write(convert_record_to_csv(line)) except Exception as e: # It may be helpful for debugging to know which line the parser failed at print 'Failed at record %d.' % i raise e
Of course, I’m not parsing every single variable in the data set (there are over 400 columns in the mortality data alone, and more than 4000 in the interview data), but in principle this same method could be used.
National Center for Health Statistics. Data File Documentation, NHANES I Epidemiologic Followup Study (NHEFS), National Center for Health Statistics. Hyattsville, Maryland. ↩︎