Recently I had to use an older dataset that hadn’t been nicely sanitized into something that Stata or Pandas could understand. In particular, I was working with the NHANES I Epidemiologic Follow-up Study1, which has data available only in the original tape format from the 1980s and early 1990s. An individual record from one of the smaller files looks like this (scroll right for the full effect):
9220809 511 12 0112 996102410232091442411 1 2 9 486 21400520 230031142750215185031486 0 03 42750486 051850
Where every number might be a single variable or part of a multi-column variable. For example, the first two digits are the year the individual died, followed by a five-digit identification number. In other words, it’s impossible to decipher without digging into the original documentation. This kind of data is very hard to read into a statistics package (or Excel, for the less ambitious) in this format — but never to fear! It turns out to be quite simple to parse these files in Python and output something that more high-level tools can understand.
For reference, I’m going to be using the public data files and documentation provided by the Centers for Disease Control on their website. In particular, I’ll use the data from the 1992 NHEFS Mortality survey and its accompanying documentation.
Extracting a simple variable
What we really want to do is convert the column(s) that correspond to a
particular variable into something that we can move around in Python (or output
so we can use it with another program). The documentation for the education
field, for example, looks like this::
Tape Field
Pos. Size Variable Description and Codes
52-53 2 Education
00 ... No formal education
01-08 ... Years of elementary school
09 ... 1 year of high school
10 ... 2 years of high school
11 ... 3 years of high school
12 ... 4 years of high school
13 ... 1 year of college
14 ... 2 years of college
15 ... 3 years of college
16 ... 4 years of college
17 ... 5 or more years of college
99 ... Not stated
(Note: This information is only available for deaths
occurring in 1989 and later. For earlier years, this
field will be blank.)
Parsing a line with this kind of documentation in Python is really quite easy if
you make use of slicing. It’s likely that you don’t need to differentiate
between ‘Not stated’ and ‘missing due to being before 1989’, so the blank
and
99
entries are the same from your perspective. Since a CSV file usually
represents missing data with an empty string, we can extract the education
variable from a line in the tape with:
educ = line[51:53]
if educ == ' ' or educ == '99':
educ = ''
elif educ == '00':
educ = '0'
else:
educ = educ.lstrip('0')
Although there are of other ways to accomplish this:
# If you need to remove the leading zeros and don't mind converting str -> int -> str
educ = line[51:53]
educ = str(int(educ)) if educ != ' ' and educ != '99' else ''
# Using `replace`; leaves in the leading zeros
educ = line[51:53].replace(' ', '').replace('99', '')
Another simple example might be the sex
variable:
Tape Field
Pos. Size Variable Description and Codes
59 1 Sex
1 ... Male
2 ... Female
which can be extracted from a line and converted to a dummy variable for female participants with
# Represent with 0 = male, 1 = female
female = int(line[58]) - 1
Converting categorical variables to dummy variables
Often it may be of interest to convert a categorical variable like race
:
Tape Field
Pos. Size Variable Description and Codes
62 1 Race Recode #1 (*)
1 ... White
2 ... Races other than white or black
3 ... Black
Into a series of dummy variables (i.e. white
, black
, other_race
). This is
also fairly easy to do with some ternary operators:
white = 1 if line[61] == '1' else 0
black = 1 if line[61] == '3' else 0
other_race = 1 if line[61] == '2' else 0
As a slightly more complex example, consider the entries for marital status:
Tape Field
Pos. Size Variable Description and Codes
77 1 Marital Status
1 ... Never married, single
2 ... Married
3 ... Divorced
4 ... Widowed
8 ... Marital status not on certificate
9 ... Marital status not stated
We might want the dummies for married
, widowed
, divorced
, and single
to
be empty (as opposed to zero) when we have either of the last two conditions.
if line[76] == '8' or line[76] == '9':
married = single = divorced = widowed = ''
else:
single = '1' if line[76] == '1' else '0'
married = '1' if line[76] == '2' else '0'
divorced = '1' if line[76] == '3' else '0'
widowed = '1' if line[76] == '4' else '0'
The full example
If we stuff all of this into a single function that will extract variables from a line, it might look something like the following:
def convert_record_to_csv(line):
# Identifier used by the survey
seqnum = line[2:7].lstrip('0')
# Year and month of death (day is not always provided)
y_death = '19%s' % line[:2]
m_death = line[54:56]
# Sex
female = int(line[58]) - 1
# Marriage
if line[76] == '8' or line[76] == '9':
married = single = divorced = widowed = ''
else:
single = '1' if line[76] == '1' else '0'
married = '1' if line[76] == '2' else '0'
divorced = '1' if line[76] == '3' else '0'
widowed = '1' if line[76] == '4' else '0'
# Education
educ = line[51:53]
educ = educ if educ != ' ' and educ != '99' else ''
# Autopsy
autopsy = '1' if line[83] == '1' else '0'
autopsy = autopsy if line[83] != '8' and line[83] != '9' else ''
# Race
white = 1 if line[61] == '1' else 0
black = 1 if line[61] == '3' else 0
other_race = 1 if line[61] == '2' else 0
return ', '.join([str(x) for x in (seqnum, y_death, m_death, female, educ, autopsy, white, black, other_race, single, married, divorced, widowed)]) + '\n'
We can then write a pretty simple script to convert the raw tape data (provided
by the CDC as N92mort.txt
) into a .csv
file:
with open('N92mort.txt', 'r') as source, open('out.csv', 'w') as dest:
# Write column names, as per CSV convention
dest.write('seqnum, y_death, m_death, female, educ, autopsy, white, black, other_race, single, married, divorced, widowed\n')
# Loop over each line in the source file
for i, line in enumerate(source):
try:
dest.write(convert_record_to_csv(line))
except Exception as e:
# It may be helpful for debugging to know which line the parser failed at
print 'Failed at record %d.' % i
raise e
Of course, I’m not parsing every single variable in the data set (there are over 400 columns in the mortality data alone, and more than 4000 in the interview data), but in principle this same method could be used.
-
National Center for Health Statistics. Data File Documentation, NHANES I Epidemiologic Followup Study (NHEFS), National Center for Health Statistics. Hyattsville, Maryland. ↩︎