Parsing tape data with Python

Recently I had to use an older dataset that hadn’t been nicely sanitized into something that Stata or Pandas could understand. In particular, I was working with the NHANES I Epidemiologic Follow-up Study¹, which has data available only in the original tape format from the 1980s and early 1990s. An individual record from one of the smaller files looks like this (scroll right for the full effect):

9220809          511     12           0112         996102410232091442411  1 2      9                                                         486 21400520   230031142750215185031486 0                                                                                                                                                           03 42750486 051850

Where every number might be a single variable or part of a multi-column variable. For example, the first two digits are the year the individual died, followed by a five-digit identification number. In other words, it’s impossible to decipher without digging into the original documentation. This kind of data is very hard to read into a statistics package (or Excel, for the less ambitious) in this format — but never to fear! It turns out to be quite simple to parse these files in Python and output something that more high-level tools can understand.

For reference, I’m going to be using the public data files and documentation provided by the Centers for Disease Control on their website. In particular, I’ll use the data from the 1992 NHEFS Mortality survey and its accompanying documentation.

Extracting a simple variable

What we really want to do is convert the column(s) that correspond to a particular variable into something that we can move around in Python (or output so we can use it with another program). The documentation for the education field, for example, looks like this::

   Tape  Field
   Pos.  Size   Variable Description and Codes

52-53      2    Education

                  00        ...  No formal education
                  01-08     ...  Years of elementary school
                  09        ...  1 year of high school
                  10        ...  2 years of high school
                  11        ...  3 years of high school
                  12        ...  4 years of high school
                  13        ...  1 year of college
                  14        ...  2 years of college
                  15        ...  3 years of college
                  16        ...  4 years of college
                  17        ...  5 or more years of college
                  99        ...  Not stated

                  (Note: This information is only available for deaths
                  occurring in 1989 and later.  For earlier years, this
                  field will be blank.)

Parsing a line with this kind of documentation in Python is really quite easy if you make use of slicing. It’s likely that you don’t need to differentiate between ‘Not stated’ and ‘missing due to being before 1989’, so the blank and 99 entries are the same from your perspective. Since a CSV file usually represents missing data with an empty string, we can extract the education variable from a line in the tape with:

educ = line[51:53]
if educ == '  ' or educ == '99':
    educ = ''
elif educ == '00':
    educ = '0'
else:
    educ = educ.lstrip('0')

Although there are of other ways to accomplish this:

# If you need to remove the leading zeros and don't mind converting str -> int -> str
educ = line[51:53]
educ = str(int(educ)) if educ != '  ' and educ != '99' else ''

# Using `replace`; leaves in the leading zeros
educ = line[51:53].replace(' ', '').replace('99', '')

Another simple example might be the sex variable:

   Tape  Field
   Pos.  Size   Variable Description and Codes

   59      1    Sex

                  1         ...  Male
                  2         ...  Female

which can be extracted from a line and converted to a dummy variable for female participants with

# Represent with 0 = male, 1 = female
female = int(line[58]) - 1

Converting categorical variables to dummy variables

Often it may be of interest to convert a categorical variable like race:

   Tape  Field
   Pos.  Size   Variable Description and Codes

   62      1    Race Recode #1 (*)

                  1         ...  White
                  2         ...  Races other than white or black
                  3         ...  Black

Into a series of dummy variables (i.e. white, black, other_race). This is also fairly easy to do with some ternary operators:

white = 1 if line[61] == '1' else 0
black = 1 if line[61] == '3' else 0
other_race = 1 if line[61] == '2' else 0

As a slightly more complex example, consider the entries for marital status:

   Tape  Field
   Pos.  Size   Variable Description and Codes

   77      1    Marital Status

                  1         ...  Never married, single
                  2         ...  Married
                  3         ...  Divorced
                  4         ...  Widowed
                  8         ...  Marital status not on certificate
                  9         ...  Marital status not stated

We might want the dummies for married, widowed, divorced, and single to be empty (as opposed to zero) when we have either of the last two conditions.

if line[76] == '8' or line[76] == '9':
    married = single = divorced = widowed = ''
else:
    single = '1' if line[76] == '1' else '0'
    married = '1' if line[76] == '2' else '0'
    divorced = '1' if line[76] == '3' else '0'
    widowed = '1' if line[76] == '4' else '0'

The full example

If we stuff all of this into a single function that will extract variables from a line, it might look something like the following:

def convert_record_to_csv(line):
    # Identifier used by the survey
    seqnum = line[2:7].lstrip('0')
    
    # Year and month of death (day is not always provided)
    y_death = '19%s' % line[:2]
    m_death = line[54:56]
    
    # Sex
    female = int(line[58]) - 1

    # Marriage
    if line[76] == '8' or line[76] == '9':
        married = single = divorced = widowed = ''
    else:
        single = '1' if line[76] == '1' else '0'
        married = '1' if line[76] == '2' else '0'
        divorced = '1' if line[76] == '3' else '0'
        widowed = '1' if line[76] == '4' else '0'

    # Education
    educ = line[51:53]
    educ = educ if educ != '  ' and educ != '99' else ''

    # Autopsy
    autopsy = '1' if line[83] == '1' else '0'
    autopsy = autopsy if line[83] != '8' and line[83] != '9' else ''

    # Race
    white = 1 if line[61] == '1' else 0
    black = 1 if line[61] == '3' else 0
    other_race = 1 if line[61] == '2' else 0

    return ', '.join([str(x) for x in (seqnum, y_death, m_death, female, educ, autopsy, white, black, other_race, single, married, divorced, widowed)]) + '\n'

We can then write a pretty simple script to convert the raw tape data (provided by the CDC as N92mort.txt) into a .csv file:

with open('N92mort.txt', 'r') as source, open('out.csv', 'w') as dest:
    
    # Write column names, as per CSV convention
    dest.write('seqnum, y_death, m_death, female, educ, autopsy, white, black, other_race, single, married, divorced, widowed\n')
    
    # Loop over each line in the source file
    for i, line in enumerate(source):
        try:
            dest.write(convert_record_to_csv(line))
        except Exception as e:
            # It may be helpful for debugging to know which line the parser failed at
            print 'Failed at record %d.' % i
            raise e

Of course, I’m not parsing every single variable in the data set (there are over 400 columns in the mortality data alone, and more than 4000 in the interview data), but in principle this same method could be used.

National Center for Health Statistics. Data File Documentation, NHANES I Epidemiologic Followup Study (NHEFS), National Center for Health Statistics. Hyattsville, Maryland. ↩︎