6.8. PUMS Census data

The U.S. Census website contains numerous kinds of demograohic data about the United States. In this section we focus on data collected in surveys conducted during the decennial census, which is distinct from the more frequently collected American Community Survey data. More especifically, we will confine our attention to the Census 2000 PUMS data. PUMS stands for Public Use Microdata Samples. It is called Microdata because the the records contain information about one person or one household.

A raw record looks like this:

000101000000300010 01400 70 9997
01400 70 9997 9997010100370010000000108190622

This is a person record; one line of data has been broken up into several lines for readability. Values for a large number of statistical variables are coded in each line. To read such a line you need a code book or data dictionary. This is stored on the PUMS website in Excel files. For example, for the 5% data the data dictionary is available here and the entire set of technical documents (including decoding for complicated fields like the occupation and place of birth) here. Note these documents cover all the code points for 5 per cent data, but the per cent code points are a superset of the one percent code points, so these documents cover both data sets.

The data dictionary contains entries like the following:

D OCCSOC5 7 227 233
T Occupation (SOC) for 5% file
V        00-0000  Not in universe (under 16 or last wrk =2)
R        10-0000... 99-9999   Occupation SOC Code

This tells us where the occupation information is stored in a record. It tells us that positions 227-233 contain the occupation code. It tells us that code 00-0000 is reserved for occupations that have no code, and codes 10-0000 through 99-9999 stand for occupation codes explained elsewhere. Translated into a Python splice, that’s 226:233, so:

line = 'P000011701000140010000103100101100000101470101001000\
       12099999904200000 000101000000300010     01400  70    \
       9997    9997202020202020220000020011000000020031000010 \
       01400  70    9997    9997010100370010000000108190622    \


>>> occupation = line[226:233]
>>> occupation

And another Excel file from the technical docs collection (5%_PUMS_occupation.xls) tells you this stands for Registered Nurse.

Obviously this is very low level data in pretty raw form. And it will take quite a lot of programming to extract it.

Now who needs this kind of data? The Census website document entitled What PUMS Data Users Need to Know tells us the following:

The PUMS files should be used by people who are looking for data tables that are not presented by the Census Bureau in the pretabulated products available through American FactFinder.

These files can be used to extract custom data for particular population groups (e.g., veterans, college students) or when it is not pos- sible to get particular data categories from the standard tables (e.g., families with income between 90 and 99 percent of the official poverty threshold). While it is possible to request that the Census Bureau produce custom tabulations for a fee, the PUMS files provide a much less expensive—and often faster—way to get the data.

One common group of users of the PUMS files are academic researchers interested in modeling relationships between the variables collected as part of the ACS. Another common group of users are researchers working in government and business looking at either characteristics that are not usually cross-tabulate

To help explore the PUMS data, we have provided a Python script for creating a Pandas data frame Ultimately, that data frame can be used to produce a CSV of a PUMS file or subset of it for export to some other program. Below, we give some guidance on how to use it.

6.8.1. Using the read_census_data script

  1. Download the data dictionary from the US Census FTP site.
  2. Download some PUMS data files from the website. For example, Alabama’s ASCII format data
  3. Download read__in_census_data.py. After reading in the data dictionary, this can be used to extract data from the Alabama PUMS file.

Then proceed as follows. First read in the data dictionary and store it in a customized object called a CensusInfo object.

from read_in_census_data import CensusInfo, DataFrameWrapper

one_percent_data_file = 'revisedpums1_alabama_01.txt'
data_dictionary = '5%_PUMS_record_layout.xls'
ci = CensusInfo (data_dictionary)

Using the CensusInfo method print_var_info we can now get information about the various variables in a record. For instance:

>>> ci.print_var_info('RELATE',ci.record_types[1],ci.file_types[1])
Person Record RELATE 1% file
   {'RT': 'P', 'BEG': 17, 'LEN': 2, 'DESCRIPTION': 'Relationship'}

     01  Householder
     02  Husband/wife
     03  Natural born son/daughter
     04  Adopted son/daughter
     05  Stepson/Stepdaughter
     06  Brother/sister
     07  Father/mother
     08  Grandchild
     09  Parent-in-law
     10  Son-in-law/daughter-in-law
     11  Other relative
     12  Brother-in-law/sister-in-law
     13  Nephew/niece
     14  Grandparent
     15  Uncle/aunt
     16  Cousin
     17  Roomer/boarder
     18  Housemate/roommate
     19  Unmarried partner
     20  Foster child
     21  Other nonrelative
     22  Institutionalized GQ person
     23  Noninstitutionalized GQ person

We cabn also look at a sample record:

>>> line = ci.get_sample_record_from_db (one_percent_data_file)

Opening revisedpums1_alabama_01.txt
  000101000000300010     01400  70    9997    9997202020202020220000020011000000020031000010
    01400  70    9997    9997010100370010000000108190622     313029-1111201

To get information about a DB variable, 3 things must be specfied: file type (1 per cent or 5 per cent), record type (person or household) and the variable name (column name):

>>> (var0,record_type, file_type) = ('RELATE',ci.record_types[1],ci.file_types[1])

To get available information about a value of a variable in a specific record (usually an opaque code), we first get the value of this variable for that record:

>>> val = ci.get_db_val(line, var0, record_type)
>>> var_values_info = ci.variable_values_dictionary[record_type][(var0,file_type)]

Printing some of the info just retrieved:

>>> print var0,  record_type, val, var_values_info[val]

RELATE Person Record 01 Householder
Person Record education =>
   EDUC 5% file {'RT': 'P', 'BEG': 53, 'LEN': 2,
                 'DESCRIPTION': 'Educational Attainment'}

>>> ci.print_var_info ('education', ci.person_record)

     Educational Attainment
     00  Not in universe (Under 3 years)
     01  No schooling completed
     02  Nursery school to 4th grade
     03  5th grade or 6th grade
     04  7th grade or 8th grade
     05  9th grade
     06  10th grade
     07  11th grade
     08  12th grade, no diploma
     09  High school graduate
     10  Some college, but less than 1 year
     11  One or more years of college, no degree
     12  Associate degree
     13  Bachelor's degree
     14  Master's degree
     15  Professional degree
     16  Doctorate degree

>>> ci.print_var_info ('race', ci.person_record)

Person Record race =>
   RACE3 5% file {'RT': 'P', 'BEG': 41, 'LEN': 2, 'DESCRIPTION': 'Race Recode 3'}

     Race Recode 3
     01  Some other race alone
     02  Other Pacific Islander alone
     03  Other Pacific Islander; Some other race
     04  Samoan alone
     05  Guamanian or Chamorro alone
     06  Native Hawaiian alone
     07  Other Asian alone
     08  Other Asian; Some other race
     09  Other Asian; Other Pacific Islander
     10  Vietnamese alone
     11  Korean alone
     12  Japanese alone
     13  Not used
     14  Not used
     15  Filipino alone
     16  Filipino; Some other race
     17  Filipino; Other Pacific Islander
     18  Filipino; Native Hawaiian
     19  Filipino; Japanese
     20  Chinese alone
     21  Chinese; Some other race
     22  Chinese; Native Hawaiian
     23  Chinese; Other Asian
     24  Chinese; Vietamese
     25  Chinese; Japanese
     26  Chinese; Filipino
     27  Not used
     28  Asian Indian alone
     29  Asian Indian; Some other race
     30  Not used
     31  Asian Indian; Other Asian
     32  American Indian and Alaska Native alone
     33  American Indian and Alaska Native; Some other race
     34  Not used
     35  Not used
     36  American Indian and Alaska Native; Asian Indian
     37  Black or African American alone
     38  Black or African American; Some other race
     39  Black or African American; Other Pacific Islander
     40  Black or African American; Other Asian
     41  Black or African American; Korean
     42  Black or African American; Japanese
     43  Black or African American; Filipino
     44  Black or African American; Chinese
     45  Black or African American; Asian Indian
     46  Black or African American; American Indian and Alaska Native
     47  White alone
     48  White; Some other race
     49  White; Other Pacific Islander
     50  White; Samoan
     51  White; Guamanian or Chamorro
     52  White; Natve Hawaiian
     53  White; Other Asian
     54  Not used
     55  White; Vietnamese
     56  White; Korean
     57  White; Japanese
     58  Not used
     59  White; Filipino
     60  Not used
     61  White; Filipino; Native Hawaiian
     62  White; Chinese
     63  White; Chinese; Native Hawaiian
     64  Not used
     65  White; Asian Indian
     66  White; American Indian and Alaska Native
     67  White; American Indian and Alaska Native; Some other race
     68  White; Black or African American
     69  White; Black or African American; Some other race
     70  White; Black or African American; American Indian and Alaska Native
     71  All other race combinations

>>> ci.print_var_info ('income', ci.person_record)

Person Record income =>
   INCWS 5% file {'RT': 'P', 'BEG': 244, 'LEN': 6,
                  'DESCRIPTION': 'Wage/Salary Income in 1999'}

     Wage/Salary Income in 1999
     ('000000', ' ')  No/none
     ('000001', '174999')  $1 to $174,999
     175000  Topcode
     175000+  State mean of topcoded values
      blank  Not in universe (Under 15 years)

A particularly important field is the serial_no, which connects person records to their households, and also persons to other persons in the same household:

>>> ci.print_var_info('serial_no',ci.person_record)
Person Record serial_no =>
   SERIALNO 5% file {'RT': 'P', 'BEG': 2, 'LEN': 7,
                         'Housing/Group Quarters (GQ) Unit Serial Number'}

     Housing/Group Quarters (GQ) Unit Serial Number
     ('0000001', 9999999.0)  Unique identifier assigned within state

>>> ci.print_var_info('serial_no',ci.household_record)
Housing Unit Record serial_no =>
   SERIALNO 5% file {'RT': 'H', 'BEG': 2, 'LEN': 7,
                         'Housing/Group Quarters (GQ) Unit Serial Number'}

     Housing/Group Quarters (GQ) Unit Serial Number
     ('0000001', 9999999.0)  Unique identifier assigned within state

There are fields which belong to one kind of record bu not the other. For instance:

>>> ci.print_var_info('num_persons',ci.household_record)
Housing Unit Record num_persons =>
    PERSONS 5% file {'RT': 'H', 'BEG': 106, 'LEN': 2,
                         'Number of person records following this housing record'}

     Number of person records following this housing record
     ('2', '97')  Number of persons in household
      0  Vacant unit
      1  Householder living alone or any person in group quarters

>>> ci.print_var_info('num_persons',ci.person_record)
except Exception as e:
     print e.message

num_persons=>PERSONS is not a known variable for a Person Record

Finally we show how to select the information from some subset of the full set of columns, storing them in a DataFrameWrapper to be saved to a file that can be read by pandas.

>>> df = DataFrameWrapper(ci)
>>> df.fill_frame ([], ci.person_record,

Note that the household serial_no field is included automatically, because this is the column that let us link this to other records (to other persons in the same household, so that we can aggregate information about households.

>>> df.header
['serial_no', 'race', 'education', 'income', 'gender', 'age', 'relationship']

Order of the original rows is preserved so the first two person records are about two people in the same household.

>>> df.rows[:5]
[['0000117', '47', '12', '047300', '1', '31', '01'],
 ['0000117', '47', '12', '024300', '2', '25', '02'],
 ['0000117', '47', '00', '      ', '1', '00', '03'],
 ['0000127', '47', '05', '000000', '1', '22', '01'],
 ['0000127', '47', '10', '022000', '2', '35', '02']]

We save the data from these 5 columns to a file:

>>> df.save_frame('alabama_pums_extract.csv',header=True)

>>> import pandas as pd
>>> import numpy as np
>>> dt_dict ={'serial_no':object,'race':object,'education':object,
>>> p_df = pd.read_csv('alabama_pums_extract.csv', dtype=dt_dict)

Note that the default is to use the header line to name cols. The read_csv function has a explicit header argument, but that is used only for selecting cols or renaming cols. Since all the data looks like numbers, including the serial numbers and codes, we we specify dat types for all the codes that aren’t really intended to be treated as numbers. We let pandas try to figure out the types for :samp`income` and age, which genuinely are numbers.

Just for practice now, we save this file. The version we’re saving is identical to the one we just read in.

>>> p_df.to_csv('alabama_pums_extract2.csv',index=False)
>>> p_df

<class 'pandas.core.frame.DataFrame>
Int64Index: 44487 entries, 0 to 44486
Data columns (total 7 columns):
serial_no       44487  non-null values
race            44487  non-null values
education       44487  non-null values
income          35178  non-null values
gender          44487  non-null values
age             44487  non-null values
relationship    44487  non-null values
dtypes: float64(1), int64(1), object(5)

6.9. Appendix: Data types

>>> p_df.dtypes
serial_no        object
race             object
education        object
income          float64
gender           object
age               int64
relationship     object
dtype: object

Notice income and age are integers and floats respectively, That’s good because we’re going to want to do aggregation operations like mean and sum on income

>>> df2 = pd.DataFrame([['1','2'],['34','56']],columns= ['h','t'])
>>> df2.dtypes
h    object
t    object
dtype: object
0     1
1    34
Name: h, dtype: int64
df2['h'] = df2['h'].astype(int)
h     int64
t    object
dtype: object