6.5. Data Frames

Babynames data set.

http://www.ssa.gov/oact/babynames/limits.html

Available in pydata book data, Ch 2 data, in ch02/names:

Marks-MacBook-Pro:names gawron$ cd ch02/names
Marks-MacBook-Pro:names gawron$ head yob1880.txt

Mary,F,7065
Anna,F,2604
Emma,F,2003
Elizabeth,F,1939
Minnie,F,1746
Margaret,F,1578
Ida,F,1472
Alice,F,1414
Bertha,F,1320
Sarah,F,1288

6.5.1. Using Panda

An ipython session:

In [2]: import pandas as pd

In [3]: names1880 = pd.read_csv('names/yob1880.txt',names=['name','sex','births'])

In [4]: names1880
Out[4]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2000 entries, 0 to 1999
Data columns (total 3 columns):
name      2000  non-null values
sex       2000  non-null values
births    2000  non-null values
dtypes: int64(1), object(2)

In [5]: names1880[:10]
Out[5]:
        name sex  births
0       Mary   F    7065
1       Anna   F    2604
2       Emma   F    2003
3  Elizabeth   F    1939
4     Minnie   F    1746
5   Margaret   F    1578
6        Ida   F    1472
7      Alice   F    1414
8     Bertha   F    1320
9      Sarah   F    1288

We illustrate a simple data aggregation operation:

In [6]: names1880.groupby('sex').births.sum()
Out[6]:
sex
F       90993
M      110493
Name: births, dtype: int64

This command summed all the the male and female name counts for the year 1880. So much for the year 1880. Now let’s look at the other years, each of which is stored in its own file, to see how combining Python’s IO capabilities with some panda data frame building tools, provides a useful view of the data:

In [7]: years = range(1880,2011)

In [8]: pieces = []

In [9]: columns = ['name','sex','births']

In [10]: for year in years:
   ....:     path = 'names/yob%d.txt' % year
   ....:     frame = pd.read_csv(path,names=columns)
   ....:     frame['year'] = year
   ....:     pieces.append(frame)
   ....:

In [11]: names = pd.concat(pieces, ignore_index=True)

names is now a table whose rows combine name statistics from all years:

In [11]: names
Out[12]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1690784 entries, 0 to 1690783
Data columns (total 4 columns):
name      1690784  non-null values
sex       1690784  non-null values
births    1690784  non-null values
year      1690784  non-null values
dtypes: int64(2), object(2)

The first few rows are from 1880. We can examine these using the head method:

In [12] names.head()

        name sex  births  year
0       Mary   F    7065  1880
1       Anna   F    2604  1880
2       Emma   F    2003  1880
3  Elizabeth   F    1939  1880
4     Minnie   F    1746  1880

The most recent file in the dataset is 2010. We examine these rows using the tail method:

In [12] names.tail()

              name sex  births  year
1690779    Zymaire   M       5  2010
1690780     Zyonne   M       5  2010
1690781  Zyquarius   M       5  2010
1690782      Zyran   M       5  2010
1690783      Zzyzx   M       5  2010

Now let’s aggregate some data by gender and year, summing the name counts by gender and year, and building something called a pivot table:

In [20]: total_births = names.pivot_table('births',rows='year', cols='sex', aggfunc=sum)

In [21]: total_births.tail()
Out[21]:
sex         F        M
year
2006  1896468  2050234
2007  1916888  2069242
2008  1883645  2032310
2009  1827643  1973359
2010  1759010  1898382

In [22]: total_births.plot(title='Total births by year')

The plot command in step [22] may not work unless ipython is started up with the matplotlib pylab interface running:

~gawron ipython --pylab

If you are not running ipython or the pylab interface, the following plain python commands will work instead of total_births.plot:

>>> import matplotlib.pyplot as plt

>>> fig = plt.figure(1,figsize=(8,8))

>>> ax1 = fig.add_subplot(111)
>>> fig.subplots_adjust(top=0.9,left=0.2)
>>> ax1.set_ylabel('births')
>>> ax1.set_xlabel('year')
>>> (p1,) = ax1.plot(total_births.index,total_births.F,color='pink',label='F')
>>> (p2, ) = ax1.plot(total_births.index,total_births.M,color='blue',label='M')
>>> ax1.set_title('Total births by sex and year')

>>> ax1.legend((p2,p1),('M','F'),loc='upper left')
>>> fig.show()

This is a lot more lines of code. The advantage of going this route is that many features of teh graph can be be customized. Note that colors have been set for the two lines, a title for the graph, and a legend labeling the two lines has been set in the upper left hand corner. Additionally, labels for the x and y axes have been set. The last step is to show the figure with the show command.

Both the ipython command and the standard ptthon commands should result in something like the following graph.

../_images/total_births_figure_1.png

6.5.2. The need for data frames

Why doesn’t one of the “native” Python data structures fit the bill?

  1. Lists can’t have two dimensions, really. You can represent the information in a table in a list of lists, but the resulting structure is too clunky for data analysis purposes. This point applies to both data frames and arrays:

    A = array([[1,2,3],[4,5,6]])
    A[,2]
    array([3,6])
    

    The list representation does’t give any way to easily access column 2.

  2. Pandas data frames provide easy access to all the efficiencies of arrays, for example vectorized arithmentic operations.

  3. Data frames are for data analysis. So they come with lots of data analysis bells and whistles, like simple plotting facilities.