6.6. Aggregation

In this section we look at data aggregation operations on data frames.

6.7. Split, apply, combine


Split-apply-combine paradigm with the mean operation

A simple aggregation example. Let’s make up some data:

In [1]: import pandas as pd

In [2]: dd= {'key1': ['a','a','b','b','a'],

In [2]: df = pd.DataFrame(dd)

The resulting is a 5x4 (5 rows, 4 cols) data frame that looks like this:

In [2]: df

We now group the data by key1:

In [2]: grouped = df['data1'].groupby(df['key1'])

This is a special data structure that essentially stores the information necessary for the split step of split/apply/combine:

In [2]: grouped

In [2]: grouped.head()

Although the information about key1 seems to be missing, it’s actually available in the index, where we also can see the groups the 5 rows have been split into two groups.

In [2]: grouped.index

The computational purpose of this object is to provide a structure that can be looped over efficiently to execute the next apply step. This can be seen by actually iterating over a group and printing the result:

for name, group in grouped:
    print name
    print group

which yields:

0    0.074217
1   -0.058098
4    0.654327
Name: data1, dtype: float64
2    0.262827
3   -0.026915
Name: data1, dtype: float64

We can take the mean of the grouped object, to get the mean values for the various row groups defined by key1:

In [24]: grouped.mean()
a       0.223482
b       0.117956
dtype: float64

Thus, for example, 0.223482 is the mean value of the a groups, because

0.223482 = \frac{0.074212 + -.058098 + .654327}{3}

And this step represents the apply and combine parts of split-apply-combine.

The mean operation is just one example of an aggregation operation. In general an aggregation operation is an operation that can produce a single number from a column of numbers. Examples include but are not limited to

  1. sum
  2. count
  3. min
  4. sum

In the next section, we look at a more realistic example.

6.7.1. Federal Election Commission DB

See Chapter 9 of the pydata book. Federal Election Commission DB http://www.fex.gov/disclosure/PDownload.do

gets you a 150 MB .csv file named:


avaliable in the ch09 folder of the pydata book download.

6.7.2. Reading in the data

Assuming the data from the pydata book has been downloaded into directory location stored in the name pydata_dir:

In [8]: import pandas as pd
In [9]: import os.path

In [10]: working_chapter = os.path.join(pydata_dir,'ch09')

In [11]: fec = pd.read_csv(os.path.join(working_chapter,'P00000001-ALL.csv'))

We choose that subset of the data containing Mitt Romney and Barack Obama as candidates:

In [12]: fec_mrbo = fec[fec.cand_nm.isin(['Obama, Barack', 'Romney, Mitt'])]

Looking at just 5 columns and the first few rows of the fec_mrbo table we have:

In [13]: fec_mrbo.head()[['cmte_id','cand_id','cand_nm','contbr_nm','contbr_city']]
Out [13]: cmte_id    cand_id       cand_nm           contbr_nm contbr_city
411  C00431171  P80003353  Romney, Mitt  ELDERBAUM, WILLIAM         DPO
412  C00431171  P80003353  Romney, Mitt  ELDERBAUM, WILLIAM         DPO
413  C00431171  P80003353  Romney, Mitt    CARLSEN, RICHARD         APO
414  C00431171  P80003353  Romney, Mitt      DELUCA, PIERRE         APO
415  C00431171  P80003353  Romney, Mitt    SARGENT, MICHAEL         APO

In the next section, we look at aggregating contributions by candidate and by state.

6.7.3. Aggregating by state

Next two steps: Aggregate donations by candidate and state:

In [14]: grouped = fec_mrbo.groupby(['cand_nm','contbr_st'])
In [15]: totals = grouped.contb_receipt_amt.sum().unstack(0).fillna(0)

Note that two states are quite low in their donation totals: V has 2 donations to Romney and ND has 578, but they total to under 100K:

In [16]: grouped.size()

Filter out states with very small contrib totals:

In [17]: totals = totals[totals.sum(1) > 100000]

Turn into percentages for visualization:

In [18]: percent = totals.div(totals.sum(1),axis=0)