# 6.6. Aggregation¶

In this section we look at data aggregation operations on data frames.

# 6.7. Split, apply, combine¶

A simple aggregation example. Let’s make up some data:

```
In [1]: import pandas as pd
In [2]: dd= {'key1': ['a','a','b','b','a'],
In [2]: df = pd.DataFrame(dd)
```

The resulting is a 5x4 (5 rows, 4 cols) data frame that looks like this:

```
In [2]: df
```

We now group the data by *key1*:

```
In [2]: grouped = df['data1'].groupby(df['key1'])
```

This is a special data structure that essentially stores the information
necessary for the *split* step of split/apply/combine:

```
In [2]: grouped
In [2]: grouped.head()
```

Although the information about *key1* seems to be missing,
it’s actually available in the index, where we also
can see the groups the 5 rows have been split into
two groups.

```
In [2]: grouped.index
```

The computational purpose of this object is
to provide a structure that can be looped over efficiently
to execute the next *apply* step. This can be seen by
actually iterating over a group and printing the result:

```
for name, group in grouped:
print name
print group
```

which yields:

```
a
0 0.074217
1 -0.058098
4 0.654327
Name: data1, dtype: float64
b
2 0.262827
3 -0.026915
Name: data1, dtype: float64
```

We can take the mean of the *grouped* object, to get the mean values
for the various row groups defined by *key1*:

```
In [24]: grouped.mean()
Out[24]:
key1
a 0.223482
b 0.117956
dtype: float64
```

Thus, for example, 0.223482 is the mean value of the **a** groups,
because

And this step represents the apply and combine parts of split-apply-combine.

The *mean* operation is just one example
of an aggregation operation. In general an aggregation
operation is an operation that can produce a single
number from a column of numbers. Examples include
but are not limited to

- sum
- count
- min
- sum

In the next section, we look at a more realistic example.

## 6.7.1. Federal Election Commission DB¶

See Chapter 9 of the pydata book. Federal Election Commission DB http://www.fex.gov/disclosure/PDownload.do

gets you a 150 MB .csv file named:

```
P00000001-ALL.csv
```

avaliable in the ch09 folder of the pydata book download.

## 6.7.2. Reading in the data¶

Assuming the data from the pydata book has been downloaded into
directory location stored in the name *pydata_dir*:

```
In [8]: import pandas as pd
In [9]: import os.path
In [10]: working_chapter = os.path.join(pydata_dir,'ch09')
In [11]: fec = pd.read_csv(os.path.join(working_chapter,'P00000001-ALL.csv'))
```

We choose that subset of the data containing Mitt Romney and Barack Obama as candidates:

```
In [12]: fec_mrbo = fec[fec.cand_nm.isin(['Obama, Barack', 'Romney, Mitt'])]
```

Looking at just 5 columns and the first few rows of the fec_mrbo table we have:

```
In [13]: fec_mrbo.head()[['cmte_id','cand_id','cand_nm','contbr_nm','contbr_city']]
Out [13]: cmte_id cand_id cand_nm contbr_nm contbr_city
411 C00431171 P80003353 Romney, Mitt ELDERBAUM, WILLIAM DPO
412 C00431171 P80003353 Romney, Mitt ELDERBAUM, WILLIAM DPO
413 C00431171 P80003353 Romney, Mitt CARLSEN, RICHARD APO
414 C00431171 P80003353 Romney, Mitt DELUCA, PIERRE APO
415 C00431171 P80003353 Romney, Mitt SARGENT, MICHAEL APO
```

In the next section, we look at aggregating contributions by candidate and by state.

## 6.7.3. Aggregating by state¶

Next two steps: Aggregate donations by candidate and state:

```
In [14]: grouped = fec_mrbo.groupby(['cand_nm','contbr_st'])
In [15]: totals = grouped.contb_receipt_amt.sum().unstack(0).fillna(0)
```

Note that two states are quite low in their donation totals: V has 2 donations to Romney and ND has 578, but they total to under 100K:

```
In [16]: grouped.size()
```

Filter out states with very small contrib totals:

```
In [17]: totals = totals[totals.sum(1) > 100000]
```

Turn into percentages for visualization:

```
In [18]: percent = totals.div(totals.sum(1),axis=0)
```