7.2. Pivot Tables

7.2.1. Introducing the pivot table

Suppose, as an aid to interpreting the numbers in the 'births' column, we we are interest in the average popularity of a female name. Then we want to find the mean births for all female names.

We might proceed as follows. Pick a year, say 1881.

names1881 = names[names['year']  == 1881]
namesfemale1881 = names1881[names1881['sex'] == 'F']
namesfemale1881['births'].mean()
98.03304904051173

We might characterize this code as follows:

  1. Split step: Line 1 splits off a group of rows, namesfemale1881 using one of the possible values in sex column. We’ll call that column the indexing column.

  2. Apply step: Line 2 applies an aggregation function ( .mean() ) to the 'births' column of those rows; call births the values column.

In this example we performed the split step on one of the possible values in the indexing column, then performed apply step on one group of rows. As we saw with cross-tabulation. we frequently want to perform these two steps for all possible values in the indexing column.

And of course that leads to the need for combining: we want to combine the results in a new Data Structure, indexed by the values that defined our groups In our example, the indexing column is 'sex' with possible values 'F' and 'M’, so we’d like a new DataFrame with a single column containing the 'births' means for those groups.

Summarizing: we need to split the DataFrame into groups of rows (using an index column), apply (applying an aggregation function to a values column), and combine (into a DataFrame). These are exactly the steps we executed for cross-tabulation. The difference in this case is that instead of just counting the groups of rows, we want to apply the aggregation function to the values column 'births' . It turns out this variation on the three steps comes up often enough to merit a name.

All three steps can be performed using pivot table method.

pt = names1881.pivot_table(values= 'births',index='sex',aggfunc='mean')
pt
births
sex
F 98.033049
M 101.051153

In fact, mean is the default aggregation function, and the first and second arguments of the .pivot_table() method are always the values column and the index column, so we get the same result by writing:

pt = names1881.pivot_table('births','sex')
pt
births
sex
F 98.033049
M 101.051153

It is worth emphasizing the importance of the combine step; the .pivot_table() method combines its results into a DataFrame. That means we can leverage all our knowledge of how DataFrames work in using it, for example, by plotting the results. One of the key takeaways about pandas is that almost all the functions and methods that apply to DataFrames and Series return either a DataFrame or a Series. By understanding the properties of what’s returned, we can use it more effectively in the next analytical step.

As desired, the values in the 'sex' column now index the new DataFrame:

pt.index
Index(['F', 'M'], dtype='object', name='sex')

Note the original name of the indexing column has been preserved, as the above output shows. It is stored in the name attribute of the DataFrame index:

pt.index.name
'sex'

As with any DataFrame, .loc[ ] is used to access individual rows using names from the index.

# Any expression containing this is a KeyError
# pt['sex']
# because 'sex' is not the name of a column in `pt`.
pt.loc['F']
births    98.033049
Name: F, dtype: float64

We have just found the mean female and male births for 1881.

Suppose we want total births instead of mean births. It can still be done with a pivot table, and we still need to group by sex, and apply an aggregation functiomn to the values column, but we need to change our aggregation function from mean to sum.

names1881.pivot_table('births','sex',aggfunc='sum')
births
sex
F 91955
M 100748

So now we know the total number of male and female births in 1881 (at least those accounted for our data sample).

Continuing along these lines, let’s move back to the names DataFrame: Suppose we want to find birth totals by gender and by year. That is, we want a new DataFrame in which the index value for a row is a year and the two columns show the total births for 'F' and 'M' in that year.

So want the following pivot table:

total_births = names.pivot_table('births','year', columns='sex', aggfunc='sum')
total_births.head()
sex F M
year
1880 90993 110493
1881 91955 100748
1882 107851 113687
1883 112322 104632
1884 129021 114445

Of course we could just as easily have the index values be genders and the columns be years.

total_births2 = names.pivot_table('births','sex', columns='year', aggfunc='sum')
#only showing the first 10 of 131 columns
total_births2.iloc[:,:10]
year 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889
sex
F 90993 91955 107851 112322 129021 133056 144538 145983 178631 178369
M 110493 100748 113687 104632 114445 107802 110785 101412 120857 110590

The difference between these two pivot table computations lies entirely in the combine step.

Under the hood both DataFrames require computing the same row groups and applying 'sum' to the births column in each group.

In fact, total_births2 is the transpose of total_births.

import numpy as np
np.all(total_births2.T == total_births)
True

Thus in total_births, we find the 2006 birth totals with .loc[]:

total_births.loc[2006]
sex
F    1896468
M    2050234
Name: 2006, dtype: int64

In total_births2, we access the same 2006 birth totals the way we access any column.

total_births2[2006]
sex
F    1896468
M    2050234
Name: 2006, dtype: int64

Since total_births and total_births2 contain essentially the same information, let’s summarize what we learned about the .pivot_table() method with total_births.

How was total_births created? By three steps

  1. Splitting the rows of name into groups (one for each year, gender pair).

  2. Applying the aggregation function sum to each group

  3. Combining the results into a single DataFrame total_births.

This is the now familiar the split/apply/combine strategy.

7.2.1.1. Cross tab versus pivot_table

Compare total_births2 with the output of pd.crosstab, which we use to get the joint distribution counts for two attributes; although the rows and columns are the same, the numbers are different.

total_births2.iloc[:,:5]
year 1880 1881 1882 1883 1884
sex
F 90993 91955 107851 112322 129021
M 110493 100748 113687 104632 114445
crosstab_sex_year = pd.crosstab(names['sex'],names['year'])
crosstab_sex_year.iloc[:,:10]
year 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889
sex
F 942 938 1028 1054 1172 1197 1282 1306 1474 1479
M 1058 997 1099 1030 1125 1097 1110 1067 1177 1111
crosstab_sex_year[1880]['F']
942

What does

crosstab_sex_year[1880]['F']

mean?

It represents how many times the names DataFrame pairs 1880 in the 'year' column with ‘F’ in the 'sex' column, and in names, that amounts to computing how many distinct female names there were in 1880.

In this example pd.crosstab combined two columns of the names DataFrame to compute the crosstab DataFrame crosstab_sex_year. In general, pd.crosstab takes any two sequences of equal length and computes how many times unique pairings occur.

[1,0,1,0,0,1,1]
[b,b,b,a,b,a,a]

For example, in the two arrays seen here, a 1 from the first sequence is paired with a b from the second twice. The results are summarized in the cross-tabulation DataFrame.

pd.crosstab(np.array([1,0,1,0,0,1,1]),np.array(['b','b','b','a','b','a','a']))
col_0 a b
row_0
0 1 2
1 2 2

If the two sequences are columns from a DataFrame, then the counts tell us how many rows contain each possible pairing of values. So each pair of values defines a group of rows (the rows with 'year' 1881 and 'sex' ‘F’, for example, in names). Note there’s no values column providing numbers to perform an aggregation operation on. All we do is compute the row groups and count their sizes.

Now there is one curve ball, mentioned here because you may come across code that uses it: pd.crosstab has an optional values argument. In lieu of just counting the rows in a group, we can apply an aggregation operation to the values column of the row groups.

For example we can use crosstab to recreate total_births2, the DataFrame we created above with pivot_table. Here is the DataFrame we created with pivot_table.

total_births2.iloc[:,:10]
year 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889
sex
F 90993 91955 107851 112322 129021 133056 144538 145983 178631 178369
M 110493 100748 113687 104632 114445 107802 110785 101412 120857 110590

And here is the same DataFrame using crosstab.

# Specify sum as the aggregation function
# and births as the numerical value column for sum to apply to
ct_table = pd.crosstab(names['sex'],names['year'], values= names['births'],aggfunc='sum')
ct_table.iloc[:,:10]
year 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889
sex
F 90993 91955 107851 112322 129021 133056 144538 145983 178631 178369
M 110493 100748 113687 104632 114445 107802 110785 101412 120857 110590

So it is possible to use crosstab to create a pivot_table type DataFrame; however, the general convention is that it is clearer to use pivot_table, the more specific tool, to create what most people know as a pivot table.

7.2.1.1.1. Plotting

All data frames have a plot method; for a simple DataFrame like a pivot table the the plot method draws a very simple picture.

Consider plotting total_births, the pivot table created in a previous section. Here’s the DataFrame. Note the index and the two columns:

total_births
sex F M
year
1880 90993 110493
1881 91955 100748
1882 107851 113687
1883 112322 104632
1884 129021 114445
... ... ...
2006 1896468 2050234
2007 1916888 2069242
2008 1883645 2032310
2009 1827643 1973359
2010 1759010 1898382

131 rows × 2 columns

And here is the default plot, which puts the index on the x-axis and plots the two columns as separate line plots.

from matplotlib import pyplot as plt
ax = total_births.plot(title='Total births by year')
#plt.show()
../_images/output_57_0.png

7.2.1.1.2. Alternative plotting script

The above graph is great and often what we want is just to take a quick look at the relationships in the data, and the default DataFrame plot will do exactly the right thing with no customization. It’s helpful to know that pandas is using a Python package called matplotlib to draw the graph above, and we can do the same ourselves, with a lot more lines of code, but also with a lot more customization.

import matplotlib.pyplot as plt
fig = plt.figure(1,figsize=(8,8))
ax1 = fig.add_subplot(111)
fig.subplots_adjust(top=0.9,left=0.2)
ax1.set_ylabel('births')

ax1.set_xlabel('year')
(p1,) = ax1.plot(total_births.index,total_births['F'],color='salmon',label='F')
(p2, ) = ax1.plot(total_births.index,total_births['M'],color='blue',label='M')
ax1.set_title('Total births by sex and year')

ax1.legend((p2,p1),('M','F'),loc='upper left',title='sex')
plt.show()
../_images/output_60_0.png

The basic requirement for the matplotlib.plot(...) function is that you need to supply two sequences of the same length giving the x- and y- coordinates respectively of the plot. We can optionally specify plot attributes like line style, color, and label (for use in the legend) for each line plotted.

There are also other plot functions which could provide alternative views of the relationships.

For example, we could do this as a bar plot. The focus below is on lines 9 and 10 which draw the bar plots for the two genders.

import matplotlib.pyplot as plt
fig = plt.figure(1,figsize=(8,8))
ax1 = fig.add_subplot(111)
fig.subplots_adjust(top=0.9,left=0.2)
ax1.set_ylabel('births')

ax1.set_xlabel('year')
width=.1
ax1.bar(total_births.index,total_births['F'],color='salmon',label='F',width=width)
ax1.bar(total_births.index+width,total_births['M'],color='blue',label='M',alpha=.1)
ax1.set_title('Total births by sex and year')

ax1.legend(loc='upper left',title='sex')
plt.show()
../_images/output_63_0.png

7.2.1.2. Cross tab vs pivot table revisited

When do we do use a pivot table, when cross-tabulation?

The answer is that we use cross-tabulation whenever counting is involved, even normalized counting (percentages).

Recall the problem we solved with cross-tabulation in Part One of the Pandas Intro, getting the joint distribution of complaint types and agencies:

import os.path
#How to break up long strings into multiline segments
#Note the use of "line continued" character \
data_url = 'https://raw.githubusercontent.com/gawron/python-for-social-science/master/'\
'pandas/datasets/311-service-requests.csv'
# Some columns are of mixed types.  This is OK.  But we have to set
# low_memory=False
complaints = pd.read_csv(data_url,low_memory=False)
three = ['DOT', "DOP", 'NYPD']
pt00 = complaints[complaints.Agency.isin(three)]
# only show the first 15 of many rows
pd.crosstab(pt00['Complaint Type'], pt00['Agency']).iloc[:15]
Agency DOP DOT NYPD
Complaint Type
Agency Issues 0 20 0
Animal Abuse 0 0 164
Bike Rack Condition 0 7 0
Bike/Roller/Skate Chronic 0 0 32
Blocked Driveway 0 0 4590
Bridge Condition 0 20 0
Broken Muni Meter 0 2070 0
Bus Stop Shelter Placement 0 14 0
Compliment 0 1 0
Curb Condition 0 66 0
DOT Literature Request 0 123 0
Derelict Vehicle 0 0 803
Disorderly Youth 0 0 26
Drinking 0 0 83
Ferry Complaint 0 4 0

In thinking about how we used pd.crosstab to solve this problem, it’s worth considering whether we could do this with a pivot table.

Pivot_table construction begins the same way cross-tabulation does: it splits the rows into groups such that each group contains all the rows for one agency/compliant type say, ‘NYPD’/‘Anmal Abuse’; it then applies the aggregation function to each group.

There is an aggregation function'count' which does we want. It simply computes the number of rows in each group.

There is one wrinkle, and this is the awkward part: the pivot table function needs an argument for the values column (in other aggregation operations this argument gives the numerical values to which the aggregation operation applies). When the aggregation operation is 'count' any valid column will serve.

So we pick one at random, say "Status". The result is almost exactly what we computed with crosstab, though with slightly more obscure code:

three = ['DOT', "DOP", 'NYPD']
pt00 = complaints[complaints.Agency.isin(three)]
pt0 = pt00.pivot_table(values='Status', index='Complaint Type' ,
                       columns = 'Agency', aggfunc='count')
pt0.iloc[:15]
Agency DOP DOT NYPD
Complaint Type
Agency Issues NaN 20.0 NaN
Animal Abuse NaN NaN 164.0
Bike Rack Condition NaN 7.0 NaN
Bike/Roller/Skate Chronic NaN NaN 32.0
Blocked Driveway NaN NaN 4590.0
Bridge Condition NaN 20.0 NaN
Broken Muni Meter NaN 2070.0 NaN
Bus Stop Shelter Placement NaN 14.0 NaN
Compliment NaN 1.0 NaN
Curb Condition NaN 66.0 NaN
DOT Literature Request NaN 123.0 NaN
Derelict Vehicle NaN NaN 803.0
Disorderly Youth NaN NaN 26.0
Drinking NaN NaN 83.0
Ferry Complaint NaN 4.0 NaN

The difference is the appearance of NaN in all those places where there were no rows in the group to count.

All in all, it is an improvement in clarity and informativeness to use crosstab for this task. As we shall see below, there are cases where cross tabulation can be extended, using other aggregation operations, to build DataFrames that are more easily understood as pivot_tables. The rule of thumb is to use cross tabulation where the operation applied to groups is counting.

Summarizing: this is the kind of problem to use cross-tabulation for. It involves counting.

Even though it is possible to we call the .pivot_table() method on complaints, using 'count' as the aggregation operation, this is not best practice when counting is involved.

7.2.1.2.1. Using pivot instead of groupby

As we discussed in part one, splitting (or grouping) is the first of the three operations in the split/apply/combine strategy.

We used the nba dataset to illustrate, because it has some very natural groupings.

We reload the NBA dataset.

import pandas as pd
nba_file_url = 'https://gawron.sdsu.edu/python_for_ss/course_core/data/nba.csv'
nba_df = pd.read_csv(nba_file_url)

Some of the questions we answered there could have been done with a pivot table:

For example, let’s use a pivot table to answer the question about average weight of centers.

nba_pt_wt = nba_df.pivot_table('Weight','Position')
print(nba_pt_wt)
print()
print(f"And the answer is: {nba_pt_wt.loc['C','Weight']:.1f}")
              Weight
Position
C         254.205128
PF        240.430000
PG        189.478261
SF        221.776471
SG        206.686275

And the answer is: 254.2

The numbers here are means because mean is the default aggregation function. Similarly, consider the question:

What position earns the highest salary on average?

The pivot table answer, going all the way to the highest valued position:

# Get pivot table; sort values for the Salary column, get first element of index
nba_df.pivot_table('Salary','Position')['Salary'].sort_values(ascending=False).index[0]
'C'

The point is that pivot_table, like pd.crosstab, executes the split-apply-combine strategy in one step and so it will often be more convenient to call pivot_table rather than groupby.

The .groupby() method is the first step in creating a pivot table; hence, we can always emulate the effect of .pivot_table() by grouping and then apply an aggregation function.

However, if we’re after only one aggregation operation for a particular grouping, there really isn’t an advantage to using groupby. We may as well use pivot_table.

7.2.1.3. New Stocks dataset: More pivot table examples using Time

  1. Adding columns to enable new kinds of grouping.

  2. “Melting” different value columns into 1 value column (in order to set up a pivot).

  3. Centering and scaling data.

import pandas as pd
import os.path
import urllib.request
import os.path
from matplotlib import pyplot as plt

url_dir = 'https://gawron.sdsu.edu/python_for_ss/course_core/data'
time1 = pd.read_csv(os.path.join(url_dir,'stocks.csv'))
time1
Unnamed: 0 date AA GE IBM MSFT
0 0 1990-02-01 00:00:00 4.98 2.87 16.79 0.51
1 1 1990-02-02 00:00:00 5.04 2.87 16.89 0.51
2 2 1990-02-05 00:00:00 5.07 2.87 17.32 0.51
3 3 1990-02-06 00:00:00 5.01 2.88 17.56 0.51
4 4 1990-02-07 00:00:00 5.04 2.91 17.93 0.51
... ... ... ... ... ... ...
5467 5467 2011-10-10 00:00:00 10.09 16.14 186.62 26.94
5468 5468 2011-10-11 00:00:00 10.30 16.14 185.00 27.00
5469 5469 2011-10-12 00:00:00 10.05 16.40 186.12 26.96
5470 5470 2011-10-13 00:00:00 10.10 16.22 186.82 27.18
5471 5471 2011-10-14 00:00:00 10.26 16.60 190.53 27.27

5472 rows × 6 columns

To get the idea of this data set try this:

Plotting challenge: Plot IBM and GE stock prices for February. Try to place the two plots side by side.

Note: even though the values in the 'date' column are strings, they’ve very well chosen strings for our purposes. Normal string sorting will reproduce chronological order.

There are no NaN values in the data.

len(time1), len(time1.dropna())
(5472, 5472)
time_feb = time1[(time1['date'] >='1990-02-01 00:00:00') & (time1['date'] < '1990-03-01 00:00:00') ]
# Subplot array is 1x2 1 row with 2 cols.
(fig, (ax1,ax2))  = plt.subplots(1,2)
# More room between subplots
fig.tight_layout()
# Rotate the xaxis labels 45 degrees to get them to display nicer.
time_feb.plot('date',"IBM",rot=45,ax=ax1)
time_feb.plot('date',"GE",rot=45,ax=ax2)
<AxesSubplot:xlabel='date'>
../_images/output_86_1.png

Task: Create a pivot table which shows the mean prices by month for ‘IBM’.

We’d like to look at things month by month. We need to make grouping by month easy. While we’re at it, let’s also enable year by year aggregation. The structure of the entries in the 'date' column makes both things easy.

time1['year'] = time1['date'].apply(lambda x: int(x[:4]))
time1['month'] = time1['date'].apply(lambda x: x[:7])

Note that adding columns is not the only way to facilitate different kinds of grouping. You can also use groupby, which takes an arbitrary function to assign labels, or you can simply pass groupby a sequence of labels you’ve created. These options are described in the pandas documentation; see the pandas_doc_notes notebook for examples.

Here is time1 with the new columns.

time1
Unnamed: 0 date AA GE IBM MSFT year month
0 0 1990-02-01 00:00:00 4.98 2.87 16.79 0.51 1990 1990-02
1 1 1990-02-02 00:00:00 5.04 2.87 16.89 0.51 1990 1990-02
2 2 1990-02-05 00:00:00 5.07 2.87 17.32 0.51 1990 1990-02
3 3 1990-02-06 00:00:00 5.01 2.88 17.56 0.51 1990 1990-02
4 4 1990-02-07 00:00:00 5.04 2.91 17.93 0.51 1990 1990-02
... ... ... ... ... ... ... ... ...
5467 5467 2011-10-10 00:00:00 10.09 16.14 186.62 26.94 2011 2011-10
5468 5468 2011-10-11 00:00:00 10.30 16.14 185.00 27.00 2011 2011-10
5469 5469 2011-10-12 00:00:00 10.05 16.40 186.12 26.96 2011 2011-10
5470 5470 2011-10-13 00:00:00 10.10 16.22 186.82 27.18 2011 2011-10
5471 5471 2011-10-14 00:00:00 10.26 16.60 190.53 27.27 2011 2011-10

5472 rows × 8 columns

Now it’s easy to create a pivot table which shows the month-by-month mean prices for IBM. Note that IBM is tha value column and month the indexing column.

time1.pivot_table('IBM','month')
IBM
month
1990-02 17.781579
1990-03 18.466818
1990-04 18.767500
1990-05 20.121818
1990-06 20.933810
... ...
2011-06 164.875455
2011-07 178.204500
2011-08 169.194348
2011-09 170.245714
2011-10 182.405000

261 rows × 1 columns

Task: Create a DataFrame which has a single row for each month showing the mean prices for ‘IBM’, “GE’ and ‘MSFT’ for that month in separate columns.

It would be nice to have one DataFrame which gave the month by month means for all three stocks in time1; this is essentially a pivot_table task. The issue is that each pivot_table command can only apply .mean() to one values column; and we have three.

We could do three pivots and concatenate columnwise. An alternative pandas offers is “melting”; conceptually, melt is the inverse of groupby. We unfold 30 day rows for April into 90 rows, 30 for each stock, by replacing the original three stock value columns with one values column and adding a variable column to keep track of what stock had each value:

time1_melt = pd.melt(time1, id_vars=['month'], value_vars=['AA', 'GE','IBM'])
time1_melt[time1_melt['month'] == '1990-04']
month variable value
41 1990-04 AA 5.18
42 1990-04 AA 5.17
43 1990-04 AA 5.13
44 1990-04 AA 5.15
45 1990-04 AA 5.04
46 1990-04 AA 5.06
47 1990-04 AA 5.11
48 1990-04 AA 5.20
49 1990-04 AA 5.25
50 1990-04 AA 5.27
51 1990-04 AA 5.26
52 1990-04 AA 5.21
53 1990-04 AA 5.15
54 1990-04 AA 5.10
55 1990-04 AA 5.04
56 1990-04 AA 5.07
57 1990-04 AA 5.10
58 1990-04 AA 5.12
59 1990-04 AA 5.14
60 1990-04 AA 5.07
5513 1990-04 GE 2.99
5514 1990-04 GE 3.04
5515 1990-04 GE 3.01
5516 1990-04 GE 3.01
5517 1990-04 GE 3.01
5518 1990-04 GE 3.02
5519 1990-04 GE 3.01
5520 1990-04 GE 3.03
5521 1990-04 GE 3.09
5522 1990-04 GE 3.12
5523 1990-04 GE 3.13
5524 1990-04 GE 3.11
5525 1990-04 GE 3.08
5526 1990-04 GE 3.06
5527 1990-04 GE 3.02
5528 1990-04 GE 3.01
5529 1990-04 GE 3.01
5530 1990-04 GE 3.02
5531 1990-04 GE 2.99
5532 1990-04 GE 2.99
10985 1990-04 IBM 18.41
10986 1990-04 IBM 18.54
10987 1990-04 IBM 18.52
10988 1990-04 IBM 18.45
10989 1990-04 IBM 18.41
10990 1990-04 IBM 18.34
10991 1990-04 IBM 18.41
10992 1990-04 IBM 18.49
10993 1990-04 IBM 18.62
10994 1990-04 IBM 19.28
10995 1990-04 IBM 19.32
10996 1990-04 IBM 19.10
10997 1990-04 IBM 18.97
10998 1990-04 IBM 19.01
10999 1990-04 IBM 19.01
11000 1990-04 IBM 18.93
11001 1990-04 IBM 19.01
11002 1990-04 IBM 18.91
11003 1990-04 IBM 18.67
11004 1990-04 IBM 18.95

Now we can pivot into the DataFrame we want:

time1_month_pivot = time1_melt.pivot_table('value','month','variable')
time1_month_pivot
variable AA GE IBM
month
1990-02 5.043684 2.873158 17.781579
1990-03 5.362273 2.963636 18.466818
1990-04 5.141000 3.037500 18.767500
1990-05 5.278182 3.160000 20.121818
1990-06 5.399048 3.275714 20.933810
... ... ... ...
2011-06 15.390000 18.295000 164.875455
2011-07 15.648000 18.500500 178.204500
2011-08 12.353478 15.878261 169.194348
2011-09 11.209524 15.557143 170.245714
2011-10 9.778000 15.735000 182.405000

261 rows × 3 columns

What stock gave the best return over the time period in the data?

To answer this question we essentially want to perform a delta calculation: subtract the value on the first day from the value on the last day. The issue is what measure of value should we use when stock share prices differ significantly in scale (compare IBM and MSFT in Feb 1990 in time1). We need some way of scaling the price differences so as to make them comparable.

We will present one way of doing this without a great deal of discussion because this question is essentially a pretext for showing you how to center and scale data.

Strategy: Center and scale each stock column to show stock price in standard deviation units. (Plot the three resulting stock price lines, for fun). Delta calculation: For each stock, compute the difference between the first and last days in the data in standard deviation units.

The stock that is the furthest above its starting point at the end of the period is the winner. Note that the “center” of the centering is arbitrary for this calculation. We choose to make it the mean because that is a standard choice. But in the end we’re just computing the difference in values beween the last day and the first day, and the choice of center will have no effect on that difference.

(x_last - c) - (x_first - c) = x_last - x_first

What does matter is the choice of the units we use to represent stock values. We choose standard deviation units because that is a reasonable way to standardize the prices of stocks.

The next cell actually centers and scales each column.

# If you used time1 you would need to tinker with both the plot command
# and the delta calculation below, because the DataFrame index is not a sequence of time points.
# df = time1[['AA','GE','IBM']]
df = time1_month_pivot
mn,std = df.mean(),df.std()
# centering (the numerator)  & scaling (the denominator)
new_df=(df-mn)/std

How did that work? Well, pandas sensibly decides that in a normally structured DataFrame, it doesn’t make sense to take the mean or standard deviation of a row (each column is a different kind of thing), so mean and standard deviation calculations are automatically applied to columns.

mn
variable
AA     17.445346
GE     17.947286
IBM    66.919186
dtype: float64

When we subtract this from df the numbers are automatically aligned with columns, and

df - mn

automatically turns into elementwise subtraction of the mean from each column.

Note that for the cell above to work we should (minimally) have a DataFrame with all numerical columns and no NaN issues (or at least NaN issues that have been thought through in advance). Our pivot table meets these criteria: It is all numerical with no NaNs. Because there are no NaNs in the original DataFrame time1, we could have just as easily have used the three value columns in time1 for centering and scaling, but we would end up with a DataFrame with no temporal index and no time related columns.

Here is the result of centering the pivot table:

new_df
variable AA GE IBM
month
1990-02 -1.288710 -1.418216 -1.172091
1990-03 -1.255604 -1.409703 -1.155745
1990-04 -1.278598 -1.402754 -1.148573
1990-05 -1.264343 -1.391229 -1.116268
1990-06 -1.251783 -1.380342 -1.096900
... ... ... ...
2011-06 -0.213580 0.032714 2.336573
2011-07 -0.186770 0.052048 2.654514
2011-08 -0.529118 -0.194660 2.439593
2011-09 -0.647991 -0.224871 2.464671
2011-10 -0.796747 -0.208138 2.754710

261 rows × 3 columns

The brief code snippet above is in this case equivalent to the for-loop in the cell below, which might be useful code if you try to do this on a subset of the columns in a DataFrame.

df = time1_month_pivot
new_dict = {}
for col in df.columns:
    col_series = df[col]
    mn, std = col_series.mean(),col_series.std()
    # centering (the numerator)  & scaling (the denominator)
    new_dict[col] = (col_series - mn)/std
new_df = pd.DataFrame(new_dict)
new_df
AA GE IBM
month
1990-02 -1.288710 -1.418216 -1.172091
1990-03 -1.255604 -1.409703 -1.155745
1990-04 -1.278598 -1.402754 -1.148573
1990-05 -1.264343 -1.391229 -1.116268
1990-06 -1.251783 -1.380342 -1.096900
... ... ... ...
2011-06 -0.213580 0.032714 2.336573
2011-07 -0.186770 0.052048 2.654514
2011-08 -0.529118 -0.194660 2.439593
2011-09 -0.647991 -0.224871 2.464671
2011-10 -0.796747 -0.208138 2.754710

261 rows × 3 columns

new_df.plot()
<AxesSubplot:xlabel='month'>
../_images/output_110_1.png

Delta calculation: The winner, as the picture suggests, is IBM.

# (Values on last day) - (Values on first day)
(new_df.iloc[-1] - new_df.iloc[0])
variable
AA     0.491963
GE     1.210078
IBM    3.926800
dtype: float64

Of course it always matters what time period you choose: Same steps as above over different time period.

Different answer for what the best stock is.

time2 = time1[time1['year']< 2008]
time2_melt = pd.melt(time2, id_vars=['month'], value_vars=['AA', 'GE','IBM'])
time2_month_pivot = time2_melt.pivot_table('value','month','variable')
df = time2_month_pivot
mn,std = df.mean(),df.std()
# centering (the numerator)  & scaling (the denominator)
new_df=(df-mn)/std

Revised plot.

new_df.plot()
<AxesSubplot:xlabel='month'>
../_images/output_116_1.png

Revised delta calculation:

(new_df.iloc[-1] - new_df.iloc[0])
variable
AA     2.892419
GE     2.507379
IBM    2.496016
dtype: float64

7.3. Merging

In this section we introduce pandas merging function, which merges the data in two different DataFrames into one DataFrame.

We will look at an example that illustrates the simplest case, the case where the indexes of the two DatFrames agree. The operation is very much like the following operation in numpy:

import numpy as np
## Prepare a1
a1 = np.ones((3,4))
a1[1] =np.zeros(4,)
print(a1)
## Prepare a2
a2 = np.zeros((3,5))
a2[1] = np.ones((5,))
print(a2)
# Merge
np.concatenate([a1,a2],axis=1)
[[1. 1. 1. 1.]
 [0. 0. 0. 0.]
 [1. 1. 1. 1.]]
[[0. 0. 0. 0. 0.]
 [1. 1. 1. 1. 1.]
 [0. 0. 0. 0. 0.]]
array([[1., 1., 1., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 0., 0., 0., 0., 0.]])

We concatenate the rows of the first array with the row sof the second, ending up with a new array with 9 columns.

The pandas merge operation will be very similar, except that in pandas merging, the order of the rows won’t matter. The indexes of the two DataFrames are used to align the rows.

We begin by loading the first of our two DataFrames, which contains LifeSatisfaction data on many countries.

import pandas as pd
import importlib.util
import os.path
#from google.colab import drive

notebook_lifesat_url0 = 'https://gawron.sdsu.edu/python_for_ss/course_core/book_draft/_static/'
oecd_file = 'oecd_bli_2015.csv'
oecd_url= f'{notebook_lifesat_url0}{oecd_file}'
# Clean up some number formatting issues '1,000' => 1000
# Unicode characters that can only be interpreted with an encoding
oecd_bli = pd.read_csv(oecd_url, thousands=',',encoding='utf-8')
# Focus on a subset of of the daty
oecd_bli = oecd_bli[oecd_bli["INEQUALITY"]=="TOT"]

This table contains economic and social statistics for people in a number of countries. The INEQUALITY attribute contains statistics for subpopulations like low/high income, men/women. Since we won’t be looking at those sub-populations in this section, the first step after reading in the data is to reduce the table to those rows containing statistics about the total population.

The data in this big table is stored in an interesting and very popular format. Let’s understand that before moving on. First there are facts about 36 distinct countries. One of the names in the Country column (OECD - Total) is a label under which totals for all the countries will be aggregated.

countries = oecd_bli['Country'].unique()
print(f'{len(countries)} countries in data')
print (countries)
37 countries in data
['Australia' 'Austria' 'Belgium' 'Canada' 'Czech Republic' 'Denmark'
 'Finland' 'France' 'Germany' 'Greece' 'Hungary' 'Iceland' 'Ireland'
 'Italy' 'Japan' 'Korea' 'Luxembourg' 'Mexico' 'Netherlands' 'New Zealand'
 'Norway' 'Poland' 'Portugal' 'Slovak Republic' 'Spain' 'Sweden'
 'Switzerland' 'Turkey' 'United Kingdom' 'United States' 'Brazil' 'Chile'
 'Estonia' 'Israel' 'Russia' 'Slovenia' 'OECD - Total']

The cell below shows what happens when we zoom in on one country, Poland. The table contains a number of rows with information about Poland, each with a different value in the Indicator column. That is the name of some statistic about Poland. The numerical value for that statistic is in the Value column and the unit for that statistic and the unit is in the UNIT CODE (or Unit) column. So the first row printed out tells us that 3.2% of all households in Poland are dwellings without basic facilities, an indicator of substantial poverty.

pol = oecd_bli[oecd_bli["Country"]=="Poland"]
pol.iloc[:,:12]
LOCATION Country INDICATOR Indicator MEASURE Measure INEQUALITY Inequality Unit Code Unit PowerCode Code PowerCode
21 POL Poland HO_BASE Dwellings without basic facilities L Value TOT Total PC Percentage 0 units
130 POL Poland HO_HISH Housing expenditure L Value TOT Total PC Percentage 0 units
239 POL Poland HO_NUMR Rooms per person L Value TOT Total RATIO Ratio 0 units
348 POL Poland IW_HADI Household net adjusted disposable income L Value TOT Total USD US Dollar 0 units
531 POL Poland IW_HNFW Household net financial wealth L Value TOT Total USD US Dollar 0 units
640 POL Poland JE_EMPL Employment rate L Value TOT Total PC Percentage 0 units
825 POL Poland JE_JT Job security L Value TOT Total PC Percentage 0 units
936 POL Poland JE_LTUR Long-term unemployment rate L Value TOT Total PC Percentage 0 units
1121 POL Poland JE_PEARN Personal earnings L Value TOT Total USD US Dollar 0 units
1306 POL Poland SC_SNTWS Quality of support network L Value TOT Total PC Percentage 0 units
1473 POL Poland ES_EDUA Educational attainment L Value TOT Total PC Percentage 0 units
1584 POL Poland ES_STCS Student skills L Value TOT Total AVSCORE Average score 0 units
1769 POL Poland ES_EDUEX Years in education L Value TOT Total YR Years 0 units
1880 POL Poland EQ_AIRP Air pollution L Value TOT Total MICRO_M3 Micrograms per cubic metre 0 units
1989 POL Poland EQ_WATER Water quality L Value TOT Total PC Percentage 0 units
2100 POL Poland CG_TRASG Consultation on rule-making L Value TOT Total AVSCORE Average score 0 units
2209 POL Poland CG_VOTO Voter turnout L Value TOT Total PC Percentage 0 units
2394 POL Poland HS_LEB Life expectancy L Value TOT Total YR Years 0 units
2505 POL Poland HS_SFRH Self-reported health L Value TOT Total PC Percentage 0 units
2690 POL Poland SW_LIFS Life satisfaction L Value TOT Total AVSCORE Average score 0 units
2869 POL Poland PS_SFRV Assault rate L Value TOT Total PC Percentage 0 units
2980 POL Poland PS_REPH Homicide rate L Value TOT Total RATIO Ratio 0 units
3091 POL Poland WL_EWLH Employees working very long hours L Value TOT Total PC Percentage 0 units
3202 POL Poland WL_TNOW Time devoted to leisure and personal care L Value TOT Total HOUR Hours 0 units

We can use the pivot method to recast the data into a much easier to grasp format. The key point is that each country and INDICATOR determines a specific value. So let’s have one row for each country with one column for each INDICATOR, and in that column we’ll place the VALUE associated with that country and that indicator. It’s as easy as this:

oecd_bli_pt = oecd_bli.pivot_table(index="Country", columns="Indicator", values="Value")
# Looking at one row
oecd_bli_pt.loc['Poland']
Indicator
Air pollution                                   33.00
Assault rate                                     1.40
Consultation on rule-making                     10.80
Dwellings without basic facilities               3.20
Educational attainment                          90.00
Employees working very long hours                7.41
Employment rate                                 60.00
Homicide rate                                    0.90
Household net adjusted disposable income     17852.00
Household net financial wealth               10919.00
Housing expenditure                             21.00
Job security                                     7.30
Life expectancy                                 76.90
Life satisfaction                                5.80
Long-term unemployment rate                      3.77
Personal earnings                            22655.00
Quality of support network                      91.00
Rooms per person                                 1.10
Self-reported health                            58.00
Student skills                                 521.00
Time devoted to leisure and personal care       14.20
Voter turnout                                   55.00
Water quality                                   79.00
Years in education                              18.40
Name: Poland, dtype: float64

Looking back to compare one value in the original DF, with the derived data in the pivot:

poland = oecd_bli[oecd_bli['Country']=='Poland']
poland[poland['Indicator']  == 'Job security']['Value']
825    7.3
Name: Value, dtype: float64
oecd_bli_pt.loc['Poland'].loc['Job security']
7.3

What is weird about what we just did with pivot_table?

The example above is not an outlier. For every country/indicator pair, the value in the 'Value' column in the original DataFrame is exactly the same as the value in the corresponding row and column of the pivot table. It appears as if no aggregation operation has taken place!

Actually appearances are deceiving. What has happened is that in this particular data set all the row groups we get by pairing country and indicator are of size 1; the default aggregation operation is mean and when we take the mean of a single number we get that number back.

So what is the point of creating a pivot table then? The point is that we’ve restructured the data into a more comprehensible
(and more easily manipulated) format; and that is one of the most important uses of the pivot table operation. Sometimes we want to collapse multiple rows into one just because the data is easier to work with that way. Compare the original DataFrame with the pivot table and see if you don’t agree that this data set is a case in point.

In future exercises with this data, we’re going to take particular interest in the Life satisfaction score, a kind of general “quality of life” or “happiness” score computed from a formula combining the indicators in this data.

That is, it is a value computed from the other values in the same row. In a regression context, we would think of this as the dependent variable, the variable whose value we try to compute as some function of the others.

oecd_bli_pt["Life satisfaction"].head()
Country
Australia    7.3
Austria      6.9
Belgium      6.9
Brazil       7.0
Canada       7.3
Name: Life satisfaction, dtype: float64

Elsewhere, on the world wide web, with help from Google, we find data about GDP (“gross domestic product”) here.

This is the second of the two DataFrames we will merge.

This is a very simple data frame with columns containing various useful country-specific pieces of information. Since there is one row per country, we have made the 'Country’ column the index.

# Downloaded data from http://goo.gl/j1MSKe (=> imf.org) to github
gdp_file = "gdp_per_capita.csv"
gdp_url = f'{notebook_lifesat_url0}{gdp_file}'
gdp_per_capita = pd.read_csv(gdp_url, thousands=',', delimiter='\t',
                             encoding='latin1', na_values="n/a")
gdp_per_capita.rename(columns={"2015": "GDP per capita"}, inplace=True)
# Make "Country" the index column.  We are going to merge data on this column.
gdp_per_capita.set_index("Country", inplace=True)
#gdp_per_capita.head(2)
gdp_per_capita.index
Index(['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Australia', 'Austria', 'Azerbaijan',
       ...
       'United States', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela', 'Vietnam', 'Yemen', 'Zambia', 'Zimbabwe', 'International Monetary Fund, World Economic Outlook Database, April 2016'], dtype='object', name='Country', length=190)

Note that this lines up reasonably well with the index of the pivot table we created above.

oecd_bli_pt.index
Index(['Australia', 'Austria', 'Belgium', 'Brazil', 'Canada', 'Chile',
       'Czech Republic', 'Denmark', 'Estonia', 'Finland', 'France', 'Germany',
       'Greece', 'Hungary', 'Iceland', 'Ireland', 'Israel', 'Italy', 'Japan',
       'Korea', 'Luxembourg', 'Mexico', 'Netherlands', 'New Zealand', 'Norway',
       'OECD - Total', 'Poland', 'Portugal', 'Russia', 'Slovak Republic',
       'Slovenia', 'Spain', 'Sweden', 'Switzerland', 'Turkey',
       'United Kingdom', 'United States'],
      dtype='object', name='Country')

We now engage in the great magic, the single most important operation by which information is created, the merge. We are going to take the quality of life data, which is indexed by country, and the GDP data, now also indexed by country, and merge rows, producing one large table which contains all the rows and columns of the oecd_bli table, as well as a new GDP per Capita column.

full_country_stats = pd.merge(left=oecd_bli_pt, right=gdp_per_capita, left_index=True, right_index=True)
full_country_stats.sort_values(by="GDP per capita", inplace=True)
len(oecd_bli_pt),len(gdp_per_capita),len(full_country_stats)
(37, 190, 36)

Note that the gdp data cotained many rows not contained in the life satisfaction data index. Those rows were ignored.

We have a new DataFrame combining the columns of the two merged DataFrames:

print(len(oecd_bli_pt.columns),len(gdp_per_capita.columns),len(full_country_stats.columns))
full_country_stats.head(2)
24 6 30
Air pollution Assault rate Consultation on rule-making Dwellings without basic facilities Educational attainment Employees working very long hours Employment rate Homicide rate Household net adjusted disposable income Household net financial wealth Housing expenditure Job security Life expectancy Life satisfaction Long-term unemployment rate Personal earnings Quality of support network Rooms per person Self-reported health Student skills Time devoted to leisure and personal care Voter turnout Water quality Years in education Subject Descriptor Units Scale Country/Series-specific Notes GDP per capita Estimates Start After
Country
Brazil 18.0 7.9 4.0 6.7 45.0 10.41 67.0 25.5 11664.0 6844.0 21.0 4.6 73.7 7.0 1.97 17177.0 90.0 1.6 69.0 402.0 14.97 79.0 72.0 16.3 Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, curren... 8669.998 2014.0
Mexico 30.0 12.8 9.0 4.2 37.0 28.83 61.0 23.4 13085.0 9056.0 21.0 4.9 74.6 6.7 0.08 16193.0 77.0 1.0 66.0 417.0 13.89 63.0 67.0 14.4 Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, curren... 9009.280 2015.0

We print out a sub DataFrame containing just the GDP per capita and Life Satisfaction columns; we look at the "United States" and following rows:

full_country_stats[["GDP per capita", 'Life satisfaction']].loc["United States":]
GDP per capita Life satisfaction
Country
United States 55805.204 7.2
Norway 74822.106 7.4
Switzerland 80675.308 7.5
Luxembourg 101994.093 6.9

In a future exercise we will try to run a linear regression analysis to predict Life Satisfaction.

As a first step, it might be interesting to see how well we could do just using GDP, though a close look at the rows above shows no linear predictor will be perfect: An increase in GDP does not necessarily predict an increase in life satisfaction.

Summing up what we did in this section. We loaded two Data frames and merged them on their indexes, in effect concatenating the rows of the first dataFrame with the rows of the second.

In order to execute the merge we had to make sure their indexes contained the same sort of things. We modified he first dataset with a pivot table operation that made 'Countries' the index. We modified the second dataset by promoting the 'Countries' column to be the index.

7.3.1. Another kind of merging: Join

We look at a second Merge example involving a slightly different operation join.

We will merge data about U.S. Covid case numbers with geographical data, to allow visualizing the data on maps.

import pandas as pd
from datetime import datetime


def geoid2code(geoid):
    return int(geoid[4:])

# this data set has cumulative stats
nyt_github_covid_cumulative = 'https://raw.githubusercontent.com/nytimes/'\
                        'covid-19-data/master/us-counties.csv'
nyt_github_covid_rolling_avg = 'https://raw.githubusercontent.com/nytimes/'\
                        'covid-19-data/master/rolling-averages/us-counties.csv'
df = pd.read_csv(nyt_github_covid_rolling_avg,converters=dict(geoid=geoid2code))
#df.rename(columns={'geoid': 'GEOID'},inplace=True)

start, end = datetime.fromisoformat(df['date'].min()),\
               datetime.fromisoformat(df['date'].max())

One row per county day pair containing various kind of Covide case data.

df2 = df[(df['county']=='Cook') & (df['state']=='Illinois')]
df2
date geoid county state cases cases_avg cases_avg_per_100k deaths deaths_avg deaths_avg_per_100k
4 2020-01-24 17031 Cook Illinois 1 0.14 0.00 0 0.00 0.00
6 2020-01-25 17031 Cook Illinois 0 0.14 0.00 0 0.00 0.00
9 2020-01-26 17031 Cook Illinois 0 0.14 0.00 0 0.00 0.00
14 2020-01-27 17031 Cook Illinois 0 0.14 0.00 0 0.00 0.00
19 2020-01-28 17031 Cook Illinois 0 0.14 0.00 0 0.00 0.00
... ... ... ... ... ... ... ... ... ... ...
1760590 2021-09-25 17031 Cook Illinois 0 865.71 16.81 0 8.86 0.17
1763838 2021-09-26 17031 Cook Illinois 0 865.71 16.81 0 8.86 0.17
1767086 2021-09-27 17031 Cook Illinois 1924 786.86 15.28 21 8.57 0.17
1770334 2021-09-28 17031 Cook Illinois 486 748.00 14.52 17 10.29 0.20
1773582 2021-09-29 17031 Cook Illinois 764 723.00 14.04 4 9.29 0.18

615 rows × 10 columns

7.3.2. Adding Lat Longs to the data

The Geographical info is stored in a column named 'geoid' (note the lower case).

df[:5]
date geoid county state cases cases_avg cases_avg_per_100k deaths deaths_avg deaths_avg_per_100k
0 2020-01-21 53061 Snohomish Washington 1 0.14 0.02 0 0.0 0.0
1 2020-01-22 53061 Snohomish Washington 0 0.14 0.02 0 0.0 0.0
2 2020-01-23 53061 Snohomish Washington 0 0.14 0.02 0 0.0 0.0
3 2020-01-24 53061 Snohomish Washington 0 0.14 0.02 0 0.0 0.0
4 2020-01-24 17031 Cook Illinois 1 0.14 0.00 0 0.0 0.0

The 'geoid' column contains FIPS geographical codes that we can use to make maps visualizing the geographic distribution of the Covid data.

The problem is many of the programs designed for that purpose want lat/long coordinates.

Solution: we go to the census.gov site to find mappings from geocodes to lat/longs. We turn this new data into a pandas DataFrame. We then join the new DataFrame to our old one.

#Normally you'd get this data at census.giv from a compressed file.
true_url = 'https://www2.census.gov/geo/docs/maps-data/data/gazetteer/2021_Gazetteer/'\
      '2021_Gaz_counties_national.zip'
# To simplify things Ive uncompressed it and copied it here, uncompressed,
url = 'https://gawron.sdsu.edu/python_for_ss/course_core/book_draft/_static/'\
      '202### Doing the join1_Gaz_counties_national.txt'
# The file uses tabs, not "," as a separator.  `pd.read_csv` still works if you
# tell it that.
codes = pd.read_csv(url,sep='\t')
# last column name misparsed, many spaces added.  data cleanup
long = codes.columns[-1]
codes.rename(columns={long: long.strip()},inplace=True)

Note that the new data set has both a GEOID column and LAT and LONG columns; note the case of ‘GEOID’.

In fact, the LAT/LONGs are located at the centers of the GEOID regions, so each GEOID value is associated with exactly one LAT/LONG pair.

codes[:5]
USPS GEOID ANSICODE NAME ALAND AWATER ALAND_SQMI AWATER_SQMI INTPTLAT INTPTLONG
0 AL 1001 161526 Autauga County 1539634184 25674812 594.456 9.913 32.532237 -86.646440
1 AL 1003 161527 Baldwin County 4117656514 1132955729 1589.836 437.437 30.659218 -87.746067
2 AL 1005 161528 Barbour County 2292160149 50523213 885.008 19.507 31.870253 -85.405104
3 AL 1007 161529 Bibb County 1612188717 9572303 622.470 3.696 33.015893 -87.127148
4 AL 1009 161530 Blount County 1670259090 14860281 644.891 5.738 33.977358 -86.566440

In addition, each county has a unique GEOID.

codes[(codes['NAME']=='King County')&(codes['USPS']=='WA')]
USPS GEOID ANSICODE NAME ALAND AWATER ALAND_SQMI AWATER_SQMI INTPTLAT INTPTLONG
2970 WA 53033 1531933 King County 5479337396 496938967 2115.584 191.869 47.490552 -121.833977

Going back from the GEOID gets us the same row.

codes[(codes['GEOID']==53033)]
USPS GEOID ANSICODE NAME ALAND AWATER ALAND_SQMI AWATER_SQMI INTPTLAT INTPTLONG
2970 WA 53033 1531933 King County 5479337396 496938967 2115.584 191.869 47.490552 -121.833977

7.3.3. Doing the join

In this section we explain the join operation that we’ll use to merge the lat-long info in the codes table with the Covid data.

The idea is to join on GEOID. Since each GEOID identifies a unique LAT/LONG pair, let’s zero in on just the columns we want to merge into the Covid data.

#Make the subtable we're going to join to.
geoid_lat_long = codes[['GEOID','INTPTLAT','INTPTLONG']]
geoid_lat_long[:5]
GEOID INTPTLAT INTPTLONG
0 1001 32.532237 -86.646440
1 1003 30.659218 -87.746067
2 1005 31.870253 -85.405104
3 1007 33.015893 -87.127148
4 1009 33.977358 -86.566440

We promote the GEOID column to be an index.

geoid_lat_long_gind = geoid_lat_long.set_index('GEOID')
geoid_lat_long_gind[:5]
INTPTLAT INTPTLONG
GEOID
1001 32.532237 -86.646440
1003 30.659218 -87.746067
1005 31.870253 -85.405104
1007 33.015893 -87.127148
1009 33.977358 -86.566440

Let’s prepare some test rows we’ll check to see the join goes right.

Here are some rows from the Covid data that all share their GEOID (because they contain data for the same county on different days). After the join, we’d like all these rows linked to the same Lat/Long info.

df[:4]
date geoid county state cases cases_avg cases_avg_per_100k deaths deaths_avg deaths_avg_per_100k
0 2020-01-21 53061 Snohomish Washington 1 0.14 0.02 0 0.0 0.0
1 2020-01-22 53061 Snohomish Washington 0 0.14 0.02 0 0.0 0.0
2 2020-01-23 53061 Snohomish Washington 0 0.14 0.02 0 0.0 0.0
3 2020-01-24 53061 Snohomish Washington 0 0.14 0.02 0 0.0 0.0

Here we do the join, and we see it has the desired effect.

The new DataFrame has two new LAT/LONG columns. Rows with the same GEOID value have the same LAT/LONG values.

new_df = df.join(geoid_lat_long_gind,on='geoid')
new_df[:4]
date geoid county state cases cases_avg cases_avg_per_100k deaths deaths_avg deaths_avg_per_100k INTPTLAT INTPTLONG
0 2020-01-21 53061 Snohomish Washington 1 0.14 0.02 0 0.0 0.0 48.054913 -121.765038
1 2020-01-22 53061 Snohomish Washington 0 0.14 0.02 0 0.0 0.0 48.054913 -121.765038
2 2020-01-23 53061 Snohomish Washington 0 0.14 0.02 0 0.0 0.0 48.054913 -121.765038
3 2020-01-24 53061 Snohomish Washington 0 0.14 0.02 0 0.0 0.0 48.054913 -121.765038

Let’s summarize what we did.

  1. We had two columns (INTPTLAT,INTPTLONG) in the join DataFrame (geoid_lat_long_gind) that we wanted to distribute to the rows of the Covid df (df).

  2. We identified a column in df that we were going to merge on ('geoid'). This column is like an indexing column in a pivot or cross-tabulation. It divides the rows of df into groups. Each group should get the same INTPTLAT, INTPTLONG values.

  3. We promoted the geoid column in geoid_lat_long_gind to be the index. Note the name of this promoted column didn’t matter ('geoid' and 'GEOID' are different). What mattered is that both 'geoid' and 'GEOID' contained the same value set, geoid codes.

  4. We did the join, creating a new data frame with the same number of rows as df, but two new columns.

len(df),len(new_df),len(df.columns),len(new_df.columns)
(1774204, 1774204, 10, 12)

We join one DataFrame df with another join DataFrame.

What makes a join different from a merge is that there has to be a column in df that is joined on; and the values in that column can all be found in the index of the join DataFrame.

The data in the join DataFrame is then distributed to the rows of df according to the row groups defined by the joined-on column.

We do not show it here, but it is possible to join on more than one column, just as it is possible to use more than one column for grouping in a cross-tabulation or a pivot table.

7.3.3.1. Summary and conclusion

In Part two of our introduction to pandas we focused on pivoting and merging.

Both these topics fall under the general heading of restructuring data. Pivot tables collapse multiple rows of the input dfata, often making it easier to understand; often, but not always, that involves performing some aggregation operation on groups of rows.

Merging operations combine data from two DataFrames (often from two different data sources). Merging results in DataFrame with a different number of columns than either of the inputs. This counts as restructuring. But where pivoting and cross tabulation are essentially ways of computing staistical summaries, merging has the potential creating new information and adding value. Adding GDP numbers to the life satisfaction data opened up new ways of constructing a data model. Adding lat longs to the Covid data opened the door to new kinds of visualization.