**Chapter 1 – The Machine Learning landscape**

_This is the code used to generate some of the figures in chapter 1._

# Setup

First, let's make sure this notebook works well in both python 2 and 3, import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures:

In [1]:
# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals

# Common imports
import numpy as np
import numpy.random as rnd
import os

# to make this notebook's output stable across runs
rnd.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "fundamentals"

def save_fig(fig_id, tight_layout=True):
    path = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID, fig_id + ".png")
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format='png', dpi=300)

# Load and prepare Life satisfaction data

Before starting this exercise you should create a folder called `datasets` in the folder where this notebook is stored.  We will be placing various data files in that folder, and the code below is written so as to look there.  We will be making various subfolders in the `datasets` folder, and if you want, you can start by 
creating the first one right away. It's called `lifesat`.

So let's say your data is stored in a Unix type system in the folder `/Users/fred/Desktop/school/python_for_ss`.  Then there is a subfolder of that called
`datasets` and a subfolder of **that** called `lifesat`.  So the full path to the `lifesat` data looks
like this.

```
/Users/fred/Desktop/school/python_for_ss/datasets/lifesat
```

The OECD (Organization for Economic Cooperation and Development) stats website  contains all kinds if economic statistics on countries in downloadable form, in particular in a very popular stripped-down spreadsheet format call ".csv" (for comma-separated values).  You will get a local copy.  The particular dataset we want is  the BLI data ("Better Life Index").  

It contains a number of economic and social variables used to estimate
"Quality of Life" in communities large and small.  Try this now so you can have a local copy of this data
Visit [here](http://stats.oecd.org/index.aspx?DataSetCode=BLI) and choose the year 2015 so you get numbers like the ones below.  Then pull down the export menu, and choose 
CSV format.  Download the file into a subfolder of the `datasets` folder called `lifesat`.

In [2]:
import pandas as pd

# Download CSV from http://stats.oecd.org/index.aspx?DataSetCode=BLI
datapath = "datasets/lifesat/"

oecd_bli = pd.read_csv(datapath+"oecd_bli_2015.csv", thousands=',')
oecd_bli = oecd_bli[oecd_bli["INEQUALITY"]=="TOT"]

This table contains economic and social statistics for people in a number of countries.  The `INEQUALITY` attribute
is for looking at subpopulations like low/high income, men/women.  Since we won't be looking at those
sub-populations in this exercise, the first step after reading in the data is to reduce the table to those 
rows containing statistics about the total population.

The data in this big table is stored in an interesting and very popular format. Let's understand that 
before moving on.  First there are facts about 36 distinct countries.  One of the names in the `Country`
column  (`OECD - Total`) is a label under which totals for all the countries will be aggregated.

In [3]:
countries = set(oecd_bli['Country'])
print(len(countries), 'countries in data')
print (countries)

37 countries in data
set(['Canada', 'Turkey', 'Italy', 'Czech Republic', 'Luxembourg', 'France', 'Slovak Republic', 'Ireland', 'Norway', 'Israel', 'Australia', 'Iceland', 'Slovenia', 'Germany', 'Chile', 'Belgium', 'Spain', 'Netherlands', 'Denmark', 'Poland', 'Finland', 'OECD - Total', 'United States', 'Sweden', 'Korea', 'Japan', 'Switzerland', 'New Zealand', 'Russia', 'Brazil', 'Estonia', 'Portugal', 'Mexico', 'United Kingdom', 'Austria', 'Greece', 'Hungary'])


The cell below shows what happens when we zoom in on one country, Poland.  The table contains a number of rows with information about Poland, each with a different value in the `INDICATOR` column (code name) or `Indicator` column (English name).  That is the name of some statistic about Poland.  The numerical value for that statistic is in the `Value` column and the unit for that statistic and the unit is in the `UNIT CODE` (or `Unit`) column.  So the first row printed out tells us that 3.2% of all households in Poland are dwellings without basic facilities, an indicator of substantial poverty.

In [4]:
pol = oecd_bli[oecd_bli["Country"]=="Poland"]
pol

Unnamed: 0,LOCATION,Country,INDICATOR,Indicator,MEASURE,Measure,INEQUALITY,Inequality,Unit Code,Unit,PowerCode Code,PowerCode,Reference Period Code,Reference Period,Value,Flag Codes,Flags
21,POL,Poland,HO_BASE,Dwellings without basic facilities,L,Value,TOT,Total,PC,Percentage,0,units,,,3.2,,
130,POL,Poland,HO_HISH,Housing expenditure,L,Value,TOT,Total,PC,Percentage,0,units,,,21.0,E,Estimated value
239,POL,Poland,HO_NUMR,Rooms per person,L,Value,TOT,Total,RATIO,Ratio,0,units,,,1.1,,
348,POL,Poland,IW_HADI,Household net adjusted disposable income,L,Value,TOT,Total,USD,US Dollar,0,units,,,17852.0,E,Estimated value
531,POL,Poland,IW_HNFW,Household net financial wealth,L,Value,TOT,Total,USD,US Dollar,0,units,,,10919.0,,
640,POL,Poland,JE_EMPL,Employment rate,L,Value,TOT,Total,PC,Percentage,0,units,,,60.0,,
825,POL,Poland,JE_JT,Job security,L,Value,TOT,Total,PC,Percentage,0,units,,,7.3,,
936,POL,Poland,JE_LTUR,Long-term unemployment rate,L,Value,TOT,Total,PC,Percentage,0,units,,,3.77,,
1121,POL,Poland,JE_PEARN,Personal earnings,L,Value,TOT,Total,USD,US Dollar,0,units,,,22655.0,,
1306,POL,Poland,SC_SNTWS,Quality of support network,L,Value,TOT,Total,PC,Percentage,0,units,,,91.0,,


We can use the `pivot` method to recast the data into a much easier to grasp format.  The key point is that each country and INDICATOR determines a specific value.  So let's have one row for each country with one column for each `INDICATOR`, and in that column we'll place the `VALUE` associated with that country and that indicator.  It's as easy as this:

In [5]:

oecd_bli = oecd_bli.pivot(index="Country", columns="Indicator", values="Value")
oecd_bli.head(2)

Indicator,Air pollution,Assault rate,Consultation on rule-making,Dwellings without basic facilities,Educational attainment,Employees working very long hours,Employment rate,Homicide rate,Household net adjusted disposable income,Household net financial wealth,...,Long-term unemployment rate,Personal earnings,Quality of support network,Rooms per person,Self-reported health,Student skills,Time devoted to leisure and personal care,Voter turnout,Water quality,Years in education
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Australia,13.0,2.1,10.5,1.1,76.0,14.02,72.0,0.8,31588.0,47657.0,...,1.08,50449.0,92.0,2.3,85.0,512.0,14.41,93.0,91.0,19.4
Austria,27.0,3.4,7.1,1.0,83.0,7.61,72.0,0.4,31173.0,49887.0,...,1.19,45199.0,89.0,1.6,69.0,500.0,14.46,75.0,94.0,17.0


In the exercise ahead, we're going to take particular interest in the `Life satisfaction` score,  a kind of general "quality of life" or "happiness" score computed from a formula combining many of the indicators in this data. 

In [6]:
oecd_bli["Life satisfaction"].head()

Country
Australia    7.3
Austria      6.9
Belgium      6.9
Brazil       7.0
Canada       7.3
Name: Life satisfaction, dtype: float64

Notice when we print out the `Life satisfaction` column, the country name is also printed out.  This is because the `Country` was defined as the index of the new table created when we used the `pivot` method.  Think of the index column (or columns) as providing a unique name for each row.

# Load and prepare GDP per capita data

Elsewhere, on the world wide web, with help from Google, we find data about GDP  ("gross domestic product")
[here](http://goo.gl/j1MSKe).  Hit the download butten and place another csv file in the same
directory as the last data.

In [7]:
# Download data from http://goo.gl/j1MSKe (=> imf.org)
gdp_per_capita = pd.read_csv(datapath+"gdp_per_capita.csv", thousands=',', delimiter='\t',
                             encoding='latin1', na_values="n/a")
gdp_per_capita.rename(columns={"2015": "GDP per capita"}, inplace=True)
# Make "Country" the index column.  We are going to merge data on this column.
gdp_per_capita.set_index("Country", inplace=True)
gdp_per_capita.head(2)

Unnamed: 0_level_0,Subject Descriptor,Units,Scale,Country/Series-specific Notes,GDP per capita,Estimates Start After
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Afghanistan,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",599.994,2013.0
Albania,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",3995.383,2010.0


 We now engage in the great magic, the single most important operation by which information is created, the `merge`.
 We are going to take the quality of life data, which is indexed by country, and the GDP data, which is  now also indexed
 by country, and  merge rows, producing one large table which contains all the rows and columns of the `oecd_bli` table,
 as well as a new `GDP per Capita` column.

In [8]:
full_country_stats = pd.merge(left=oecd_bli, right=gdp_per_capita, left_index=True, right_index=True)
full_country_stats.sort_values(by="GDP per capita", inplace="True")
full_country_stats

Unnamed: 0_level_0,Air pollution,Assault rate,Consultation on rule-making,Dwellings without basic facilities,Educational attainment,Employees working very long hours,Employment rate,Homicide rate,Household net adjusted disposable income,Household net financial wealth,...,Time devoted to leisure and personal care,Voter turnout,Water quality,Years in education,Subject Descriptor,Units,Scale,Country/Series-specific Notes,GDP per capita,Estimates Start After
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Brazil,18.0,7.9,4.0,6.7,45.0,10.41,67.0,25.5,11664.0,6844.0,...,14.97,79.0,72.0,16.3,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",8669.998,2014.0
Mexico,30.0,12.8,9.0,4.2,37.0,28.83,61.0,23.4,13085.0,9056.0,...,13.89,63.0,67.0,14.4,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",9009.28,2015.0
Russia,15.0,3.8,2.5,15.1,94.0,0.16,69.0,12.8,19292.0,3412.0,...,14.97,65.0,56.0,16.0,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",9054.914,2015.0
Turkey,35.0,5.0,5.5,12.7,34.0,40.86,50.0,1.2,14095.0,3251.0,...,13.42,88.0,62.0,16.4,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",9437.372,2013.0
Hungary,15.0,3.6,7.9,4.8,82.0,3.19,58.0,1.3,15442.0,13277.0,...,15.04,62.0,77.0,17.6,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",12239.894,2015.0
Poland,33.0,1.4,10.8,3.2,90.0,7.41,60.0,0.9,17852.0,10919.0,...,14.2,55.0,79.0,18.4,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",12495.334,2014.0
Chile,46.0,6.9,2.0,9.4,57.0,15.42,62.0,4.4,14533.0,17733.0,...,14.41,49.0,73.0,16.5,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",13340.905,2014.0
Slovak Republic,13.0,3.0,6.6,0.6,92.0,7.02,60.0,1.2,17503.0,8663.0,...,14.99,59.0,81.0,16.3,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",15991.736,2015.0
Czech Republic,16.0,2.8,6.8,0.9,92.0,6.98,68.0,0.8,18404.0,17299.0,...,14.98,59.0,85.0,18.1,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",17256.918,2015.0
Estonia,9.0,5.5,3.3,8.1,90.0,3.3,68.0,4.8,15167.0,7680.0,...,14.9,64.0,79.0,17.5,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",17288.083,2014.0


In [9]:
full_country_stats[["GDP per capita", 'Life satisfaction']].loc["United States"]

GDP per capita       55805.204
Life satisfaction        7.200
Name: United States, dtype: float64