9.5. Visualizing Geographical Data

9.5.1. Loading the data

import pandas as pd
HOUSING_PATH = "data"
def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

9.5.2. Eyeballing the data

Let’s look at some statewide housing data for California.

Each row represents one district in California. The districts range in size from 1 household to over 6,000. They vary in location, income level, and of course, median house value.

housing = load_housing_data()
housing.head()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
housing.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
longitude             20640 non-null float64
latitude              20640 non-null float64
housing_median_age    20640 non-null float64
total_rooms           20640 non-null float64
total_bedrooms        20433 non-null float64
population            20640 non-null float64
households            20640 non-null float64
median_income         20640 non-null float64
median_house_value    20640 non-null float64
ocean_proximity       20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
housing["ocean_proximity"].value_counts()
<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: ocean_proximity, dtype: int64
print(housing.describe())
          longitude      latitude  housing_median_age   total_rooms  count  20640.000000  20640.000000        20640.000000  20640.000000
mean    -119.569704     35.631861           28.639486   2635.763081
std        2.003532      2.135952           12.585558   2181.615252
min     -124.350000     32.540000            1.000000      2.000000
25%     -121.800000     33.930000           18.000000   1447.750000
50%     -118.490000     34.260000           29.000000   2127.000000
75%     -118.010000     37.710000           37.000000   3148.000000
max     -114.310000     41.950000           52.000000  39320.000000

       total_bedrooms    population    households  median_income  count    20433.000000  20640.000000  20640.000000   20640.000000
mean       537.870553   1425.476744    499.539680       3.870671
std        421.385070   1132.462122    382.329753       1.899822
min          1.000000      3.000000      1.000000       0.499900
25%        296.000000    787.000000    280.000000       2.563400
50%        435.000000   1166.000000    409.000000       3.534800
75%        647.000000   1725.000000    605.000000       4.743250
max       6445.000000  35682.000000   6082.000000      15.000100

       median_house_value
count        20640.000000
mean        206855.816909
std         115395.615874
min          14999.000000
25%         119600.000000
50%         179700.000000
75%         264725.000000
max         500001.000000
%matplotlib inline
import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(11,8))
save_fig("histograms")
plt.show()
Saving figure histograms
../_images/output_11_1.png

The median income data has been scaled to some strange units. More than that, it turns out that it is capped at both ends. The highest value is 15 and the lowest is .5. This may create some issues, but let’s try and work with it.

9.5.3. Transforming data: Binning

Despite the capping, the median income of the CA districts in our data has a typical Pareto’s Law distribution.

housing["median_income"].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x1165a4bd0>
../_images/output_15_1.png

Let’s create a column that separates income into 5 categories. We’re not actually going to use these categories in visualizing our data, but we are going use them in sampling it.

First we scale down the existing income numbers by dividing each by 1.5, for example 7.5/1.5 = 5.0. Then we do some capping of our own. We’ll save all values less than 5 as is and anything greater than 5 is 5. This is done using the where method.

housing["income_cat"] = np.ceil(housing["median_income"] / 1.5)
housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)
housing["income_cat"].value_counts()
3.0    7236
2.0    6581
4.0    3639
5.0    2362
1.0     822
Name: income_cat, dtype: int64

So the histogram for our new column looks as follows.

housing["income_cat"].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x117741650>
../_images/output_19_1.png

Stratified sampling is a statistical sampling technique which attempts to guarantee that a sample represents the true proportions of given attributes. This is especially useful when the sample size is small enough so that the chances of a non-representative sample are fairly high. We’ev taken the first step, which is to divide the population into sample groups. Now we sample from each group in proportion to its size.

Here’s a toy example of how the data structures work.

from sklearn.model_selection import StratifiedShuffleSplit
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 1, 1])
sss = StratifiedShuffleSplit(n_splits=3, test_size=0.5, random_state=0)
print(sss.get_n_splits(X, y))
print(sss)
3
StratifiedShuffleSplit(n_splits=3, random_state=0, test_size=0.5,
            train_size=None)
for train_index, test_index in sss.split(X, y):
      print("TRAIN:", train_index, "TEST:", test_index)
      X_train, X_test = X[train_index], X[test_index]
      y_train, y_test = y[train_index], y[test_index]
TRAIN: [1 2] TEST: [3 0]
TRAIN: [0 2] TEST: [1 3]
TRAIN: [0 2] TEST: [3 1]

The second argument to the split method represents a column of values for an attribute whose proportions we want to preserve in samples.

When creating sss we passed in n_splits=3, so we got three different samples; for each, the proportion of 0 and 1’s in y reflects what it is in the data set as a whole (50/50).

Now let’s try this on our data, but we’ll just do one split, asking for the proportions of the various income levels as represented in our new column to be accurately preserved.

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]
ss = split.split(housing, housing["income_cat"])

Let’s compare stratified sampling to ordinary random sampling to see the benefits of stratified sampling.

from sklearn.model_selection import train_test_split

def income_cat_proportions(data):
    return data["income_cat"].value_counts() / len(data)

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=4)

compare_props = pd.DataFrame({
    "Overall": income_cat_proportions(housing),
    "Stratified": income_cat_proportions(strat_test_set),
    "Random": income_cat_proportions(test_set),
}).sort_index()
compare_props["Rand. %error"] = 100 * compare_props["Random"] / compare_props["Overall"] - 100
compare_props["Strat. %error"] = 100 * compare_props["Stratified"] / compare_props["Overall"] - 100
compare_props
Overall Random Stratified Rand. %error Strat. %error
1.0 0.039826 0.039729 0.039729 -0.243309 -0.243309
2.0 0.318847 0.320010 0.318798 0.364686 -0.015195
3.0 0.350581 0.352955 0.350533 0.677170 -0.013820
4.0 0.176308 0.170300 0.176357 -3.407530 0.027480
5.0 0.114438 0.117006 0.114583 2.243861 0.127011

Now that we’ve used the income_cat values to help us sample our data, let’s drop them, since they’re clearly redundant.

for set in (strat_train_set, strat_test_set):
    set.drop("income_cat", axis=1, inplace=True)

9.5.4. Putting the data on a map

housing = strat_train_set.copy()

Now comes the interesting idea. Our data contains lat/long coordinates for each housing district. Each coordinate represent the geographic center of the district. As a start, let’s do a simple scatter plot of all the districts in our data. Note the familiar shape that emerges.

housing.plot(kind="scatter", x="longitude", y="latitude")
save_fig("bad_visualization_plot")
Saving figure bad_visualization_plot
../_images/output_35_1.png
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)
save_fig("better_visualization_plot")
Saving figure better_visualization_plot
../_images/output_36_1.png

In the following plot command, the c parameter is telling us which column value to use to determine the color given by the colormap. The s parameter is determining the size of the point being plotted. Thus the value of the median_house_value for a point will determine its color, and the value in the population column (divided by 100) will determine its size.

# c is the attribute we'll map onto colors, s is the attribute we'll represent with circle size.
housing.plot(kind="scatter", x="longitude", y="latitude",
    s=housing['population']/100, label="population",
    c="median_house_value", cmap=plt.get_cmap("jet"),
    colorbar=True, alpha=0.4, figsize=(10,7),
)
plt.legend()
save_fig("housing_prices_scatterplot")
plt.show()
Saving figure housing_prices_scatterplot
../_images/output_38_1.png

Next we include a map-image showing the boundaries of Californa. The map-image is a polygon with 4 corners (a rectangle). The coordinates in our scatter plots are Lat/Longs. To scale and position our map-image to be consistent with out scatterplot, we need the Lat long coordinates for two of the the 4 corners, the lower left and the upper right. Those are given in extent parameter in the order LongitudeLowerLeft, LongitudeUpperRight, LatitudeLowerLeft, LatitudeUpperRight. Normally we would use a geographic tool kit like basemap (included as part of matplotlib) that provides geographic map polygons with boundaries of various kinds drawn, as well as the associated Lat Long info. We’ll illustrate that in the next example. For now we illustrate the basic idea with a evry simple map, which still adds a considerable amount of value to our visualization.

# This is a excellent discussion of the ideas involved in a plot with a colorbar
# https://stackoverflow.com/questions/19816820/how-to-retrieve-colorbar-instance-from-figure-in-matplotlib
# What's called a "scalar mappable" in that discussion might also be called a scalar-valued function,
# a function that assigns a scalar value to each x,y point being plotted. Our scalar mappable in
# this example is NOT the image of CA, but the housing prices assigned tp each point in the scatter plot.
import matplotlib.image as mpimg
california_img=mpimg.imread(os.path.join(PROJECT_ROOT_DIR,'prepared_images','california.png'))
# c is the attribute we'll map onto colors, s is the attribute we'll represent with circle size.
ax = housing.plot(kind="scatter", x="longitude", y="latitude", figsize=(10,7),
                       s=housing['population']/100, label="Population",
                       c="median_house_value", cmap=plt.get_cmap("jet"),
                       colorbar=True, alpha=0.4,
                      )
plt.imshow(california_img, extent=[-124.55, -113.80, 32.45, 42.05], alpha=0.5)
plt.ylabel("Latitude", fontsize=14)
plt.xlabel("Longitude", fontsize=14)

prices = housing["median_house_value"]
tick_values = np.linspace(prices.min(), prices.max(), 11)
# Creating the plot later (doing the plot with colorbar=False) creates some sort of scaling issues.
# Only not sure why.
#cbar = plt.colorbar(ax=ax)
#cbar.ax.set_yticks([int(round(v/1000)) for v in tick_values])
#cbar.ax.set_yticklabels(["${0:d}k".format(int(round(v/1000))) for v in tick_values])
#cbar.set_label('Median House Value', fontsize=16)
plt.legend(fontsize=16)
save_fig("california_housing_prices_plot")
plt.show()
Saving figure california_housing_prices_plot
../_images/output_40_1.png

9.5.5. A Kernel Density Estimation example

We are going to draw a map that uses a probability estimate of a geographically linked variable. When a probability estimate is appropriate, and reasonably accurate, it is a very powerful tool; the problem is to know when a probability estimate is appropriate.

Let’s think about some basics. In the previous example we looked at data about housing prices, an eminently geographic property that is a number. What is the price of House A? Or in our particular case, the median price of houses in District A? In this example we’re going to look at species distributions, an eminently geographic property that could also be expressed as a number. What is the population of species A in district 1? The question is: How meaningful is that number? Since natural events can raise and lower the population, since species members move around from day to to day, often in large numbers, any data we collect is likely to be outdated very quickly. More interestingly, species population numbers are likely to change from place to place in predictable ways: There is a general tendency for the species population of district 1 to be similar to that of nearby districts. What we have is a variable that behaves like a continuous function over locations. These features suggest that it is profitable to think of species distribution as a probability distribution conditioned by location. Notice how the housing price example differs. Housing prices are much more stable from day to day. They are also much less continuous. The housing prices of one district are not a very good predictor of the housing prices of a neighboring district. Both housing prices and species population numbers can change dramatically over short distances – | natural barriers can have dramatic effects in both cases – but dramatic changes over short distances are much more the norm for housing prices. All this suggests that what we want for predicting housing prices is the actual prices of actual houses, and if not that, the closer to that that we can get, the better. With species, data about actual sightings of actual species members is still the ground truth, but what we can reliably conclude from a sighting is best expressed as an increase in our chances of seeing species members, a probability distribution.

And that is the kind of thing we know how to draw a picture for. We are going to draw what’s called a contour plot. A contour plot tells us the value of a continuous function on a 2D grid. In the simplest case, what we draw is lines telling us the places where the value of the function is the same. Those lines will generally be connected because the function is continuous. We can color code the lines to tell us the direction in which the function value increases, and if we draw lots of lines and use a continuous color map we get a full picture of the peaks and valleys of the function.

First we get some raw data about species distribution. In particular we’re going to look at distributions for ‘Bradypus Variegatus’ (the brown-throated sloth)

and ‘Microryzomys Minutus’ (the forest small rice rat)

This data consists of raw numbers for species population at particular lat/long coordinates. Then we fit a certain kind of probability distribution called a Gaussian to it. That defines a continuous function over lat/long coordinates. The value returned for any coordinate by that function – which is called a probability function – is the probability of seeing a member of species at that latitude and longitude. Then we draw a contour plot of the function. We will use a “reds” colormap: this is a sequential colormap in which increasing darkness and saturation of the color increases a higher numerical value. Basically, the redder the shade, the higher the probability. You can learn more about matplotlib colormaps and find other maps to try here.

For those interested in more technical details, here are some facts about the computation:

  1. What we are actually going to use is a probability density function which assigns probabilities to spatial regions, not to points.

  2. The “fitting” of a probability function to set of points is called a kernel density estimation. Broadly speaking, the probability values are smoothed to assign values to all the points on the map, not just the ones we know about, using the principle that points that are close together have similar probabilities. We use the ‘haversine’ metric appropriate for lat/long coordinates (and spherical coordinates) to define “closeness”. It basically gives us the distance between points given their lat-long coordinates. We use the Ball Tree algorithm for kernel density estimation because it is the most appropriate for geographic applications.

# Author: Jake Vanderplas <jakevdp@cs.washington.edu>
#
# License: BSD 3 clause

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_species_distributions
from sklearn.datasets.species_distributions import construct_grids
from sklearn.neighbors import KernelDensity

# if basemap is available, we'll use it.
# otherwise, we'll improvise later...
try:
    from mpl_toolkits.basemap import Basemap
    basemap = True
except ImportError:
    basemap = False

The cell above does the basic imports. Note that an import is attempted for a module named Basemap, but a flag is set in case that import fails. More on this below.

# Get matrices/arrays of species IDs and locations
data = fetch_species_distributions()
species_names = ['Bradypus Variegatus', 'Microryzomys Minutus']

# The only attributes we care about in `training` are lat and long.
Xtrain = np.vstack([data['train']['dd lat'],
                    data['train']['dd long']]).T
# Microryzomys Minutus gets coded as 1,
# Bradypus Variegatus as 0
ytrain = np.array([d.decode('ascii').startswith('micro')
                  for d in data['train']['species']], dtype='int')
Xtrain *= np.pi / 180.  # Convert lat/long to radians

# Set up the data grid for the contour plot
xgrid, ygrid = construct_grids(data)
# To help kernel density estimation take a reasonable length of time we
# consistently use every 5th data point
X, Y = np.meshgrid(xgrid[::5], ygrid[::5][::-1])
# We'll make use of the fact that coverages[6] has measurements at all
# land points.  This will help us decide between land and water.
land_reference = data.coverages[6][::5, ::5]
# -9999 means water.  Make a Boolean grid identifying land.
land_mask = (land_reference > -9999).ravel()

xy = np.vstack([Y.ravel(), X.ravel()]).T
xy = xy[land_mask]
xy *= np.pi / 180.

The sklearn species distrubution docs are here. This is data appropriate for conservation studies, including data about species distrubutions and 14 environmental variables. In this section we’ll just explore the relationship between the distribution facts and locations, and throw away the rest.

data.coverages[6].shape
(1592, 1212)

The cell above gets the data and does some preprocessing. Let’s look at the result first.

print xy.shape
xy [:5]
(28008, 2)
array([[ 0.41015237, -1.35699349],
       [ 0.41015237, -1.35263017],
       [ 0.41015237, -1.33081355],
       [ 0.41015237, -1.32645023],
       [ 0.41015237, -1.32208691]])

xy is a 28,008 x 2 array. The numbers are lat long coordinates converted to radians (basically, degrees are converted to radians).

print ytrain[:5]
ois = [(i,cls) for (i,cls) in enumerate(ytrain) if cls == 1]
zis = [(i,cls) for (i,cls) in enumerate(ytrain) if cls == 0]
print len(ytrain)
print ois[:5], len(ois)
print zis[:5], len(zis)
[1 1 1 1 1]
1624
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)] 698
[(88, 0), (89, 0), (90, 0), (91, 0), (92, 0)] 926

ytrain tells us our species (0 is ‘Bradypus Variegatus’, 1 is ‘Microryzomys Minutus’. There are 1,624 species location tuples in the data, 698 ‘Microryzomys Minutus’ and 926 ‘Bradypus Variegatus’,.

# Plot map of South America with distributions of each species
fig = plt.figure()
fig.subplots_adjust(left=0.05, right=0.95, wspace=0.05)

for i in range(2):
    plt.subplot(1, 2, i + 1)

    # construct a kernel density estimate of the distribution
    print(" - computing KDE in spherical coordinates")
    kde = KernelDensity(bandwidth=0.04, metric='haversine',
                        kernel='gaussian', algorithm='ball_tree')
    kde.fit(Xtrain[ytrain == i])

    # evaluate only on the land: -9999 indicates ocean
    Z = -9999 + np.zeros(land_mask.shape[0])
    Z[land_mask] = np.exp(kde.score_samples(xy))
    Z = Z.reshape(X.shape)

    # plot contours of the density
    levels = np.linspace(0, Z.max(), 25)
    plt.contourf(X, Y, Z, levels=levels, cmap=plt.cm.Reds)
    plt.title(species_names[i])
plt.show()
- computing KDE in spherical coordinates
- computing KDE in spherical coordinates
../_images/output_53_1.png

The cell above trains the KDE model on the species distribution data and draws the contour plots. You can see the lightly drawn outline of South America in both species plots. It’s there because land_mask restricts our data to points on land, and since we have species data from all over South America, our points capture the shape of the entire land mass. Next we’ll redraw the picture adding basemap to draw an outline map of South America, coordinated with our data, including major rivers and country borders, to help locate the contour masses. This picture will help us tell a story our data. For example, we’ll see that Bradypus Variegatus has populations along the eastern Amazon, where Microryzomus Minutus does not. However the habitation area of Microryzomus Minutus extends much further south, along the entire Peruvian coast. To learn more about basemap, look here and also here.

for i in range(2):
    plt.subplot(1, 2, i + 1)

    # construct a kernel density estimate of the distribution
    print(" - computing KDE in spherical coordinates")
    kde = KernelDensity(bandwidth=0.04, metric='haversine',
                        kernel='gaussian', algorithm='ball_tree')
    kde.fit(Xtrain[ytrain == i])

    # evaluate only on the land: -9999 indicates ocean
    Z = -9999 + np.zeros(land_mask.shape[0])
    Z[land_mask] = np.exp(kde.score_samples(xy))
    Z = Z.reshape(X.shape)

    # plot contours of the density
    levels = np.linspace(0, Z.max(), 25)
    plt.contourf(X, Y, Z, levels=levels, cmap=plt.cm.Greens)
    plt.title(species_names[i])
    if basemap:
        print(" - plot coastlines using basemap")
        m = Basemap(projection='cyl', llcrnrlat=Y.min(),
                    urcrnrlat=Y.max(), llcrnrlon=X.min(),
                    urcrnrlon=X.max(), resolution='c')
        m.drawcoastlines()
        m.drawcountries()
    else:
        print(" - plot coastlines from coverage")
        plt.contour(X, Y, land_reference,
                    levels=[-9999], colors="k",
                    linestyles="solid")
        plt.xticks([])
        plt.yticks([])

    plt.title(species_names[i])

plt.show()
- computing KDE in spherical coordinates
- plot coastlines using basemap
- computing KDE in spherical coordinates
- plot coastlines using basemap
../_images/output_55_1.png

The codeblock under if basemap draws the coastlines and country borders using basemap. The else block is executed if basemap is not available and just draws boundaries based on the locations in the data, which basically just draws black lines along the lightly shaded boundaries of our first picture. You can see what that looks like even if you do have basemap by changing the if basemap line to if not basemap. It’s not bad, but it’s not nearly as easy to understand how the colors relate to actual locations.

9.5.6. Human population (more kernel density estimation)

In this example, we will do something very similar to the last example, but for humans.

import pandas as pd
filename = '../chapter3/data/worldcitiespop.txt'
data = pd.read_csv(filename)

Let’s plot the raw coordinates, with one pixel per position.

plot(data.Longitude, data.Latitude, ',')
[<matplotlib.lines.Line2D at 0x212cef90>]
../_images/output_61_1.png

We easily recognize the world map!

locations = data[['Longitude','Latitude']].as_matrix()

Now, we load a world map with matplotlib.basemap.

from mpl_toolkits.basemap import Basemap
m = Basemap(projection='mill', llcrnrlat=-65, urcrnrlat=85, llcrnrlon=-180, urcrnrlon=180)

The name m is now set to a Python function that returns Cartesian coordinates when given Lat/Longs. We get the Cartesian coordinates of the map’s corners.

x0, y0 = m(-180, -65)
x1, y1 = m(180, 85)

Let’s create a histogram of the human density. First, we retrieve the population of every city.

population = data.Population

We project the geographical coordinates into cartesian coordinates.

x, y = m(locations[:,0], locations[:,1])

We handle cities which do not have a population value.

weights = population.copy()
weights[isnan(weights)] = 1000

We create the 2D histogram with NumPy.

h, _, _ = histogram2d(x, y, bins=(linspace(x0, x1, 500), linspace(y0, y1, 500)), weights=weights)
h[h==0] = 1

We filter this histogram with a Gaussian kernel.

import scipy.ndimage.filters
z = scipy.ndimage.filters.gaussian_filter(log(h.T), 1)

We draw the coast lines in the world map, as well as the filtered human density.

figure(figsize=(10,6))
m.drawcoastlines()
m.imshow(z, origin='lower', extent=[x0,x1,y0,y1], cmap=get_cmap('Reds'))
<matplotlib.image.AxesImage at 0x6e417d0>
../_images/output_79_1.png