# 8.5. Visualizing Geographical Data¶

## 8.5.1. Loading the data¶

```
import pandas as pd
HOUSING_PATH = "data"
def load_housing_data(housing_path=HOUSING_PATH):
csv_path = os.path.join(housing_path, "housing.csv")
return pd.read_csv(csv_path)
```

## 8.5.2. Eyeballing the data¶

Let’s look at some statewide housing data for California.

Each row represents one district in California. The districts range in size from 1 household to over 6,000. They vary in location, income level, and of course, median house value.

```
housing = load_housing_data()
housing.head()
```

longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|

0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |

1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |

2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |

3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |

4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |

```
housing.info()
```

```
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
longitude 20640 non-null float64
latitude 20640 non-null float64
housing_median_age 20640 non-null float64
total_rooms 20640 non-null float64
total_bedrooms 20433 non-null float64
population 20640 non-null float64
households 20640 non-null float64
median_income 20640 non-null float64
median_house_value 20640 non-null float64
ocean_proximity 20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
```

```
housing["ocean_proximity"].value_counts()
```

```
<1H OCEAN 9136
INLAND 6551
NEAR OCEAN 2658
NEAR BAY 2290
ISLAND 5
Name: ocean_proximity, dtype: int64
```

```
print(housing.describe())
```

longitude latitude housing_median_age total_rooms count 20640.000000 20640.000000 20640.000000 20640.000000 mean -119.569704 35.631861 28.639486 2635.763081 std 2.003532 2.135952 12.585558 2181.615252 min -124.350000 32.540000 1.000000 2.000000 25% -121.800000 33.930000 18.000000 1447.750000 50% -118.490000 34.260000 29.000000 2127.000000 75% -118.010000 37.710000 37.000000 3148.000000 max -114.310000 41.950000 52.000000 39320.000000 total_bedrooms population households median_income count 20433.000000 20640.000000 20640.000000 20640.000000 mean 537.870553 1425.476744 499.539680 3.870671 std 421.385070 1132.462122 382.329753 1.899822 min 1.000000 3.000000 1.000000 0.499900 25% 296.000000 787.000000 280.000000 2.563400 50% 435.000000 1166.000000 409.000000 3.534800 75% 647.000000 1725.000000 605.000000 4.743250 max 6445.000000 35682.000000 6082.000000 15.000100 median_house_value count 20640.000000 mean 206855.816909 std 115395.615874 min 14999.000000 25% 119600.000000 50% 179700.000000 75% 264725.000000 max 500001.000000

```
%matplotlib inline
import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(11,8))
save_fig("histograms")
plt.show()
```

```
Saving figure histograms
```

The median income data has been scaled to some strange units. More than
that, it turns out that it is **capped** at both ends. The highest value
is 15 and the lowest is .5. This may create some issues, but let’s try
and work with it.

## 8.5.3. Transforming data: Binning¶

Despite the capping, the median income of the CA districts in our data has a typical Pareto’s Law distribution.

```
housing["median_income"].hist()
```

```
<matplotlib.axes._subplots.AxesSubplot at 0x1165a4bd0>
```

Let’s create a column that separates income into 5 categories. We’re not actually going to use these categories in visualizing our data, but we are going use them in sampling it.

First we scale down the existing income numbers by dividing each by 1.5,
for example 7.5/1.5 = 5.0. Then we do some capping of our own. We’ll
save all values less than 5 as is and anything greater than 5 is 5. This
is done using the `where`

method.

```
housing["income_cat"] = np.ceil(housing["median_income"] / 1.5)
housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)
housing["income_cat"].value_counts()
```

```
3.0 7236
2.0 6581
4.0 3639
5.0 2362
1.0 822
Name: income_cat, dtype: int64
```

So the histogram for our new column looks as follows.

```
housing["income_cat"].hist()
```

```
<matplotlib.axes._subplots.AxesSubplot at 0x117741650>
```

Stratified sampling is a statistical sampling technique which attempts to guarantee that a sample represents the true proportions of given attributes. This is especially useful when the sample size is small enough so that the chances of a non-representative sample are fairly high. We’ev taken the first step, which is to divide the population into sample groups. Now we sample from each group in proportion to its size.

Here’s a toy example of how the data structures work.

```
from sklearn.model_selection import StratifiedShuffleSplit
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 1, 1])
sss = StratifiedShuffleSplit(n_splits=3, test_size=0.5, random_state=0)
print(sss.get_n_splits(X, y))
print(sss)
```

```
3
StratifiedShuffleSplit(n_splits=3, random_state=0, test_size=0.5,
train_size=None)
```

```
for train_index, test_index in sss.split(X, y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
```

```
TRAIN: [1 2] TEST: [3 0]
TRAIN: [0 2] TEST: [1 3]
TRAIN: [0 2] TEST: [3 1]
```

The second argument to the `split`

method represents a column of
values for an attribute whose proportions we want to preserve in
samples.

When creating `sss`

we passed in `n_splits=3`

, so we got three
different samples; for each, the proportion of 0 and 1’s in `y`

reflects what it is in the data set as a whole (50/50).

Now let’s try this on our data, but we’ll just do one split, asking for the proportions of the various income levels as represented in our new column to be accurately preserved.

```
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]
```

```
ss = split.split(housing, housing["income_cat"])
```

Let’s compare stratified sampling to ordinary random sampling to see the benefits of stratified sampling.

```
from sklearn.model_selection import train_test_split
def income_cat_proportions(data):
return data["income_cat"].value_counts() / len(data)
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=4)
compare_props = pd.DataFrame({
"Overall": income_cat_proportions(housing),
"Stratified": income_cat_proportions(strat_test_set),
"Random": income_cat_proportions(test_set),
}).sort_index()
compare_props["Rand. %error"] = 100 * compare_props["Random"] / compare_props["Overall"] - 100
compare_props["Strat. %error"] = 100 * compare_props["Stratified"] / compare_props["Overall"] - 100
```

```
compare_props
```

Overall | Random | Stratified | Rand. %error | Strat. %error | |
---|---|---|---|---|---|

1.0 | 0.039826 | 0.039729 | 0.039729 | -0.243309 | -0.243309 |

2.0 | 0.318847 | 0.320010 | 0.318798 | 0.364686 | -0.015195 |

3.0 | 0.350581 | 0.352955 | 0.350533 | 0.677170 | -0.013820 |

4.0 | 0.176308 | 0.170300 | 0.176357 | -3.407530 | 0.027480 |

5.0 | 0.114438 | 0.117006 | 0.114583 | 2.243861 | 0.127011 |

Now that we’ve used the `income_cat`

values to help us sample our
data, let’s drop them, since they’re clearly redundant.

```
for set in (strat_train_set, strat_test_set):
set.drop("income_cat", axis=1, inplace=True)
```

## 8.5.4. Putting the data on a map¶

```
housing = strat_train_set.copy()
```

Now comes the interesting idea. Our data contains lat/long coordinates for each housing district. Each coordinate represent the geographic center of the district. As a start, let’s do a simple scatter plot of all the districts in our data. Note the familiar shape that emerges.

```
housing.plot(kind="scatter", x="longitude", y="latitude")
save_fig("bad_visualization_plot")
```

```
Saving figure bad_visualization_plot
```

```
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)
save_fig("better_visualization_plot")
```

```
Saving figure better_visualization_plot
```

In the following `plot`

command, the `c`

parameter is telling us
which column value to use to determine the color given by the colormap.
The `s`

parameter is determining the *size* of the point being
plotted. Thus the value of the `median_house_value`

for a point will
determine its color, and the value in the `population`

column (divided
by 100) will determine its size.

```
# c is the attribute we'll map onto colors, s is the attribute we'll represent with circle size.
housing.plot(kind="scatter", x="longitude", y="latitude",
s=housing['population']/100, label="population",
c="median_house_value", cmap=plt.get_cmap("jet"),
colorbar=True, alpha=0.4, figsize=(10,7),
)
plt.legend()
save_fig("housing_prices_scatterplot")
plt.show()
```

```
Saving figure housing_prices_scatterplot
```

Next we include a map-image showing the boundaries of Californa. The
map-image is a polygon with 4 corners (a rectangle). The coordinates in
our scatter plots are Lat/Longs. To scale and position our map-image to
be consistent with out scatterplot, we need the Lat long coordinates for
two of the the 4 corners, the lower left and the upper right.
Those are given in `extent`

parameter in the order
LongitudeLowerLeft, LongitudeUpperRight, LatitudeLowerLeft,
LatitudeUpperRight. Normally we would use a geographic tool kit like
`basemap`

(included as part of `matplotlib`

) that provides
geographic map polygons with boundaries of various kinds drawn, as well
as the associated Lat Long info. We’ll illustrate that in the next
example. For now we illustrate the basic idea with a evry simple map,
which still adds a considerable amount of value to our visualization.

```
# This is a excellent discussion of the ideas involved in a plot with a colorbar
# https://stackoverflow.com/questions/19816820/how-to-retrieve-colorbar-instance-from-figure-in-matplotlib
# What's called a "scalar mappable" in that discussion might also be called a scalar-valued function,
# a function that assigns a scalar value to each x,y point being plotted. Our scalar mappable in
# this example is NOT the image of CA, but the housing prices assigned tp each point in the scatter plot.
import matplotlib.image as mpimg
california_img=mpimg.imread(os.path.join(PROJECT_ROOT_DIR,'prepared_images','california.png'))
# c is the attribute we'll map onto colors, s is the attribute we'll represent with circle size.
ax = housing.plot(kind="scatter", x="longitude", y="latitude", figsize=(10,7),
s=housing['population']/100, label="Population",
c="median_house_value", cmap=plt.get_cmap("jet"),
colorbar=True, alpha=0.4,
)
plt.imshow(california_img, extent=[-124.55, -113.80, 32.45, 42.05], alpha=0.5)
plt.ylabel("Latitude", fontsize=14)
plt.xlabel("Longitude", fontsize=14)
prices = housing["median_house_value"]
tick_values = np.linspace(prices.min(), prices.max(), 11)
# Creating the plot later (doing the plot with colorbar=False) creates some sort of scaling issues.
# Only not sure why.
#cbar = plt.colorbar(ax=ax)
#cbar.ax.set_yticks([int(round(v/1000)) for v in tick_values])
#cbar.ax.set_yticklabels(["${0:d}k".format(int(round(v/1000))) for v in tick_values])
#cbar.set_label('Median House Value', fontsize=16)
plt.legend(fontsize=16)
save_fig("california_housing_prices_plot")
plt.show()
```

```
Saving figure california_housing_prices_plot
```

## 8.5.5. A Kernel Density Estimation example¶

We are going to draw a map that uses a probability estimate of a geographically linked variable. When a probability estimate is appropriate, and reasonably accurate, it is a very powerful tool; the problem is to know when a probability estimate is appropriate.

Let’s think about some basics. In the previous example we looked at data about housing prices, an eminently geographic property that is a number. What is the price of House A? Or in our particular case, the median price of houses in District A? In this example we’re going to look at species distributions, an eminently geographic property that could also be expressed as a number. What is the population of species A in district 1? The question is: How meaningful is that number? Since natural events can raise and lower the population, since species members move around from day to to day, often in large numbers, any data we collect is likely to be outdated very quickly. More interestingly, species population numbers are likely to change from place to place in predictable ways: There is a general tendency for the species population of district 1 to be similar to that of nearby districts. What we have is a variable that behaves like a continuous function over locations. These features suggest that it is profitable to think of species distribution as a probability distribution conditioned by location. Notice how the housing price example differs. Housing prices are much more stable from day to day. They are also much less continuous. The housing prices of one district are not a very good predictor of the housing prices of a neighboring district. Both housing prices and species population numbers can change dramatically over short distances – | natural barriers can have dramatic effects in both cases – but dramatic changes over short distances are much more the norm for housing prices. All this suggests that what we want for predicting housing prices is the actual prices of actual houses, and if not that, the closer to that that we can get, the better. With species, data about actual sightings of actual species members is still the ground truth, but what we can reliably conclude from a sighting is best expressed as an increase in our chances of seeing species members, a probability distribution.

And that is the kind of thing we know how to draw a picture for. We are
going to draw what’s called a **contour** plot. A contour plot tells us
the value of a continuous function on a 2D grid. In the simplest case,
what we draw is lines telling us the places where the value of the
function is the same. Those lines will generally be connected because
the function is continuous. We can color code the lines to tell us the
direction in which the function value increases, and if we draw lots of
lines and use a continuous color map we get a full picture of the peaks
and valleys of the function.

First we get some raw data about species distribution. In particular we’re going to look at distributions for ‘Bradypus Variegatus’ (the brown-throated sloth)

and ‘Microryzomys Minutus’ (the forest small rice rat)

This data consists of raw numbers for species population at particular
lat/long coordinates. Then we **fit** a certain kind of probability
distribution called a **Gaussian** to it. That defines a continuous
function over lat/long coordinates. The value returned for any
coordinate by that function – which is called a probability function –
is the probability of seeing a member of species at that latitude and
longitude. Then we draw a **contour plot** of the function. We will use
a “reds” colormap: this is a sequential colormap in which increasing
darkness and saturation of the color increases a higher numerical value.
Basically, the redder the shade, the higher the probability. You can
learn more about `matplotlib`

colormaps and find other maps to try
here.

For those interested in more technical details, here are some facts about the computation:

What we are actually going to use is a

**probability density**function which assigns probabilities to spatial regions, not to points.The “fitting” of a probability function to set of points is called a

**kernel density estimation.**Broadly speaking, the probability values are smoothed to assign values to all the points on the map, not just the ones we know about, using the principle that points that are close together have similar probabilities. We use the ‘haversine’ metric appropriate for lat/long coordinates (and spherical coordinates) to define “closeness”. It basically gives us the distance between points given their lat-long coordinates. We use the**Ball Tree**algorithm for kernel density estimation because it is the most appropriate for geographic applications.

```
# Author: Jake Vanderplas <jakevdp@cs.washington.edu>
#
# License: BSD 3 clause
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_species_distributions
from sklearn.datasets.species_distributions import construct_grids
from sklearn.neighbors import KernelDensity
# if basemap is available, we'll use it.
# otherwise, we'll improvise later...
try:
from mpl_toolkits.basemap import Basemap
basemap = True
except ImportError:
basemap = False
```

The cell above does the basic imports. Note that an import is
**attempted** for a module named `Basemap`

, but a flag is set in case
that import fails. More on this below.

```
# Get matrices/arrays of species IDs and locations
data = fetch_species_distributions()
species_names = ['Bradypus Variegatus', 'Microryzomys Minutus']
# The only attributes we care about in `training` are lat and long.
Xtrain = np.vstack([data['train']['dd lat'],
data['train']['dd long']]).T
# Microryzomys Minutus gets coded as 1,
# Bradypus Variegatus as 0
ytrain = np.array([d.decode('ascii').startswith('micro')
for d in data['train']['species']], dtype='int')
Xtrain *= np.pi / 180. # Convert lat/long to radians
# Set up the data grid for the contour plot
xgrid, ygrid = construct_grids(data)
# To help kernel density estimation take a reasonable length of time we
# consistently use every 5th data point
X, Y = np.meshgrid(xgrid[::5], ygrid[::5][::-1])
# We'll make use of the fact that coverages[6] has measurements at all
# land points. This will help us decide between land and water.
land_reference = data.coverages[6][::5, ::5]
# -9999 means water. Make a Boolean grid identifying land.
land_mask = (land_reference > -9999).ravel()
xy = np.vstack([Y.ravel(), X.ravel()]).T
xy = xy[land_mask]
xy *= np.pi / 180.
```

The `sklearn`

species distrubution docs are
here.
This is data appropriate for conservation studies, including data about
species distrubutions and 14 environmental variables. In this section
we’ll just explore the relationship between the distribution facts and
locations, and throw away the rest.

```
data.coverages[6].shape
```

```
(1592, 1212)
```

The cell above gets the data and does some preprocessing. Let’s look at the result first.

```
print xy.shape
xy [:5]
```

```
(28008, 2)
```

```
array([[ 0.41015237, -1.35699349],
[ 0.41015237, -1.35263017],
[ 0.41015237, -1.33081355],
[ 0.41015237, -1.32645023],
[ 0.41015237, -1.32208691]])
```

`xy`

is a 28,008 x 2 array. The numbers are lat long coordinates
converted to radians (basically, degrees are converted to radians).

```
print ytrain[:5]
ois = [(i,cls) for (i,cls) in enumerate(ytrain) if cls == 1]
zis = [(i,cls) for (i,cls) in enumerate(ytrain) if cls == 0]
print len(ytrain)
print ois[:5], len(ois)
print zis[:5], len(zis)
```

```
[1 1 1 1 1]
1624
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)] 698
[(88, 0), (89, 0), (90, 0), (91, 0), (92, 0)] 926
```

`ytrain`

tells us our species (0 is ‘Bradypus Variegatus’, 1 is
‘Microryzomys Minutus’. There are 1,624 species location tuples in the
data, 698 ‘Microryzomys Minutus’ and 926 ‘Bradypus Variegatus’,.

```
# Plot map of South America with distributions of each species
fig = plt.figure()
fig.subplots_adjust(left=0.05, right=0.95, wspace=0.05)
for i in range(2):
plt.subplot(1, 2, i + 1)
# construct a kernel density estimate of the distribution
print(" - computing KDE in spherical coordinates")
kde = KernelDensity(bandwidth=0.04, metric='haversine',
kernel='gaussian', algorithm='ball_tree')
kde.fit(Xtrain[ytrain == i])
# evaluate only on the land: -9999 indicates ocean
Z = -9999 + np.zeros(land_mask.shape[0])
Z[land_mask] = np.exp(kde.score_samples(xy))
Z = Z.reshape(X.shape)
# plot contours of the density
levels = np.linspace(0, Z.max(), 25)
plt.contourf(X, Y, Z, levels=levels, cmap=plt.cm.Reds)
plt.title(species_names[i])
plt.show()
```

```
- computing KDE in spherical coordinates
- computing KDE in spherical coordinates
```

The cell above trains the KDE model on the species distribution data and
draws the contour plots. You can see the lightly drawn outline of South
America in both species plots. It’s there because `land_mask`

restricts our data to points on land, and since we have species data
from all over South America, our points capture the shape of the entire
land mass. Next we’ll redraw the picture adding `basemap`

to draw an
outline map of South America, coordinated with our data, including major
rivers and country borders, to help locate the contour masses. This
picture will help us tell a story our data. For example, we’ll see that
*Bradypus Variegatus* has populations along the eastern Amazon, where
*Microryzomus Minutus* does not. However the habitation area of
*Microryzomus Minutus* extends much further south, along the entire
Peruvian coast. To learn more about `basemap`

, look
here and also
here.

```
for i in range(2):
plt.subplot(1, 2, i + 1)
# construct a kernel density estimate of the distribution
print(" - computing KDE in spherical coordinates")
kde = KernelDensity(bandwidth=0.04, metric='haversine',
kernel='gaussian', algorithm='ball_tree')
kde.fit(Xtrain[ytrain == i])
# evaluate only on the land: -9999 indicates ocean
Z = -9999 + np.zeros(land_mask.shape[0])
Z[land_mask] = np.exp(kde.score_samples(xy))
Z = Z.reshape(X.shape)
# plot contours of the density
levels = np.linspace(0, Z.max(), 25)
plt.contourf(X, Y, Z, levels=levels, cmap=plt.cm.Greens)
plt.title(species_names[i])
if basemap:
print(" - plot coastlines using basemap")
m = Basemap(projection='cyl', llcrnrlat=Y.min(),
urcrnrlat=Y.max(), llcrnrlon=X.min(),
urcrnrlon=X.max(), resolution='c')
m.drawcoastlines()
m.drawcountries()
else:
print(" - plot coastlines from coverage")
plt.contour(X, Y, land_reference,
levels=[-9999], colors="k",
linestyles="solid")
plt.xticks([])
plt.yticks([])
plt.title(species_names[i])
plt.show()
```

```
- computing KDE in spherical coordinates
- plot coastlines using basemap
- computing KDE in spherical coordinates
- plot coastlines using basemap
```

The codeblock under `if basemap`

draws the coastlines and country
borders using `basemap`

. The `else`

block is executed if `basemap`

is not available and just draws boundaries based on the locations in the
data, which basically just draws black lines along the lightly shaded
boundaries of our first picture. You can see what that looks like even
if you do have `basemap`

by changing the `if basemap`

line to
`if not basemap`

. It’s not bad, but it’s not nearly as easy to
understand how the colors relate to actual locations.

## 8.5.6. Human population (more kernel density estimation)¶

In this example, we will do something very similar to the last example, but for humans.

```
import pandas as pd
filename = '../chapter3/data/worldcitiespop.txt'
data = pd.read_csv(filename)
```

Let’s plot the raw coordinates, with one pixel per position.

```
plot(data.Longitude, data.Latitude, ',')
```

```
[<matplotlib.lines.Line2D at 0x212cef90>]
```

We easily recognize the world map!

```
locations = data[['Longitude','Latitude']].as_matrix()
```

Now, we load a world map with `matplotlib.basemap`

.

```
from mpl_toolkits.basemap import Basemap
m = Basemap(projection='mill', llcrnrlat=-65, urcrnrlat=85, llcrnrlon=-180, urcrnrlon=180)
```

The name `m`

is now set to a Python function that returns Cartesian
coordinates when given Lat/Longs. We get the Cartesian coordinates of
the map’s corners.

```
x0, y0 = m(-180, -65)
x1, y1 = m(180, 85)
```

Let’s create a histogram of the human density. First, we retrieve the population of every city.

```
population = data.Population
```

We project the geographical coordinates into cartesian coordinates.

```
x, y = m(locations[:,0], locations[:,1])
```

We handle cities which do not have a population value.

```
weights = population.copy()
weights[isnan(weights)] = 1000
```

We create the 2D histogram with NumPy.

```
h, _, _ = histogram2d(x, y, bins=(linspace(x0, x1, 500), linspace(y0, y1, 500)), weights=weights)
h[h==0] = 1
```

We filter this histogram with a Gaussian kernel.

```
import scipy.ndimage.filters
z = scipy.ndimage.filters.gaussian_filter(log(h.T), 1)
```

We draw the coast lines in the world map, as well as the filtered human density.

```
figure(figsize=(10,6))
m.drawcoastlines()
m.imshow(z, origin='lower', extent=[x0,x1,y0,y1], cmap=get_cmap('Reds'))
```

```
<matplotlib.image.AxesImage at 0x6e417d0>
```