6.2. More on two-dimensional arrays

In this section we discuss some of the uses of 2D arrays, focusing on their role in representing mathematical relationships. We also cover some more advanced aspects of the data type.

The need for 2D arrays is obvious if you’ve taken a linear algebra class. They correspond to the mathematical object called a matrix. One application for matrices is in solving systems of equations, but this really only scratches the surface. A more fundamental application for a data scientist is they can represent all that is known about a single data set. Generally each row represents an item (an individual or event in the data), and the entry in each column its value for a particular attribute. For example let’s say each room in a hotel has lamps, tables, chairs, and beds, but in varying numbers. We might represent the inventory of items in 5-room hotel with a 5x4 matrix (5 rows, 4 columns), as follows:

6  3 4 1
5  2 3 2
8  3 6 2
5  1 3 1
10 4 7 2

So the first row represents a room with 6 lamps, 3 tables, 4 chairs, and 1 bed. Now if we represent the cost of each item as 1D cost array (or vector) to use the mathematical term,

40 175 90 450,

where the costs are ordered in the same way as our columns above: lamps costs, table costs, chair costs, and bed costs. Then we can compute the per room costs as follows:

room_matrix = \
np.array(
[[6,  3, 4, 1],
[5,  2, 3, 2],
[8,  3, 6, 2],
[5,  1, 3, 1],
[10, 4, 7, 2]])

cost_vector = np.array([40, 175, 90, 450])
print room_matrix
print cost_vector
print room_matrix.dot(cost_vector)
[[ 6  3  4  1]
 [ 5  2  3  2]
 [ 8  3  6  2]
 [ 5  1  3  1]
 [10  4  7  2]]
[ 40 175  90 450]
[1575 1720 2285 1095 2630]

The mathematical name for the dot method, which computes the per room cost, is matrix multiplication.

Arrays have many attributes.

print a.shape
print a.ndim
print a.dtype.name
print a.itemsize
print a.size
print type(a)
print np.ndarray
(3, 5)
2
int64
8
15
<type 'numpy.ndarray'>
<type 'numpy.ndarray'>
[6 7 8]
<type 'numpy.ndarray'>
b = np.array([6, 7, 8])
print b
print type(b)
b
[6 7 8]
<type 'numpy.ndarray'>
array([6, 7, 8])

A sequence of sequences can be used to define a 2D array, but since the inner sequences are rows of a table, they must all be the same length.

b = np.array( [ (1.5,2,3), (4,5,6) ] )
b
array([[ 1.5,  2. ,  3. ],
       [ 4. ,  5. ,  6. ]])

Arrays can also contain complex numbers, but remember it takes two quantities to define a single complex number:

c = np.array( [ [1,2], [3,4] ], dtype=complex )
c
array([[ 1.+0.j,  2.+0.j],
       [ 3.+0.j,  4.+0.j]])

A very convenient way to fill an array is to start with an array containing all 0’s or 1’s and then update the contents:

X = np.zeros( (3,4) )
X
array([[ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.]])

Indexing works as it does with ordinary Python sequences, except that now there are two dimensions to worry about

X[0,0]
0.0
X[1,2]
0.0
X[3,0]
---------------------------------------------------------------------------

IndexError                                Traceback (most recent call last)

<ipython-input-16-fb9c09f08361> in <module>()
----> 1 X[3,0]


IndexError: index 3 is out of bounds for axis 0 with size 3

Assigment works as it does with Python sequences, except that it now works in two dimensions

X[1,2] = 3
X
array([[ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  3.,  0.],
       [ 0.,  0.,  0.,  0.]])
X = np.zeros( (3,4) )
ctr = 0
(rows,cols) = X.shape
for i in range(rows):
    for j in range(cols):
        ctr += 1
        X[i,j] += ctr*10
X
array([[  10.,   20.,   30.,   40.],
       [  50.,   60.,   70.,   80.],
       [  90.,  100.,  110.,  120.]])

A much faster and more natural way to create the same array:

X = np.arange(10,130,10).reshape((3,4))
X
array([[ 10,  20,  30,  40],
       [ 50,  60,  70,  80],
       [ 90, 100, 110, 120]])

The following expression gives the second row of X

X[1,:]
array([50, 60, 70, 80])

The following expression gives the 3rd column of X:

X[:,2]
array([ 30,  70, 110])

Note that columns and rows are just 1D arrays; nothing identifies them as row vectors or column vectors.

As a convenience, numpy lets you skip the : when you want a row. In other words, the next two expressions are synonyms. Note that you need the : if you want a column.

print X[1,:]
print X[1]
[ 50.  60.  70.  80.]
[ 50.  60.  70.  80.]
print X[1,:], X[1,:].shape
print X[:,1]
[50 60 70 80] (4,)
[ 20  60 100]

Next, let’s get part of a column or row. The first two elements of the second column of X, followed by the first two elements of the second row of X:

print X
print X[1,:2]
print X[:2,1]
[[ 10  20  30  40]
 [ 50  60  70  80]
 [ 90 100 110 120]]
[50 60]
[20 60]

Now let’s get a 2x2 subarray, consisting of the upper left hand corner of X, followed by the central part of the last two rows.

print X
print X[:2,:2]
print X[1:,1:3]
[[ 10  20  30  40]
 [ 50  60  70  80]
 [ 90 100 110 120]]
[[10 20]
 [50 60]]
[[ 60  70]
 [100 110]]

Array transposition is sometimes a nice way to get to the 2D array you really want. The transposition of an array M is called M.T, and the definition is that

M.T[i,j] = M[j,i]

So if M is an m x n array, then M.T is an n x m array. Look at X.T and verify these definitions. The mth row of X becomes the mth column of X.T. The nth column of X becomes the nth row of X.T.

print X
print
print X.T
print X[1,2], X.T[2,1]
[[ 10  20  30  40]
 [ 50  60  70  80]
 [ 90 100 110 120]]

[[ 10  50  90]
 [ 20  60 100]
 [ 30  70 110]
 [ 40  80 120]]
70 70

There are also 3D arrays, which have a third dimension; each position along the third dimension defines a 2D array.

A =np.arange(24).reshape((2,3,4))
A
array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])

In a 2D array, a single index slice is a row or column (a 1D array). In a 3D array, single index slice is a 2D array. As the examples below suggest, 3D arrays are kind of a hard data structure to think about, and they are generally used for very specific applications, such as representing color images (3 layers of 2D representation, one for each red-green-blue color intensity).

A[1,:,:]
array([[12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23]])
A[:,:,3]
array([[ 3,  7, 11],
       [15, 19, 23]])
A[:,2,:]
array([[ 8,  9, 10, 11],
       [20, 21, 22, 23]])

6.2.1. 3. Elementwise arithmetic operations

We noted above an important fact about arrays which distinguishes them from lists: Arithmetic operations can be performed on arrays just as if they were numbers. We discuss that a bit more explicitly in this section.

First note that arithmetic operations such as addition and multiplication between two arrays are well defined, if the arrays meet certain constraints. If they have the same shape, addition and multiplication are always well-defined.

import numpy as np
a = np.array( [20,30,40,50] )
b = np.arange( 4 )
print a
print b
[20 30 40 50]
[0 1 2 3]
a + b
array([20, 31, 42, 53])
a * b
array([  0,  30,  80, 150])
a - b
array([20, 29, 38, 47])

In the next example we illustrate broadcasting. Elementwise operations are extended to apply between arrays and objects that are not arrays or between arrays that are not the same shape. o for example, 2 * a (a is the array above) returns an array that contains all the elements of a multiplied by 2

2 * a
array([ 40,  60,  80, 100])

This works by “broadcasting” 2 into an array the same size as a and then doing elementwise mutliplication on the two arrays.

Elementwise arithmetic also works with 2D arrays, often in a fairly intutive fashion.

A = np.array( [[1,1],
            [0,1]] )
B = np.array( [[2,0],
            [3,4]] )
A * B  # elementwise product
array([[2, 0],
       [0, 4]])

You can multiply two arrays of the same shape together as we did with a and b above, but you can’t rely on broadcasting to figure out what to do with mismatched array sizes. In general, adding arrays of different sizes together will fail:

c = np.array([2,3])
c * a
---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-3-e4d7d0499d5f> in <module>()
      1 c = np.array([2,3])
----> 2 c * a


ValueError: operands could not be broadcast together with shapes (2,) (4,)

But sometimes there is a sensible way to apply the operation along some dimension of a 2D array. Consider:

a = np.array( [20,30,40,50] )
c = np.arange(8).reshape((2,4))
c
array([[0, 1, 2, 3],
      [4, 5, 6, 7]])
a + c
array([[20, 31, 42, 53],
      [24, 35, 46, 57]])

The same broadcast principle applies to Boolean tests, which are converted to elementwise operations, resulting in an array of Booleans. We return to this point below.

a < 35
array([ True,  True, False, False], dtype=bool)

6.2.2. Elementwise Boolean operations, Masking

We noted above that the idea of elementwise operations extends to Boolean tests.

In general, applying a Boolean test to an array returns an array of truth-values of the same size as the orginal array. Just as

3 + X

adds 3 to every element of array X, so

X > 2

returns an array of truth-values which tells us which elements of X are greater than 2.

import numpy as np
X = np.array([1,5,2,7,9,4,3,-6])
print X
print X >= 3
[ 1  5  2  7  9  4  3 -6]
[False  True False  True  True  True  True False]

One of the most important uses of Boolean arrays is that they can index other arrays. This will always work when a Boolean array is the same shape as the array it indexes. So in the simplest case:

X[X>=3]
array([5, 7, 9, 4, 3])

This returns exactly the members of X that are greater than or equal to 3.

We could do the same with a list comprehension, but the array computation above is much faster:

[x for x in X if x >= 3]
[5, 7, 9, 4, 3]

The reason that X[X>=3] works has to do with a basic fact of array indexing we haven’t really made explicit yet. Any Boolean array Y of the right length can be used to index an array X if X and Y are the same length.

A hokey but essential example. Suppose we have an array of length 8 and we want to access the first, third, and seventh elements. Here’s how to do it with a Boolean array:

X = np.arange(8) + 1
Y = np.array([True,False,True,False,False,False,True,False])
print X
X[Y]
[1 2 3 4 5 6 7 8]
array([1, 3, 7])

Boolean arrays can also be used to count the number of elements in an array that satisfy some constraint. The Boolean True is treated as 1 with arithmetic operations and the Boolean False as 0, so summing a Boolean array counts the number of Trues. The following expression correctly counts the number of elements in X that are greater than or equal to 3.

sum(X>=3)
5

In this case, though, using the array operation is about an order of magnitude less efficient than the natural Pythonic idiom.

len([x for x in X if x >3])
5

More on array operations and efficiency below. It’s a tricky subject.

Boolean operations work more or less as expected on 2D arrays:

import numpy as np
y = np.arange(35).reshape(5,7)
y>3
array([[False, False, False, False,  True,  True,  True],
       [ True,  True,  True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True,  True,  True]], dtype=bool)

Note that this Boolean array preserves the 5x7 shape of y.

(y>3).shape
(5, 7)

But if we use it to index elements of y, the result has to be a 1D array:

y[y>3]
array([ 4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
       21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34])

Now suppose we have an array of data that has 4 columns, but we only want to keep the first, second and fourth columns. Here’s a simple way to do that using a Boolean array:

X = np.arange(8).reshape((2,4)) + 1
print X
print X[:,np.array([True,True,False,True])]
[[1 2 3 4]
 [5 6 7 8]]
[[1 2 4]
 [5 6 8]]

Now Boolean arrays and Boolean masking is a great feature but it will occasionally cause you grief if you forget that something is an array and that a Boolean test will often return an array of truth values instead of a truth value.

Python will return an error if you try to use an array where a Boolean value is expected. The most common error of this sort is in an if-clause.

if y:
    print 'hi'
---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-27-512f0bc47840> in <module>()
----> 1 if y:
      2     print 'hi'


ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

And the result of any Boolean test that can be interpreted as an elementwise operation will raise the same error:

if y == 0:
    print 'hi'
---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-31-3e5bb3ed66c7> in <module>()
----> 1 if y == 0:
      2     print 'hi'


ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

The is test, however, will not raise this error, because it cannot be interpreted as an elementwise operation:

if y is 3:
    print 'Hi'

The flip side of this is that you must know to expect different kinds of results with == and is where arrays are concerned.

print y is 3
False
print y == 3
[[False False False  True False False False]
 [False False False False False False False]
 [False False False False False False False]
 [False False False False False False False]
 [False False False False False False False]]

6.2.3. Data in the form of arrays

Many python modules that provide data do so in array form. As an example, we load the famous iris data set due to Edgar Anderson and made famous in a paper by Ronald Fisher.

from sklearn.datasets import load_iris
data = load_iris()
features = data['data']
target = data['target']

The variable features is a numpy array:

print features.shape
print features.ndim
print features.dtype.name
print features.size
print type(features)
print features[:10]
(150, 4)
2
float64
600
<type 'numpy.ndarray'>
[[ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]
 [ 4.7  3.2  1.3  0.2]
 [ 4.6  3.1  1.5  0.2]
 [ 5.   3.6  1.4  0.2]
 [ 5.4  3.9  1.7  0.4]
 [ 4.6  3.4  1.4  0.3]
 [ 5.   3.4  1.5  0.2]
 [ 4.4  2.9  1.4  0.2]
 [ 4.9  3.1  1.5  0.1]]

The last thing printed is the first 10 rows of features. This array represents data about irises. Each row represents a different iris and gives 4 measurements for that exemplar. So there are 150 iris exemplars; with 4 measurements for each, that’s 600 items data in the array (features.size). The data is used for classification studies.

Let’s apply what we’ve just learnt to pick a ceratin subset of the data. The first number in each row represents the sepal length of that particular iris exemplar. Let’s find all the irises in the data set whose sepal length is 5.

First we get a vector for the first column, and use it to create a Boolean vector identifying which rows have 5.0 in the first column.

first_col = features[:,0]
print first_col == 5.0
[False False False False  True False False  True False False False False
 False False False False False False False False False False False False
 False  True  True False False False False False False False False  True
 False False False False  True False False  True False False False False
 False  True False False False False False False False False False False
  True False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False  True False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False]

Then we use the Boolean array just constructed to index the rows, combined with a : to indicate we want all columns.

features[first_col == 5.0,:]
array([[ 5. ,  3.6,  1.4,  0.2],
       [ 5. ,  3.4,  1.5,  0.2],
       [ 5. ,  3. ,  1.6,  0.2],
       [ 5. ,  3.4,  1.6,  0.4],
       [ 5. ,  3.2,  1.2,  0.2],
       [ 5. ,  3.5,  1.3,  0.3],
       [ 5. ,  3.5,  1.6,  0.6],
       [ 5. ,  3.3,  1.4,  0.2],
       [ 5. ,  2. ,  3.5,  1. ],
       [ 5. ,  2.3,  3.3,  1. ]])

So there are only 10 plants out of 150 that have sepal lengths of 5.

Of course we can do this all in one step, with the same result:

features[features[:,0] == 5.0,:]
array([[ 5. ,  3.6,  1.4,  0.2],
       [ 5. ,  3.4,  1.5,  0.2],
       [ 5. ,  3. ,  1.6,  0.2],
       [ 5. ,  3.4,  1.6,  0.4],
       [ 5. ,  3.2,  1.2,  0.2],
       [ 5. ,  3.5,  1.3,  0.3],
       [ 5. ,  3.5,  1.6,  0.6],
       [ 5. ,  3.3,  1.4,  0.2],
       [ 5. ,  2. ,  3.5,  1. ],
       [ 5. ,  2.3,  3.3,  1. ]])

Since we’re indexing rows here, the : can also be left out:

features[features[:,0] == 5.0]
array([[ 5. ,  3.6,  1.4,  0.2],
       [ 5. ,  3.4,  1.5,  0.2],
       [ 5. ,  3. ,  1.6,  0.2],
       [ 5. ,  3.4,  1.6,  0.4],
       [ 5. ,  3.2,  1.2,  0.2],
       [ 5. ,  3.5,  1.3,  0.3],
       [ 5. ,  3.5,  1.6,  0.6],
       [ 5. ,  3.3,  1.4,  0.2],
       [ 5. ,  2. ,  3.5,  1. ],
       [ 5. ,  2.3,  3.3,  1. ]])

A more interesting use of Boolean array indexing is to find all the irises of a particular class, using the variable target defined when we loaded the data; target is an array containing the class of each iris in the data set:

print type(target)
print target.shape
print set(target)
print features[0], target[0]
print features[90], target[90]
<type 'numpy.ndarray'>
(150,)
set([0, 1, 2])
[ 5.1  3.5  1.4  0.2] 0
[ 5.5  2.6  4.4  1.2] 1

As the printouts indicate , target is a 1D array (a vector) of length 150, containing only the values 0,1, and 2. These are the three classes to which an iris can belong. There are exactly as many entries in the target array as there are rows in the features array. For any iris, its row-index in the features array is the index of its class in the target. Above we printed the features and target class for irises 0 and 90.

We can take advantage of this structure to find all the irises of class 1 very efficiently. Note that target == 1 is a Boolean array of length 150.

target==1
array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False], dtype=bool)

To find the irises of class 1, we simply use that Boolean array to specify the rows we want in the features array:

features[target==1,:]
array([[ 7. ,  3.2,  4.7,  1.4],
       [ 6.4,  3.2,  4.5,  1.5],
       [ 6.9,  3.1,  4.9,  1.5],
       [ 5.5,  2.3,  4. ,  1.3],
       [ 6.5,  2.8,  4.6,  1.5],
       [ 5.7,  2.8,  4.5,  1.3],
       [ 6.3,  3.3,  4.7,  1.6],
       [ 4.9,  2.4,  3.3,  1. ],
       [ 6.6,  2.9,  4.6,  1.3],
       [ 5.2,  2.7,  3.9,  1.4],
       [ 5. ,  2. ,  3.5,  1. ],
       [ 5.9,  3. ,  4.2,  1.5],
       [ 6. ,  2.2,  4. ,  1. ],
       [ 6.1,  2.9,  4.7,  1.4],
       [ 5.6,  2.9,  3.6,  1.3],
       [ 6.7,  3.1,  4.4,  1.4],
       [ 5.6,  3. ,  4.5,  1.5],
       [ 5.8,  2.7,  4.1,  1. ],
       [ 6.2,  2.2,  4.5,  1.5],
       [ 5.6,  2.5,  3.9,  1.1],
       [ 5.9,  3.2,  4.8,  1.8],
       [ 6.1,  2.8,  4. ,  1.3],
       [ 6.3,  2.5,  4.9,  1.5],
       [ 6.1,  2.8,  4.7,  1.2],
       [ 6.4,  2.9,  4.3,  1.3],
       [ 6.6,  3. ,  4.4,  1.4],
       [ 6.8,  2.8,  4.8,  1.4],
       [ 6.7,  3. ,  5. ,  1.7],
       [ 6. ,  2.9,  4.5,  1.5],
       [ 5.7,  2.6,  3.5,  1. ],
       [ 5.5,  2.4,  3.8,  1.1],
       [ 5.5,  2.4,  3.7,  1. ],
       [ 5.8,  2.7,  3.9,  1.2],
       [ 6. ,  2.7,  5.1,  1.6],
       [ 5.4,  3. ,  4.5,  1.5],
       [ 6. ,  3.4,  4.5,  1.6],
       [ 6.7,  3.1,  4.7,  1.5],
       [ 6.3,  2.3,  4.4,  1.3],
       [ 5.6,  3. ,  4.1,  1.3],
       [ 5.5,  2.5,  4. ,  1.3],
       [ 5.5,  2.6,  4.4,  1.2],
       [ 6.1,  3. ,  4.6,  1.4],
       [ 5.8,  2.6,  4. ,  1.2],
       [ 5. ,  2.3,  3.3,  1. ],
       [ 5.6,  2.7,  4.2,  1.3],
       [ 5.7,  3. ,  4.2,  1.2],
       [ 5.7,  2.9,  4.2,  1.3],
       [ 6.2,  2.9,  4.3,  1.3],
       [ 5.1,  2.5,  3. ,  1.1],
       [ 5.7,  2.8,  4.1,  1.3]])

Now suppose we wanted to just see the sepal lengths (column 0) of the target class 1 irises, and compare them to the sepal lengths of the target class 0 irises, to see if sepal lengths provided a good way of telling the two classes apart. We could do:

print features[target==1,0]
print features[target==0,0]
[ 7.   6.4  6.9  5.5  6.5  5.7  6.3  4.9  6.6  5.2  5.   5.9  6.   6.1  5.6
  6.7  5.6  5.8  6.2  5.6  5.9  6.1  6.3  6.1  6.4  6.6  6.8  6.7  6.   5.7
  5.5  5.5  5.8  6.   5.4  6.   6.7  6.3  5.6  5.5  5.5  6.1  5.8  5.   5.6
  5.7  5.7  6.2  5.1  5.7]
[ 5.1  4.9  4.7  4.6  5.   5.4  4.6  5.   4.4  4.9  5.4  4.8  4.8  4.3  5.8
  5.7  5.4  5.1  5.7  5.1  5.4  5.1  4.6  5.1  4.8  5.   5.   5.2  5.2  4.7
  4.8  5.4  5.2  5.5  4.9  5.   5.5  4.9  4.4  5.1  5.   4.5  4.4  5.   5.1
  4.8  5.1  4.6  5.3  5. ]

And though there is some overlap, we see that there is a tendency for the column-0 value be higher in target 1 than it is target class 0.

6.2.4. Array operations

Some operations are elementwise, some are not. The ones that are not are the ones it makes sense to apply to an array as a whole. Let’s cook up an array:

a = np.array([[1.0, 2.0], [3.0, 4.0], [8.0,5.0]])
a
array([[ 1.,  2.],
       [ 3.,  4.],
       [ 8.,  5.]])

Now we’ll transpose it. That is, we’ll switch the rows and columns. Instead of a 3x2 array, we get a 2x3 array.

b = a.transpose()
b
array([[ 1.,  3.,  8.],
       [ 2.,  4.,  5.]])

So transposing an array operation. It applies to an array as a whole, not element by element, and returns an array as a result.

To access the same item in b, we now hae to swap the row and column indices

print a[2,0]
print b[0,2]
8.0
8.0

Another example of an array operation is the matrix product, or dot product. This is a mathematically important operation that takes two arrays and returns an array. You can look up its definition in Wikipedia, if you’re interested.

a.dot(b)
array([[  5.,  11.,  18.],
       [ 11.,  25.,  44.],
       [ 18.,  44.,  89.]])

The moral: In programming with arrays, it is important to know which operations can be applied elementwise, and which cannot. The distinction is not always completely predictable. We saw above that == is a Boolean operation that applies elementwise to an array, but is, another Boolean operation, does not.

6.2.5. Cosine

One way of measuring the similarity between two arrays x and y. is to take their cosine. The name for this similarity measure is quite appropriate. The cosine of x and y is a measure of the geometric cosine of the angle between them. It is 1 when that angle is 0 (i.e., x and y are identical in direction), and -1 when the vectors point in opposite directions, and 0 when the vectors are orthogonal (the do not share a component in any direction). It is computed by taking the dot product of the unit vectors pointing in the same direction as x and y. To do this we divide x and y by their Euclidean length (LA.norm in the code below) and take the dot product of the results.

import numpy as np
import numpy.linalg as LA
x = np.array([3,4])
y = np.array([4,3])

def find_unit_vector(x):
    return x/LA.norm(x,ord=2)

def cosine (x,y):
    return (find_unit_vector(x)).dot(find_unit_vector(y))

The maximum possible value cosine can have for two vectors is 1.0. The maximum possible value is achieved when you compare a vector x with itself x. So nothing can be more similar to x than x is to itself. Whew. That’s reassuring.

cosine(x,x)
1.0
cosine(x,-x)
-1.0
cosine(x,y)
0.95999999999999996
LA.norm(find_unit_vector(x))
1.0

6.2.6. Why use arrays?

import timeit
import numpy as np

py_sec = timeit.timeit('sum(x*x for x in xrange(1000))',
                              number=10000)

np_sec = timeit.timeit('na.dot(na)',
                            setup="import numpy as np; na=np.arange(1000)",
                            number=10000)

print("Normal Python: %f sec" % py_sec)

print("NumPy: %f sec" % np_sec)
Normal Python: 0.840633 sec
NumPy: 0.019618 sec

Arrays provide all kinds of indexing convenience for accessing data that is genuinely tabular. Tabular data is data that has good reason for being represented in a table; that is, there is good reason to think of the rows as a unit, and good reason to think of the columns as units. And therefore, there is good reason to think we might want to apply some computational operations to columns and rows. We’ll explore those ideas a bit more in a future assignment.

All kinds of data is fundamentally tabular, as we’ll see. But although that’s very important, it’s probably not the main reason why so many scientists, engineers, and data analysts use numpy and other programming tools like matlab and R that provide tables, or matrices (the mathematical name for a similar concept), as a fundamental data structure. The main reason is computational efficiency.

Having a single data structure with predictable types and elementwise basic arithmetic operations makes it possible to write very efficient ways of performing more complex operations. The above code snippet demonstrates this by providing a timing comparison between two ways of computing teh sum of the squares between 0 and 999 inclusive; the first is used in computing the variable normal_py_sec, is the ordinary Python way of computing such a sum: Cook up a container of theose integers, square each of them and sum the result. The second, stored in the varibale np_sec, uses numpy arrays. There is actually a very important mathematical operation dot product that multiplies all the corresponding elements of two vectors (elemntwise) and sums the result. Using this precompiled fast math operation we speed the code up by a factor of about 80 on my machine, and your results, on your various home machines will be different, but in the same ballpark.

Note that numpy arrays in and of themselves don’t help. In the code snippet below, we multiply the na array times itself and sum the result. This code is actually slower than the basic Python code, because it pays the overhead of creating the numpy array but doesn’t exploit it by using the proper optimized operation. The lesson here is that coding so as to take maximum advantage of the efficiencies of numpy can be tricky, but if you have a good reason to be efficient, it’s definitely worth some experimentation. That starts with knowing about the kinds of builtin array specific functions numpy offers.

naive_np_sec = timeit.timeit('sum(na*na)',
                             setup="import numpy as np; na=np.arange(1000)",
                             number=10000)
print("Naive NumPy: %f sec" % naive_np_sec)
Naive NumPy: 1.382197 sec