6.2. More on two-dimensional arrays¶
In this section we discuss some of the uses of 2D arrays, focusing on their role in representing mathematical relationships. We also cover some more advanced aspects of the data type.
The need for 2D arrays is obvious if you’ve taken a linear algebra class. They correspond to the mathematical object called a matrix. One application for matrices is in solving systems of equations, but this really only scratches the surface. A more fundamental application for a data scientist is they can represent all that is known about a single data set. Generally each row represents an item (an individual or event in the data), and the entry in each column its value for a particular attribute. For example let’s say each room in a hotel has lamps, tables, chairs, and beds, but in varying numbers. We might represent the inventory of items in 5-room hotel with a 5x4 matrix (5 rows, 4 columns), as follows:
6 3 4 1
5 2 3 2
8 3 6 2
5 1 3 1
10 4 7 2
So the first row represents a room with 6 lamps, 3 tables, 4 chairs, and 1 bed. Now if we represent the cost of each item as 1D cost array (or vector) to use the mathematical term,
40 175 90 450,
where the costs are ordered in the same way as our columns above: lamps costs, table costs, chair costs, and bed costs. Then we can compute the per room costs as follows:
room_matrix = \
np.array(
[[6, 3, 4, 1],
[5, 2, 3, 2],
[8, 3, 6, 2],
[5, 1, 3, 1],
[10, 4, 7, 2]])
cost_vector = np.array([40, 175, 90, 450])
print room_matrix
print cost_vector
print room_matrix.dot(cost_vector)
[[ 6 3 4 1]
[ 5 2 3 2]
[ 8 3 6 2]
[ 5 1 3 1]
[10 4 7 2]]
[ 40 175 90 450]
[1575 1720 2285 1095 2630]
The mathematical name for the dot
method, which computes the per
room cost, is matrix multiplication.
Arrays have many attributes.
print a.shape
print a.ndim
print a.dtype.name
print a.itemsize
print a.size
print type(a)
print np.ndarray
(3, 5)
2
int64
8
15
<type 'numpy.ndarray'>
<type 'numpy.ndarray'>
[6 7 8]
<type 'numpy.ndarray'>
b = np.array([6, 7, 8])
print b
print type(b)
b
[6 7 8]
<type 'numpy.ndarray'>
array([6, 7, 8])
A sequence of sequences can be used to define a 2D array, but since the inner sequences are rows of a table, they must all be the same length.
b = np.array( [ (1.5,2,3), (4,5,6) ] )
b
array([[ 1.5, 2. , 3. ],
[ 4. , 5. , 6. ]])
Arrays can also contain complex numbers, but remember it takes two quantities to define a single complex number:
c = np.array( [ [1,2], [3,4] ], dtype=complex )
c
array([[ 1.+0.j, 2.+0.j],
[ 3.+0.j, 4.+0.j]])
A very convenient way to fill an array is to start with an array containing all 0’s or 1’s and then update the contents:
X = np.zeros( (3,4) )
X
array([[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.]])
Indexing works as it does with ordinary Python sequences, except that now there are two dimensions to worry about
X[0,0]
0.0
X[1,2]
0.0
X[3,0]
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-16-fb9c09f08361> in <module>()
----> 1 X[3,0]
IndexError: index 3 is out of bounds for axis 0 with size 3
Assigment works as it does with Python sequences, except that it now works in two dimensions
X[1,2] = 3
X
array([[ 0., 0., 0., 0.],
[ 0., 0., 3., 0.],
[ 0., 0., 0., 0.]])
X = np.zeros( (3,4) )
ctr = 0
(rows,cols) = X.shape
for i in range(rows):
for j in range(cols):
ctr += 1
X[i,j] += ctr*10
X
array([[ 10., 20., 30., 40.],
[ 50., 60., 70., 80.],
[ 90., 100., 110., 120.]])
A much faster and more natural way to create the same array:
X = np.arange(10,130,10).reshape((3,4))
X
array([[ 10, 20, 30, 40],
[ 50, 60, 70, 80],
[ 90, 100, 110, 120]])
The following expression gives the second row of X
X[1,:]
array([50, 60, 70, 80])
The following expression gives the 3rd column of X
:
X[:,2]
array([ 30, 70, 110])
Note that columns and rows are just 1D arrays; nothing identifies them as row vectors or column vectors.
As a convenience, numpy
lets you skip the :
when you want a row.
In other words, the next two expressions are synonyms. Note that you
need the :
if you want a column.
print X[1,:]
print X[1]
[ 50. 60. 70. 80.]
[ 50. 60. 70. 80.]
print X[1,:], X[1,:].shape
print X[:,1]
[50 60 70 80] (4,)
[ 20 60 100]
Next, let’s get part of a column or row. The first two elements of the
second column of X
, followed by the first two elements of the second
row of X
:
print X
print X[1,:2]
print X[:2,1]
[[ 10 20 30 40]
[ 50 60 70 80]
[ 90 100 110 120]]
[50 60]
[20 60]
Now let’s get a 2x2 subarray, consisting of the upper left hand corner
of X
, followed by the central part of the last two rows.
print X
print X[:2,:2]
print X[1:,1:3]
[[ 10 20 30 40]
[ 50 60 70 80]
[ 90 100 110 120]]
[[10 20]
[50 60]]
[[ 60 70]
[100 110]]
Array transposition is sometimes a nice way to get to the 2D array you
really want. The transposition of an array M
is called M.T
, and
the definition is that
M.T[i,j] = M[j,i]
So if M
is an m
x n
array, then M.T
is an n
x m
array. Look at X.T
and verify these definitions. The m
th row
of X
becomes the m
th column of X.T
. The n
th column
of X
becomes the n
th row of X.T
.
print X
print
print X.T
print X[1,2], X.T[2,1]
[[ 10 20 30 40]
[ 50 60 70 80]
[ 90 100 110 120]]
[[ 10 50 90]
[ 20 60 100]
[ 30 70 110]
[ 40 80 120]]
70 70
There are also 3D arrays, which have a third dimension; each position along the third dimension defines a 2D array.
A =np.arange(24).reshape((2,3,4))
A
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]],
[[12, 13, 14, 15],
[16, 17, 18, 19],
[20, 21, 22, 23]]])
In a 2D array, a single index slice is a row or column (a 1D array). In a 3D array, single index slice is a 2D array. As the examples below suggest, 3D arrays are kind of a hard data structure to think about, and they are generally used for very specific applications, such as representing color images (3 layers of 2D representation, one for each red-green-blue color intensity).
A[1,:,:]
array([[12, 13, 14, 15],
[16, 17, 18, 19],
[20, 21, 22, 23]])
A[:,:,3]
array([[ 3, 7, 11],
[15, 19, 23]])
A[:,2,:]
array([[ 8, 9, 10, 11],
[20, 21, 22, 23]])
6.3. Elementwise arithmetic operations¶
We noted above an important fact about arrays which distinguishes them from lists: Arithmetic operations can be performed on arrays just as if they were numbers. We discuss that a bit more explicitly in this section.
First note that arithmetic operations such as addition and multiplication between two arrays are well defined, if the arrays meet certain constraints. If they have the same shape, addition and multiplication are always well-defined.
import numpy as np
a = np.array( [20,30,40,50] )
b = np.arange( 4 )
print a
print b
[20 30 40 50]
[0 1 2 3]
a + b
array([20, 31, 42, 53])
a * b
array([ 0, 30, 80, 150])
a - b
array([20, 29, 38, 47])
In the next example we illustrate broadcasting. Elementwise operations are extended to apply between arrays and objects that are not arrays or between arrays that are not the same shape. o for example, 2 * a (a is the array above) returns an array that contains all the elements of a multiplied by 2
2 * a
array([ 40, 60, 80, 100])
This works by “broadcasting” 2 into an array the same size as a and then doing elementwise mutliplication on the two arrays.
Elementwise arithmetic also works with 2D arrays, often in a fairly intutive fashion.
A = np.array( [[1,1],
[0,1]] )
B = np.array( [[2,0],
[3,4]] )
A * B # elementwise product
array([[2, 0],
[0, 4]])
You can multiply two arrays of the same shape together as we did with
a
and b
above, but you can’t rely on broadcasting to figure out
what to do with mismatched array sizes. In general, adding arrays
of different sizes together will fail:
c = np.array([2,3])
c * a
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-3-e4d7d0499d5f> in <module>()
1 c = np.array([2,3])
----> 2 c * a
ValueError: operands could not be broadcast together with shapes (2,) (4,)
But sometimes there is a sensible way to apply the operation along some dimension of a 2D array. Consider:
a = np.array( [20,30,40,50] )
c = np.arange(8).reshape((2,4))
c
array([[0, 1, 2, 3],
[4, 5, 6, 7]])
a + c
array([[20, 31, 42, 53],
[24, 35, 46, 57]])
The same broadcast principle applies to Boolean tests, which are converted to elementwise operations, resulting in an array of Booleans. We return to this point below.
a < 35
array([ True, True, False, False], dtype=bool)
6.3.1. Elementwise Boolean operations, Masking¶
We noted above that the idea of elementwise operations extends to Boolean tests.
In general, applying a Boolean test to an array returns an array of truth-values of the same size as the orginal array. Just as
3 + X
adds 3 to every element of array X
, so
X > 2
returns an array of truth-values which tells us which elements of X
are greater than 2.
import numpy as np
X = np.array([1,5,2,7,9,4,3,-6])
print X
print X >= 3
[ 1 5 2 7 9 4 3 -6]
[False True False True True True True False]
One of the most important uses of Boolean arrays is that they can index other arrays. This will always work when a Boolean array is the same shape as the array it indexes. So in the simplest case:
X[X>=3]
array([5, 7, 9, 4, 3])
This returns exactly the members of X that are greater than or equal to 3.
We could do the same with a list comprehension, but the array computation above is much faster:
[x for x in X if x >= 3]
[5, 7, 9, 4, 3]
The reason that X[X>=3]
works has to do with a basic fact of array
indexing we haven’t really made explicit yet. Any Boolean array Y of the
right length can be used to index an array X if X and Y are the same
length.
A hokey but essential example. Suppose we have an array of length 8 and we want to access the first, third, and seventh elements. Here’s how to do it with a Boolean array:
X = np.arange(8) + 1
Y = np.array([True,False,True,False,False,False,True,False])
print X
X[Y]
[1 2 3 4 5 6 7 8]
array([1, 3, 7])
Boolean arrays can also be used to count the number of elements in
an array that satisfy some constraint. The Boolean True
is treated as
1 with arithmetic operations
and the Boolean False
as 0, so summing a Boolean array counts the
number of Trues. The following expression correctly counts the number of
elements in X
that are greater than or equal to 3.
sum(X>=3)
5
In this case, though, using the array operation is about an order of magnitude less efficient than the natural Pythonic idiom.
len([x for x in X if x >3])
5
More on array operations and efficiency below. It’s a tricky subject.
Boolean operations work more or less as expected on 2D arrays:
import numpy as np
y = np.arange(35).reshape(5,7)
y>3
array([[False, False, False, False, True, True, True],
[ True, True, True, True, True, True, True],
[ True, True, True, True, True, True, True],
[ True, True, True, True, True, True, True],
[ True, True, True, True, True, True, True]], dtype=bool)
Note that this Boolean array preserves the 5x7 shape of y
.
(y>3).shape
(5, 7)
But if we use it to index elements of y
, the result has to be a 1D
array:
y[y>3]
array([ 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34])
Now suppose we have an array of data that has 4 columns, but we only want to keep the first, second and fourth columns. Here’s a simple way to do that using a Boolean array:
X = np.arange(8).reshape((2,4)) + 1
print X
print X[:,np.array([True,True,False,True])]
[[1 2 3 4]
[5 6 7 8]]
[[1 2 4]
[5 6 8]]
Now Boolean arrays and Boolean masking is a great feature but it will occasionally cause you grief if you forget that something is an array and that a Boolean test will often return an array of truth values instead of a truth value.
Python will return an error if you try to use an array where a Boolean
value is expected. The most common error of this sort is in an
if
-clause.
if y:
print 'hi'
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-27-512f0bc47840> in <module>()
----> 1 if y:
2 print 'hi'
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
And the result of any Boolean test that can be interpreted as an elementwise operation will raise the same error:
if y == 0:
print 'hi'
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-31-3e5bb3ed66c7> in <module>()
----> 1 if y == 0:
2 print 'hi'
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
The is
test, however, will not raise this error, because it cannot
be interpreted as an elementwise operation:
if y is 3:
print 'Hi'
The flip side of this is that you must know to expect different kinds of
results with ==
and is
where arrays are concerned.
print y is 3
False
print y == 3
[[False False False True False False False]
[False False False False False False False]
[False False False False False False False]
[False False False False False False False]
[False False False False False False False]]
6.4. More on arrays¶
6.4.1. Data in the form of arrays¶
Many python modules that provide data do so in array form. As an example, we load the famous iris data set due to Edgar Anderson and made famous in a paper by Ronald Fisher.
from sklearn.datasets import load_iris
data = load_iris()
features = data['data']
target = data['target']
The variable features
is a numpy
array:
print features.shape
print features.ndim
print features.dtype.name
print features.size
print type(features)
print features[:10]
(150, 4)
2
float64
600
<type 'numpy.ndarray'>
[[ 5.1 3.5 1.4 0.2]
[ 4.9 3. 1.4 0.2]
[ 4.7 3.2 1.3 0.2]
[ 4.6 3.1 1.5 0.2]
[ 5. 3.6 1.4 0.2]
[ 5.4 3.9 1.7 0.4]
[ 4.6 3.4 1.4 0.3]
[ 5. 3.4 1.5 0.2]
[ 4.4 2.9 1.4 0.2]
[ 4.9 3.1 1.5 0.1]]
The last thing printed is the first 10 rows of features
. This array
represents data about irises. Each row represents a different iris and
gives 4 measurements for that exemplar. So there are 150 iris exemplars;
with 4 measurements for each, that’s 600 items data in the array
(features.size
). The data is used for classification studies.
Let’s apply what we’ve just learnt to pick a ceratin subset of the data. The first number in each row represents the sepal length of that particular iris exemplar. Let’s find all the irises in the data set whose sepal length is 5.
First we get a vector for the first column, and use it to create a Boolean vector identifying which rows have 5.0 in the first column.
first_col = features[:,0]
print first_col == 5.0
[False False False False True False False True False False False False
False False False False False False False False False False False False
False True True False False False False False False False False True
False False False False True False False True False False False False
False True False False False False False False False False False False
True False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False True False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False]
Then we use the Boolean array just constructed to index the rows,
combined with a :
to indicate we want all columns.
features[first_col == 5.0,:]
array([[ 5. , 3.6, 1.4, 0.2],
[ 5. , 3.4, 1.5, 0.2],
[ 5. , 3. , 1.6, 0.2],
[ 5. , 3.4, 1.6, 0.4],
[ 5. , 3.2, 1.2, 0.2],
[ 5. , 3.5, 1.3, 0.3],
[ 5. , 3.5, 1.6, 0.6],
[ 5. , 3.3, 1.4, 0.2],
[ 5. , 2. , 3.5, 1. ],
[ 5. , 2.3, 3.3, 1. ]])
So there are only 10 plants out of 150 that have sepal lengths of 5.
Of course we can do this all in one step, with the same result:
features[features[:,0] == 5.0,:]
array([[ 5. , 3.6, 1.4, 0.2],
[ 5. , 3.4, 1.5, 0.2],
[ 5. , 3. , 1.6, 0.2],
[ 5. , 3.4, 1.6, 0.4],
[ 5. , 3.2, 1.2, 0.2],
[ 5. , 3.5, 1.3, 0.3],
[ 5. , 3.5, 1.6, 0.6],
[ 5. , 3.3, 1.4, 0.2],
[ 5. , 2. , 3.5, 1. ],
[ 5. , 2.3, 3.3, 1. ]])
Since we’re indexing rows here, the :
can also be left out:
features[features[:,0] == 5.0]
array([[ 5. , 3.6, 1.4, 0.2],
[ 5. , 3.4, 1.5, 0.2],
[ 5. , 3. , 1.6, 0.2],
[ 5. , 3.4, 1.6, 0.4],
[ 5. , 3.2, 1.2, 0.2],
[ 5. , 3.5, 1.3, 0.3],
[ 5. , 3.5, 1.6, 0.6],
[ 5. , 3.3, 1.4, 0.2],
[ 5. , 2. , 3.5, 1. ],
[ 5. , 2.3, 3.3, 1. ]])
A more interesting use of Boolean array indexing is to find all the
irises of a particular class, using the variable target
defined when
we loaded the data; target
is an array containing the class of each
iris in the data set:
print type(target)
print target.shape
print set(target)
print features[0], target[0]
print features[90], target[90]
<type 'numpy.ndarray'>
(150,)
set([0, 1, 2])
[ 5.1 3.5 1.4 0.2] 0
[ 5.5 2.6 4.4 1.2] 1
As the printouts indicate , target
is a 1D array (a vector) of
length 150, containing only the values 0,1, and 2. These are the three
classes to which an iris can belong. There are exactly as many entries
in the target
array as there are rows in the features
array. For
any iris, its row-index in the features
array is the index of its
class in the target
. Above we printed the features and target class
for irises 0 and 90.
We can take advantage of this structure to find all the irises of class
1 very efficiently. Note that target == 1
is a Boolean array of
length 150.
target==1
array([False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False], dtype=bool)
To find the irises of class 1, we simply use that Boolean array to
specify the rows we want in the features
array:
features[target==1,:]
array([[ 7. , 3.2, 4.7, 1.4],
[ 6.4, 3.2, 4.5, 1.5],
[ 6.9, 3.1, 4.9, 1.5],
[ 5.5, 2.3, 4. , 1.3],
[ 6.5, 2.8, 4.6, 1.5],
[ 5.7, 2.8, 4.5, 1.3],
[ 6.3, 3.3, 4.7, 1.6],
[ 4.9, 2.4, 3.3, 1. ],
[ 6.6, 2.9, 4.6, 1.3],
[ 5.2, 2.7, 3.9, 1.4],
[ 5. , 2. , 3.5, 1. ],
[ 5.9, 3. , 4.2, 1.5],
[ 6. , 2.2, 4. , 1. ],
[ 6.1, 2.9, 4.7, 1.4],
[ 5.6, 2.9, 3.6, 1.3],
[ 6.7, 3.1, 4.4, 1.4],
[ 5.6, 3. , 4.5, 1.5],
[ 5.8, 2.7, 4.1, 1. ],
[ 6.2, 2.2, 4.5, 1.5],
[ 5.6, 2.5, 3.9, 1.1],
[ 5.9, 3.2, 4.8, 1.8],
[ 6.1, 2.8, 4. , 1.3],
[ 6.3, 2.5, 4.9, 1.5],
[ 6.1, 2.8, 4.7, 1.2],
[ 6.4, 2.9, 4.3, 1.3],
[ 6.6, 3. , 4.4, 1.4],
[ 6.8, 2.8, 4.8, 1.4],
[ 6.7, 3. , 5. , 1.7],
[ 6. , 2.9, 4.5, 1.5],
[ 5.7, 2.6, 3.5, 1. ],
[ 5.5, 2.4, 3.8, 1.1],
[ 5.5, 2.4, 3.7, 1. ],
[ 5.8, 2.7, 3.9, 1.2],
[ 6. , 2.7, 5.1, 1.6],
[ 5.4, 3. , 4.5, 1.5],
[ 6. , 3.4, 4.5, 1.6],
[ 6.7, 3.1, 4.7, 1.5],
[ 6.3, 2.3, 4.4, 1.3],
[ 5.6, 3. , 4.1, 1.3],
[ 5.5, 2.5, 4. , 1.3],
[ 5.5, 2.6, 4.4, 1.2],
[ 6.1, 3. , 4.6, 1.4],
[ 5.8, 2.6, 4. , 1.2],
[ 5. , 2.3, 3.3, 1. ],
[ 5.6, 2.7, 4.2, 1.3],
[ 5.7, 3. , 4.2, 1.2],
[ 5.7, 2.9, 4.2, 1.3],
[ 6.2, 2.9, 4.3, 1.3],
[ 5.1, 2.5, 3. , 1.1],
[ 5.7, 2.8, 4.1, 1.3]])
Now suppose we wanted to just see the sepal lengths (column 0) of the target class 1 irises, and compare them to the sepal lengths of the target class 0 irises, to see if sepal lengths provided a good way of telling the two classes apart. We could do:
print features[target==1,0]
print features[target==0,0]
[ 7. 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5. 5.9 6. 6.1 5.6
6.7 5.6 5.8 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6. 5.7
5.5 5.5 5.8 6. 5.4 6. 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5. 5.6
5.7 5.7 6.2 5.1 5.7]
[ 5.1 4.9 4.7 4.6 5. 5.4 4.6 5. 4.4 4.9 5.4 4.8 4.8 4.3 5.8
5.7 5.4 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5. 5. 5.2 5.2 4.7
4.8 5.4 5.2 5.5 4.9 5. 5.5 4.9 4.4 5.1 5. 4.5 4.4 5. 5.1
4.8 5.1 4.6 5.3 5. ]
And though there is some overlap, we see that there is a tendency for the column-0 value be higher in target 1 than it is target class 0.
6.4.2. Array-level operations¶
Some operations are elementwise, some are not. The ones that are not are array-level operations, the ones it makes sense to apply to an array as a whole. Let’s cook up an array:
a = np.array([[1.0, 2.0], [3.0, 4.0], [8.0,5.0]])
a
array([[ 1., 2.],
[ 3., 4.],
[ 8., 5.]])
Now we’ll transpose it. That is, we’ll switch the rows and columns. Instead of a 3x2 array, we get a 2x3 array.
b = a.transpose()
b
array([[ 1., 3., 8.],
[ 2., 4., 5.]])
So transposing is an array-level operation. It applies to an array as a whole, not element by element, and returns an array as a result.
To access the same item in b
, we now hae to swap the row and column
indices
print a[2,0]
print b[0,2]
8.0
8.0
Another example of an array operation is the matrix product, or dot product. This is a mathematically important operation that takes two arrays and returns an array. You can look up its definition in Wikipedia, if you’re interested.
a.dot(b)
array([[ 5., 11., 18.],
[ 11., 25., 44.],
[ 18., 44., 89.]])
The moral: In programming with arrays, it is important to know which
operations can be applied elementwise, and which cannot. The distinction
is not always completely predictable. We saw above that ==
is a
Boolean operation that applies elementwise to an array, but is
,
another Boolean operation, does not.
6.4.3. Cosine: An array level operation¶
As a more mathematical example of an array-level operation, we’ll use cosine. That is, cosine as an operation on two 1-dimensional arrays (or vectors), often thought of as a measure of their similarity.
One way of measuring the similarity between two arrays x
and y
.
is to take their cosine. The name for this similarity measure is quite
appropriate. The cosine of x
and y
is a measure of the geometric
cosine of the angle between them. It is 1 when that angle is 0 (i.e.,
x
and y
are identical in direction), and -1 when the vectors
point in opposite directions, and 0 when the vectors are orthogonal (they
do not share a component in any direction). It is computed by taking the
dot product of the unit vectors pointing in the same direction as x
and y
. To do this we divide x
and y
by their Euclidean
length (LA.norm
in the code below) and take the dot product of the
results.
import numpy as np
import numpy.linalg as LA
x = np.array([3,4])
y = np.array([4,3])
def find_unit_vector(x):
return x/LA.norm(x,ord=2)
def cosine (x,y):
return (find_unit_vector(x)).dot(find_unit_vector(y))
It is clear from these definitions that it makes no sense to think of cosine being applied elementwise. It is not an operation on scalars.
cosine(x,x)
1.0
cosine(x,-x)
-1.0
cosine(x,y)
0.95999999999999996
LA.norm(find_unit_vector(x))
1.0
6.4.4. Why use arrays?¶
import timeit
import numpy as np
py_sec = timeit.timeit('sum(x*x for x in xrange(1000))',
number=10000)
np_sec = timeit.timeit('na.dot(na)',
setup="import numpy as np; na=np.arange(1000)",
number=10000)
print("Normal Python: %f sec" % py_sec)
print("NumPy: %f sec" % np_sec)
Normal Python: 0.840633 sec
NumPy: 0.019618 sec
Arrays provide all kinds of indexing convenience for accessing data that is genuinely tabular. Tabular data is data that has good reason for being represented in a table; that is, there is good reason to think of the rows as a unit, and good reason to think of the columns as units. And therefore, there is good reason to think we might want to apply some computational operations to columns and rows. We’ll explore those ideas a bit more in a future assignment.
All kinds of data is fundamentally tabular, as we’ll see. But although
that’s very important, it’s probably not the main reason why so many
scientists, engineers, and data analysts use numpy
and other
programming tools like matlab
and R
that provide tables, or
matrices (the mathematical name for a similar concept), as a
fundamental data structure. The main reason is computational efficiency.
Having a single data structure with predictable types and elementwise
basic arithmetic operations makes it possible to write very efficient
ways of performing more complex operations. The above code snippet
demonstrates this by providing a timing comparison between two ways of
computing teh sum of the squares between 0 and 999 inclusive; the first
is used in computing the variable normal_py_sec
, is the ordinary
Python way of computing such a sum: Cook up a container of theose
integers, square each of them and sum the result. The second, stored in
the varibale np_sec
, uses numpy arrays. There is actually a very
important mathematical operation dot product that multiplies all the
corresponding elements of two vectors (elemntwise) and sums the result.
Using this precompiled
fast math operation we speed the code up by a
factor of about 80 on my machine, and your results, on your various home
machines will be different, but in the same ballpark.
Note that numpy arrays in and of themselves don’t help. In the code
snippet below, we multiply the na
array times itself and sum the
result. This code is actually slower than the basic Python code,
because it pays the overhead of creating the numpy
array but doesn’t
exploit it by using the proper optimized operation. The lesson here is
that coding so as to take maximum advantage of the efficiencies of
numpy
can be tricky, but if you have a good reason to be efficient,
it’s definitely worth some experimentation. That starts with knowing
about the kinds of builtin array specific functions numpy
offers.
naive_np_sec = timeit.timeit('sum(na*na)',
setup="import numpy as np; na=np.arange(1000)",
number=10000)
print("Naive NumPy: %f sec" % naive_np_sec)
Naive NumPy: 1.382197 sec