6.1. Numpy and Arrays

This section provides a very cursory introduction to Numpy, enough to set up some of the basic concepts used in pandas. Numpy is a vast toolbox with a host of powerful mathematical features. In this section we have the modest goal of introducing arrays and array operations, trying to present some basic methods for creating and accessing arrays.

For more advanced students a very useful introduction to the capabilities of numpy is provided in Nicholas Rougier’s github tutorial, which mixes code and prose.

For students already familiar with Matlab, there are a number of strong similarities, but there are also some important differences. This article Numpy for Matlab users does a good explaining the relationship. The gist of the discussion is that it is possible to do things the Matlab way using the Python matrix class, but there is a price to be paid, especially when using a non standard Python module (third party software, often indispensable), and it may be worth it to do things the Python way (using arrays).

6.1.1. Making ranges

We first introduce the range function, which is useful on its own as a Python programming tool; its relevance here is its strong connection with list and array splices.

x = range(15)
x
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
range(1,5)
[1, 2, 3, 4]
range(5,100)
[5,
 6,
 7,
 8,
 9,
 10,

 ....

 90,
 91,
 92,
 93,
 94,
 95,
 96,
 97,
 98,
 99]
range(5,100,5)

This time we create a sequence that counts up from 5 by 5. The third argument specifies the size of the counting steps

[5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95]

The counting steps may also be negative, in which case we count down from the index specified by the first argument, up to but not including the second index. So in the case the second argument must be less than the first.

range(10,0,-1)
[10, 9, 8, 7, 6, 5, 4, 3, 2, 1]

To get to 0 the last argument must be one further counting down, that is, -1.

range(10,-1,-1)

Which gives:

[10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

The result of range is always a list.

type(x)
list

The way range works is the same as the way splices work. First, second, and third arguments of a splice all do the same things as the corresponding arguments of splices.

L = range(10,-1,-1)
L[2:8]
[8, 7, 6, 5, 4, 3]

In particular splices also take a third “step” argument.

L[8:2:-1]

gives

[2, 3, 4, 5, 6, 7]
>>> L
[10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

So an efficient Pythonic way to reverse a list is:

>>> L[::-1]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

6.1.2. Numpy arrays

We import the numpy module to demonstrate the use of arrays:

>>> import numpy as np

Arrays are columns of numbers. Actually they dont have to be numbers; they can also be strings; but all the items are generally of the same data type. Conceptually, there is nothing more to the idea of arrays than there is to the idea of lists. They are data structures containing items in sequence, like the following:

>>> x = np.array([1.0,2.,3.1])

Like lists you can access them by index:

>>> x[2]
3.1

So why do we need arrays in addition to lists? One reasin reason is space. We can save a great deal of space storing sequences if we know that all the items in the sequence are of the same data type. Another reason is time; mathematical operations can be made much more efficient if they are performed on sequences of uniform type. So the one-type restriction on arrays is quite helpful, in light of the fact that there are people out there doing massive amounts of number crunching involving very large arrays.

A large part of why arrays provide such massive gains in efficiency is vectorization of operations. The fancy mathematical term for a column of numbers is a vector. To vectorize an operation means to generalize it from an operation on numbers to an operation on vectors. When you load numpy, vectorized versions of all the basic arithmetic operations are defined. For example, consider addition:

>>> x = np.array([1.0,2.,3.1])
>>> y = np.array([-1.0,-2.,2.9])
>>> x + y
array([ 0.,  0.,  6.])

The result of adding array x and array y is a new array whose $i$th element is the sum of $x[i]$ and $y[i]$.

Similar generalizations apply to all the 2-place arithmetic operations. So why should ordinary working data scientists care about arrays? One answer of course is that efficiency usually ends up mattering, even when you think it won’t. But there is a simpler answer that has immediate consequences even for beginners. Vectorization provides us with a lot of programming conveniences that make for clearer, more concise code.

We return to vectorization and the conveniences it offers below and in the visualization chapter. For now let us consider some more methods of making arrays.

The simplest way to create an array is just to pass a list to the np.array function, as we did in our first example above.

>>> b = np.array([6, 7, 8])
>>> b
array([6 7 8])
>>> type(b)
<type 'numpy.ndarray'>

Then there is arange, a close cousin of range.

import numpy as np
a = np.arange(15)
a

Instead of returning a list, arange returns an array. Another example with arange, using integers, and a step argument.

>>> arange( 10, 30, 5 )
array([10, 15, 20, 25])

An important feature of arange, distinguishing it from range, is that none of the arguments, including the step argument, need to be integers:

>>> x = np.arange(0, 10, 0.01)
array([ 0.  ,  0.01,  0.02,  0.03,  0.04,  0.05,  0.06,  0.07,  0.08,
        0.09,  0.1 ,  0.11,  0.12,  0.13,  0.14,  0.15,  0.16,  0.17,
        0.18,  0.19,  0.2 ,  0.21,  0.22,  0.23,  0.24,  0.25,  0.26,
        0.27,  0.28,  0.29,  0.3 ,  0.31,  0.32,  0.33,  0.34,  0.35,
        0.36,  0.37,  0.38,  0.39,  0.4 ,  0.41,  0.42,  0.43,  0.44,
        0.45,  0.46,  0.47,  0.48,  0.49,  0.5 ,  0.51,  0.52,  0.53,

        ....

        9.54,  9.55,  9.56,  9.57,  9.58,  9.59,  9.6 ,  9.61,  9.62,
        9.63,  9.64,  9.65,  9.66,  9.67,  9.68,  9.69,  9.7 ,  9.71,
        9.72,  9.73,  9.74,  9.75,  9.76,  9.77,  9.78,  9.79,  9.8 ,
        9.81,  9.82,  9.83,  9.84,  9.85,  9.86,  9.87,  9.88,  9.89,
        9.9 ,  9.91,  9.92,  9.93,  9.94,  9.95,  9.96,  9.97,  9.98,  9.99])

Some other differences between arrays and lists are demonstrated with the folowing attributes.

>>> a.shape
(3, 5)
>>> a.ndim
2
>>> a.dtype.name
int64
>>> a.itemsize
8
>>> a.size
15
>>> type(a)
<type 'numpy.ndarray'>
>>> np.ndarray

6.1.3. Two-dimensional arrays

The real power of arrays emerges when we look at more complicated examples representing tabular information. Lists are always 1-dimensional; they are simple sequences, with only one direction to go when looking for the next item. Arrays can have more than one dimension.

We start with a simple sequence of length 15; we now reshape it into a 3 by 5 table (3 rows, 5 columns):

a = np.arange(15).reshape(3, 5)
print a
b = np.arange(15).reshape(5, 3)
a = np.arange(15).reshape(3, 5)
print b
print b.transpose()

This results in

[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]]
[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]
 [12 13 14]]
[[ 0  3  6  9 12]
 [ 1  4  7 10 13]
 [ 2  5  8 11 14]]

To create a table ( or 2-dimensional array), we pass np.array a list of lists. Each of the embedded lists specifies one row of the table:

>>> b = np.array( [ (1.5,2,3), (4,5,6) ] )
>>> b
array([[ 1.5,  2. ,  3. ],
       [ 4. ,  5. ,  6. ]])

An important feature of arrays, distinguishing them from lists, is that all the data items is generally of the same data type. The data type is often deduced from the arguments passed in, or it can overtly be specified with the optional dtype argument, which must specify a legal numpy data type. That includes all the basic Python number types.

>>> c = np.array( [ [1,2], [3,4] ], dtype=complex )
>>> c
array([[ 1.+0.j,  2.+0.j],
       [ 3.+0.j,  4.+0.j]])

Two other important ways of creating arrays are the ones and zeros functions. We create a 3 by 4 array containing only the float 0.

>>> np.zeros( (3,4) )
array([[ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.]])

We create a 3 by 4 array containing all 1’s.

>>> A =np.ones( (3,4))
>>> np.zeros( (3,4) )
array([[ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.]])

6.1.4. Indexing into arrays

We create an array using arange and an elementwise application of cubing, to get a array with the first ten perfect cubes (starting from 0).

>>> a = np.arange(10)**3
>>> a
array([  0,   1,   8,  27,  64, 125, 216, 343, 512, 729])

This is a one-dimensional array whose data can be accessed just like a list:

>>> a[2]
8
>>> a[2:5]
array([ 8, 27, 64])

Note that the way this array was created is peculiar to arrays. You “cube” an array, but you can’t cube a list:

 >>> L = range(10)
 >>> L**3
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'int'

We can also make assignments to array positions, just as we do with lists. In this example, we use a splice with a step to affect to affect every other position in a slice.

>>> a[:6:2] = -1000    # equivalent to a[0:6:2] = -1000; from start to position 6, exclusive, set every 2nd element to -1000
>>> a
array([-1000,     1, -1000,    27, -1000,   125,   216,   343,   512,   729])
array([  729,   512,   343,   216,   125, -1000,    27, -1000,     1, -1000])

Again what is happening here is peculiar to arrays. You can’t assign a non sequence in a splice assignment to a list:

>>> L[::2] = 3
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: must assign iterable to extended slice

One can use loop through arrays, as one loops through lists:

for i in a:
    print i**(1/3.),
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0

But this is often unnecessary. One can often perform mathematical operations on arrays as if they were numbers:

a ** (1/3.,)
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0

In fact, performing the arithmetic operation directly on the array is much more efficient than using the loop. We discuss some of the possibilities opened up by this fact in the next section.

6.1.5. Efficient array creation

The following way of making a 2D array is very general and much faster than using a loop. Just write a function of two arguments which for any i and j returns the value you want to place at (i,j) in the array. You can also specify a type for the elements of the array, using a known Python type or special numpy types for numbers. For example, numpy has the following types for integers, providing data elements of different memory sizes:

np.int,np.int16,np.int32,np.int64
import numpy as np
def f(i,j):
      return 10*i+j

# make a 5x4 array [20 elements] `b` using the function `f`
# where b[i,j] = f(i,j)
b = np.fromfunction(f,(5,4),dtype=int)
b
array([[ 0,  1,  2,  3],
       [10, 11, 12, 13],
       [20, 21, 22, 23],
       [30, 31, 32, 33],
       [40, 41, 42, 43]])
print b[2,3]
print b[2,:]
23
[20 21 22 23]
b[0:5, 1]                       # each row in the second column of b
array([ 1, 11, 21, 31, 41])
b[ : ,1]                        # equivalent to the previous example
array([ 1, 11, 21, 31, 41])
b[1:3, : ]                      # each column in the second and third row of b
array([[10, 11, 12, 13],
       [20, 21, 22, 23]])