3.4.4. Dictionaries

Dictionaries store mappings between many kinds of data types. Suppose we are studying the Enron email network, keeping track of who emailed who. For we each employee, we want to keep a list who they emailed. The following picture captures the kind of information we want:

../_images/enron_toy_graph.png

How do we represent this kind of visual information in a way that allows us to access it in a Python program? What’s crucial here is the names and the fact that there are links between some of them (but not all of them). Whalley sends emails to two individuals, Mike Grigsby and Louise Kitchen; Grigsby sends emails to three indivduals, Lousie Kitchen, Scott Neal, and Kenneth Lay.

This is the kind of information that could very naturally be stored in a dictionary. The dictionary capturing just this little subpart of the Enron email graph would look like this:

1
2
3
4
5
6
>>> enron_network

{'Mike Grigsby':  ['Louise Kitchen', 'Scott Neal', 'Kenneth Lay'],
 'Greg Whalley':  ['Mike Grigsby', 'Louise Kitchen'],

}

Each dictionary entry represents the email sent by one employee. Thus, the first entry tells us Mike Grigsby sent email to Louise Kitchen, Scott Neal, and Kenneth Lay. Each employee is represented by a string that uniquely identifies them (their names). Thus, this dictionary relates strings to lists of strings (Pythonistas say it maps strings to lists of strings). The strings before the colons are the keys, and the lists after the colon are the values. Any hashable object can be the key of a dictionary; we’re not going to define hashable here, but the concept hashable has a close connection to the concept immutable; lists of strings are mutable and cannot be dictionary keys, while tuples of strings are immutable and can be; strings are also immutable and can also be dictionary keys. In this course, we will limit our use of dictionary keys to strings, tuples of strings, and numbers.

Most often our dictionary keys will be strings. We will use all kinds of objects as values.

>>> X = {'x':42, 'y': 3.14, 'z':7}
>>> Y = {1:2, 3:4}
>>> Z = {}

There is an alternative syntax for creating dictionaries:

>>> L = dict(x=42,y=3.14,z=7)

We will return to this feature when we discuss keyword arguments for functions, which this is a special case of.

Anything can be a value in a dictionary, including dictionaries and lists. Dictionaries can also be elements in lists:

>>> Double = [{'x':42,'y': 3.14, 'z':7},{1:2,3:4},{'w':14}]
>>> Double[0]
{'y': 3.1400000000000001, 'x': 42, 'z': 7}

This means to retrieve the value 42 for the key x, we can do:

>>> FirstDict = Double[0]
>>> FirstDict['x']
42

But Python allows any expression which has a Dictionary as its value to be followed by [key]. It is much more convenient and Pythonic to do this in one step:

>>> Double[0]['x']
42

3.4.4.1. Bundling information

A dictionary can be used to collect various attributes of an object. For example:

dd = dict(name = 'Mark Gawron',
          language = 'Python',
          favorite_tv_show = 'Monty Python',
          favorite_dessert = 'Apple pie')

The information can now be retrieved when convenient:

>>> dd['language']
'Python'

This kind of use, essentially representing some object with a dictionary, starts us down the road that leads to Python classes, discussed in Section Classes and class attributes, but it is low cost and fairly common.

Finally dictionaries can be update, though an assignment statement. So we can take dd and change the favorite desert after a visit to a good Mexican restaurant:

1
2
3
4
5
6
>>> dd['favorite_dessert'] = 'Flan'
>>> dd
dict(name = 'Mark Gawron',
     language = 'Python',
     favorite_tv_show = 'Monty Python',
     favorite_dessert = 'Flan')

3.4.4.2. From list to dictionary

It often happens that we have data in a format convenient for providing one kind of information, but we need it for a slightly different purpose. Suppose we have the words of English arranged according to frequency rank, from most frequent to least frequent. Such a list looks like this:

rank_list = ['the', 'be', 'of', 'and', 'a', 'in', 'to', 'have', 'it',
               'to', 'for', 'I', 'that', 'you', 'he', 'on', 'with', 'do',
               'at', 'by', 'not', 'this', 'but', 'from', 'they', 'his', ...]

This is convenient if you want to know things like what the 6th most frequent word of English is:

>>> rank_list[5]  # 6th word on list has index 5
'in'

But suppose you want to go in the opposite direction? Given a word, you want to know what its frequency rank is. Python provides a built-in method for finding the index of any item in a list, so we could equally well do:

>>> rank_list.index('in')
5

But this is actually a little inefficient: Python has to search through the list, comparing each item with ‘in’, until it finds the right one. If Python starts from the beginning of the list each time, that will take quite a lot of comparisons for words at the end of the list.

What this example illustrates is a very basic but extremely important computational concept. You want your information to be indexed in a way convenient for the kind of questions you’re going to ask. A phone book is an extremely useful format if you know a name and you need a phone number. Similarly, Webster’s Dictionary is wonderful if you have a word and want to know what the meaning is. In both cases, alphabetically sorting the entries give us an efficient way of accessing the information. But the phone book is fairly inconvenient if you have a phone number and want to know who has it, and Webster’s is quite hard to use if you have a meaning in mind and are hunting for the word to express it, or even if you have a word, and are looking for its rhyme.

If you have a set of strings, and you want to get particular kinds of information about each, dictionaries are the way to go: Dictionary look ups use an indexing scheme much like alphabetization to efficiently retrieve values for keys. In the example of the rank list, if our main interest is in going from words to ranks, what we want to do is convert the list into a dictionary. We want to convert rank_list into something we’ll call rank_dict, which looks like this:

1
2
3
4
5
{'the':0, 'be':1, 'of':2, 'and':3, 'a':4, 'in':5,
 'to':6,  'have':7, 'it':8, 'to':9, 'for':10, 'I':11,
 'that':12, 'you':13, 'he':14, 'on':15, 'with:16', 'do':17,
 'at':18, 'by':19, 'not':20, 'this':21, 'but':22, 'from':23,
  'they':24, 'his':25, ...]

One way to produce rank_dict is with this code snippet:

rank_dict = dict()
for (i, word) in enumerate(rank_list):
    rank_dict[word] = (i+1)

The first line creates an empty dictionary. Here is what enumerate does to a list:

>>> list(enumerate(['a','b','c']))
[(0,'a'),(1,'b'),(2,'c')]

Essentially, what enumerate does is pair each element in the list with its index, which is exactly the information we want for our dictionary.

3.4.4.3. Combining lists into a dictionary

It is frequently the case that the meaning of a particular data item is defined by its position in a sequence of things. The i th thing in that sequence is associated with a particular entity xi and the ith thing in some other sequence might also be associated with xi. For example, a company database might store information about employees in separate files, but the i th line of each file is reserved for information about the employee with id number i.

As a concrete example, suppose we downloaded information about word frequencies in the British National Corpus, a very large sample of English, likely to give very reliable counts, and it came in two files, with contents that looked like this:

File 1

File 2

a

2186369

abandon

424

abortion

147

about

5256

about

14455

above

213

above

1071

above

1288

abroad

394

abruptly

114

absence

594

absent

150

absolute

348

absolutely

578

absorb

268

abbey

111

ability

1046

able

3045

abolish

174

abolition

115

abnormal

80

absorption

93

abstract

160

That is, the i th numbered line in File 2 gives the frequency of the i th English word in File 1. The easiest way to read this data into Python produces two lists, word_list and freq_list. Now suppose we want to be able to go conveniently from a word to its frequency. The right thing to do is to make a dictionary. Here is one way to do that, using code like the code we’ve already seen:

freq_dict = dict()
for (i, word) in enumerate(word_list):
   freq_dict[word] = freq_list[i]

Look at this code and make sure you understand it. We use the enumerate function to give us the index of each word in the list as we see it, then associate that with the i th frequency in freq_list.

But Python provides a faster (more Pythonic) way to do this. As with list and str, the name of the Python type dict is also a function for producing dictionaries. In fact, we used that convention to create an empty dictionary in the code snippet above. But dict can do more than just create empty dictionaries. Given a list of pairs, it produces a dictionary whose keys are the first members of the pairs and whose values are corresponding second members:

>>> L = [('a',1), ('b',2), ('c',3)]
>>> dd = dict(L)
>>> dd
{'a':1, 'b':2, 'c':3}

Unfortunately, we have two lists, not one, and neither is a list of pairs. Not to worry. Python also provides an easy way to create the list we want. The function is called zip, and it takes two lists of the same length and returns a list of pairs:

>>> L_left = ['a','b','c']
>>> L_right = [1, 2, 3]
>>> zip(L_left, L_right)
[('a',1), ('b',2), ('c',3)]

Thus in this instance zip returns the same list L we saw above.

One of the key uses of zip is to join two sets of items into a list of pairs, which can then be turned into a dictionary using the type function dict:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
>>> from string import ascii_lowercase

>>> ascii_lowercase
'abcdefghijklmnopqrstuvwxyz'

>>> range(1, 27)
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24, 25, 26]

>>> zip(ascii_lowercase, range(1,27))
[('a', 1), ('b', 2), ('c', 3), ('d', 4), ('e', 5), ('f', 6), ('g', 7),
('h', 8), ('i', 9), ('j', 10), ('k', 11), ('l', 12), ('m', 13), ('n', 14),
('o', 15), ('p', 16), ('q', 17), ('r', 18), ('s', 19), ('t', 20), ('u', 21),
('v', 22), ('w', 23), ('x', 24), ('y', 25), ('z', 26)]
>>> dict(zip(ascii_lowercase, range(1,27)))
{'a': 1, 'c': 3, 'b': 2, 'e': 5, 'd': 4, 'g': 7, 'f': 6, 'i': 9,
'h': 8, 'k': 11, 'j': 10, 'm': 13, 'l': 12, 'o': 15, 'n': 14,
'q': 17, 'p': 16, 's': 19, 'r': 18, 'u': 21, 't': 20, 'w': 23,
'v': 22, 'y': 25, 'x': 24, 'z': 26}

Given these features, it is quite simple to produce the frequency dictionary we want from the lists defined above:

>>> freq_dict = dict(zip(word_list,freq_list))

This is a frequently used Python idiom which will come in handy.

3.4.4.4. A text-based example

Note

This section uses the data module example_string which can be found here. [Add ipython notebook]

We illustrate some code here for computing word counts in a string. This is included here as a very clear example of when dictionaries should be used: We have a set of strings (words in our vocabulary) and information that we want to keep about each string (the word’s count). We need to update that information in a loop (see Section Loops), as we look at each word in the string:

1
2
3
4
5
6
7
8
9
from example_string import example_string

count_dict = dict()

for word in example_string.split():
    if word in count_dict:
       count_dict[word] += 1
    else:
       count_dict[word] = 1

And then we have:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
>>> count_dict
{'all': 3,
 'consider': 2,
  'dance': 1,
  'better,': 1,
  'sakes,': 1,
  'pleasant,': 1,
  'four': 3,
   'go': 1,

    ...

   'Lizzy': 2,
   'Jane,': 1,

     ...

    'Kitty,': 2
  }

Let’s see what’s going on in the code. [Explain]

3.4.4.5. Dictionary operators and methods

If d is a dictionary then:

len(d)

Return the number of items in the d.

d[key]

Return the item of d with key key.

In general, d[key] returns an error if key is a a key in d. See Python docs for ways to customize this (__missing__() method).

d[key] = value

Set d[key] to value.

del d[key]

Remove d[key] from d. Raises a KeyError if key is not in d.

key in d

Return True if d has a key key, else :samp: False.

key not in d

Equivalent to :samp: not key in d.

d.clear()

Remove all items from the dictionary.

d.copy()

Return a shallow copy of the dictionary.

d.get(key[, default])

Return the value for key if key is in the dictionary, else default. If default is not given, it defaults to None, so that this method never raises a KeyError.

d.items()

Return a list of the dictionary’s (key, value) pairs.

d.keys()

Return a list of d’s keys. See the note for dict.items().

d.pop(key[, default])

If key is in the dictionary, remove it and return its value, else return default. If default is not given and key is not in the dictionary, a KeyError is raised.

d.popitem()

Remove and return an arbitrary (key, value) pair from the dictionary.

popitem() is useful to destructively iterate over a dictionary, as often used in set algorithms. If the dictionary is empty, calling popitem() raises a KeyError.

d.setdefault(key[, default])

If key is in the dictionary, return its value. If not, insert key with a value of default and return default. default defaults to None.

d.update([other])

Update the dictionary with the key/value pairs from other, overwriting existing keys. Return None.

update() accepts either another dictionary object or an iterable of key/value pairs (as tuples or other iterables of length two). If keyword arguments are specified, the dictionary is then updated with those key/value pairs: d.update(red=1, blue=2).

d.values()

Return a copy of the dictionary’s list of values. See the note for dict.items().

3.4.4.6. Python collections module

Introduce Counters and defaultdicts.

Run though the above example, simplifying through the use of Counters.

A counter is a special kind of dictionary defined in the Python collections module:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
>>> from collections import Counter
>>> # Tally occurrences of words in a list
>>> cnt = Counter()
>>> for word in ['red', 'blue', 'red', 'green', 'blue', 'blue']:
...     cnt[word] += 1
>>> cnt
Counter({'blue': 3, 'red': 2, 'green': 1})

>>> # Find the ten most common words in Hamlet
>>> import re
>>> words = re.findall(r'\w+', open('hamlet.txt').read().lower())
>>> Counter(words).most_common(10)
[('the', 1143), ('and', 966), ('to', 762), ('of', 669), ('i', 631),
 ('you', 554),  ('a', 546), ('my', 514), ('hamlet', 471), ('in', 451)]

Counters can be initialized with any sequence, and they will count the token occurrences in that sequence. For example:

>>> c = Counter('gallahad')
>>> c
Counter({'a': 3, 'l': 2, 'h': 1, 'g': 1, 'd': 1})

They can also be initialized directly with count information from a dictionary:

>>> c = Counter({'red': 4, 'blue': 2})      # a new counter from a mapping
>>> c = Counter(cats=4, dogs=8)             # a new counter from keyword args

A Counter is a kind of dictionary, but it does not behave entirely like a standard dictionary. The count of a missing element is 0; there are no key errors, as there would be with a standard Python dictionary:

>>> c = Counter(['eggs', 'ham'])
>>> c['bacon']
0

See the Python docs for more features.

Now we said a counter can initialized directly with any sequence; the correct term is any iterable, roughly, anything that can be iterated through with a for loop. But caution must be exercised when initializing counters this way, to guarantee that the right things are being counted. For example, if what is desired is the word counts, it won’t work to simply initilize a Counter with a file handle, even though a file handle is an iterable:

1
2
3
4
5
6
>>> from collections import Counter
>>> from urllib2 import  urlopen
>>> book_url = 'http://www.gutenberg.org/ebooks/1342.txt.utf-8' # Pride & Prejudice URL
>>> freq = Counter(urlopen(book_url).read()) # Open, read, ct :  One line
>>> len(freq)
88

Hmm, shorter book than one might think. Let’s look at it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
>>> freq
Counter({' ': 113925, 'e': 70339, 't': 47283, 'a': 42155, '
         o': 41138, 'n': 38428, 'i': 36272, 'h': 33881, 's': 33292,
        'r': 33289, 'd': 22246, 'l': 21283, 'u': 15440, '\n': 13426,
        '\r': 13426, 'm': 13401, 'c': 13395, 'y': 12654, 'f': 12177,
        'w': 11922, 'g': 10160, ',': 9278, 'p': 8386, 'b': 8249, '.': 6396,
        'v': 5811, '"': 3554, 'k': 3241, 'I': 2674, 'M': 1723,
        ';': 1539, '-': 1191, 'B': 1114, 'z': 933, 'T': 875, 'x': 865,
        'E': 856, '_': 808, 'L': 788, "'": 748, 'H': 701, 'C': 666,
        'W': 651, 'j': 638, 'q': 637, 'D': 597, 'S': 578, 'A': 539,
        '!': 500, '?': 462, 'Y': 379, 'J': 332, 'P': 297, 'N': 294,
        'G': 283, 'O': 241, 'F': 206, 'R': 177, ':': 154, 'K': 101,
        '1': 83, 'U': 68, '*': 58, '(': 38, ')': 38, '2': 35, '3': 33,
        '4': 30, 'V': 29, '5': 29, '0': 26, '/': 26, '8': 18, '6': 18,
        '9': 17, '7': 12, 'Z': 5, '$': 2, '@': 2, 'X': 2, '[': 2, ']': 2,
        '\xbb': 1, '\xbf': 1, '\xef': 1, '#': 1, '%': 1, 'Q': 1})

What happened and why? The counter counts occurrences of the elements of whatever sequence is passed in. The elements of a string are characters, so the counter counts character tokens, and we end up with a fairly representative sample of the letter frequencies of the English.

We can turn this into word frequencues by passing n a sequence whose elements are words. We turn the string read from the Gutenberg URL into a list of words before submitting it to a counter:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
>>> freq2 = Counter(urlopen(book_url).read().split())
>>> len(freq2)
13638
>>> freq2.most_common(50)
[('the', 4205), ('to', 4121), ('of', 3662), ('and', 3309), ('a', 1945),
('her', 1858), ('in', 1813), ('was', 1796), ('I', 1740), ('that', 1419),
('not', 1356), ('she', 1306), ('be', 1209), ('his', 1167), ('had', 1126),
('as', 1119), ('with', 1040), ('he', 1038), ('for', 1003), ('you', 987),
('it', 952), ('have', 817), ('is', 806), ('Mr.', 766), ('at', 751),
('on', 647), ('by', 633), ('but', 609), ('my', 583), ('were', 542),
('all', 529), ('so', 519), ('which', 509), ('been', 501), ('him', 500),
('could', 497), ('from', 480), ('they', 472), ('very', 466), ('would', 457),    ('no', 428), ('their', 408), ('your', 399), ('Elizabeth', 398),
('will', 386), ('what', 384), ('such', 371), ('this', 359), ('or', 357),
('an', 348)]

Interestingly the following also works, in the sense of not giving an error:

1
2
3
4
5
6
7
8
>>> freq3 = Counter(urlopen(book_url))
>>> freq3.most_common(10)
[('\r\n', 2394), ('                          * * * * *\r\n', 6),
('them."\r\n', 3), ('it.\r\n', 3), ('them.\r\n', 3),
('family.\r\n', 2), ('do."\r\n', 2),
('between Mr. Darcy and herself.\r\n', 2),
('almost no restrictions whatsoever.  You may copy it, give it away or\r\n', 2),
('together.\r\n', 2)]

We will not be able to understand what is happening here until we understand the type of object produced by the urllib2 function open:

>>> urlopen(book_url)
<addinfourl at 4303001288 whose fp = <socket._fileobject object at 0x1004a33d0>>

This creates an instance of a class which behaves very like the final type of container introduced in Section on More Python types, a file-like object. Although the data is being streamed through a data port on your computer (called a socket), which is in turn connected to a web server on the internet, the urlopen function produces an object that behaves just like a file stream connected to a file on your computer. This is one of the powerful features of Python basic types. Coupled with a programmer’s abilities to define customized “types” (called classes), types can effectively be customzied customized into objects that fill your particular data needs, but they are still defined for the same methods that work with a basic type. We saw on example of that above with Counters, which extend and specialize dictionaries; likewise, the openurl function from the urllib2 module returns a form of file-like object. We defer further discussion of file like objects until the Section an Files and file IO streams.