3.4.4. Dictionaries

Dictionaries store mappings between many kinds of data types. Suppose we are studying the Enron email network, keeping track of who emailed who. For we each employee, we want to keep a list who they emailed. This is the kind of information that would be stored in a dictionary. It would look like this:

1>>> enron_network
2
3{'Mike Grigsby':  ['Louise Kitchen', 'Scott Neal', 'Kenneth Lay', ... ],
4 'Greg Whalley':  ['Mike Grigsby', 'Louise Kitchen;, ... ],
5  ...
6
7}

Each entry represents the email sent by one employee. Thus, the first entry tells us Mike Grigsby sent email to Louise Kitchen, Scott Neal, Kenneth Lay, etcetera. Each employee is represented by a string that uniquely identifies them (their names). Thus, this dictionary relates strings to lists of strings (Pythonistas say it maps strings to lists of strings). The strings before the colons are the keys, and the lists after the colon are the values. Any immutable type can be the key of a dictionary.

For our purposes we will most often use strings as the ‘keys’ of a dictionary. We will use all kinds of things as values.

>>> X = {'x':42, 'y': 3.14, 'z':7}
>>> Y = {1:2, 3:4}
>>> Z = {}
Make some assignments into a dict
>>> L = dict(x=42,y=3.14,z=7)

The following is the way to do the same with integer keys.

>>> M = dict([[1, 2],[3, 4]])
>>> X['x']
42
>>> X['y']
3.1400000000000001
>>> M[1]
2
The following raises a KeyError:
>>> X['w']
...
KeyError: 'w'

Dictionaries and lists can be mixed. We saw above that a list can be a value in a dictionary. Dictionaries can also be elements in lists:

>>> Double = [{'x':42,'y': 3.14, 'z':7},{1:2,3:4},{'w':14}]
>>> Double[0]
{'y': 3.1400000000000001, 'x': 42, 'z': 7}

This means to retrieve the value 42 for the key x, we can do:

>>> FirstDict = Double[0]
>>> FirstDict['y']
42

But Python allows any expression which has a Dictionary as its value to be followed by [key]. It is much more convenient and Pythonic to do this in one step:

>>> Double[0]['x']
42

3.4.4.1. Bundling information

A dictionary can be used to collect various attributes of an object. For example:

dd = dict(name = 'Mark Gawron',
          language = 'Python',
          favorite_tv_show = 'Monty Python',
          favorite_desert = 'Apple pie')

The information can now be retrieved when convenient:

>>> dd['language']
'Python'

This kind of use, essentially representing some object with a dictionary, starts us down the road that leads to Python classes, discussed in Section Classes and class attributes, but it is low cost and fairly common.

3.4.4.2. From list to dictionary

It often happens that we have data in a format convenient for providing one kind of information, but we need it for a slightly different purpose. Suppose we have the words of English arranged according to frequency rank, from most frequent to least frequent. Such a list looks like this:

rank_list = ['the', 'be', 'of', 'and', 'a', 'in', 'to', 'have', 'it',
               'to', 'for', 'I', 'that', 'you', 'he', 'on', 'with', 'do',
               'at', 'by', 'not', 'this', 'but', 'from', 'they', 'his', ...]

This is convenient if you want to know things like what the 6th most frequent word of English is:

>>> rank_list[5]  # 6th word on list has index 5
'in'

But suppose you want to go in the opposite direction? Given a word, you want to know what its frequency rank is. Python provides a built-in method for finding the index of any item in a list, so we could equally well do:

>>> rank_list.index('in')
5

But this is actually a little inefficient: Python has to search through the list, comparing each item with ‘in’, until it finds the right one. If Python starts from the beginning of the list each time, that will take quite a lot of comparisons for low-ranked words.

What this example illustrates is a very basic but extremely important computational concept. You want your information to be indexed in a way convenient for the kind of questions you’re going to ask. A phone book is an extremely useful format if you know a name and you need a phone number. Similarly, Webster’s Dictionary is wonderful if you have a word and want to know what the meaning is. In both cases, alphabetically sorting the entries give us an efficient way of accessing the information. But the phone book is fairly inconvenient if you have a phone number and want to know who has it, and Webster’s is quite hard to use if you have a meaning in mind and are hunting for the word to express it, or even if you have a word, and are looking for its rhyme.

If you have a set of strings, and you want to get particular kinds of information about each, dictionaries are the way to go: Dictionary look ups use an indexing scheme much like alphabetization to efficiently retrieve values for keys. In the example of the rank list, if our main interest is in going from words to ranks, what we want to do is convert the list into a dictionary. We want to convert rank_list into something we’ll call rank_dict, which looks like this:

1{'the':0, 'be':1, 'of':2, 'and':3, 'a':4, 'in':5,
2 'to':6,  'have':7, 'it':8, 'to':9, 'for':10, 'I':11,
3 'that':12, 'you':13, 'he':14, 'on':15, 'with:16', 'do':17,
4 'at':18, 'by':19, 'not':20, 'this':21, 'but':22, 'from':23,
5  'they':24, 'his':25, ...]

One way to produce rank_dict is with this code snippet:

rank_dict = dict()
for (i, word) in enumerate(rank_list):
    rank_dict[word] = (i+1)

The first line creates and empty dictionary. Here is what enumerate does to a list:

>>> list(enumerate(['a','b','c']))
[(0,'a'),(1,'b'),(2,'c')]

What enumerate returns Essentially, what enumerate does is pair each element in the list with its index, which is exactly the information we want for our dictionary.

3.4.4.3. Combining lists into a dictionary

It is frequently the case that the meaning of a particular data item is defined by its position in a sequence of things. The i th thing in that sequence is associated with a particular entity xi and the ith thing in some other sequence might also be associated with xi. For example, a company database might store information about employees in separate files, but the i th line of each file is reserved for information about the employee with id number i.

As a concrete example, suppose we downloaded information about word frequencies in the British National Corpus, a very large sample of English, likely to give very reliable counts, and it came in two files, with contents that looked like this:

\begin{array}[t]{lr}
\text{File 1} & \text{File 2}\\
\hline
a & 2186369  \\
abandon & 4249 \\
abbey & 1110 \\
ability & 10468 \\
able & 30454 \\
abnormal & 809 \\
abolish & 1744 \\
abolition & 1154 \\
abortion & 1471 \\
about & 52561 \\
about & 144554 \\
above & 2139 \\
above & 10719 \\
above & 12889 \\
abroad & 3941 \\
abruptly & 1146 \\
absence & 5949 \\
absent & 1504 \\
absolute & 3489 \\
absolutely & 5782 \\
absorb & 2684 \\
absorption & 932 \\
abstract & 1605 \\
\end{array}

That is, the i th numbered line in File 2 gives the frequency of the i th English word in File 1. The easiest way to read this data into Python produces two lists, word_list and freq_list. Now suppose we want to be able to go conveniently from a word to its frequency. The right thing to do is to make a dictionary. Here is one way to do that, using code like the code we’ve already seen:

freq_dict = dict()
for (i, word) in enumerate(word_list):
   freq_dict[word] = freq_list[i]

Look at this code and make sure you understand it. We use the enumerate function to give us the index of each word in the list as we see it, then associate that with the i th frequency in freq_list.

But Python provides a faster (more Pythonic) way to do this. As with list and str, the name of the Python type dict is also a function for producing dictionaries. In fact, we used that convention to create an empty dictionary in the code snippet above. But dict can do more than just create empty dictionaries. Given a list of pairs, it produces a dictionary whose keys are the first members of the pairs and whose values are corresponding second members:

>>> L = [('a',1), ('b',2), ('c',3)]
>>> dd = dict(L)
>>> dd
{'a':1, 'b':2, 'c':3}

Unfortunately, we have two lists, not one, and neither is a list of pairs. Not to worry. Python also provides an easy way to create the list we want. The function is called zip, and it takes two lists of the same length and returns a list of pairs:

>>> L_left = ['a','b','c']
>>> L_right = [1, 2, 3]
>>> zip(L_left, L_right)
[('a',1), ('b',2), ('c',3)]

Thus in this instance zip returns the same list L we saw above.

Given these feature, it is quite simple to produce the frequency dictionary we want:

>>> freq_dict = dict(zip(word_list,freq_list))

This is a frequently used Python idiom which will come in handy.

3.4.4.4. A text-based example

Note

This section uses the data module example_string which can be found here. [Add ipython notebook]

We illustrate some code here for computing word counts in a string. This is included here as a very clear example of when dictionaries should be used: We have a set of strings (words in our vocabulary) and information that we want to keep about each string (the word’s count). We need to update that information in a loop (see Section Loops), as we look at each word in the string:

1from example_string import example_string
2
3count_dict = dict()
4
5for word in example_string.split():
6    if word in count_dict:
7       count_dict[word] += 1
8    else:
9       count_dict[word] = 1

And then we have:

 1>>> count_dict
 2{'all': 3,
 3 'consider': 2,
 4  'dance': 1,
 5  'better,': 1,
 6  'sakes,': 1,
 7  'pleasant,': 1,
 8  'four': 3,
 9   'go': 1,
10
11    ...
12
13   'Lizzy': 2,
14   'Jane,': 1,
15
16     ...
17
18    'Kitty,': 2
19  }

Let’s see what’s going on in the code. [Explain]

3.4.4.5. Dictionary operators and methods

If d is a dictionary then:

len(d)

Return the number of items in the d.

d[key]

Return the item of d with key key.

In general, d[key] returns an error if key is a a key in d. See Python docs for ways to customize this (__missing__() method).

d[key] = value

Set d[key] to value.

del d[key]

Remove d[key] from d. Raises a KeyError if key is not in d.

key in d

Return True if d has a key key, else :samp: False.

key not in d

Equivalent to :samp: not key in d.

d.clear()

Remove all items from the dictionary.

d.copy()

Return a shallow copy of the dictionary.

d.get(key[, default])

Return the value for key if key is in the dictionary, else default. If default is not given, it defaults to None, so that this method never raises a KeyError.

d.items()

Return a list of the dictionary’s (key, value) pairs.

d.keys()

Return a list of d’s keys. See the note for dict.items().

d.pop(key[, default])

If key is in the dictionary, remove it and return its value, else return default. If default is not given and key is not in the dictionary, a KeyError is raised.

d.popitem()

Remove and return an arbitrary (key, value) pair from the dictionary.

popitem() is useful to destructively iterate over a dictionary, as often used in set algorithms. If the dictionary is empty, calling popitem() raises a KeyError.

d.setdefault(key[, default])

If key is in the dictionary, return its value. If not, insert key with a value of default and return default. default defaults to None.

d.update([other])

Update the dictionary with the key/value pairs from other, overwriting existing keys. Return None.

update() accepts either another dictionary object or an iterable of key/value pairs (as tuples or other iterables of length two). If keyword arguments are specified, the dictionary is then updated with those key/value pairs: d.update(red=1, blue=2).

d.values()

Return a copy of the dictionary’s list of values. See the note for dict.items().

3.4.4.6. Python collections module

Introduce Counters and defaultdicts.

Run though the above example, simplifying through the use of Counters.

A counter is a special kind of dictionary defined in the Python collections module:

 1>>> from collections import Counter
 2>>> # Tally occurrences of words in a list
 3>>> cnt = Counter()
 4>>> for word in ['red', 'blue', 'red', 'green', 'blue', 'blue']:
 5...     cnt[word] += 1
 6>>> cnt
 7Counter({'blue': 3, 'red': 2, 'green': 1})
 8
 9>>> # Find the ten most common words in Hamlet
10>>> import re
11>>> words = re.findall(r'\w+', open('hamlet.txt').read().lower())
12>>> Counter(words).most_common(10)
13[('the', 1143), ('and', 966), ('to', 762), ('of', 669), ('i', 631),
14 ('you', 554),  ('a', 546), ('my', 514), ('hamlet', 471), ('in', 451)]

Counters can be initialized with any sequence, and they will count the token occurrences in that sequence. For example:

>>> c = Counter('gallahad')
>>> c
Counter({'a': 3, 'l': 2, 'h': 1, 'g': 1, 'd': 1})

They can also be initialized directly with count information from a dictionary:

>>> c = Counter({'red': 4, 'blue': 2})      # a new counter from a mapping
>>> c = Counter(cats=4, dogs=8)             # a new counter from keyword args

A Counter is a kind of dictionary, but it does not behave entirely like a standard dictionary. The count of a missing element is 0; there are no key error occurs, as would the case with a standard Python dictionary:

>>> c = Counter(['eggs', 'ham'])
>>> c['bacon']
0

See the Python docs for more features.

Now we said a counter can initialized directly with any sequence; the correct term is any iterable, roughly, anything that can be iterated through with a for loop. But caution must be exercised when initializing counters this way, to guarantee that the right things are being counted. For example, if what is desired is the word counts, it won’t work to simply initilize a Counter with a file handle, even though a file handle is an iterable:

 1>>> from collections import Counter
 2>>> tr = Counter(open('../anatomy/pride_and_prejudice.txt','r'))
 3>>> len(tr)
 411003
 5>>> tr.most_common(10)
 6[('\r\n', 2394), ('                          * * * * *\r\n', 6),
 7 ('them."\r\n', 3), ('it.\r\n', 3), ('them.\r\n', 3),
 8 ('family.\r\n', 2), ('do."\r\n', 2),
 9 ('between Mr. Darcy and herself.\r\n', 2),
10 ('almost no restrictions whatsoever.  You may copy it, give it away or\r\n', 2), ('together.\r\n', 2)]

What happened and why?