3.4.4. Dictionaries¶
Dictionaries store mappings between many kinds of data types. Suppose we are studying the Enron email network, keeping track of who emailed who. For each employee, we want to keep a list who they emailed. This is the kind of information that would be stored in a dictionary. It could look like this:
1>>> enron_network
2
3{'Mike Grigsby': ['Louise Kitchen', 'Scott Neal', 'Kenneth Lay', ... ],
4 'Greg Whalley': ['Mike Grigsby', 'Louise Kitchen;, ... ],
5 ...
6
7}
Each entry represents the email sent by one employee. Thus, the first entry tells us Mike Grigsby sent email to Louise Kitchen, Scott Neal, Kenneth Lay, etcetera. Each employee is represented by a string that uniquely identifies them (their names). Thus, this dictionary relates strings to lists of strings (Pythonistas say it maps strings to lists of strings). The strings before the colons are the keys, and the lists after the colon are the values. Any immutable type can be the key of a dictionary.
For our purposes we will most often use strings as the ‘keys’ of a dictionary. We will use all kinds of things as values.
>>> X = {'x':42, 'y': 3.14, 'z':7}
>>> Y = {1:2, 3:4}
>>> Z = {}
- Make some assignments into a dict
>>> L = dict(x=42,y=3.14,z=7)
The following is the way to do the same with integer keys.
>>> M = dict([[1, 2],[3, 4]])>>> X['x'] 42 >>> X['y'] 3.1400000000000001 >>> M[1] 2
The following raises a KeyError:
>>> X['w']
...
KeyError: 'w'
Dictionaries and lists can be mixed. We saw above that a list can be a value in a dictionary. Dictionaries can also be elements in lists:
>>> Double = [{'x':42,'y': 3.14, 'z':7},{1:2,3:4},{'w':14}]
>>> Double[0]
{'y': 3.1400000000000001, 'x': 42, 'z': 7}
This means to retrieve the value 42
for the key x
,
we can do:
>>> FirstDict = Double[0]
>>> FirstDict['x']
42
But Python allows any expression which has a Dictionary as its value
to be followed by [key]
. It is much more
convenient and Pythonic to do this in one step:
>>> Double[0]['x']
42
3.4.4.1. Bundling information¶
A dictionary can be used to collect various attributes of an object. For example:
dd = dict(name = 'Mark Gawron',
language = 'Python',
favorite_tv_show = 'Monty Python',
favorite_desert = 'Apple pie')
The information can now be retrieved when convenient:
>>> dd['language']
'Python'
This kind of use, essentially representing some object with a dictionary, ilustrates how dictionaries serve as Python’s generic tool for representing special purpose data. Eventually as the kinds of computation associated with the data become more complicated, we might eventually want to move to Python classes, discussed in Section Classes and class attributes, but in many contexts dictionaries can work very well as a simpler lower overhead alternative.
3.4.4.2. From list to dictionary¶
It often happens that we have data in a format convenient for providing one kind of information, but we need it for a slightly different purpose. Suppose we have the words of English arranged according to frequency rank, from most frequent to least frequent. Such a list looks like this:
rank_list = ['the', 'be', 'of', 'and', 'a', 'in', 'to', 'have', 'it',
'to', 'for', 'I', 'that', 'you', 'he', 'on', 'with', 'do',
'at', 'by', 'not', 'this', 'but', 'from', 'they', 'his', ...]
This is convenient if you want to know things like what the 6th most frequent word of English is:
>>> rank_list[5] # 6th word on list has index 5
'in'
But suppose you want to go in the opposite direction? Given a word,
you want to know what its frequency rank is. Python provides
a built-in method
for finding the index of any item
in a list, so we could equally well do:
>>> rank_list.index('in')
5
But this is actually a little inefficient: Python has to search through the list, comparing each item with ‘in’, until it finds the right one. If Python starts from the beginning of the list each time, that will take quite a lot of comparisons for low-ranked words.
What this example illustrates is a very basic but extremely important computational concept. You want your information to be indexed in a way convenient for the kind of questions you’re going to ask. A phone book is an extremely useful format if you know a name and you need a phone number. Similarly, Webster’s Dictionary is wonderful if you have a word and want to know what the meaning is. In both cases, alphabetically sorting the entries give us an efficient way of accessing the information. But the phone book is fairly inconvenient if you have a phone number and want to know who has it, and Webster’s is quite hard to use if you have a meaning in mind and are hunting for the word to express it, or even if you have a word, and are looking for its rhyme.
If you have a set of strings, and you want to get particular
kinds of information about each, dictionaries
are the way to go: Dictionary look ups
use an indexing scheme much like alphabetization
to efficiently retrieve values for keys.
In the example of the rank list, if our main
interest is in going from words to ranks,
what we want to do is convert the list into a dictionary.
We want to convert rank_list
into something
we’ll call rank_dict
, which looks like this:
1{'the':0, 'be':1, 'of':2, 'and':3, 'a':4, 'in':5,
2 'to':6, 'have':7, 'it':8, 'to':9, 'for':10, 'I':11,
3 'that':12, 'you':13, 'he':14, 'on':15, 'with:16', 'do':17,
4 'at':18, 'by':19, 'not':20, 'this':21, 'but':22, 'from':23,
5 'they':24, 'his':25, ...]
If we have rank_list
and rank_list
, One way to produce rank_dict
is with this code snippet:
rank_dict = dict()
for (i, word) in enumerate(rank_list):
rank_dict[word] = (i+1)
The first line creates and empty dictionary.
Here is what enumerate
does to a list:
>>> list(enumerate(['a','b','c']))
[(0,'a'),(1,'b'),(2,'c')]
What enumerate
returns
Essentially, what enumerate
does is pair each element in the list
with its index, which is exactly the information we want for our dictionary.
3.4.4.3. Combining lists into a dictionary¶
It is frequently the case that the meaning of a particular data item is defined by its position in a sequence of things. For example, the twelfth thing in a sequence might be the weight of a particular data sample and the twelfth thing in some other sequence might be the volume of that data sample. Or a company database might store personal information about employees in separate files, but the n*th line of each file is reserved for information about the employee with id number *n.
As a concrete example, suppose we downloaded information about word frequencies in the British National Corpus, a very large sample of English, likely to give very reliable counts, and it came in two files, with contents that looked like this:
That is, the i th numbered line in File 2 gives the frequency of the
i th English word in File 1. The easiest way
to read this data into Python produces two lists,
word_list
and freq_list
. Now suppose we want
to be able to go conveniently from a word to its frequency.
The right thing to do is to make a dictionary. Here is one way
to do that, using code like the code we’ve already seen:
freq_dict = dict()
for (i, word) in enumerate(word_list):
freq_dict[word] = freq_list[i]
Look at this code and make sure you understand it. We use the
enumerate function to give us the index of each word in the list
as we see it, then associate that with the i th frequency in freq_list
.
But Python provides a faster (more Pythonic) way to do
this. As with list
and str
, the name of the Python
type dict
is also a function for producing dictionaries.
In fact, we used that convention to create an empty
dictionary in the code snippet above. But dict
can
do more than just create empty dictionaries. Given a list of
pairs, it produces a dictionary whose keys are the first
members of the pairs and whose values are corresponding
second members:
>>> L = [('a',1), ('b',2), ('c',3)]
>>> dd = dict(L)
>>> dd
{'a':1, 'b':2, 'c':3}
Unfortunately, we have two lists, not one, and neither is
a list of pairs. Not to worry. Python also provides an easy
way to create the list we want. The function is called
zip
, and it takes two lists of the same length
and returns a list of pairs:
>>> L_left = ['a','b','c']
>>> L_right = [1, 2, 3]
>>> zip(L_left, L_right)
[('a',1), ('b',2), ('c',3)]
Thus in this instance zip
returns the same list
L
we saw above.
Given these feature, it is quite simple to produce the frequency dictionary we want:
>>> freq_dict = dict(zip(word_list,freq_list))
This is a frequently used Python idiom which will come in handy.
3.4.4.4. A text-based example¶
Note
This section uses the data module example_string
which can be found here. [Add ipython notebook]
We illustrate some code here for computing word counts in a string. This is included here as a very clear example of when dictionaries should be used: We have a set of strings (words in our vocabulary) and information that we want to keep about each string (the word’s count). We need to update that information in a loop (see Section Loops), as we look at each word in the string:
1from example_string import example_string
2
3count_dict = dict()
4
5for word in example_string.split():
6 if word in count_dict:
7 count_dict[word] += 1
8 else:
9 count_dict[word] = 1
And then we have:
1>>> count_dict
2{'all': 3,
3 'consider': 2,
4 'dance': 1,
5 'better,': 1,
6 'sakes,': 1,
7 'pleasant,': 1,
8 'four': 3,
9 'go': 1,
10
11 ...
12
13 'Lizzy': 2,
14 'Jane,': 1,
15
16 ...
17
18 'Kitty,': 2
19 }
Let’s see what’s going on in the code. [Explain]
3.4.4.5. Dictionary operators and methods¶
If d
is a dictionary then:
len(d)
Return the number of items in the d.
d[key]
Return the item of d with key key.
In general,
d[key]
returns an error ifkey
is a a key in d. See Python docs for ways to customize this (__missing__()
method).
d[key] = value
Set
d[key]
tovalue
.
del d[key]
Remove
d[key]
fromd
. Raises aKeyError
ifkey
is not ind
.
key in d
Return
True
ifd
has a keykey
, elseFalse
.
key not in d
Equivalent to
not key in d.
d.clear()
Remove all items from the dictionary.
d.copy()
Return a shallow copy of the dictionary.
d.get(key[, default])
Return the value for key if key is in the dictionary, else default. If default is not given, it defaults to None, so that this method never raises a KeyError.
d.items()
Return a list of the dictionary’s (key, value) pairs.
d.keys()
Return a list of d’s keys. See the note for dict.items().
d.pop(key[, default])
If key is in the dictionary, remove it and return its value, else return default. If default is not given and key is not in the dictionary, a KeyError is raised.
d.popitem()
Remove and return an arbitrary
(key, value)
pair from the dictionary.
popitem()
is useful to destructively iterate over a dictionary, as often used in set algorithms. If the dictionary is empty, callingpopitem()
raises aKeyError
.
d.setdefault(key[, default])
If key is in the dictionary, return its value. If not, insert key with a value of default and return default. default defaults to None.
d.update([other])
Update the dictionary with the key/value pairs from other, overwriting existing keys. Return None.
update()
accepts either another dictionary object or an iterable of key/value pairs (as tuples or other iterables of length two). If keyword arguments are specified, the dictionary is then updated with those key/value pairs:d.update(red=1, blue=2)
.
d.values()
Return a copy of the dictionary’s list of values. See the note for dict.items().
3.4.4.6. Python collections module¶
Introduce Counters and defaultdicts.
Run though the above example, simplifying through the use of Counters.
A counter is a special kind of dictionary defined in the Python collections module:
1>>> from collections import Counter
2>>> # Tally occurrences of words in a list
3>>> cnt = Counter()
4>>> for word in ['red', 'blue', 'red', 'green', 'blue', 'blue']:
5... cnt[word] += 1
6>>> cnt
7Counter({'blue': 3, 'red': 2, 'green': 1})
8
9>>> # Find the ten most common words in Hamlet
10>>> import re
11>>> words = re.findall(r'\w+', open('hamlet.txt').read().lower())
12>>> Counter(words).most_common(10)
13[('the', 1143), ('and', 966), ('to', 762), ('of', 669), ('i', 631),
14 ('you', 554), ('a', 546), ('my', 514), ('hamlet', 471), ('in', 451)]
Counters can be initialized with any sequence, and they will count the token occurrences in that sequence. For example:
>>> c = Counter('gallahad')
>>> c
Counter({'a': 3, 'l': 2, 'h': 1, 'g': 1, 'd': 1})
They can also be initialized directly with count information from a dictionary:
>>> c = Counter({'red': 4, 'blue': 2}) # a new counter from a mapping
>>> c = Counter(cats=4, dogs=8) # a new counter from keyword args
A Counter is a kind of dictionary, but it does not behave entirely like a standard dictionary. The count of a missing element is 0; there are no key error occurs, as would the case with a standard Python dictionary:
>>> c = Counter(['eggs', 'ham'])
>>> c['bacon']
0
See the Python docs for more features.
Now we said a counter can initialized directly with any sequence; the correct term is any iterable, roughly, anything that can be iterated through with a for loop. But caution must be exercised when initializing counters this way, to guarantee that the right things are being counted. For example, if what is desired is the word counts, it won’t work to simply initilize a Counter with a file handle, even though a file handle is an iterable:
1>>> from collections import Counter
2>>> tr = Counter(open('../anatomy/pride_and_prejudice.txt','r'))
3>>> len(tr)
411003
5>>> tr.most_common(10)
6[('\r\n', 2394), (' * * * * *\r\n', 6),
7 ('them."\r\n', 3), ('it.\r\n', 3), ('them.\r\n', 3),
8 ('family.\r\n', 2), ('do."\r\n', 2),
9 ('between Mr. Darcy and herself.\r\n', 2),
10 ('almost no restrictions whatsoever. You may copy it, give it away or\r\n', 2), ('together.\r\n', 2)]
What happened and why?