Computational Linguistics

Python Dictionaries

P
Y
T
H
O
N
 

P
Y
T
H
O
N

 

Basic Dictionary Access

You need to store various counts. For example, you need to store the counts of the number of times a word co-occurs with a tag in the training data.

There are various ways you could do this. But the deciding factor is this: You are basically building a V x T matrix, where V is vocabulary size and T is number of tags. The matrix will be sparse. Most words won't co-occur with most tags in the training data.

The natural way to deal with such a sparse matrix in Python is to use a dictionary:

Let's assume we store the data as follows. The keys for the dictionary are words. The values are dictionaries. To each word we associate a dictionary which gives for each tag the count of the number of times the word co-occurs with that tag:

>>> wordtag['dance']
{'VBZ': 1, 'NN1': 1}
That is, the dictonary associated with 'dance' tells us we have seen it once as a 'VBZ' (tensed verb) and once as a 'NN1 (common noun).

Just to be clear in the following discussion, let's empty the dictionary:

wordtag={}
This statement declares wordtag as an empty dictionary (a hash table or alist in other languages).

Now let's assign something to it:

>>> wordtag['dance']={'VBZ': 1, 'NN1': 1}
>>> wordtag
{'dance': {'VBZ': 1, 'NN1': 1}}
So now we have a dictionary of dictionaries, defined for just the one word 'dance'.

Now we ask the dictionary about the count of some arbitrary word/tag pair:

>>> wordtag['walk']['VBZ']
Traceback (most recent call last):
  File "", line 1, in ?
KeyError: 'walk'
Of course we get an error. Python has very good error/exception handling. This kind of error is important to know about if you're using dictionaries. It's called a 'KeyError' as the error message (in red) tells us. Needless to say in Python errors too are objects and always belong to classes. An error of the Key Error class is raised whenever we access a dictionary asking for information about some key the dictionary knows nothing about.

Suppose we now try to ASSIGN a value using the variable assignment operator ('=').

>>> wordtag['walk']['VBZ']= 1
Traceback (most recent call last):
  File "", line 1, in ?
KeyError: 'walk'
Same error. Why? Because this says, (a) find the value for key 'walk' in dictionary wordtag; that should be a dictionary. (b) Then assign to that dictionary the value '1' for the key 'VBZ'. But we never get to step (b) because step (a) raises a Key Error. So we have to do this:
>>> wordtag['walk']={}
Assign an empty dictionary as the value of 'walk' in wordtag.

And now we can do the assignment without error:

>>> wordtag['walk']['VBZ']= 1
And now we can do other assignments to 'walk':
>>> wordtag['walk']['NN1']= 1
And now when we ask for the dictonary associated with 'walk' we get what's expected.
>>> wordtag['walk']
{'VBZ': 1, 'NN1': 1}
And we can fetch now embedded values with a single query:
>>> wordtag['walk']['VBZ']
1

So that's basically how things work. And of course this gives us some problems to solve when we write a program requiring random access and random updates to a dictionary or dictionaries.

 

P
Y
T
H
O
N

 

Random Dictionary Access

Now we are set up to see the basic problem.

In general we have to access wordtag throughout the computation, sometimes to update a piece of information (increment a count) sometimes just to look it up. In either case for an arbitrary word W and and arbitrary tag T we risk two kinds of error

  1. Unknown word: We've never seen W before and wordtag has no dictionary entry for it;
  2. Unknown wordtag combo (we've never seen W paired with T in the training data before).

What do we do?

 

P
Y
T
H
O
N

 

Get: The solution

The python dictionary 'method 'get' is set up to do just what we want.

First some simple examples. Consider the following simple dictionary.

Dict
{'c': 3}
>>> Dict['a']
Traceback (most recent call last):
  File "", line 1, in ?
KeyError: 'a'
As expected. Now instead of using 'Dict['a'], we use the NEARLY synonymous 'get' method.
>>> Dict.get('a',0)
0
No error. This looks up the value associated with key 'a' in dictionary 'Dict', just as 'Dict['a'] does, but the second argument is what to return if the key is undefined. So 0 is returned in this case.

Now we check Dict again:

>>> Dict['a']
Traceback (most recent call last):
  File "", line 1, in ?
KeyError: 'a'
Nope. Still undefined. To define the value we have to explicitly assign what 'get' returns.
>>> Dict['a']=Dict.get('a',0)
>>> Dict['a']
0

Python provides a single function called setdefault that performs both retrieval and setting. Like get it takes a default which is the value returned if the given key has no value; unlike get, setdefault also sets the dictionary key to that value.

So

>>> D.setdefault('a',0)
0
is equivalent to:
>>> Dict['a']=Dict.get('a',0)

Both Dict.get('a',0) zand Dict.setdefault('a',0) differ from Dict['a'] in their exception-raising behavior. But there is a more fundamental difference. Neither Dict.get('a',0) nor Dict.setdefault('a',0) make sense with assignment syntax:

>>> Dict.get('a',0) = 0
SyntaxError: can't assign to function call

What about incrementing? Add 1 to a count that may not yet be defined:

>>> Dict['a']=Dict.get('a',0) + 1
This is the idiom for incrementing. Notice. This works right because 'get' only returns the default 0 when necessary; it always returns the value that's there if there already is one:
>>> Dict['b']=1
>>> Dict.get('b',0)
1
>>> Dict['b']=Dict.get('b',0) + 1
>>> Dict.get('b',0)
2

Notice that the highly idiomatic += statement is of no help if you have no guarantee that a key will be defined:

Dict['c']+=1
Traceback (most recent call last):
  File "", line 1, in ?
KeyError: 'c'
 

P
Y
T
H
O
N

 

Dictionaries of dictionaries

'get' IS enough to solve the dictionary of dictionaries version of our problem. But it takes some care.

>>> wordtag
{'dance': {'VBZ': 1, 'NN1': 1}}
Now we ask the dictionary about the count of some arbitrary word/tag pair using get:
>>> wordtag['walk'].get('VBZ',0)
Traceback (most recent call last):
  File "", line 1, in ?
KeyError: 'walk'
The syntax shows what the problem is. wordtag['walk'] SHOULD return a dictionary. Then the get method would be defined. Instead wordtag['walk'] returns the usual Key Error because wordtag is an empty dictionary. To avoid the error, we have to call get twice:
>>> wordtag.get('walk',{}).get('VBZ',0)
0
The first call returns the empty dictionary '{}' (the 2nd argument of 'get'), for which get is defined, so the 2nd call now succeeds, returning the fallback value 0. Pretty hideous. The following is more readable and gets the same thing done:
>>> wordtag['walk']= wordtag.get('walk',{})
>>> wordtag['walk'].get('VBZ',0)
0
The first line is an assignment statement which sets wordtag['walk'] to the empty dictionary, which then allows the second line to succeed. And for incrementing values this gives the somewhat readable:
>>> wordtag['walk']= wordtag.get('walk',{})
>>> wordtag['walk']['VBZ'] = wordtag['walk'].get('VBZ',0) +1
End of discussion.