5.6. Files and file IO streams

Note

This section has an ipython notebook.

It also makes use of the UTF-8 text version of Pride and Prejudice, downloaded from Project Gutenberg. You may visit Project Gutenberg and learn your way around (which will be needed for some of our future assignments), and download Pride and Prejudice from there, or, for now, you can download it here.

Here’s one way to open a file and read in all of the contents:

file_str = open('pride_and_prejudice.txt','r').read()

The variable file_str is now a single string containing the entire novel. Let’s look at some statistics:

1>>> len(file_str)
2717569
3>>> len(file_str.split())
4124588
5>>> file_lines = open('pride_and_prejudice.txt','r').readlines()
6>>> len(file_lines)
713426

Why are the lengths for these containers so different? We address this in the next section.

5.6.1. File stream objects

Let’s consider the python type tree again:

Python type tree

Python container type tree

Notice the tree contains something more than containers. The next level up from container is iterable, and the only example given of a non-container iterable is a file-like-object. A file-like-object is what the open function called above returns:

>>> file_obj = open('pride_and_prejudice.txt','r')
>>> file_obj
<open file 'pride_and_prejudice.txt', mode 'r' at 0x1002dced0>

The read and readlines methods are methods of file-like objects. The read method returns the entire file as a single string and the readlines method returns the entire file as a list of lines (each line is a string). There is no method on file-like-objects that returns a list of words, but there is a method on strings, split that splits them into words (roughly, we discuss shortcomings of split in Section Regression and Classification):

1Method      For                Returns
2-------------------------------------------------------
3read        file-like-object    string
4split       string              list of strings (words)
5readlines   file-like-object    list of strings (lines)

An iterable is something very like a container. It has a set of elements that can be iterated through in sequence. The difference between ordinary containers like lists and tuples and iterables like file-like-objects is that file-like-objects work by establishing a link to some data resource out in the world; in the case of a file on disk that resource is a input-output stream. On demand a portion of that data can be transferred into the Python process and worked with; but unlike a container, that resource can be exhausted. Once all the data in a file-like-object stream has been transferred, the stream is empty and will yield no more data. If the stream was linked to a file, the file is of course still there. But to get at the data again, a new file-like-object must created with another open command. Here’s demonstration:

1>>> ofh = open('foo.txt','w')
2>>> print >> ofh, 'Great news'
3>>> print >> ofh, "Monty Python lives!"
4>>> print >> ofh, 'Hooray!'
5>>> ofh.close()

We opened a file for writing through the stream ofh. Then we wrote three lines to it. Now let’s read them back:

 1>>> ifh = open('foo.txt','r')
 2>>> line1 = ifh.readline()
 3>>> line1
 4'Great news\n'
 5>>> line2 = ifh.readline()
 6>>> line2
 7'Monty Python lives!\n'
 8>>> line3 = ifh.readline()
 9>>> line3
10'Hooray!\n'
11>>> line4 = ifh.readline()
12>>> line4
13''

Each successive call to readine returns the next line in the file. So a file-like-object maintains a state tracking where it is in the file. After the last bit of data is read, each successive call to readine returns the empty string. The stream is exhausted.

Iterables may be iterated through in a loop. In the case of a file-like-object, the iteration through the file will be line by line. So in:

for line in fh:

line will be a line if fh is a file-like-object. The line will end with a newline character (\n).

The way iterables work is that they have a next method; each call to next returns the appropriate data item for the iterable’s current state and updates the state. The code implementing a for loop calls that method each time through the loop. Thus the relevant generalization about iterables is that they can be iterated through. We can get almost the same behavior as we got with readline by directly using the next method:

 1>>> ifh = open('foo.txt','r')
 2>>> ifh.next()
 3'Great news\n'
 4>>> ifh.next()
 5'Monty Python lives!\n'
 6>>> ifh.next()
 7'Hooray!\n'
 8>>> ifh.next()
 9Traceback (most recent call last):
10  File "<stdin>", line 1, in <module>
11StopIteration

The difference is that when we’ve run out of lines, calling next again raises a StopIteration error. The inner code of a for loop uses this property to know when to stop iterating and exit a loop.

What this shows us is that file streams are resources. They can be used up, at which point they no longer yield the content resource they provided when they were active. Files are resources in another sense; open files (active file streams) actually use up operating system resources that can be exhausted. With this in mind, it is a good idea to close any file that has been opened:

>>> ifh.close()

5.6.2. Words counts revisited

Here’s something we’ve seen before, opening a file for reading and getting word counts. We’ll take a fresh look now that we understand file-like-objects better:

 1from collections import Counter
 2ctr = Counter()
 3token_ctr = 0
 4
 5with open('pride_and_prejudice.txt','r') as file_handle:
 6   for line in file_handle:
 7       line_words = line.strip().split()
 8       for word in line_words:
 9           token_ctr += 1
10           ctr[word] += 1

This uses the with statement. with introduces a code block within which the file stream is active. One of the benefits of using a with statement block is that the file is automatically closed for you when you exit the block.

Returning to the effects of the code snippet above, we discover:

>>> len(ctr)
13638
>>> token_ctr
124588

What happened? The vocabulary size is 13,638; that’s what we’re counting:

>>> ctr.most_common(10)
[('the', 4205), ('to', 4121), ('of', 3660), ('and', 3309),
 ('a', 1945), ('her', 1858), ('in', 1813), ('was', 1796),
 ('I', 1740), ('that', 1419)]

The number of word tokens is 124,588. Counting tokens of word types like the and to is what gets us the raw numbers like 4205 and 4121:

>>> sum(ctr.values())
124588

This code makes use of Counters, introduced in Section Dictionaries. There is an even simpler way to use a counter:

ctr2 = Counter(open('pride_and_prejudice.txt','r'))

In this case however, we iterate through the file-like-object as in a for loop, which means we iterate line by line. So what the counter will be counting will be lines:

 1>>> ctr2.most_common(10)
 2[('\r\n', 2394),
 3 ('                          * * * * *\r\n', 6),
 4 ('them."\r\n', 3),
 5 ('it.\r\n', 3),
 6 ('them.\r\n', 3),
 7 ('family.\r\n', 2), ('do."\r\n', 2),
 8 ('between Mr. Darcy and herself.\r\n', 2),
 9 ('almost no restrictions whatsoever.
10  You may copy it, give it away or\r\n', 2),
11 ('together.\r\n', 2)]

What would we be counting if we did the following?

ctr3 = Counter(open(‘pride_and_prejudice’,’r’).read())

And finally, what about the following?

ctr4 = Counter(open(‘pride_and_prejudice’,’r’).read().split())

5.6.3. Other kinds of file-like objects

The kind of behavior illustrated above with Python’s default file streams can be replicated in a variety of input situations. For example, we saw in the Section on Dictionaries that we could download material from the web through a filelike object:

1 >>> from collections import Counter
2 >>> from urllib2 import  urlopen
3 >>> book_url = 'http://www.gutenberg.org/ebooks/1342.txt.utf-8' # Pride & Prejudice URL
4>>> handle = urlopen(book_url)
5>>> handle
6<addinfourl at 4303001288 whose fp = <socket._fileobject object at 0x1004a33d0>>

Note

The code above can be downloaded here.

This can now be accessed the same way we accessed a file:

1>>> handle.next()
2'\xef\xbb\xbfThe Project Gutenberg EBook of Pride and Prejudice, by Jane Austen\r\n'
3>>> handle.next()
4'\r\n'
5>>> handle.readline()
6'This eBook is for the use of anyone anywhere at no cost and with\r\n'