5.6. Files and file IO streams¶
Note
This section has an ipython notebook.
It also makes use of the UTF-8 text version of Pride and Prejudice, downloaded from
Project Gutenberg. You may visit Project Gutenberg
and learn your way around (which will be needed for some
of our future assignments), and download Pride and Prejudice from there,
or, for now, you can download it here.
Here’s one way to open a file and read in all of the contents:
file_str = open('pride_and_prejudice.txt','r').read()
The variable file_str
is now a single string containing
the entire novel. Let’s look at some statistics:
1>>> len(file_str)
2717569
3>>> len(file_str.split())
4124588
5>>> file_lines = open('pride_and_prejudice.txt','r').readlines()
6>>> len(file_lines)
713426
Why are the lengths for these containers so different? We address this in the next section.
5.6.1. File stream objects¶
Let’s consider the python type tree again:
Notice the tree contains something more than containers. The next
level up from container is iterable, and the only
example given of a non-container iterable is a file-like-object.
A file-like-object is what the open
function
called above returns:
>>> file_obj = open('pride_and_prejudice.txt','r')
>>> file_obj
<open file 'pride_and_prejudice.txt', mode 'r' at 0x1002dced0>
The read
and readlines
methods are methods of file-like
objects. The read
method returns the
entire file as a single string
and the readlines
method returns the entire
file as a list of lines (each line is a string).
There is no method on file-like-objects that returns
a list of words, but there is a method on strings,
split
that splits them into words (roughly,
we discuss shortcomings of split
in Section
Regression and Classification):
1Method For Returns
2-------------------------------------------------------
3read file-like-object string
4split string list of strings (words)
5readlines file-like-object list of strings (lines)
An iterable is something
very like a container. It has a set of elements that can be iterated
through in sequence. The difference between ordinary
containers like lists and tuples and iterables like
file-like-objects is that file-like-objects work by establishing
a link to some data resource out in the world;
in the case of a file on disk that resource is
a input-output stream. On demand
a portion of that data can be transferred into the Python process
and worked with; but unlike a container, that resource can be exhausted.
Once all the data in a file-like-object stream has been transferred,
the stream is empty and will yield no more data. If the stream was
linked to a file, the file is of course still there. But to get
at the data again, a new file-like-object must created with
another open
command. Here’s demonstration:
1>>> ofh = open('foo.txt','w')
2>>> print >> ofh, 'Great news'
3>>> print >> ofh, "Monty Python lives!"
4>>> print >> ofh, 'Hooray!'
5>>> ofh.close()
We opened a file for writing through the stream ofh
.
Then we wrote three lines to it. Now let’s read them back:
1>>> ifh = open('foo.txt','r')
2>>> line1 = ifh.readline()
3>>> line1
4'Great news\n'
5>>> line2 = ifh.readline()
6>>> line2
7'Monty Python lives!\n'
8>>> line3 = ifh.readline()
9>>> line3
10'Hooray!\n'
11>>> line4 = ifh.readline()
12>>> line4
13''
Each successive
call to readine
returns the next line in the
file. So a file-like-object maintains a state
tracking where it is in the file.
After the last bit of data is read, each successive
call to readine
returns the empty string.
The stream is exhausted.
Iterables may be iterated through in a loop. In the case of a file-like-object, the iteration through the file will be line by line. So in:
for line in fh:
line
will be a line if fh
is a file-like-object.
The line will end with a newline
character (\n
).
The way iterables work is that they have a next
method;
each call to next
returns the appropriate data
item for the iterable’s current state and updates the state.
The code implementing a for
loop calls that method each time through
the loop. Thus the relevant generalization about iterables is that
they can be iterated through. We can get almost the same behavior
as we got with readline
by directly using the next
method:
1>>> ifh = open('foo.txt','r')
2>>> ifh.next()
3'Great news\n'
4>>> ifh.next()
5'Monty Python lives!\n'
6>>> ifh.next()
7'Hooray!\n'
8>>> ifh.next()
9Traceback (most recent call last):
10 File "<stdin>", line 1, in <module>
11StopIteration
The difference is that when we’ve run out of lines, calling next
again raises a StopIteration
error. The inner code of a for
loop uses this property to know when to stop iterating and exit a loop.
What this shows us is that file streams are resources. They can be used up, at which point they no longer yield the content resource they provided when they were active. Files are resources in another sense; open files (active file streams) actually use up operating system resources that can be exhausted. With this in mind, it is a good idea to close any file that has been opened:
>>> ifh.close()
5.6.2. Words counts revisited¶
Here’s something we’ve seen before, opening a file for reading and getting word counts. We’ll take a fresh look now that we understand file-like-objects better:
1from collections import Counter
2ctr = Counter()
3token_ctr = 0
4
5with open('pride_and_prejudice.txt','r') as file_handle:
6 for line in file_handle:
7 line_words = line.strip().split()
8 for word in line_words:
9 token_ctr += 1
10 ctr[word] += 1
This uses the with
statement. with
introduces
a code block within which the file stream is active. One of the
benefits of using a with
statement block is that
the file is automatically closed for you when you exit the block.
Returning to the effects of the code snippet above, we discover:
>>> len(ctr)
13638
>>> token_ctr
124588
What happened? The vocabulary size is 13,638; that’s what we’re counting:
>>> ctr.most_common(10)
[('the', 4205), ('to', 4121), ('of', 3660), ('and', 3309),
('a', 1945), ('her', 1858), ('in', 1813), ('was', 1796),
('I', 1740), ('that', 1419)]
The number of word tokens is 124,588. Counting tokens of word types like the and to is what gets us the raw numbers like 4205 and 4121:
>>> sum(ctr.values())
124588
This code makes use of Counters, introduced in Section Dictionaries. There is an even simpler way to use a counter:
ctr2 = Counter(open('pride_and_prejudice.txt','r'))
In this case however, we iterate through the file-like-object
as in a for
loop, which means we iterate line by line.
So what the counter will be counting will be lines:
1>>> ctr2.most_common(10)
2[('\r\n', 2394),
3 (' * * * * *\r\n', 6),
4 ('them."\r\n', 3),
5 ('it.\r\n', 3),
6 ('them.\r\n', 3),
7 ('family.\r\n', 2), ('do."\r\n', 2),
8 ('between Mr. Darcy and herself.\r\n', 2),
9 ('almost no restrictions whatsoever.
10 You may copy it, give it away or\r\n', 2),
11 ('together.\r\n', 2)]
What would we be counting if we did the following?
ctr3 = Counter(open(‘pride_and_prejudice’,’r’).read())
And finally, what about the following?
ctr4 = Counter(open(‘pride_and_prejudice’,’r’).read().split())
5.6.3. Other kinds of file-like objects¶
The kind of behavior illustrated above with Python’s default file streams can be replicated in a variety of input situations. For example, we saw in the Section on Dictionaries that we could download material from the web through a filelike object:
1 >>> from collections import Counter
2 >>> from urllib2 import urlopen
3 >>> book_url = 'http://www.gutenberg.org/ebooks/1342.txt.utf-8' # Pride & Prejudice URL
4>>> handle = urlopen(book_url)
5>>> handle
6<addinfourl at 4303001288 whose fp = <socket._fileobject object at 0x1004a33d0>>
Note
The code above can be downloaded here.
This can now be accessed the same way we accessed a file:
1>>> handle.next()
2'\xef\xbb\xbfThe Project Gutenberg EBook of Pride and Prejudice, by Jane Austen\r\n'
3>>> handle.next()
4'\r\n'
5>>> handle.readline()
6'This eBook is for the use of anyone anywhere at no cost and with\r\n'