4.6. Files and file IO streams¶
This section has
an ipython notebook.
It also makes use of the UTF-8 text version of Pride and Prejudice, downloaded from
Project Gutenberg. You may visit Project Gutenberg
and learn your way around (which will be needed for some
of our future assignments), and download Pride and Prejudice from there,
or, for now, you can download it
Here’s one way to open a file and read in all of the contents:
file_str = open('pride_and_prejudice.txt','r').read()
file_str is now a single string containing
the entire novel. Let’s look at some statistics:
1>>> len(file_str) 2717569 3>>> len(file_str.split()) 4124588 5>>> file_lines = open('pride_and_prejudice.txt','r').readlines() 6>>> len(file_lines) 713426
Why are the lengths for these containers so different? We address this in the next section.
4.6.1. File stream objects¶
Let’s consider the python type tree again:
Notice the tree contains something more than containers. The next
level up from container is iterable, and the only
example given of a non-container iterable is a file-like-object.
A file-like-object is what the
called above returns:
>>> file_obj = open('pride_and_prejudice.txt','r') >>> file_obj <open file 'pride_and_prejudice.txt', mode 'r' at 0x1002dced0>
readlines methods are methods of file-like
read method returns the
entire file as a single string
readlines method returns the entire
file as a list of lines (each line is a string).
There is no method on file-like-objects that returns
a list of words, but there is a method on strings,
split that splits them into words (roughly,
we discuss shortcomings of
split in Section
Regression and Classification):
1Method For Returns 2------------------------------------------------------- 3read file-like-object string 4split string list of strings (words) 5readlines file-like-object list of strings (lines)
An iterable is something
very like a container. It has a set of elements that can be iterated
through in sequence. The difference between ordinary
containers like lists and tuples and iterables like
file-like-objects is that file-like-objects work by establishing
a link to some data resource out in the world;
in the case of a file on disk that resource is
a input-output stream. On demand
a portion of that data can be transferred into the Python process
and worked with; but unlike a container, that resource can be exhausted.
Once all the data in a file-like-object stream has been transferred,
the stream is empty and will yield no more data. If the stream was
linked to a file, the file is of course still there. But to get
at the data again, a new file-like-object must created with
open command. Here’s demonstration:
1>>> ofh = open('foo.txt','w') 2>>> print >> ofh, 'Great news' 3>>> print >> ofh, "Monty Python lives!" 4>>> print >> ofh, 'Hooray!' 5>>> ofh.close()
We opened a file for writing through the stream
Then we wrote three lines to it. Now let’s read them back:
1>>> ifh = open('foo.txt','r') 2>>> line1 = ifh.readline() 3>>> line1 4'Great news\n' 5>>> line2 = ifh.readline() 6>>> line2 7'Monty Python lives!\n' 8>>> line3 = ifh.readline() 9>>> line3 10'Hooray!\n' 11>>> line4 = ifh.readline() 12>>> line4 13''
readine returns the next line in the
file. So a file-like-object maintains a state
tracking where it is in the file.
After the last bit of data is read, each successive
readine returns the empty string.
The stream is exhausted.
Iterables may be iterated through in a loop. In the case of a file-like-object, the iteration through the file will be line by line. So in:
for line in fh:
line will be a line if
fh is a file-like-object.
The line will end with a
newline character (
The way iterables work is that they have a
each call to
returns the appropriate data
item for the iterable’s current state and updates the state.
The code implementing a
for loop calls that method each time through
the loop. Thus the relevant generalization about iterables is that
they can be iterated through. We can get almost the same behavior
as we got with
readline by directly using the
1>>> ifh = open('foo.txt','r') 2>>> ifh.next() 3'Great news\n' 4>>> ifh.next() 5'Monty Python lives!\n' 6>>> ifh.next() 7'Hooray!\n' 8>>> ifh.next() 9Traceback (most recent call last): 10 File "<stdin>", line 1, in <module> 11StopIteration
The difference is that when we’ve run out of lines, calling
again raises a
StopIteration error. The inner code of a
loop uses this property to know when to stop iterating and exit a loop.
What this shows us is that file streams are resources. They can be used up, at which point they no longer yield the content resource they provided when they were active. Files are resources in another sense; open files (active file streams) actually use up operating system resources that can be exhausted. With this in mind, it is a good idea to close any file that has been opened:
4.6.2. Words counts revisited¶
Here’s something we’ve seen before, opening a file for reading and getting word counts. We’ll take a fresh look now that we understand file-like-objects better:
1from collections import Counter 2ctr = Counter() 3token_ctr = 0 4 5with open('pride_and_prejudice.txt','r') as file_handle: 6 for line in file_handle: 7 line_words = line.strip().split() 8 for word in line_words: 9 token_ctr += 1 10 ctr[word] += 1
This uses the
a code block within which the file stream is active. One of the
benefits of using a
with statement block is that
the file is automatically closed for you when you exit the block.
Returning to the effects of the code snippet above, we discover:
>>> len(ctr) 13638 >>> token_ctr 124588
What happened? The vocabulary size is 13,638; that’s what we’re counting:
>>> ctr.most_common(10) [('the', 4205), ('to', 4121), ('of', 3660), ('and', 3309), ('a', 1945), ('her', 1858), ('in', 1813), ('was', 1796), ('I', 1740), ('that', 1419)]
The number of word tokens is 124,588. Counting tokens of word types like the and to is what gets us the raw numbers like 4205 and 4121:
>>> sum(ctr.values()) 124588
This code makes use of Counters, introduced in Section dictionaries. There is an even simpler way to use a counter:
ctr2 = Counter(open('pride_and_prejudice.txt','r'))
In this case however, we iterate through the file-like-object
as in a
for loop, which means we iterate line by line.
So what the counter will be counting will be lines:
1>>> ctr2.most_common(10) 2[('\r\n', 2394), 3 (' * * * * *\r\n', 6), 4 ('them."\r\n', 3), 5 ('it.\r\n', 3), 6 ('them.\r\n', 3), 7 ('family.\r\n', 2), ('do."\r\n', 2), 8 ('between Mr. Darcy and herself.\r\n', 2), 9 ('almost no restrictions whatsoever. 10 You may copy it, give it away or\r\n', 2), 11 ('together.\r\n', 2)]
What would we be counting if we did the following?
ctr3 = Counter(open(‘pride_and_prejudice’,’r’).read())
And finally, what about the following?
ctr4 = Counter(open(‘pride_and_prejudice’,’r’).read().split())
4.6.3. Other kinds of file-like objects¶
The kind of behavior illustrated above with Python’s default file streams can be replicated in a variety of input situations. For example, we saw in the Section on dictionaries that we could download material from the web through a filelike object:
1 >>> from collections import Counter 2 >>> from urllib2 import urlopen 3 >>> book_url = 'http://www.gutenberg.org/ebooks/1342.txt.utf-8' # Pride & Prejudice URL 4>>> handle = urlopen(book_url) 5>>> handle 6<addinfourl at 4303001288 whose fp = <socket._fileobject object at 0x1004a33d0>>
The code above can be downloaded
This can now be accessed the same way we accessed a file:
1>>> handle.next() 2'\xef\xbb\xbfThe Project Gutenberg EBook of Pride and Prejudice, by Jane Austen\r\n' 3>>> handle.next() 4'\r\n' 5>>> handle.readline() 6'This eBook is for the use of anyone anywhere at no cost and with\r\n'