11.2. Using HTMLParserΒΆ

Once you have your hands on some raw html downloaded from the web, you are probably interested in parsing it, using an HTML parser, and then extracting some piece of content you know is available.

In this section we demonstrate one of the easiest things to do, which is is simply to extract all the text. This simple code snippet, which is taken from Brad Dayley’s Python phrasebook, can be downloaded here

import HTMLParser
import urllib

urlText = []

#Define HTML Parser
class parseText(HTMLParser.HTMLParser):
        
    def handle_data(self, data):
        if data != '\n':
            urlText.append(data)
    

#Create instance of HTML parser
lParser = parseText()

thisurl = "http://www-rohan.sdsu.edu/~gawron/index.html"
#Feed HTML file into parser
lParser.feed(urllib.urlopen(thisurl).read())
lParser.close()
for item in urlText:
    print item

What is happening here is quite simple. The first few lines define a custom parseText class as specialization of the HTMLParser.HTMLParser class defined in Python’s HTMLParser module, which is imported in line 1.

The HTMLParser documentation invites users to define their own specializations, anticipating that they will want to implement their own data handling function by redefining the handle_data method. This is done in lines 8-10. The handle_data method expects to fed data one line at a time. In this format an empty line is just n, so this method just checks if the line is empty, and if it isn’t, appends to the global list urlText.

Line 15 creates an instance of the class, and line 19 opens the url using urlib.urlopen, described in the last section. The close method terminates the connection and the lines found can then be found in urlText.