11.1. Using urllibΒΆ

A URL or Universal Resource Locator is an address on the World Wide Web. In a Python program that address will be in the form of a string such as:

http://www-rohan.sdsu.edu

The Python urllib module is a module that opens a communication link with a URL. The link can then be used to download the raw content of the web site.

import urllib

thisurl = "http://www-rohan.sdsu.edu/~gawron/index.html"

handle = urllib.urlopen(thisurl)

html_gunk =  handle.read()

If no errors have occurred, the variable handle (line 5) is set to a Python socket object which contains information about the communication link that has been set up with the website and can be used to make further requests. As data consumers, we will mostly be interested in just downloading an entire web page and extracting information from it. This is done by setting the variable html_gunk to the result of calling the read method on handle (line 7).

The variable html_gunk is now set to the string downloaded from the webpage, which contains a sequence of commands in Hyper Text Markup Language (HTML). HTML is a language which, in many cases, is used to specify the content and appearances of web pages. A web page written in HTML usually signals that fact to web browsers by having a URL ending in the extension ”.htm” or ”.html”. The first few characters of our example look like this:

1
2
3
4
>>> html_gunk[:150]
'<html>\n<head>\n<title>Jean Mark Gawron</title>\n</head>\n
<!-- comment wow bgcolor="   #69301C" pr "#2060a0"  #778899\n<body bgcolor="#A91609"
text="#ffff'

Most people downloading data from the web are not interested in looking at HTML. They are interested only extracting the content expressed in it. This is the job of the HTML parser discussed in the next section.