11.3. Simple scrapingΒΆ

In the previous sections we introduced the two basic steps of web-scraping: downloading some data and parsing it into text. The downside of doing it the way we did it there was that we threw away all the HTML and ended up with pure strings. But in many cases that will give you far more than you want. For example, you will get all the ad copy, or, more likely, all the code (in various web frameworks) that loads the ad copy. Sifting through this to find what you want can be quite tedious and, if significant amounts of data are being downloaded, it can be impossible.

In this section we’ll have a look at a slightly different approach, using an HTML parser to give you a tree that preserves the structure of the original HTML, and then searching that tree for the data you want. For this approach, we’ll use a slightly easier to use HTML parser, provided by the lxml module. We’ll start by using an example from the Python for Data Analysis book. Here’s the code for that example:

from lxml.html import parse
from urllib2 import urlopen
from pandas.io.parsers import TextParser

def _unpack (row,kind='td'):
    elts = row.findall('.//%s' % kind)
    return [val.text for val in elts]

def parse_options_data (table):
    rows = table.findall('.//tr')
    header = _unpack(rows[0],kind='th')
    data = [_unpack(r) for r in rows[1:]]
    return TextParser(data,names=header).get_chunk()

if __name__ == '__main__':
    #parsed = parse('http://finance.yahoo.com/q/op?s=AAPL+Options')
    #parsed = parse('http://www-rohan.sdsu.edu/~gawron')

    #parsed = parse('http://www.lajollasurf.org/cgi-bin/plottide.pl')
    url = 'http://www.ezfshn.com/tides/usa/california/san%20diego'
    parsed = parse(url)

    doc = parsed.getroot()

    links = doc.findall('.//a')

    links_sub_list = links[15:20]
    lnk = links_sub_list[0]

    sample_url = lnk.get('href')

    sample_display_text = lnk.text_content()

    tables = doc.findall('.//table')
    ## Look at tables,  find a table of interest
    #puts = tables[9]
    ## Ditto
    #calls = tables[13]
    dt = tables[0]

    rows = dt.findall('.//tr')

    headers = _unpack(rows[0],kind='th')

    row_vals =  _unpack(rows[1],kind='td')

    #call_data = parse_options_data(calls)
    tide_data = parse_options_data(dt)

    print tide_data[:10]

The download begins as before.