{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Basic HTML Parser" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import HTMLParser\n", "import urllib\n", "\n", "urlText = []\n", "\n", "#Define HTML Parser\n", "class parseText(HTMLParser.HTMLParser):\n", " \n", " def handle_data(self, data):\n", " if data != '\\n':\n", " urlText.append(data)\n", " \n", "\n", "#Create instance of HTML parser\n", "lParser = parseText()\n", "\n", "thisurl = \"http://www-rohan.sdsu.edu/~gawron/index.html\"\n", "#Feed HTML file into parser\n", "html_gook = urllib.urlopen(thisurl).read()\n", "lParser.feed(html_gook)\n", "lParser.close()\n", "#for item in urlText:\n", "# print item\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Newsfeeds" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import os\n", "import sys\n", "import feedparser\n", "from bs4 import BeautifulSoup\n", "#from bs4 import get_text as clean_html\n", "import urllib\n", "#import HTMLParser\n", "\n", "\n", "def parseHtml(html):\n", " return BeautifulSoup(html).contents\n", "\n", "def get_feedparser_feed(FEED_URL):\n", "\n", " fp = feedparser.parse(FEED_URL)\n", "\n", " if fp and fp.entries and fp.entries[0]:\n", " print \"Fetched %s entries from '%s'\" % (len(fp.entries), fp.feed.title)\n", " else:\n", " print 'No entries parseed!'\n", " sys.exit()\n", " return fp\n", "\n", " ## TODO: Look at fp.status for a 404.\n", " ## Thhere may be page content but the page you asked for may be gonbe.\n", " ## look at fp.feed.summary for a lot of URl\n", " \n", "def get_blog_posts(fp):\n", " global feed_dict, blog_posts\n", " \n", " blog_posts = []\n", " for e in fp.entries:\n", " try:\n", " content = e.content[0]\n", " except AttributeError:\n", " content = e.summary_detail\n", " feed_dict = {'title': e.title,\n", " 'content': parseHtml(content.value),\n", " 'link': e.links[0].href}\n", " blog_posts.append(feed_dict)\n", " return blog_posts\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We look at what a real newsfeed looks like. Here's [the OReilly press newsfeed.](http://feeds.feedburner.com/oreilly/radar/atom)." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fetched 15 entries from 'O'Reilly Radar - Insight, analysis, and research about emerging technologies'\n" ] } ], "source": [ "FEED_URL = 'http://feeds.feedburner.com/oreilly/radar/atom'\n", "fp = get_feedparser_feed(FEED_URL)\n", "blog_posts = get_blog_posts(fp)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'content': [
    \n", "
  1. Japanese Scientists Create Touchable Holograms (Reuters) -- Using femtosecond laser technology, the researchers developed 'Fairy Lights, a system that can fire high-frequency laser pulses that last one millionth of one billionth of a second. The pulses respond to human touch, so that - when interrupted - the hologram's pixels can be manipulated in mid-air.
  2. \n", "
  3. Google Cloud Vision API -- classifies images into thousands of categories (e.g., \"boat,\" \"lion,\" \"Eiffel Tower\"), detects faces with associated emotions, and recognizes printed words in many languages.
  4. \n", "
  5. Not Even Close: The State of Computer Security (Vimeo) -- hilarious James Mickens talk with the best description ever.
  6. \n", "
  7. 20 Product Prioritization Techniques: A Map and Guided Tour -- excellent collection of techniques for ordering possible product work.
  8. \n", "
\n", "
\n", " \n", "
\"\"],\n", " 'link': u'http://feedproxy.google.com/~r/oreilly/radar/atom/~3/la8O5g2zYk4/four-short-links-3-december-2015.html',\n", " 'title': u'Four short links: 3 December 2015'}" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "blog_posts[3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The content strings are in the `content` attribute, but not as strings." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "data": { "text/plain": [ "1" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print type(blog_posts[3]['content'])\n", "len(blog_posts[3]['content'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get to something closer to a string" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[
    \n", "
  1. Japanese Scientists Create Touchable Holograms (Reuters) -- Using femtosecond laser technology, the researchers developed 'Fairy Lights, a system that can fire high-frequency laser pulses that last one millionth of one billionth of a second. The pulses respond to human touch, so that - when interrupted - the hologram's pixels can be manipulated in mid-air.
  2. \n", "
  3. Google Cloud Vision API -- classifies images into thousands of categories (e.g., \"boat,\" \"lion,\" \"Eiffel Tower\"), detects faces with associated emotions, and recognizes printed words in many languages.
  4. \n", "
  5. Not Even Close: The State of Computer Security (Vimeo) -- hilarious James Mickens talk with the best description ever.
  6. \n", "
  7. 20 Product Prioritization Techniques: A Map and Guided Tour -- excellent collection of techniques for ordering possible product work.
  8. \n", "
\n", "
\n", " \n", "
\"\"]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "blog_posts[3]['content'][0].contents" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Which looks like a list of 9 strings, but isn't." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "bs4.element.Tag" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(blog_posts[3]['content'][0].contents[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can if we want get the text string at this point as follows:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "u'\\nJapanese Scientists Create Touchable Holograms (Reuters) -- Using femtosecond laser technology, the researchers developed \\'Fairy Lights, a system that can fire high-frequency laser pulses that last one millionth of one billionth of a second. The pulses respond to human touch, so that - when interrupted - the hologram\\'s pixels can be manipulated in mid-air.\\nGoogle Cloud Vision API -- classifies images into thousands of categories (e.g., \"boat,\" \"lion,\" \"Eiffel Tower\"), detects faces with associated emotions, and recognizes printed words in many languages.\\nNot Even Close: The State of Computer Security (Vimeo) -- hilarious James Mickens talk with the best description ever.\\n20 Product Prioritization Techniques: A Map and Guided Tour -- excellent collection of techniques for ordering possible product work.\\n\\n\\n \\n'" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "blog_posts[3]['content'][0].contents[0].text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But this text is actually a list of news teasers, and by immediately turning it into strings, we've made it look like connected text! In general, trying to get to strings too soon is the wrong way to use Soup objects. Instead, take advantage of the structure. Print a blog post:\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "
    \n", "
  1. Japanese Scientists Create Touchable Holograms (Reuters) -- Using femtosecond laser technology, the researchers developed 'Fairy Lights, a system that can fire high-frequency laser pulses that last one millionth of one billionth of a second. The pulses respond to human touch, so that - when interrupted - the hologram's pixels can be manipulated in mid-air.
  2. \n", "
  3. Google Cloud Vision API -- classifies images into thousands of categories (e.g., \"boat,\" \"lion,\" \"Eiffel Tower\"), detects faces with associated emotions, and recognizes printed words in many languages.
  4. \n", "
  5. Not Even Close: The State of Computer Security (Vimeo) -- hilarious James Mickens talk with the best description ever.
  6. \n", "
  7. 20 Product Prioritization Techniques: A Map and Guided Tour -- excellent collection of techniques for ordering possible product work.
  8. \n", "
\n", "
\n", " \n", "
\"\"\n" ] } ], "source": [ "print blog_posts[3]['content'][0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This have a `BeautifulSoup` representation of an HTML list. That list is a structured object that has been parsed for you. You don't have to figure out where the list elements start and end. You don't have to worry about whether or not all the list elements actually end with `<\\li>` (they frequently don't). To get all the list elements of the HTML list, do:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[
  • Japanese Scientists Create Touchable Holograms (Reuters) -- Using femtosecond laser technology, the researchers developed 'Fairy Lights, a system that can fire high-frequency laser pulses that last one millionth of one billionth of a second. The pulses respond to human touch, so that - when interrupted - the hologram's pixels can be manipulated in mid-air.
  • ,\n", "
  • Google Cloud Vision API -- classifies images into thousands of categories (e.g., \"boat,\" \"lion,\" \"Eiffel Tower\"), detects faces with associated emotions, and recognizes printed words in many languages.
  • ,\n", "
  • Not Even Close: The State of Computer Security (Vimeo) -- hilarious James Mickens talk with the best description ever.
  • ,\n", "
  • 20 Product Prioritization Techniques: A Map and Guided Tour -- excellent collection of techniques for ordering possible product work.
  • ]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "blog_posts[3]['content'][0]('li')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First list element:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "
  • Japanese Scientists Create Touchable Holograms (Reuters) -- Using femtosecond laser technology, the researchers developed 'Fairy Lights, a system that can fire high-frequency laser pulses that last one millionth of one billionth of a second. The pulses respond to human touch, so that - when interrupted - the hologram's pixels can be manipulated in mid-air.
  • " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "blog_posts[3]['content'][0]('li')[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "List of links inside first element:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "[Japanese Scientists Create Touchable Holograms]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "blog_posts[3]['content'][0]('li')[0]('a')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And so on. As long as we know something about the structure, we can get any piece of information we need." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Parsing HTML from the internet" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we understand a little of how Beautiful Soup works, let's go back to realistic example of downloading HTML from the internet, grabbing the strings from our feedparser module." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [], "source": [ "e = fp.entries[0]\n", "content = e.content[0]\n", " \n", "raw_html = content.value\n", "c_html = parseHtml(raw_html)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "unicode" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Each feed object is a json format representation of the content: a Python dict with string keys and dictionary,\n", "# list, and string values\n", "type(raw_html)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(list, bs4.element.Tag)" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(c_html),type(c_html[0])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "# The content att of e is a list of dictionaries, each a json format obj. The value att has the content string\n", "e.content[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Among other things HTML code points are turned into unicode to give something close to a plain unicode string.\n", "# Links and images removed. The content striong is left/\n", "c_html" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "type(c_html)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "c_html.text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Beautiful Soup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `Beautiful Soup` module is an HTML parser like the one provided by Python's `HTMLParser` module, but it is very powerful \n", "and flexible. The following examples are from http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html\n", "\n", "A Beautiful Soup constructor takes an XML or HTML document in the form of a string (or an open file-like object). It parses the document and creates a corresponding data structure in memory.\n", "\n", "If you give Beautiful Soup a perfectly-formed document, the parsed data structure looks just like the original document. But if there's something wrong with the document, Beautiful Soup uses heuristics to figure out a reasonable structure for the data structure.\n", "Parsing HTML\n", "\n", "Use the BeautifulSoup class to parse an HTML document. Here are some of the things that BeautifulSoup knows:\n", "\n", " Some tags can be nested (
    ) and some can't (

    ).\n", " Table and list tags have a natural nesting order. For instance, tags go inside tags, not the other way around.\n", " The contents of a