Screen scraping 2

Using HTMLPareser

Using HTMLParser simply means subclassing it, and overriding various event-handling methods such as handle_starttag or handle_data.

Handle_starttag(tag, attrs): When a start tag is found. Attrs is a sequence of (name, value) pairs.

Handle_startendtag(tag, attrs): for empty tags; default handles start and end separately

Handle_endtag(tag): when end tag is found

Handle_data(data): for textual data

Handle_charref(ref): for character references of the form &#ref

Handle_entityref(name): for entity references of the form &name

Handle_decl(decl): for declarations of the form <!...>

Handle_pi(data): for processing instructions

from urllib import urlopen
import re
from HTMLParser import HTMLParser

class Scraper(HTMLParser):
    in_h2 = False
    in_link = False
    
    def handle_starttag(self, tag, attrs):
        attrs = dict(attrs)
        if tag == 'h2':
            self.in_h2 = True
        if tag == 'a' and 'href' in attrs:
            self.in_link = True
            self.chunks = []
            self.url = attrs['href']
            
    def handle_data(self, data):
        if self.in_link:
            self.chunks.append(data)
            
    def handle_endtag(self, tag):
        if tag == 'h2':
            self.in_h2 = False
        if tag == 'a':
            if self.in_h2 and self.in_link:
                print '%s (%s)' %(''.join(self.chunks), self.url)
            self.in_link = False

text = urlopen("http://www.python.org/community/jobs/").read()
parser = Scraper()
parser.feed(text)
parser.close()

 

posted @ 2012-05-22 22:19  小楼  阅读(200)  评论(0编辑  收藏  举报