Screen scraping 2
Using HTMLPareser
Using HTMLParser simply means subclassing it, and overriding various event-handling methods such as handle_starttag or handle_data.
Handle_starttag(tag, attrs): When a start tag is found. Attrs is a sequence of (name, value) pairs.
Handle_startendtag(tag, attrs): for empty tags; default handles start and end separately
Handle_endtag(tag): when end tag is found
Handle_data(data): for textual data
Handle_charref(ref): for character references of the form &#ref
Handle_entityref(name): for entity references of the form &name
Handle_decl(decl): for declarations of the form <!...>
Handle_pi(data): for processing instructions
from urllib import urlopen import re from HTMLParser import HTMLParser class Scraper(HTMLParser): in_h2 = False in_link = False def handle_starttag(self, tag, attrs): attrs = dict(attrs) if tag == 'h2': self.in_h2 = True if tag == 'a' and 'href' in attrs: self.in_link = True self.chunks = [] self.url = attrs['href'] def handle_data(self, data): if self.in_link: self.chunks.append(data) def handle_endtag(self, tag): if tag == 'h2': self.in_h2 = False if tag == 'a': if self.in_h2 and self.in_link: print '%s (%s)' %(''.join(self.chunks), self.url) self.in_link = False text = urlopen("http://www.python.org/community/jobs/").read() parser = Scraper() parser.feed(text) parser.close()
作者:Shane
出处:http://bluescorpio.cnblogs.com
本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接,否则保留追究法律责任的权利。
出处:http://bluescorpio.cnblogs.com
本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接,否则保留追究法律责任的权利。
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步