Screen scraping 3
Use BeautifulSoup
from urllib import urlopen from bs4 import BeautifulSoup as BS text = urlopen("http://www.python.org/community/jobs/").read() soup = BS(text.decode('gbk', 'ignore')) jobs = set() for header in soup('h2'): links = header('a', 'reference') if not links: continue link = links[0] jobs.add('%s (%s)' % (link.string, link['href'])) print '\n'.join(sorted(jobs, key = lambda s: s.lower())) eliminate duplicates and print the names in sorted order soup('h2'): to get a list of all h2 elements header('a', 'reference') to get a list of child elements of the reference class
作者:Shane
出处:http://bluescorpio.cnblogs.com
本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接,否则保留追究法律责任的权利。
出处:http://bluescorpio.cnblogs.com
本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接,否则保留追究法律责任的权利。