Python爬虫开发【第1篇】【beautifulSoup4解析器】

CSS 选择器：BeautifulSoup4

Beautiful Soup 也是一个HTML/XML的解析器，主要的功能也是如何解析和提取 HTML/XML 数据。

pip 安装：pip install beautifulsoup4

官方文档：http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0

抓取工具	速度	使用难度	安装难度
正则	最快	困难	无（内置）
BeautifulSoup	慢	最简单	简单
lxml	快	简单	一般

使用BeautifuSoup4爬腾讯社招页面

地址：http://hr.tencent.com/position.php?&start=10#a

 1 # bs4_tencent.py
 2 
 3 
 4 from bs4 import BeautifulSoup
 5 import urllib2
 6 import urllib
 7 import json    # 使用了json格式存储
 8 
 9 def tencent():
10     url = 'http://hr.tencent.com/'
11     request = urllib2.Request(url + 'position.php?&start=10#a')
12     response =urllib2.urlopen(request)
13     resHtml = response.read()
14 
15     output =open('tencent.json','w')
16 
17     html = BeautifulSoup(resHtml,'lxml')
18 
19 # 创建CSS选择器
20     result = html.select('tr[class="even"]')
21     result2 = html.select('tr[class="odd"]')
22     result += result2
23 
24     items = []
25     for site in result:
26         item = {}
27 
28         name = site.select('td a')[0].get_text()
29         detailLink = site.select('td a')[0].attrs['href']
30         catalog = site.select('td')[1].get_text()
31         recruitNumber = site.select('td')[2].get_text()
32         workLocation = site.select('td')[3].get_text()
33         publishTime = site.select('td')[4].get_text()
34 
35         item['name'] = name
36         item['detailLink'] = url + detailLink
37         item['catalog'] = catalog
38         item['recruitNumber'] = recruitNumber
39         item['publishTime'] = publishTime
40 
41         items.append(item)
42 
43     # 禁用ascii编码，按utf-8编码
44     line = json.dumps(items,ensure_ascii=False)
45 
46     output.write(line.encode('utf-8'))
47     output.close()
48 
49 if __name__ == "__main__":
50    tencent()

posted @ 2018-08-11 19:37 Nice1949 阅读(330) 评论(0) 收藏举报

刷新页面返回顶部

Nice1949

Python爬虫开发【第1篇】【beautifulSoup4解析器】

CSS 选择器：BeautifulSoup4

使用BeautifuSoup4爬腾讯社招页面

公告