Python爬虫开发【第1篇】【beautifulSoup4解析器】

CSS 选择器:BeautifulSoup4

Beautiful Soup 也是一个HTML/XML的解析器,主要的功能也是如何解析和提取 HTML/XML 数据。

pip 安装:pip install beautifulsoup4

官方文档:http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0

抓取工具速度使用难度安装难度
正则 最快 困难 无(内置)
BeautifulSoup 最简单 简单
lxml 简单 一般

 

使用BeautifuSoup4爬腾讯社招页面

地址:http://hr.tencent.com/position.php?&start=10#a

 1 # bs4_tencent.py
 2 
 3 
 4 from bs4 import BeautifulSoup
 5 import urllib2
 6 import urllib
 7 import json    # 使用了json格式存储
 8 
 9 def tencent():
10     url = 'http://hr.tencent.com/'
11     request = urllib2.Request(url + 'position.php?&start=10#a')
12     response =urllib2.urlopen(request)
13     resHtml = response.read()
14 
15     output =open('tencent.json','w')
16 
17     html = BeautifulSoup(resHtml,'lxml')
18 
19 # 创建CSS选择器
20     result = html.select('tr[class="even"]')
21     result2 = html.select('tr[class="odd"]')
22     result += result2
23 
24     items = []
25     for site in result:
26         item = {}
27 
28         name = site.select('td a')[0].get_text()
29         detailLink = site.select('td a')[0].attrs['href']
30         catalog = site.select('td')[1].get_text()
31         recruitNumber = site.select('td')[2].get_text()
32         workLocation = site.select('td')[3].get_text()
33         publishTime = site.select('td')[4].get_text()
34 
35         item['name'] = name
36         item['detailLink'] = url + detailLink
37         item['catalog'] = catalog
38         item['recruitNumber'] = recruitNumber
39         item['publishTime'] = publishTime
40 
41         items.append(item)
42 
43     # 禁用ascii编码,按utf-8编码
44     line = json.dumps(items,ensure_ascii=False)
45 
46     output.write(line.encode('utf-8'))
47     output.close()
48 
49 if __name__ == "__main__":
50    tencent()

 

 

 

posted @ 2018-08-11 19:37  Nice1949  阅读(302)  评论(0编辑  收藏  举报