Python网页信息采集:使用PhantomJS采集淘宝天猫商品内容
Python网页信息采集:使用PhantomJS采集淘宝天猫商品内容
1,引言
最近一直在看Scrapy 爬虫框架,并尝试使用Scrapy框架写一个可以实现网页信息采集的简单的小程序。尝试过程中遇到了很多小问题,希望大家多多指教。
本文主要介绍如何使用Scrapy结合PhantomJS采集天猫商品内容,文中自定义了一个DOWNLOADER_MIDDLEWARES,用来采集需要加载js的动态网页内容。看了很多介绍DOWNLOADER_MIDDLEWARES资料,总结来说就是使用简单,但会阻塞框架,所以性能方面不佳。一些资料中提到了自定义DOWNLOADER_HANDLER或使用scrapyjs可以解决阻塞框架的问题,有兴趣的小伙伴可以去研究一下,这里就不多说了。
2,具体实现
2.1,环境需求
需要执行以下步骤,准备Python开发和运行环境:
- Python–官网下载安装并部署好环境变量 (本文使用Python版本为3.5.1)
- lxml–官网库下载对应版本的.whl文件,然后命令行界面执行 “pip install .whl文件路径”
- Scrapy–命令行界面执行 “pip install Scrapy”,详细请参考《Scrapy的第一次运行测试》
- selenium–命令行界面执行 “pip install selenium”
- PhantomJS —官网下载
上述步骤展示了两种安装:1,安装下载到本地的wheel包;2,用Python安装管理器执行远程下载和安装。注:包的版本需要和python版本配套
2.2,开发和测试过程
首先找到需要采集的网页,这里简单找了一个天猫商品,网址https://world.tmall.com/item/526449276263.htm,页面如下:
然后开始编写代码,以下代码默认都是在命令行界面执行
1),创建scrapy爬虫项目tmSpider
1
2
|
E:\python-3.5.1>scrapy startproject tmSpider
|
2),修改settings.py配置
- 更改ROBOTSTXT_OBEY的值为False;
- 关闭scrapy默认的下载器中间件;
- 加入自定义DOWNLOADER_MIDDLEWARES。
配置如下:
1
2
3
4
5
|
DOWNLOADER_MIDDLEWARES = {
<span class="hljs-string">'tmSpider.middlewares.middleware.CustomMiddlewares'</span>: <span class="hljs-number">543</span>,
<span class="hljs-string">'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware'</span>: <span class="hljs-keyword">None</span>
}
|
3),在项目目录下创建middlewares文件夹,然后在文件夹下创建middleware.py文件,代码如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
<span class="hljs-comment"># -*- coding: utf-8 -*-</span>
<span class="hljs-keyword">from</span> scrapy.exceptions <span class="hljs-keyword">import</span> IgnoreRequest
<span class="hljs-keyword">from</span> scrapy.http <span class="hljs-keyword">import</span> HtmlResponse, Response
<span class="hljs-keyword">import</span> tmSpider.middlewares.downloader <span class="hljs-keyword">as</span> downloader
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">CustomMiddlewares</span><span class="hljs-params">(object)</span>:</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">process_request</span><span class="hljs-params">(self, request, spider)</span>:</span>
url = str(request.url)
dl = downloader.CustomDownloader()
content = dl.VisitPersonPage(url)
<span class="hljs-keyword">return</span> HtmlResponse(url, status = <span class="hljs-number">200</span>, body = content)
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">process_response</span><span class="hljs-params">(self, request, response, spider)</span>:</span>
<span class="hljs-keyword">if</span> len(response.body) == <span class="hljs-number">100</span>:
<span class="hljs-keyword">return</span> IgnoreRequest(<span class="hljs-string">"body length == 100"</span>)
<span class="hljs-keyword">else</span>:
<span class="hljs-keyword">return</span> response
|
4),使用selenium和PhantomJS写一个网页内容下载器,同样在上一步创建好的middlewares文件夹中创建downloader.py文件,代码如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
|
<span class="hljs-comment"># -*- coding: utf-8 -*-</span>
<span class="hljs-keyword">import</span> time
<span class="hljs-keyword">from</span> scrapy.exceptions <span class="hljs-keyword">import</span> IgnoreRequest
<span class="hljs-keyword">from</span> scrapy.http <span class="hljs-keyword">import</span> HtmlResponse, Response
<span class="hljs-keyword">from</span> selenium <span class="hljs-keyword">import</span> webdriver
<span class="hljs-keyword">import</span> selenium.webdriver.support.ui <span class="hljs-keyword">as</span> ui
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">CustomDownloader</span><span class="hljs-params">(object)</span>:</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span><span class="hljs-params">(self)</span>:</span>
<span class="hljs-comment"># use any browser you wish</span>
cap = webdriver.DesiredCapabilities.PHANTOMJS
cap[<span class="hljs-string">"phantomjs.page.settings.resourceTimeout"</span>] = <span class="hljs-number">1000</span>
cap[<span class="hljs-string">"phantomjs.page.settings.loadImages"</span>] = <span class="hljs-keyword">True</span>
cap[<span class="hljs-string">"phantomjs.page.settings.disk-cache"</span>] = <span class="hljs-keyword">True</span>
cap[<span class="hljs-string">"phantomjs.page.customHeaders.Cookie"</span>] = <span class="hljs-string">'SINAGLOBAL=3955422793326.2764.1451802953297; '</span>
self.driver = webdriver.PhantomJS(executable_path=<span class="hljs-string">'F:/phantomjs/bin/phantomjs.exe'</span>, desired_capabilities=cap)
wait = ui.WebDriverWait(self.driver,<span class="hljs-number">10</span>)
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">VisitPersonPage</span><span class="hljs-params">(self, url)</span>:</span>
print(<span class="hljs-string">'正在加载网站.....'</span>)
self.driver.get(url)
time.sleep(<span class="hljs-number">1</span>)
<span class="hljs-comment"># 翻到底,详情加载</span>
js=<span class="hljs-string">"var q=document.documentElement.scrollTop=10000"</span>
self.driver.execute_script(js)
time.sleep(<span class="hljs-number">5</span>)
content = self.driver.page_source.encode(<span class="hljs-string">'gbk'</span>, <span class="hljs-string">'ignore'</span>)
print(<span class="hljs-string">'网页加载完毕.....'</span>)
<span class="hljs-keyword">return</span> content
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__del__</span><span class="hljs-params">(self)</span>:</span>
self.driver.quit()
|
5) 创建爬虫模块
在项目目录E:python-3.5.1tmSpider,执行如下代码:
1
2
|
<span class="hljs-attribute">E</span>:\python-<span class="hljs-number">3.5</span><span class="hljs-number">.1</span>\tmSpider>scrapy genspider tmall <span class="hljs-string">'tmall.com'</span>
|
执行后,项目目录E:python-3.5.1tmSpidertmSpiderspiders下会自动生成tmall.py程序文件。该程序中parse函数处理scrapy下载器返回的网页内容,采集网页信息的方法可以是:
- 使用xpath或正则方式从response.body中采集所需字段,
- 通过gooseeker api获取的内容提取器实现一站转换所有字段,而且不用手工编写转换用的xpath(如何获取内容提取器请参考python使用xslt提取网页数据),代码如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
|
<span class="hljs-comment"># -*- coding: utf-8 -*-</span>
<span class="hljs-keyword">import</span> time
<span class="hljs-keyword">import</span> scrapy
<span class="hljs-keyword">import</span> tmSpider.gooseeker.gsextractor <span class="hljs-keyword">as</span> gsextractor
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">TmallSpider</span><span class="hljs-params">(scrapy.Spider)</span>:</span>
name = <span class="hljs-string">"tmall"</span>
allowed_domains = [<span class="hljs-string">"tmall.com"</span>]
start_urls = (
<span class="hljs-string">'https://world.tmall.com/item/526449276263.htm'</span>,
)
<span class="hljs-comment"># 获得当前时间戳</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">getTime</span><span class="hljs-params">(self)</span>:</span>
current_time = str(time.time())
m = current_time.find(<span class="hljs-string">'.'</span>)
current_time = current_time[<span class="hljs-number">0</span>:m]
<span class="hljs-keyword">return</span> current_time
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">parse</span><span class="hljs-params">(self, response)</span>:</span>
html = response.body
print(<span class="hljs-string">"----------------------------------------------------------------------------"</span>)
extra=gsextractor.GsExtractor()
extra.setXsltFromAPI(<span class="hljs-string">"0a3898683f265e7b28991e0615228baa"</span>, <span class="hljs-string">"淘宝天猫_商品详情30474"</span>,<span class="hljs-string">"tmall"</span>,<span class="hljs-string">"list"</span>)
result = extra.extract(html)
print(str(result).encode(<span class="hljs-string">'gbk'</span>, <span class="hljs-string">'ignore'</span>).decode(<span class="hljs-string">'gbk'</span>))
<span class="hljs-comment">#file_name = 'F:/temp/淘宝天猫_商品详情30474_' + self.getTime() + '.xml'</span>
<span class="hljs-comment">#open(file_name,"wb").write(result)</span>
|
6),启动爬虫
在E:python-3.5.1tmSpider项目目录下执行命令
1
2
|
E:\python-3.5.1\simpleSpider>scrapy crawl tmall
|
输出结果:
提一下,上述命令只能一次启动一个爬虫,如果想同时启动多个呢?那就需要自定义一个爬虫启动模块了,在spiders下创建模块文件runcrawl.py,代码如下
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
<span class="hljs-comment"># -*- coding: utf-8 -*-</span>
<span class="hljs-keyword">import</span> scrapy
<span class="hljs-keyword">from</span> twisted.internet <span class="hljs-keyword">import</span> reactor
<span class="hljs-keyword">from</span> scrapy.crawler <span class="hljs-keyword">import</span> CrawlerRunner
<span class="hljs-keyword">from</span> tmall <span class="hljs-keyword">import</span> TmallSpider
...
spider = TmallSpider(domain=<span class="hljs-string">'tmall.com'</span>)
runner = CrawlerRunner()
runner.crawl(spider)
...
d = runner.join()
d.addBoth(<span class="hljs-keyword">lambda</span> _: reactor.stop())
reactor.run()
|
执行runcrawl.py文件,输出结果:
3,展望
以自定义DOWNLOADER_MIDDLEWARES调用PhantomJs的方式实现爬虫后,在阻塞框架的问题上纠结了很长的时间,一直在想解决的方式。后续会研究一下scrapyjs,splash等其他调用浏览器的方式看是否能有效的解决这个问题。
4,相关文档
5,集搜客GooSeeker开源代码下载源
1, GooSeeker开源Python网络爬虫GitHub源
6,文档修改历史
1,2016-07-04:V1.0
原文链接:https://segmentfault.com/a/1190000005866893