phantomjs+selenium实现爬取动态网址

之前使用 selenium + firefox驱动浏览器来实现爬取动态网址，但是firefox经常更新，更新后时常会导致webdriver启动不来，所以改用phantomjs+selenium来改善一下。
使用phantomjs和使用浏览器区别并不大。

一，首先还是需要下载Phantomjs

Phantomjs对各个主流的平台都支持，下载页面。选择好存放的目录，例如D:\phantomjs。
phantomjs的可执行文件就在bin目录下，可以将D:\phantomjs\bin目录加入环境变量中。如果不加入环境变量，那么selenium在驱动phantomjs时就需要指定路径。

二，在Selenium中驱动Phantomjs

from selenium import webdriver
from selenium.common.exceptions import TimeoutException

##可以对phantomjs配置
#cap = webdriver.DesiredCapabilities.PHANTOMJS    #获取webdriver对Phantomjs的默认配置
#cap["phantomjs.page.settings.resourceTimeout"] = 5000    #资源加载超时时长
#cap["phantomjs.page.settings.loadImages"] = False    #是否加载图片
#driver = webdriver.PhantomJS(desired_capabilities=cap)

#未将phantomjs加入环境变量,需要指定phantomjs的路径
#driver = webdriver.PhantomJS(executable_path="D:\phantomjs\bin\phantomjs.exe")
driver = webdriver.PhantomJS()
driver.set_page_load_timeout(5)    #设置页面超时时长
#driver.set_script_timeout(5)    #设置页面JS超时时长，这两者超时后会报TimeoutException错

##当超时后停止页面的加载
##有些页面在加载出你想要的数据后，还是会一直加载一些其他资源
tru:
    driver.get("www.tvmao.com")
exception TimeoutException:
    driver.execute_script("window.stop()")

##获取网页源代码后，就可以将其保存起来进而进行数据解析了
page_source = driver.page_source()

############
#
#数据解析部分
#
############

phantomjs可配置的选项，可以看官方文档说明

posted @ 2016-10-18 00:37 Bencakes 阅读(2569) 评论(0) 编辑收藏举报

刷新页面返回顶部

Bencakes

phantomjs+selenium实现爬取动态网址

一，首先还是需要下载Phantomjs

二，在Selenium中驱动Phantomjs

公告