Scrapy and Selenium
How to scrapy js?
scrapy结合webkit抓取js生成的页面 http://www.cnblogs.com/Safe3/archive/2011/10/19/2217965.html
pip install -U selenium
Selenium IDE
http://docs.seleniumhq.org/projects/ide/
Download the server separately, from: http://selenium-release.storage.googleapis.com/2.40/selenium-server-standalone-2.40.0.jar
java -jar selenium-server-standalone-2.40.0.jar
下面我们开始一步步来做:
1. 首先,进入你的电脑上Selenium Server的jar包所在的目录,通过java -jar xxx.jar的方式运行它,程序会自动监听本地的4444端口;
2. 参考我的上一篇博文《如何连入一台没有外网IP的服务器》 ,将本地的4444端口与服务器的4444端口建立Remote映射;
3. 使用Scrapy框架开始编写python程序,具体的例子不再赘述,网上有许多例子,比如这个:https://gist.github.com/1045108。仅描述几个要点:
a) 在python里调用selenium这样写:
self.sel = selenium(“localhost”, 4444, “*firefox”,”http://example.com/”)
不过直接写 “*firefox” 可能会找不到Firefox的路径,这时可以强制指定Firefox的程序路径,比如:”*firefox D:/Program Files/Mozilla Firefox/firefox.exe”。
b) 获取Firefox渲染完成后的HTML代码:
sel = self.selenium sel.open(response.url) sel.wait_for_page_to_load(10000) html = sel.get_eval(“selenium.browserbot.getCurrentWindow().document.getElementsByTagName(‘html’)[0].innerHTML”)
from selenium import selenium from scrapy.spider import BaseSpider from scrapy.http import Request import time import lxml.html class SeleniumSprider(BaseSpider): name = "selenium" allowed_domains = ['selenium.com'] start_urls = ["http://localhost"] def __init__(self, **kwargs): print kwargs self.sel = selenium("localhost", 4444, "*firefox","http://selenium.com/") self.sel.start() def parse(self, response): sel = self.sel sel.open("/index.aspx") sel.click("id=radioButton1") sel.select("genderOpt", "value=male") sel.type("nameTxt", "irfani") sel.click("link=Submit") time.sleep(1) #wait a second for page to load root = lxml.html.fromstring(sel.get_html_source())
参考:
http://networkedblogs.com/F9Eph
https://pypi.python.org/pypi/selenium
http://docs.seleniumhq.org/download/
http://yupengyan.com/scrapy-and-selenium.html