【爬虫】大杀器——phantomJS+selenium

江湖上有一个传说，得倚天屠龙者可称霸武林。爬虫中也有两个大杀器，他们结合在一起时，无往不利，不管你静态网站还是动态网站，通吃。

phantomJS

http://phantomjs.org/
一种无头浏览器，何为无头浏览器，你可以看做一个无界面的浏览器，电脑能看到，人却看不到（没界面怎么看）。
下载安装：http://phantomjs.org/download.html

http://selenium-python.readthedocs.io/getting-started.html
能直接调用浏览器（打开浏览器，访问某个页面，获取页面信息等）。
安装命令：

phantomJS和selenium结合在一起就好像撼地神牛配上了跳刀、UG配上了辉耀、钢背兽配上了玲珑心。碰到搞不定的网站，直接上这两个大杀器。

http://www.tianyancha.com/search/%E7%99%BE%E5%BA%A6%20%E6%9D%8E%E5%BD%A6%E5%AE%8F?checkFrom=searchBox
天眼查为了反爬虫可谓是煞费苦心，还专门招聘反爬虫工程师，真是丧心病狂

from bs4 import BeautifulSoup
from selenium import webdriver
import urllib
driver = webdriver.PhantomJS(
executable_path='/usr/local/bin/phantomjs') # 浏览器的地址如果是windows，应该是某个exe地址
def search(keyword):
url_keyword = urllib.parse.quote(keyword)
url = "http://www.tianyancha.com/search/" + url_keyword + "?checkFrom=searchBox"
print(url)
driver.get(url)
bsObj = BeautifulSoup(driver.page_source, "html5lib")
print(bsObj)
company_list = bsObj.find_all("span", attrs={"ng-bind-html": "node.name | trustHtml"})
for company in company_list:
print(company.get_text())
if __name__ == '__main__':
search("阿里巴巴马云")

posted @ 2018-05-20 23:15 快乐多巴胺阅读(578) 评论(0) 收藏举报

刷新页面返回顶部