selenium+BeautifulSoup+phantomjs爬取新浪新闻

一下载phantomjs，把phantomjs.exe的文件路径加到环境变量中，也可以phantomjs.exe拷贝到一个已存在的环境变量路径中，比如我用的anaconda，我把phantomjs.exe文件加入到了Anaconda3这个文件夹中（Anaconda3已加入环境变量）

二 pip安装selenium+BeautifulSoup+phantomjs 命令pip install selenium，anaconda中已有BeautifulSoup，不用管

三爬取数据，目标是爬取新浪新闻下的公司下面的所有的新闻文本。如图是新闻文章的列表，我们首先要抓取文章对用的链接，然后进入链接抓取文本

由于采用的是js加载的，如果直接用beautifulsoup是解析不出的，这里采用selenium+phantomjs抓取。抓取的思路是首先模拟点击公司新闻按钮，进入公司新闻栏目下，抓取该页所有新闻文章对应的链接，然后点击模拟点击下一页进入下一页循环抓取

下面是粗糙的代码实现：

from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
import time
def get_links(driver):
    '''
    爬取链接并写入txt中
    '''
    t1 = time.time()
    try:
        driver.find_element(By.LINK_TEXT, "下一页").click()#每爬取完一页点击下一页
    except NoSuchElementException:
        time.sleep(1)
        driver.find_element(By.LINK_TEXT, "下5页").click()#有可能遇到没有下一页，尝试点击下5页
    time.sleep(1)
    bs = BeautifulSoup(driver.page_source)#不知道怎么用selenium直接解析出href。把selenium的webdriver调用page_source函数在传入BeautifulSoup中，就可以用BeautifulSoup解析网页了
    links = []
    for i in bs.findAll('a',href=re.compile("http://finance.sina.com.cn/chanjing/gsnews/.")):#用正则表达式找出所有需要的链接
        link = i.get('href')
        if link not in links:#去掉重复链接
            links.append(link)
            f.write(link+'\n')
    t2 = time.time()
    page_num = bs.find('span',{'class','pagebox_num_nonce'}).text#找出当前页数
    page_num = int(page_num)
    if page_num>4:
        return
    print('爬取完第%d页,用时%d秒'%(page_num,t2-t1))
    get_links(driver)
    
def get_text(links,path):
    '''
    解析出所需文本，第一个参数为链接列表，第二个为保存路径
    '''
    n=0
    for link in links:
        html = urlopen(link)
        bsObj = BeautifulSoup(html)
        temp = ''
        try:
            for link in bsObj.find("div",{'id':re.compile('artibody')}).findAll('p'):
                temp = temp+link.text.strip()#把每一段都拼接在一起
            print(temp[:31])
            path.write(temp+'\n')
            n+=1
            print('爬取完第%d篇'%n)
            print('\n')
        except (AttributeError,UnicodeEncodeError,UnicodeEncodeError):#这里的处理可能有点暴力
            continue
            
if True:#我把爬取的链接保存了下，所分成了两部，第一次爬取链接，第二次爬取文本  
    f = open('E:\hei.txt','w')
    driver = webdriver.PhantomJS()#如果phantomjs.exe所在路径没有加入环境变量，这里也可以直接把其路径作为参数传给PhantomJS()
    driver.get("http://finance.sina.com.cn/chanjing/")
    driver.find_element(By.LINK_TEXT, "公司新闻").click()
    time.sleep(2)
    get_links(driver)
    f.close()
    driver.close()
    
if True:#爬取文本  
    xl = open('E:\heiii.txt','w')
    with open('E:\heii.txt') as f:
        links = [link.strip() for link in f]
    get_text(links,xl)

posted @ 2016-01-20 14:04 木羊羊羊阅读(3755) 评论(0) 编辑收藏举报

刷新页面返回顶部

木羊羊羊

selenium+BeautifulSoup+phantomjs爬取新浪新闻

公告