python爬虫之小说爬取

废话不多说，直接进入正题。

今天我要爬取的网站是起点中文网，内容是一部小说。

首先是引入库

from urllib.request import urlopen
from bs4 import BeautifulSoup

然后将网址赋值

html=urlopen("http://read.qidian.com/chapter/dVQvL2RfE4I1/hJBflakKUDMex0RJOkJclQ2.html")  //小说的第一章的网址
bsObj=BeautifulSoup(html)                                                                 //创建beautifulsoup对象

首先尝试爬取该页的小说内容

firstChapter=bsObj.find("div",{"class","read-content"})                                 //find方法是beautifulsoup对象拥有的函数，
print (firstChapter.read_text())

find方法也可以和正则表达式搭配使用，并且多用于图片，视频等资源的爬取

由于本次爬取内容全在一个class属性值为read-content的盒子中，所以采用了find方法，如果该网页中，文字被放在多个盒子里，则应采用findAll方法，并且返回值为一个集合，需要用循环遍历输出。

将代码整合运行，发现可以实现文章的爬取，但是现在的问题是，爬取了该小说的一章，那么，往后的几章该如何爬取呢？

由前面步骤可以得出，只要得知下一章的网址，即可进行爬取。首先，将打印文字的部分封装为函数，那么，每次取得新的地址，即可打印出对应文本

def writeNovel(html):
    bsObj=BeautifulSoup(html)
    chapter=bsObj.find("div",{"class","read-content"})
    print (chapter.get_text())

现在的问题是如何爬取下一章的网址，观察网页结构可得知，下一章的按钮实质是一个id为j_chapterNext的a标签，那么，可由这个标签获得下一章的网址

重新包装函数，整理得：

from urllib.request import urlopen
from bs4 import BeautifulSoup
def writeNovel(html):
bsObj=BeautifulSoup(html)
chapter=bsObj.find("div",{"class","read-content"})
print (chapter.get_text())
bsoup=bsObj.find("",{"id":"j_chapterNext"})
html2="http:"+bsoup.get('href')+".html"
return (urlopen(html2))

html=urlopen("http://read.qidian.com/chapter/dVQvL2RfE4I1/hJBflakKUDMex0RJOkJclQ2.html")

i=1
while(i<10):
html=writeNovel(html)
i=i+1

将文本写入text文件中

from urllib.request import urlopen
from bs4 import BeautifulSoup
def writeNovel(html):
    bsObj=BeautifulSoup(html)
    chapter=bsObj.find("div",{"class","read-content"})
    print (chapter.get_text())
    fo=open("novel.text","a")
    fo.write(chapter.get_text())
    fo.close
    bsoup=bsObj.find("",{"id":"j_chapterNext"})
    html2="http:"+bsoup.get('href')+".html"
    return (urlopen(html2))

html=urlopen("http://read.qidian.com/chapter/dVQvL2RfE4I1/hJBflakKUDMex0RJOkJclQ2.html")  

i=1
while(i<8):
    html=writeNovel(html)
    i=i+1

posted @ 2017-07-10 20:49 kkdf 阅读(7433) 评论(3) 收藏举报

刷新页面返回顶部

kkdf

python爬虫之小说爬取

公告