爬取目录

本来是想找个网站练习一下正则表达式的,结果写着写着就用上了bs
就当复习一下吧,毕竟好久没打了
实现对某小说目录的爬取,存在txt文件中
小说内容的话再加几行代码就好了,因为怕它跑太久,就只爬了一个目录

""" 
    2019/10/2
    version: 1.0.0
    by Zeronera
    实现对某小说目录的简单爬取
"""
import requests
import re
from bs4 import BeautifulSoup

def getHTMLText(url):
    """
        爬取网页的通用代码框架
    """
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()  # 如果状态不是200 引发HTTPError异常
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return "产生异常"

def main():
    url="http://www.nitianxieshen.com/index.html"
    html=getHTMLText(url)
    soup=BeautifulSoup(html,'lxml')
    dl=soup.find('div',{'id':'content'})
    chapter_list=dl.find_all('div',{'class':'container'})
    with open("nitianxieshen.txt",'w',encoding='utf-8') as f:
        for chapter in chapter_list:
            big_title=chapter.find('h2').text
            f.write(big_title+'\n')
            title_list=chapter.find_all('li')
            for i in title_list:
                f.write('   '+i.text+'\n')

if __name__ == "__main__":
    main()

posted @ 2019-10-02 20:38  Zeronera  阅读(635)  评论(0)    收藏  举报