Reptile:requests + BeautifulSopu 实现古诗词网三国名著下载

2019/1/25凌晨路飞学城爬虫课程，requests + BeautifulSoup 库实现中国古诗词网站名著《三国演义》的下载

BeautifulSoup 库是一种常用的网页解析库，可以对网页数据进行解析得到自己想要的数据，一下为基础方法：

属性和方法：

soup = BeautifulSoup(响应对象文本html，‘lxml')

1.根据标签名进行查找：

　　-soup.a 对对象中的a标签进行查找，只能返回找到的第一个

2. 获取属性:

　　- soup.a.attrs 获取a 标签下所有的属性值，对应找到的第一个标签下所有的属性值，返回一个字典类型

　　- soup.a.attrs['href'] 获取a标签下指定的href属性值

　　- soup.a['href'] 作用同上（简写方式，常用）

3. 获取内容：

　　- soup.a.string 获取a标签的直属文本，也就是只能获取a标签本身的文本内容，再下一级的就获取不到了

　　- soup.a.text 获取a标签下的所有内容，获取a标签及其下级标签的文本内容

　　- soup.a.get_text() 效果同上，具体差异不清楚

4. find: 找到第一个符合要求的标签

　　- suop.find('a') 找到第一个符合要求的a标签并返回

　　- soup.find('a', title='xxx') 可以指定标签的属性，同上也是返回找到的第一个

　　可以作用到属性，类、ID等

5. findAll:找到所有符合要求的标签 (A 不可小写，小写会报错)

　　- soup.findAll('a') 找到对象中所有的a标签并返回一个列表

　　- soup.findAll('a','b') 找到对象中所有的a和b标签

　　- soup.findAll('a', title='xxx') 找到对象中所有符合要求的a标签

6.根据选择器选择指定的内容

　　- soup.select() 适用与对css选择器比较熟悉的，虽然我不熟悉，但是我觉得这个很好用啊

　　- soup.select('.xxx li') 表示某个标签下的所有的a标签

7.还有子、子孙，父、祖先，兄弟节点等

　　- soup.a.contents 获取a标签下所有的子节点，子节点的子节点获取不了

　　- soup.a.children 获取a标签下的所有的节点。

　　- soup.a.descendants 获取a标签下的所有的子节点对应的子节点

　　- soup.a.parent 获取a标签的父级节点

　　- soup.a.parents 获取a标签的所有的上级节点直到文档的最顶层

　　- soup.a.next_siblings 获取a标签的下一个兄弟节点（平行节点？）这个我到现在都没有搞明白

　　 soup.a.previous siblings 获取a标签的上一个兄弟节点（平行节点？）同上这个我还是没有搞明白

需求：获取古诗词网的名著数据，并写入

标题也就是章节名和对应的内容不是在一个URL下的，所以要先解析出标题和标题内的URL，再对URL再次发起请求获取对应的内容

# -*- coding: utf-8 -*-
# ------ wei、yu ------

# 需求：爬取诗词名句网的历史典籍全部

import requests
from bs4 import BeautifulSoup
import os
import time
import random

# 生成文件目录
if not os.path.exists('./History Books'):
    os.mkdir("History Books")


def _book_list(url):
    book_list_response = requests.get(url=url, headers=headers)
    if book_list_response.status_code == 200:
        book_list_text = book_list_response.text

        # 调用Beautiful解析出书名列表，和书名对应的url
        soup = BeautifulSoup(book_list_text, 'lxml')
        bookInfo = soup.select('.bookmark-list a')
        book_list_response.close()
        return bookInfo
    else:
        print('内容数据请求失败，请重新尝试')
        exit()


def _title_list(url):
    title_list_response = requests.get(url=url, headers=headers)
    if title_list_response.status_code == 200:
        title_list_text = title_list_response.text

        # 调用Beautiful解析出章节列表和章节对应的内容url
        soup = BeautifulSoup(title_list_text, 'lxml')
        titleList = soup.select('.book-mulu a')
        title_list_response.close()
        return titleList
    else:
        print('内容数据请求失败，请重新尝试')
        exit()


def _content(url):
    content_response = requests.get(url=url, headers=headers)
    if content_response.status_code == 200:
        content_text = content_response.text

        # 调用Beautiful解析出章节对应的内容
        soup = BeautifulSoup(content_text, 'lxml')
        content1 = soup.find('div', class_='chapter_content')
        content_response.close()
        return content1.get_text()
    else:
        print('内容数据请求失败，请重新尝试')
        exit()


# 指定url和请求头参数
home_page_url = 'http://www.shicimingju.com/book/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/65.0.3325.181 Safari/537.36'
}
while 1:
    try:
        # 首先调用_book_list对首页链接发起请求，获得返回的书名列表
        booklists = _book_list(home_page_url)

        # 获取书名和书名对应的完整url
        for info in booklists:
            bookName = info.string  # 书名
            bookUrl = 'http://www.shicimingju.com' + info['href']  # 书名对应的完整url

            # 生成书名对应的文件句柄，每本书生成一个文件
            fileInfo = "History Books/" + str(bookName) + '.txt'
            fp = open(fileInfo, 'w', encoding='utf-8')
            for i in range(random.randint(0, 5)):
                print('\033[1;34m 数据请求中.....\033[m')
                time.sleep(0.2)
            print('\033[1;31m 数据请求成功，开始下载.....\033[m')
            time.sleep(0.2)

            # 接下来调用_title_list对书名对应的url发起请求，获得返回的章节标题
            titlelists = _title_list(bookUrl)

            # 获取章节名称和章节对应的完整url
            for title in titlelists:
                titleName = title.string  # 章节名称
                titleUrl = 'http://www.shicimingju.com' + title['href']  # 章节对应的完整url

                # 接下来调用 _content 对章节对应的url发起请求，获得返回的内容数据
                content = _content(titleUrl)
                try:
                    fp.write(titleName + "：" + content + '\n\n')
                    print('\033[34m %s \033[m \033[31m %s \033[m 写入完成..' % (bookName, titleName))
                    time.sleep(0.1)
                except:
                    print('\n \033[1;31m 数据写入异常，跳过.. \033[m \n')
                    print('\033[1;31m 重新尝试接入.. \033[m')
                    time.sleep(2)
            fp.close()
        exit('获取完成，程序结束')
    except:
        print('\n 10053 错误！ \n ')  # 这里不知道为什么会报一个10053的错误，网上搜索了一下看说是网络问题还是啥的，没有解决，这样循环会出现一旦报错就会重写全部文件，很麻烦，一开始在公司测试的时候没有什么问题，在家的时候才出现的
        continue

posted @ 2019-01-25 13:22 微雨丶阅读(222) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

Reptile:requests + BeautifulSopu 实现古诗词网三国名著下载

公告