学习进度-10 python爬虫
学习爬虫的第一个案例是小说爬虫。
小说爬虫首先是解析小说页面源代码,在页面源代码中可以看到小说每章节的内容链接
爬虫的代码:
import requests import re url = 'http://www.92kshu.cc/69509/' response = requests.get(url) response.encoding = 'gbk' html = response.text title = re.findall(r'<meta property="og:novel:book_name" content="(.*?)"/>', html)[0] fb = open('%s.txt' % title, 'w', encoding='utf-8') # 获取每章的内容 # print(html) dl = re.findall(r'<dl><dt><i class="icon"></i>正文</dt>(.*?)</dl>', html)[0] print(dl) chapter_info_list = re.findall(r'<dd><a href="(.*?)">(.*?)</a></dd>', dl) print(chapter_info_list) for chapter_info in chapter_info_list: chapter_url, chapter_title = chapter_info chapter_url = "http://www.92kshu.cc%s" % chapter_url # print(chapter_url) chapter_response = requests.get(chapter_url) chapter_response.encoding = 'gbk' chapter_html = chapter_response.text chapter_content = re.findall(r'<div class="chapter">(.*?)><br>', chapter_html)[0] # print(chapter_content) chapter_content = chapter_content.replace('<p>', '') chapter_content = chapter_content.replace('</p>', '') fb.write(chapter_title) fb.write(chapter_content) fb.write('\n') print(chapter_url)
爬虫结果: