Web爬虫|入门实战之糗事百科(附源码)
coding by real mind writing by genuine heart
解析
任务背景:https://www.qiushibaike.com/hot/
窥探网页细节:观察每一页URL的变化
第一页
进入第二页
再看看第三页
把这些URL放在一起,观察规律
1 https://www.qiushibaike.com/hot/page/1/ 2 https://www.qiushibaike.com/hot/page/2/ 3 https://www.qiushibaike.com/hot/page/3/
从图片可以看出,该URL其他地方不变,只有最后的数字会改变,代表页数
推荐使用浏览器Chrome
插件丰富,原生功能设计对爬虫开发者非常友好
分析网页源代码
通过在原来的页面上点击,选择“检查”,观察规律,这里建议当你用elements定位元素之后,就切换到network查看相应的元素,因为elements里面的网页源代码很可能是经过JS加工过的
通过图片,我们发现:每一个笑话内容,都包含在一个<a...class="contentHerf"
...class="content">里面,当然这里的属性不止一个,这里我们选择contentHerf这个属性
思考工具:什么工具最适合解析此种规律?BeautifulSoup
编码
根据第一步的分析,建立初步的代码
1 import requests 2 from bs4 import BeautifulSoup 3 import time 4 import re 5 6 headers = { 7 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36', 8 #User-Agent可以伪装浏览器,键值为浏览器的引擎信息及版本 9 'Host':'www.qiushibaike.com', 10 'Cookie':'_ga=GA1.2.2026142502.1558849033; gr_user_id=5d0a35ad-3eb6-4037-9b4d-bbc5e22c9b9f; grwng_uid=9bd612b3-7d0b-4a08-a4e1-1707e33f6995; _qqq_uuid_="2|1:0|10:1617119039|10:_qqq_uuid_|56:NjUxYWRiNDFhZTYxMjk4ZGM3MTgwYjkxMGJjNjViY2ZmZGUyNDdjMw==|fdce75d742741575ef41cd8f540465fb97b5d18891a9abb0849b3a09c530f7ee"; _xsrf=2|6d1ed4a0|7de9818067dac3b8a4e624fdd75fc972|1618129183; Hm_lvt_2670efbdd59c7e3ed3749b458cafaa37=1617119039,1617956477,1618129185; Hm_lpvt_2670efbdd59c7e3ed3749b458cafaa37=1618129185; ff2672c245bd193c6261e9ab2cd35865_gr_session_id=fd4b35b4-86d1-4e79-96f4-45bcbcbb6524; ff2672c245bd193c6261e9ab2cd35865_gr_session_id_fd4b35b4-86d1-4e79-96f4-45bcbcbb6524=true' 11 #Cookie里面保存了你的身份验证信息,可用于cookies反爬 12 } 13 14 for page in range(10): 15 url = f'https://www.qiushibaike.com/hot/page/{page}/' #f-string函数,{}中填的是变化的内容,也可以使用format函数 16 req = requests.get(url,headers=headers) 17 html = req.text 18 19 soup = BeautifulSoup(html,'lxml') 20 for joke in soup.select('.contentHerf .content span'): 21 if joke.string is not None: 22 joke_data = f'笑话一则:{joke.string.strip()}\n\n' 23 with open('../txt_file/joke.txt','ab') as f: #以追加二进制的形式写入到文本文件中,这样就不会替换掉原先的内容 24 pattern = re.compile('查看全文',re.S) 25 jok = re.sub(pattern,'这里被替换了,嘻嘻!',joke_data) 26 f.write(jok.encode('utf-8')) 27 time.sleep(1) #延迟爬取时间
查看爬取内容
上面这张图片被框起来的地方被我用正则表达式替换掉了,这里原来的内容是“查看全文”
代码优化
1 import requests 2 from bs4 import BeautifulSoup 3 import re 4 import time 5 from requests.exceptions import RequestException 6 7 8 def get_url_html(): 9 headers = { 10 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36', 11 'Host':'www.qiushibaike.com', 12 'Cookie':'_ga=GA1.2.2026142502.1558849033; gr_user_id=5d0a35ad-3eb6-4037-9b4d-bbc5e22c9b9f; grwng_uid=9bd612b3-7d0b-4a08-a4e1-1707e33f6995; _qqq_uuid_="2|1:0|10:1617119039|10:_qqq_uuid_|56:NjUxYWRiNDFhZTYxMjk4ZGM3MTgwYjkxMGJjNjViY2ZmZGUyNDdjMw==|fdce75d742741575ef41cd8f540465fb97b5d18891a9abb0849b3a09c530f7ee"; _xsrf=2|6d1ed4a0|7de9818067dac3b8a4e624fdd75fc972|1618129183; Hm_lvt_2670efbdd59c7e3ed3749b458cafaa37=1617119039,1617956477,1618129185; Hm_lpvt_2670efbdd59c7e3ed3749b458cafaa37=1618129185; ff2672c245bd193c6261e9ab2cd35865_gr_session_id=fd4b35b4-86d1-4e79-96f4-45bcbcbb6524; ff2672c245bd193c6261e9ab2cd35865_gr_session_id_fd4b35b4-86d1-4e79-96f4-45bcbcbb6524=true' 13 14 } 15 16 try: 17 for page in range(2,5): 18 url = f'https://www.qiushibaike.com/hot/page/{page}/' 19 req = requests.get(url,headers=headers) 20 if req in not None: 21 return req.text 22 else: 23 return None 24 except RequestException: 25 return None 26 27 def main(): 28 html = get_url_html() 29 soup = BeautifulSoup(html,'lxml') 30 for joke in soup.select('.contentHerf .content span'): 31 if joke.string is not None: 32 joke_data = f'笑话一则:{joke.string.strip()}\n\n' 33 with open('../txt_file/joke.txt','ab') as f: 34 pattern = re.compile('查看全文',re.S) 35 jok = re.sub(pattern,'这里被替换了,嘻嘻!',joke_data) 36 f.write(joke.encode('utf-8')) 37 38 39 40 if __name__ == '__main__': 41 main() 42 time.sleep(1)
总结
请求库requests及exceptions模块
解析库BeautifulSoup
标准库re
time模块
文本存储
更多独家精彩内容 请扫码关注个人公众号,我们一起成长,一起Coding,让编程更有趣!
—— —— —— —— — END —— —— —— —— ————
欢迎扫码关注我的公众号
小鸿星空科技