python爬虫练习2——豆瓣读书
豆瓣网站禁止了爬虫爬取,需要对其进行简单的反爬虫处理后才可进行提取。
https://book.douban.com/robots.txt
F12,找到 User-agent ,在爬虫程序中进行添加。
import requests import re for i in range (1,5): url = 'https://book.douban.com/series/39838?page='+str(i) ua = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36 Edg/91.0.864.59'} print('第'+str(i)+'页:',url) r = requests.get(url ,headers = ua , timeout = 30) pat = 'title=(.*?)onclick' til = re.compile(pat,re.S).findall(r.text) print(til) print('---------------')
看着比较乱,我们进行加工处理:
import requests import re for i in range (1,5): url = 'https://book.douban.com/series/39838?page='+str(i) ua = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36 Edg/91.0.864.59'} print('第'+str(i)+'页:',url) r = requests.get(url ,headers = ua , timeout = 30) pat = 'title=(.*?)onclick' til = re.compile(pat,re.S).findall(r.text) #print(til) for item in til: print(str(item)) print('--------------')
这样就好多了