python爬虫练习2——豆瓣读书

豆瓣网站禁止了爬虫爬取,需要对其进行简单的反爬虫处理后才可进行提取。

https://book.douban.com/robots.txt

 

网址:豆瓣读书 (douban.com)

F12,找到 User-agent ,在爬虫程序中进行添加。

 

 

 

import requests
import re

for i in range (1,5):
    url = 'https://book.douban.com/series/39838?page='+str(i)
    ua = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36 Edg/91.0.864.59'}
    
    print(''+str(i)+'页:',url)
    r =  requests.get(url ,headers = ua , timeout = 30)
    
    pat = 'title=(.*?)onclick'
    til = re.compile(pat,re.S).findall(r.text)
    print(til)
    print('---------------')

 

 

 

看着比较乱,我们进行加工处理:

import requests
import re

for i in range (1,5):
    url = 'https://book.douban.com/series/39838?page='+str(i)
    ua = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36 Edg/91.0.864.59'}
    
    print(''+str(i)+'页:',url)
    r =  requests.get(url ,headers = ua , timeout = 30)
    
    pat = 'title=(.*?)onclick'
    til = re.compile(pat,re.S).findall(r.text)
    #print(til)

    for item in til:
        
        print(str(item))
        print('--------------')

 

 

 

 这样就好多了

 

posted @ 2021-08-18 16:27  徐海建  阅读(130)  评论(0编辑  收藏  举报