爬取 豆瓣电影Top250
目标
学习爬虫,爬豆瓣榜单,获取爬取静态页面信息的能力
豆瓣电影 Top 250 https://movie.douban.com/top250
代码
import requests from bs4 import BeautifulSoup def getHTMLText(url): try: r = requests.get(url,timeout=30) r.raise_for_status() r.encoding = r.apparent_encoding return r.text except: return '产生异常' if __name__ == '__main__': i = 0 urls = ['https://movie.douban.com/top250?start='+str(n)+'&filter=' for n in range(0,250,25)] for url in urls: r = getHTMLText(url) soup = BeautifulSoup(r,'html.parser') titles = soup.select('div.hd a') rates = soup.select('span.rating_num') pics = soup.select('img[width="100"]') for title,rate,pic in zip(titles,rates,pics): data={'title':list(title.stripped_strings), 'rate':rate.get_text(), 'pic':pic.get('src')} i+=1 fileName=str(i)+'_'+data['title'][0]+' '+data['rate']+'分.jpg' pic1 = requests.get(data['pic']) with open('G:\\test\\'+fileName,'wb') as photo: photo.write(pic1.content) print(data)
爬取结果
作者:九命猫幺
博客出处:http://www.cnblogs.com/yongestcat/
欢迎转载,转载请标明出处。
如果你觉得本文还不错,对你的学习带来了些许帮助,请帮忙点击右下角的推荐
博客出处:http://www.cnblogs.com/yongestcat/
欢迎转载,转载请标明出处。
如果你觉得本文还不错,对你的学习带来了些许帮助,请帮忙点击右下角的推荐