爬取当当网 Top 500 本五星好评书籍
打开这个书籍排行榜的地址 http://bang.dangdang.com/books/fivestars/01.00.00.00.00.00-recent30-0-0-1-1,很容易就能定位到书籍信息:
难点在于如何写一个正则。。
直接看代码:
import requests import re import json headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36' } def write_item_to_file(item): with open('book.txt', 'a', encoding='UTF-8') as f: f.write(json.dumps(item, ensure_ascii=False) + '\n') f.close() def request_html(url): try: response = requests.get(url, headers=headers) if response.status_code == 200: #正常访问一般返回200 return response.text except requests.RequestException: return None def parse_html(html): pattern = re.compile('<li>.*?list_num.*?(\d+).</div>.*?<img src="(.*?)".*?class="name".*?title="(.*?)">.*?class="star">.*?class="tuijian">(.*?)</span>.*?class="publisher_info">.*?target="_blank">(.*?)</a>.*?class="biaosheng">.*?<span>(.*?)</span></div>.*?<p><span\sclass="price_n">¥(.*?)</span>.*?</li>',re.S) items = re.findall(pattern,html) for item in items: yield { 'id': item[0], 'iamge': item[1], 'title': item[2], 'recommend': item[3], 'author': item[4], 'times': item[5], 'price': item[6] } def get_titles(page): url = 'http://bang.dangdang.com/books/fivestars/01.00.00.00.00.00-recent30-0-0-1-' + str(page) html = request_html(url) items = parse_html(html) # 解析过滤我们想要的信息 for item in items: write_item_to_file(item) print(item['id'] + '.' + item['title']) if __name__ == "__main__": for i in range(1,25): get_titles(i)
总结:
1. yield的用法
2. json.dumps()将Python数据结构转换成json
3. 写正则表达式
个性签名:时间会解决一切