scrapy——终端持久化存储
笔记
- 基于终端指令:
- 要求:只可以将parse方法的返回值存储到本地的文本文件中 scrapy crawl dou -o ./douban.csv
- 注意:持久化存储的类型只可以是'json', 'jsonlines', 'jsonl', 'jl', 'csv', 'xml', 'marshal', 'pickle'这些文件
- 好处:简洁高效
- 缺点:局限性比较强(数据只可以存储到指定后缀的文本文件中)
代码
import scrapy
from douban.items import DoubanItem
class DouSpider(scrapy.Spider):
name = "dou"
#allowed_domains = ["www.douban.com"]
start_urls = ["https://www.douban.com/doulist/113652271/"]
#def parse(self, response):
# all_data = [] #存储所有解析到的数据
# div_ = response.xpath('/html/body/div[3]/div[1]/div/div[1]')
# div_list = div_.xpath('./div[contains(@class, "doulist-item")]')
# for div in div_list:
# # print(div)
# title = div.xpath('./div/div[2]/div[2]/a/text()')[0].extract()
# content = div.xpath('./div/div[2]/div[4]/text()').extract_first()
# dic = {
# 'title':title,
# 'content': content
# }
# all_data.append(dic)
# return all_data