夜雨JC

  博客园  :: 首页  :: 新随笔  :: 联系 :: 订阅 订阅  :: 管理

 

1、目标地址 http://quotes.toscrape.com

 

 

 

将页面的文章内容和作者爬下来,并保存到json文件里面。

 

下面代码:

用到的工具:scrapy ,xpath选择器,json,codecs编码

爬虫代码:

 

class ScrapeSpider(scrapy.Spider):
    name = 'toscrape'
    allowed_domains = ['toscrape.com']

    start_urls = [
        'http://quotes.toscrape.com'
    ]

    def parse(self, response):
        for quote in response.xpath('//div[@class="quote"]'):
            item = QuoteItem()
            item["text"] = quote.xpath('./span[@class="text"]/text()').get()
            item['author'] = quote.xpath('./span/small[@class="author"]/text()').get()
            yield item

        next_page = response.xpath('//nav/ul/li[@class="next"]/a/@href').get()
        if next_page and len(next_page) > 0:
            yield response.follow(next_page, self.parse)

 

在items.py 中添加数据

class QuoteItem(scrapy.Item):
    text = scrapy.Field()
    author = scrapy.Field()
    pass

 

定义pipelines: 保存到quotes.json文件中

import json
import codecs

class
QuotePipeline(object): def __init__(self): self.file = codecs.open('quotes.json', 'wb', encoding='utf-8') pass def process_item(self, item, spider): lines = json.dumps(dict(item), ensure_ascii=False) + "\n" self.file.write(lines) return item def close_spider(self, spider): self.file.close() pass

 

之后执行

  

scrapy crawl toscrape

 

爬下来的数据:

 

 

 

 

posted on 2019-12-09 15:43  夜雨JC  阅读(350)  评论(0编辑  收藏  举报