scrapy爬虫小案例1

1.items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class MyspiderItem(scrapy.Item):
    brief = scrapy.Field()
    quote = scrapy.Field()

2.myspider.py

# -*- coding: utf-8 -*-
import scrapy
from mySpider.items import MyspiderItem


# 创建一个爬虫类
class MyspiderSpider(scrapy.Spider):
    # 爬虫名
    name = 'myspider'
    # 允许爬虫作用的范围（只能在你输入的这个网址爬取信息）
    allowed_domains = ['https://market.douban.com/']
    # 爬虫起始url
    start_urls = ['https://market.douban.com/book/?utm_campaign=book_freyr_section&utm_source=douban&utm_medium=pc_web&page=1&page_num=18&']

    def parse(self, response):
        # 通过scrapy自带的xpath匹配出所有书的根节点
        book_list = response.xpath('//div[@class="book-brief"]')
        # bookItem = []
        #  遍历根节点集合  xpath返回的一定是列表
        for each in book_list:
            item = MyspiderItem()
            # extract() 将匹配出来的结果转换为Unicode字符串
            # 无extract() 结果为xpath匹配对象
            brief = each.xpath('./h3/text()').extract()
            quote = each.xpath('./div[@class="book-quote"]/p/text()').extract()

            item['brief'] = brief[0]
            item['quote'] = quote[0]

　　　　　　　# 每产生一个item数据就将数据交给管道文件

　　　　　　　yield item # 新添加的内容

3.settingd.py

# xxxxxxxxxxxxx

ITEM_PIPELINES = {
   'mySpider.pipelines.MyspiderPipeline': 300,
}

# xxxxxxxxxxxx

4.pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json


class MyspiderPipeline(object):
    # 可选的，作为类的初始化方法
    def __init__(self):
        # 创建了一个文件
        self.filename = open('myspider.json', 'wb+')  

    # 必须写的，用来处理item数据
    def process_item(self, item, spider):
        jsontext = json.dumps(dict(item), ensure_ascii=False, indent=2)
        self.filename.write(jsontext.encode('utf-8'))
        return item

    # 可选的，结束时调用
    def close_spider(self, spider):
        self.filename.close()

遇到的问题：

self.filename = open('myspider.json', 'wb+')注意划线的'wb+'
一开始我只写了'w',系统报错 

TypeError: write() argument must be str, not bytes

后来百度发现'w'是以纯文本方式写入 'wb+'是以二进制方式写入
如果以“文本”方式打开文件，当读取文件的时候，系统会将所有的"/r/n"转换成"/n"；当写入文件的时候，系统会将"/n"转换成"/r/n"写入。 
(\r 默认表示将输出的内容返回到第一个指针，这样的话，后面的内容会覆盖前面的内容,即\r 默认将指针返回到最开始后输出（在原位置再次输出）)
如果以"二进制"方式打开文件，则读/写都不会进行这样的转换。

文本的读写没有系统的学过，先写这一点以后自己注意一下，以后再遇到再学更深的内容。

posted @ 2018-12-04 09:34 Miss-Gao 阅读(236) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

Miss-Gao

scrapy爬虫小案例1

公告