scrapy框架学习（五）定义数据结构和多页爬取

定义数据结构和多页爬取

定义数据结构

在items.py文件中定义你要提取的内容（定义数据结构），比如我提取的内容为用户名name、用户头像链接face_src、好笑数funny、评论数comment，我就创建4个变量。Field方法实际上的做法是创建一个字典，给字典添加一个建，暂时不赋值，等待提取数据后再赋值。

#用户名
name=scrapy.Field()
#头像链接
face_src=scrapy.Field()
#好笑数
funny=scrapy.Field()
#评论数
comment=scrapy.Field()
在爬虫文件中使用定义好的数据结构存储数据
首先导入数据结构类
from myfirst_scrapyDemo.items import MyfirstScrapydemoItem
使用时类似字典的使用方式

item=MyfirstScrapydemoItem()

item['name']= name
item['face_src']=face_src
item['funny']=funny
item['comment']=comment

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class MyfirstScrapydemoItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    #用户名
    name=scrapy.Field()
    #头像链接
    face_src=scrapy.Field()
    #好笑数
    funny=scrapy.Field()
    #评论数
    comment=scrapy.Field()

import scrapy
from myfirst_scrapyDemo.items import MyfirstScrapydemoItem


class SpiderQiushiSpider(scrapy.Spider):
    #爬虫名字
    name = 'spider_QiuShi'
    #域名限制列表，列表值可以设置多个，限制爬虫只爬取该域名下的网页
    allowed_domains = ['www.qiushibaike.com']
    #初始爬取地址
    start_urls = ['http://www.qiushibaike.com/text']

    def parse(self, response):
        div_list= response.xpath('//div[@class="col1 old-style-col1"]/div')
        #items=[]
        item=MyfirstScrapydemoItem()
        for li in div_list:
            # name= li.xpath('.//h2/text()')[0].extract().strip('\n').strip(' ')
            # face_src= "https:"+li.xpath('.//div[1]/a[1]/img/@src')[0].extract().split('?')[0]
            # funny= li.xpath('.//div[@class="stats"]/span[1]/i/text()')[0].extract()
            # comment=li.xpath('.//div[@class="stats"]/span[2]/a/i/text()')[0].extract()
            name= li.xpath('.//h2/text()').extract_first().strip('\n').strip(' ')
            face_src= "https:"+li.xpath('.//div[1]/a[1]/img/@src').extract_first().split('?')[0]
            funny= li.xpath('.//div[@class="stats"]/span[1]/i/text()').extract_first()
            comment=li.xpath('.//div[@class="stats"]/span[2]/a/i/text()').extract_first()
            # item= {
            #     "名称":name,
            #     "链接":face_src
            # }
            # items.append(item)
            item['name']= name
            item['face_src']=face_src
            item['funny']=funny
            item['comment']=comment
            #print(item)
            yield item

多页爬取

使用yield，将scrapy.Request(url)返回给引擎，引擎寻找调度器，并重新调用scrapy中的parse

程序里一共有两个yield，我比较喜欢叫它中断，当然中断只在CPU中发生，它的作用是移交控制权，在本程序中，我们对item封装数据后，就调用yield把控制权给管道，管道拿到处理后return返回，又回到该程序。这是对第一个yield的解释。

第二个yield稍微复杂点，这条程序里利用了一个回调机制，即callback,回调的对象是parse,也就是当前方法，通过不断的回调，程序将陷入循环，如果不给程序加条件，就会陷入死循环，如本程序我把if去掉，那就是死循环了。

yield scrapy.Request(url=url,callback=self.parse)

import scrapy
from myfirst_scrapyDemo.items import MyfirstScrapydemoItem


class SpiderQiushiSpider(scrapy.Spider):
    #爬虫名字
    name = 'spider_QiuShi'
    #域名限制列表，列表值可以设置多个，限制爬虫只爬取该域名下的网页
    allowed_domains = ['www.qiushibaike.com']
    #初始爬取地址
    start_urls = ['http://www.qiushibaike.com/text']

    #爬取多页
    url='https://www.qiushibaike.com/text/page/{}/'
    page=1

    def parse(self, response):
        div_list= response.xpath('//div[@class="col1 old-style-col1"]/div')
        #items=[]
        item=MyfirstScrapydemoItem()
        for li in div_list:
            # name= li.xpath('.//h2/text()')[0].extract().strip('\n').strip(' ')
            # face_src= "https:"+li.xpath('.//div[1]/a[1]/img/@src')[0].extract().split('?')[0]
            # funny= li.xpath('.//div[@class="stats"]/span[1]/i/text()')[0].extract()
            # comment=li.xpath('.//div[@class="stats"]/span[2]/a/i/text()')[0].extract()
            name= li.xpath('.//h2/text()').extract_first().strip('\n').strip(' ')
            face_src= "https:"+li.xpath('.//div[1]/a[1]/img/@src').extract_first().split('?')[0]
            funny= li.xpath('.//div[@class="stats"]/span[1]/i/text()').extract_first()
            comment=li.xpath('.//div[@class="stats"]/span[2]/a/i/text()').extract_first()
            # item= {
            #     "名称":name,
            #     "链接":face_src
            # }
            # items.append(item)
            item['name']= name
            item['face_src']=face_src
            item['funny']=funny
            item['comment']=comment
            #print(item)
            yield item


        #爬取多页
        if self.page<6:
　　　　　　　self.page+=1
            url=self.url.format(self.page)
            yield scrapy.Request(url=url,callback=self.parse)

posted on 2021-01-29 22:34 逐梦的猫阅读(243) 评论(0) 收藏举报

刷新页面返回顶部

scrapy框架学习（五）定义数据结构和多页爬取

定义数据结构和多页爬取

定义数据结构

多页爬取

导航

公告