scrapy框架的日志等级和请求传参

阅读目录:

Scrapy的日志等级

- 在使用scrapy crawl spiderFileName运行程序时,在终端里打印输出的就是scrapy的日志信息。

  - 日志信息的种类:

        ERROR : 一般错误

        WARNING : 警告

        INFO : 一般的信息

        DEBUG : 调试信息

       

  - 设置日志信息指定输出:

    在settings.py配置文件中,加入

                    LOG_LEVEL = ‘指定日志信息种类’即可。

                    LOG_FILE = 'log.txt'则表示将日志信息写入到指定文件中进行存储。

请求传参

- 在某些情况下,我们爬取的数据不在同一个页面中,例如,我们爬取一个电影网站,电影的名称,评分在一级页面,而要爬取的其他电影详情在其二级子页面中。这时我们就需要用到请求传参。

- 案例展示:爬取https://www.4567tv.tv/frim/index6.html电影网,将一级页面中的电影名称,类型,评分一级二级页面中的上映时间,导演,片长进行爬取。

爬虫文件:movie

# -*- coding: utf-8 -*-
import scrapy

from ..items import MovieItem
class MovieSpider(scrapy.Spider):
    name = 'movie'
    # allowed_domains = ['www.mv.com']
    start_urls = ['https://www.4567tv.tv/frim/index6.html']

    # 接收一个请求传递过来的数据
    def detail_parse(self,response):
        item = response.meta['item']
        desc = response.xpath('/html/body/div[1]/div/div/div/div[2]/p[5]/span[2]/text()').extract_first()
        item['desc'] = desc
        yield item

    def parse(self, response):
        li_list = response.xpath('//div[@class="stui-pannel_bd"]/ul/li')
        for li in li_list:
            name = li.xpath(".//h4[@class='title text-overflow']/a/text()").extract_first()
            detail_url = 'https://www.4567tv.tv' + li.xpath('.//h4[@class="title text-overflow"]/a/@href').extract_first()
            item = MovieItem()
            item["name"] = name
            # meta是一个字典,字典中所有的键值对都可以传递给指定好的回调函数
            yield scrapy.Request(url=detail_url,callback= self.detail_parse,meta={'item':item})
movie.py

items.py中

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class MovieItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()
    desc = scrapy.Field()
items.py

pipelines.py中

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class MoviePipeline(object):
        f = None
        def open_spider(self, spider):
            print('开始爬虫!')
            self.f = open('./movie.txt', 'w', encoding='utf-8')

        def process_item(self, item, spider):
            name = item['name']
            desc = item['desc']
            self.f.write(name + ':' + desc + '\n')
            return item

        def close_spider(self, spider):
            print('结束爬虫!')
            self.f.close()
pipelines.py

settings.py中

ITEM_PIPELINES = {
   'Movie.pipelines.MoviePipeline': 300,
}
settings

 案例:爬取网易新闻

爬虫文件:

# -*- coding: utf-8 -*-
import scrapy
from ..items import WangyiItem
import re

class WangyiSpider(scrapy.Spider):
    name = 'wangyi'
    # allowed_domains = ['www.yu.com']
    start_urls = ['https://temp.163.com/special/00804KVA/cm_guonei.js?callback=data_callback']
    def detail_parse(self,response):
        item = response.meta["item"]
        detail_msg = response.xpath('//div[@class="post_text"]/p[1]/text()').extract_first()
        item["detail_msg"]=detail_msg
        # print(item)
        yield item

    def parse(self, response):
        str_msg = response.text
        title_list = re.findall('"title":.*',str_msg)
        docurl_list = re.findall('"docurl":"(.*)"', str_msg)

        title_lst = []
        for title in title_list:
            title = title.encode('iso-8859-1').decode('gbk').split(":")[-1]
            title_lst.append(title)

        item =WangyiItem()
        item["title"]=title_lst

        for docurl in docurl_list:
            yield scrapy.Request(url=docurl,callback= self.detail_parse,meta={"item":item})
wangyi.py

items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class WangyiItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title =scrapy.Field()
    detail_msg=scrapy.Field()
items.py

管道文件:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class WangyiPipeline(object):
    f = None
    l=0
    def open_spider(self,spider):
        print("开始爬虫")
        self.f = open("./wangyi.txt","w",encoding="utf-8")

    def process_item(self, item, spider):
        title_list = item["title"]
        detail_msg= item["detail_msg"]
        if self.l<=len(title_list):
            title = title_list[self.l]
            self.f.write(title + ":" + detail_msg)
            self.l +=1
        return item
    def close_spider(self,spider):
        print("爬虫结束")
        self.f.close()
pipelines.py

settinss.py文件

ITEM_PIPELINES = {
   'Wangyi.pipelines.WangyiPipeline': 300,
}

注意点:本案例的难点在于新闻标题是动态加载的,需要对标签进行全局的搜索,发现是在js文件中

    然后通过正则的方式进行匹配出内容

 

posted @ 2019-05-07 17:42  小萍瓶盖儿  阅读(194)  评论(0编辑  收藏  举报