爬虫--scrapy的请求传参,POST请求和cookie问题
1.scrapy的请求传参
使用场景:如果使用scrapy爬取的数据没有在同一张页面中,则必须使用请求传参
使用方法:yield scrapy.Request(url,callback,meta)
:callback回调一个函数用于数据解析
:meta用来传递数据
爬虫文件操作: 1.导包 from moviepro.items import MovieproItem 2.第一次解析是实例化item:item = MovieproItem() 3.手动传参:yield scrapy.Request(url=datile_url,callback=self.datile_parse,meta ={'item':item} ) 4.详情页解析是需要首先接收item:item = response.meta['item'] 5.提交给管道:yield item import scrapy from moviepro.items import MovieproItem class MovieSpider(scrapy.Spider): name = 'movie' # allowed_domains = ['www.xxx.com'] start_urls = ['https://www.doubiyang.cc/frim/index1.html'] def parse(self, response): li_list = response.xpath('/html/body/div[1]/div/div[2]/ul[2]/li') for li in li_list: title = li.xpath('./a/@title').extract_first() datile_url = 'https://www.doubiyang.cc/'+li.xpath('./a/@href').extract_first()+'#desc' item = MovieproItem() item['title']= title item['datile_url'] = datile_url yield scrapy.Request(url=datile_url,callback=self.datile_parse,meta ={'item':item} ) def datile_parse(self,response): item = response.meta['item'] # desc = response.xpath('//div[@class="stui-content__detail"]/p[4]/text()').extract() desc = response.xpath('//div[@class= "stui-content__desc"]/text()').extract() item['desc'] = desc yield item
items.py文件操作 import scrapy class MovieproItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title = scrapy.Field() datile_url = scrapy.Field() desc = scrapy.Field()
settings.py文件操作 USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36' LOG_LEVEL = 'ERROR' ROBOTSTXT_OBEY = False ITEM_PIPELINES = { 'moviepro.pipelines.MovieproPipeline': 300, }
2.scrapy的POST请求和cookie处理
post请求的发送:
1.重写父类的start_requests(self)方法
2.在该方法内部只需要调用yield scrapy.FormRequest(url,callback,formdata)
- 使用FormRequest发起POST请求:yield scrapy.FormRequest(url=url,callback=self.parse,formdata=data)
- callback回调函数给到数据解析的函数
- formdata用来传参
cookie处理:scrapy默认情况下会自动进行cookie处理
1.重写start_requests方法 2.发起POST请求FormRequest: yield scrapy.FormRequest(url=url,callback=self.parse,formdata=data) def start_requests(self): for url in self.start_urls: data = { 'kw':'cat' } #post请求的手动发送使用的是FormRequest yield scrapy.FormRequest(url=url,callback=self.parse,formdata=data) def parse(self, response): print(response.text)