爬虫学习:使用scrapy爬取猫眼电影
操作步骤
1.生成项目(在cmd或shell窗口运行以下3列代码)
scrapy startproject movieinfo
cd movieinfo
scrapy genspider maoyanm
生成文件结构如下:
2.相关文件内容编辑
maoyanm.py
# -*- coding: utf-8 -*- import scrapy from moviesinfo.items import MoviesinfoItem class MaoyanmSpider(scrapy.Spider): name = 'maoyanm' allowed_domains = ['maoyan.com'] start_urls = ['https://maoyan.com/films?showType=3&offset={}'.format((n-1)*30) for n in range(1,500)] def parse(self, response): urls = response.xpath('//dd/div[2]/a/@href').extract() for url in urls: yield scrapy.Request('https://maoyan.com'+url, callback=self.parseContent) #print('https://maoyan.com'+url) def parseContent(self,response): names = response.xpath('/html/body/div[3]/div/div[2]/div[1]/h3/text()').extract() ennames = response.xpath('//div[@class="ename ellipsis"]/text()').extract() movietype = response.xpath('//li[@class="ellipsis"][1]/text()').extract() movietime = response.xpath('//li[@class="ellipsis"][2]/text()').extract() releasetime = response.xpath('//li[@class="ellipsis"][3]/text()').extract() print(str(names[0])+str(ennames[0]),movietype,movietime,releasetime) #实例化 movieItem = MoviesinfoItem() movieItem['name'] = str(names[0])+' '+str(ennames[0]) movieItem['movietype'] = movietype[0] movieItem['movietime'] = movietime[0].replace('\n','').replace(" ","") movieItem['releasetime'] = releasetime[0] yield movieItem
items.py
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html import scrapy class MoviesinfoItem(scrapy.Item): # define the fields for your item here like: name = scrapy.Field() movietype = scrapy.Field() movietime = scrapy.Field() releasetime = scrapy.Field() pass
pipelines.py
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html import json class MoviesinfoPipeline(object): def open_spider(self,spider): self.f = open('movies.json','a',encoding='utf-8') def close_spider(self,spider): self.f.close() def process_item(self, item, spider): data = json.dumps(dict(item),ensure_ascii=False)+'\n' self.f.write(data) return item
settings.py
ITEM_PIPELINES = { 'moviesinfo.pipelines.MoviesinfoPipeline': 300, }#找到这行代码去掉注释
修改user-agent(非必须选项)
安装fake_useragent(在cmd或shell窗口运行下面这列代码)
pip install fake_useragent
middlewares.py
#添加以下代码!!!! import random from fake_useragent import UserAgent class RandomUserAgentMiddleware(object): #随机更换user-agent def __init__(self,crawler): super(RandomUserAgentMiddleware,self).__init__() self.ua = UserAgent() self.ua_type = crawler.settings.get("RANDOM_UA_TYPE","random") @classmethod def from_crawler(cls,crawler): return cls(crawler) def process_request(self,request,spider): def get_ua(): return getattr(self.ua,self.ua_type) request.headers.setdefault('User-Agent',get_ua())
3.运行爬虫(在cmd或shell窗口运行下面这列代码)
scrapy crawl maoyanm
等待.........
ps.没有像预想中爬完 所有页面,后来发现到一定页数后页面不会显示,之后还要需要学习一些反爬机制解决问题,或者找一些反爬机制不完善的网页进行爬取。
参考资料:
https://www.cnblogs.com/zhaopanpan/articles/9339784.html
https://www.bilibili.com/video/av19057145
https://www.bilibili.com/video/av27782740
https://www.bilibili.com/video/av30272877
scrapy更换user-agent:
https://blog.csdn.net/sinat_41701878/article/details/80295600
https://blog.csdn.net/dta0502/article/details/82666421
https://blog.csdn.net/weixin_42260204/article/details/81087402
https://www.cnblogs.com/cnkai/p/7401343.html