前言:
使用 requests + Beautifulsoup的爬虫模式,随着业务的扩展,会遇到 性能、数据快速存储、多爬虫统一管理的问题,所以选择了爬虫框架----Scrapy!
Scrapy是什么?
Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架。 其可以应用在数据挖掘,信息处理或存储历史数据等一系列的程序中。
其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的, 也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫。Scrapy用途广泛,可以用于数据挖掘、监测和自动化测试。
Scrapy功能
----引用twisted模块异步下载页面
-----HTML解析成对象
-----代理
----延迟下载
----URL字段去重
----指定深度、广度
...........................
Scrapy架构
Scrapy 使用了 Twisted异步网络库来处理网络通讯。整体架构大致如下
Scrapy主要包括了以下组件:
Scrapy采用异步、解耦设计,所以各个组件只和Engin(引擎)进行交互, 而组件之间不存在直接的交互。
Spider: Spider决定了起始URL,爬虫(Spider)构建Request请求给引擎(Engin),Engin也会把下载器( Downloader)的响应内容(Response)传回爬虫(Spider)进行解析。
Engine: Scarpy架构中的运行中枢,负责集中调度其他4个组件。
Scheduler: 充当队列功能,Engin把request请求对象Put进调度器(队列)进排队,队列满了,再从Scheduler(调度器)中Get出request请求对象给Downloader(下载器)。
Downloader:下载器把respose响应对象返回给Engine,再由Engine把respose响应对象响应给Spider进解析器。
Pipline: 把解析到的数据格式化,并存储到数据库
作为框架的使用者我们只关心,Spider和Pipline部分。
Scrapy工作流程
Scrapy运行流程大概如下:
1.爬虫(Spider)
Spider程序入口--->获取start_url---->构建request对象--->Engine
Engine--->Put request对象----->调度器Schedule。
2.调度器(Scheduler)和下载器(Downloader)
Engine从调度器----->Get request对象------>再把request对象----->下载器Downloader
下载器(Downloader)去下载完网页内容------->Engine(下载器把返回的Response响应回赖)
3.Iteam Pipline
Engine把Response内容---->传给Spider进行解析--->Engine
Engine--->解析好的数据---->Iteam Pipline----->数据库
所以作为框架的使用者我们只关心,Spider和Pipline部分。
Scrapy安装&使用
安装
1.Linux
pip install scrapy
2.Windows
2.1:下载twisted
Twisted‑18.7.0‑cp36‑cp36m‑win_amd64.whl:cp36是cpython解释器的版本,amd64Windows的位数;
2.2:安装scrapy
pip install scrapy -i http://pypi.douban.com/simple --trusted-host pypi.douban.com
创建1个爬虫项目
scrapy startproject projectname #创建1个Scrapy项目 cd projectname scrapy genspider [-t template] <name> <domain> #创建爬虫应用 scrapy gensipider -t basic le le.com #创建虫子1 scrapy gensipider -t xmlfeed bestseller.com.cn #创建虫子2 scrapy list #展示爬虫应用列表 scrapy crawl 爬虫应用名 --nolog #运行单独爬虫应用 --nolog不打印日志 修改setings.py ROBOTSTXT_OBEY = False:是否遵守爬虫协议 建议读者一定要遵循爬虫协议,如果ROBOTSTXT_OBEY = True,不能获取到respose一点要和对方打电话谈谈!
Spider

# -*- coding: utf-8 -*- from selenium import webdriver import re from urllib import parse import scrapy from scrapy.http import Request from CrawlerProject.items import JobSalaryItem class SalarySpider(scrapy.Spider): # pages = job_element.xpath('//*[@id="hidTotalPage"]/@value').extract() name = 'SalarySpider' # allowed_domains = ['search.51job.com'] #域名 # 全职工作-按时间排序 # 把引擎url扔到调度器中,调度后 ---引擎(調度器 (執行start_requests) --下载器--数据的解析下载--調度器--mainLoop(多線程) # -- respnse -- 傳給parse--返回spider文件中--通过parse解析 # 51job url_template = 'https://search.51job.com/list/00000,000000,0000,00,9,99,{},2,{}.html' # url_template = 'https://search.51job.com/list/010000%252C020000%252C040000,000000,0000,01,9,99,{},2,{}.html' # 猎聘 liepin_start_url = 'https://www.liepin.com/zhaopin/?compkind=&dqs=&pubTime=&pageSize=40&salary=&compTag=&sortFlag=°radeFlag=0&compIds=&subIndustry=&jobKind=2&industries=&compscale=&key={}&siTag=k_cloHQj_hyIn0SLM9IfRg~vGtZA3d8zZ8ik_pvODBhKQ&d_sfrom=search_fp&d_ckId=cf1fc02f18a96dffb9997a0da53ee041&d_curPage=0&d_pageSize=40&d_headId=c2ea869f999a160a5f36f9aad4e48776&curPage={}' queue = None def __init__(self, queue=None, keyword=' ', inter='', pages=50, **kwargs): chrome_options = webdriver.ChromeOptions() chrome_options.add_argument('--headless') chrome_options.add_argument('--disable-gpu') # driver = webdriver.Chrome(chrome_options=chrome_options) super().__init__(None, **kwargs) self.queue = queue self.pages = int(pages) self.keyword = keyword.upper() self.inter = inter def start_requests(self): # 如果是51 if self.inter == "猎聘网": url_lipin = self.liepin_start_url.format(parse.quote(parse.unquote(self.keyword)), 1) yield Request(url_lipin, dont_filter=True, meta={'page': 1}) else: url_51job = self.url_template.format(parse.quote(parse.quote(self.keyword)), 1) yield Request(url_51job, dont_filter=True, meta={'page': 1}) # 解析页面获取详情页链接 def parse(self, response): meta_info = response.meta curr_page = meta_info['page'] self.print_log('正在爬取第{}页...'.format(meta_info['page'])) if self.inter == "猎聘网": job_elements = response.xpath('//div[@class="sojob-result "]/ul/li/@data-info') print("=======",job_elements) for job_element in job_elements: job_code = job_element.root self.item = self.parse_info(response, job_code) job_item = self.item yield job_item yield self.to_next_page(response, curr_page) else: job_elements = response.xpath('//div[@class="j_joblist"]/div[@class="e"]').extract() print("------>",job_elements) for x in range(0, len(job_elements)): self.item = self.create_item(job_elements[x]) if self.item is not None and self.item != '': job_item = self.item print(job_item) yield job_item yield self.to_next_page(response, curr_page) def parse_info(self, response, job_code): url = '//ul/li[@data-info="' + job_code + '"]' name = response.xpath(url + '//div[@class="job-info"]/h3/a/text()').extract_first() condition = response.xpath(url + '//div[@class="job-info"]/p/@title').extract_first() post_date = response.xpath( url + '//div[@class="job-info"]/p[@class="time-info clearfix"]//time/@title' ).extract_first() company = response.xpath( url + '//div[@class="company-info nohover"]/p[@class="company-name"]/a/text()').extract_first() industry_a = response.xpath( url + '//div[@class="company-info nohover"]/p[@class="field-financing"]/span/text()' ).extract_first() industry_na = response.xpath( url + '//div[@class="company-info nohover"]/p[@class="field-financing"]/span/a/text()' ).extract_first() details = response.xpath(url + '//div[@class="job-info"]/h3/a/@href').extract_first() if len(condition.split('_')) == 4: salary = condition.split('_')[0] workplace = condition.split('_')[1] workseniority = condition.split('_')[-1] else: salary = '' workplace = '' workseniority = '' if industry_a: industry = industry_a.strip() elif industry_na: industry = industry_na.strip() else: industry = '其他' job_item = JobSalaryItem( name=name.strip() if name else '', company=company if company else '', workseniority=workseniority if workseniority else '', industry=industry, salary=salary if salary else '', workplace=workplace if workplace else '', post_date=post_date if post_date else '', detail=details if details else '', company_natrue='', people_num='', source=self.inter, key=self.keyword, citylevel='', ) return job_item def create_item(self, job_element): name = job_element.xpath('a/p[@class="t"]/span[1]/text()').extract_first() workseniority = job_element.xpath('a/p[@class="info"]/span[2]/text()').extract_first() industry = job_element.xpath('div/p[@class="int at"]/text()').extract_first() workplace = job_element.xpath('a/p[@class="info"]/span[2]/text()').extract_first() company = job_element.xpath('div/a/text()').extract_first() salary = job_element.xpath('a/p[@class="info"]/span[1]/text()').extract_first() post_date = job_element.xpath('a/p[@class="t"]/span[2]/text()').extract_first() detail = job_element.xpath('a/@href').extract_first() company_natrues = job_element.xpath('div/p[@class="dc at"]/text()').extract_first() if (salary is not None) and (workseniority is not None) and (workplace is not None) and ( company_natrues is not None): workpl = workplace.split('|') if workpl: workpl = workpl[0].strip() workseniority = re.findall(r'\d+年以上经验|\d+年经验|\d+-\d{1,2}年经验|无需经验|在校生/应届生', workseniority) if workseniority: workseniority = workseniority[0].strip() company_natrue1 = re.findall('民营公司|合资|外资|上市公司|事业单位|国企|创业公司', company_natrues) if company_natrue1: company_natrue = company_natrue1[0].strip() people_num = re.findall(r'\d+-\d+人|少于\d+人|\d+人以上', company_natrues) if people_num is not None and people_num != '': people_num = people_num[0].strip() job_item = JobSalaryItem( name=name, company=company if company else '', workseniority=workseniority if workseniority else '', industry=industry if industry else '', post_date=post_date if post_date else '', workplace=workpl if workpl else '', salary=salary if salary else '', company_natrue=company_natrue, people_num=people_num if people_num else '', detail=detail if detail else '', source=self.inter, key=self.keyword, citylevel='', ) return job_item def to_next_page(self, response=None, page_num=None): """发送下一页请求 """ print('y已经进入 下一页........') if page_num >= self.pages: self.print_log("爬取任务[页数{}>={}]".format(page_num, self.pages)) return None if self.inter == "猎聘网": url_links = self.liepin_start_url.format(self.keyword, page_num) page_num = page_num + 1 next_request = Request(url_links, dont_filter=True, meta={'page': page_num}) return next_request if next_request else None else: url_links = self.url_template.format(parse.quote(parse.quote(self.keyword)), page_num) links_len = len(url_links) if links_len == 0: self.print_log("爬取任务完成[数据只有一页]") elif links_len == 1 and page_num > 1: self.print_log("爬取任务完成[爬取到了最后一页{}]".format(page_num)) else: page_num = page_num + 1 next_url = response.urljoin(url_links) return Request(next_url, dont_filter=True, meta={'page': page_num}) def print_log(self, msg): if self.queue: self.queue.put(msg) else: print(msg) def close(self, spider, reason): # self.browser.close() print('爬取结束')
Pipline

# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html from scrapy.exporters import JsonItemExporter # 用于存储数据 class MyspiderPipeline(object): def process_item(self, item, spider): # 1.打开文件 file = open("data.json", "wb") # 2.创建导入器 exporter = JsonItemExporter(file=file) # 3.开启导入器 exporter.start_exporting() # 4.导出数据 exporter.export_item(item) # 5.关闭导出器 exporter.finish_exporting() # 6.关闭文件 file.close() return item
配置scrapy

# -*- coding: utf-8 -*- # Scrapy settings for myspider project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'myspider' SPIDER_MODULES = ['myspider.spiders'] NEWSPIDER_MODULE = 'myspider.spiders' # 设置USER_AGENT """ Chrome版 USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36" Firefox版 USER_AGENT = "Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0.1) Gecko/20121011 Firefox/21.0.1" """ USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3100.0 Safari/537.36' # Obey robots.txt rules # 是否遵守ROBOT协议? ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) # 最大并发数量 CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs # DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: # CONCURRENT_REQUESTS_PER_DOMAIN = 16 # CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) # COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) # TELNETCONSOLE_ENABLED = False # Override the default request headers: # 加载默认的请求头 DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', } # Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html # 爬虫中间件 SPIDER_MIDDLEWARES = { # 'myspider.middlewares.MyspiderSpiderMiddleware': 543, } # Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # 设置日志输出等级,设置的等级越高输出日志越少 # DEBUG # INFO # Warning # ERROR LOG_LEVEL = "ERROR" # 下载中间件 DOWNLOADER_MIDDLEWARES = { # 'myspider.middlewares.MyspiderDownloaderMiddleware': 543, } # 设置日志文件 # LOG_FILE="51job.log" # Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html # EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, # } # Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html # 在配置文件中 开启管道 # 优先级范围 0---1000,值越小优先级越高 ITEM_PIPELINES = { 'myspider.pipelines.MyspiderPipeline': 300, } # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html # AUTOTHROTTLE_ENABLED = True # The initial download delay # AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies # AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server # AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: # AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings # HTTPCACHE_ENABLED = True # HTTPCACHE_EXPIRATION_SECS = 0 # HTTPCACHE_DIR = 'httpcache' # HTTPCACHE_IGNORE_HTTP_CODES = [] # HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
爬虫翻页

# -*- coding: utf-8 -*- import scrapy from second_hand.items import SecondHandItem class MaitianSpider(scrapy.Spider): name = 'maitian' allowed_domains = ['maitian.cn'] current_url = 'http://bj.maitian.cn/esfall/R1/PG{}' current_page = 1 start_urls = [current_url.format(current_page)] # 解析列表页的请求 def parse(self, response): # 1.获取麦田所有房子列表 house_list = response.xpath('//div[@class="list_wrap"]/ul/li/div[@class="list_title"]') #判断翻页是否结束? if not house_list: return for house in house_list: item = SecondHandItem() item["title"]=house.xpath("./h1/a/text()").extract_first() item["address"] = house.xpath("./p//text()").extract_first() item["price"] = house.xpath('./div[@class="the_price"]/ol/text()').extract_first() yield item #循环发送列表页的请求 self.current_page+=1 self.current_url.format(self.current_page) yield scrapy.Request( self.current_url, callback=self.parse )
---------------

# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html from scrapy.exporters import JsonItemExporter class SecondHandPipeline(object): def open_spider(self, spider): self.file = open("01ershou.json", "wb") self.exporter = JsonItemExporter(self.file) self.exporter.start_exporting() def process_item(self, item, spider): self.exporter.export_item(item) return item def close_spider(self, spider): self.exporter.finish_exporting() self.file.close()
meta传值
如果我们需要把源自多个不同的页面的数据,合并成1种格式;
我们可以在yield request对象时使用meta参数,给一下callback传值。

# -*- coding: utf-8 -*- import scrapy from second_hand.items import SecondHandItem class MaitianSpider(scrapy.Spider): name = 'maitian' allowed_domains = ['maitian.cn'] current_url = 'http://bj.maitian.cn/esfall/R1/PG{}' current_page = 1 start_urls = [current_url.format(current_page)] # 解析列表页的请求 def parse(self, response): # 1.获取麦田所有房子列表 house_list = response.xpath('//div[@class="list_wrap"]/ul/li/div[@class="list_title"]') # 判断翻页是否结束? if not house_list: return for house in house_list: item = SecondHandItem() item["title"] = house.xpath("./h1/a/text()").extract_first() item["address"] = house.xpath("./p//text()").extract_first() item["price"] = house.xpath('./div[@class="the_price"]/ol/text()').extract_first() detail_url = 'http://bj.maitian.cn' + house.xpath('./h1/a/@href').extract_first() print(detail_url) # 附加详情页的房产销售名字 yield scrapy.Request( detail_url, callback=self.parse_detail, meta={"house": item} ) def parse_detail(self, response): item =response.meta["house"] agent_name=response.xpath('/html/body/section[5]/div[2]/div/div/div[1]/div[2]/dl/dd/text()').extract_first() item["agent"]=agent_name print(item) yield item
Pipeline组件
序列化和存储爬取的数据,以下是使用方法,Pipline组件是全局生效的,这意味着所有的爬虫只要return了item对象,都会执行pipline组件。
如何在pipline区别每个爬虫做不同操作?

def process_item(self, item, spider): #爬虫爬取数据过程中 ''' 爬虫爬取数据过程中 :param item: 爬虫中yield回来的对象 :param spider:爬虫对象 obj= JandanSpider() :return: ''' if spider.name=='jandan': print(item) #将item传递给下一个pipline的 process_item方法,串起来执行! return item
0.在scrapy项目setings.py配置文件注册pipeline

ITEM_PIPELINES = { 'sp2.pipelines.Sp2Pipeline': 300,#组册pipeline,300优先级值越小越先执行 }
1.在爬虫中yield Sp2Item()对象

from ..items import Sp2Item yield Sp2Item(url=url,text=text) #yeid item对象表示把标签内容交给 ItemPipeline组件!
2.在item中定义爬虫yield的字段

import scrapy class Sp2Item(scrapy.Item): url = scrapy.Field() #定义字段 text = scrapy.Field()
3.在pipelines设计爬取数据的存储逻辑

class Sp2Pipeline(object): def __init__(self): self.f = None def process_item(self, item, spider): #爬虫爬取数据过程中 ''' 爬虫爬取数据过程中 :param item: 爬虫中yield回来的对象 :param spider:爬虫对象 obj= JandanSpider() :return: ''' print(item) return item @classmethod def from_crawler(cls,crawler): #初始化时候,用于创建pipeline对象 """ 初始化时候,用于创建pipeline对象 :param crawler: :return: """ return cls() def open_spider(self, spider): #爬虫开始执行时,调用 """ 爬虫开始执行时,调用 :param spider: :return: """ print('爬虫开始!!') def close_spider(self, spider): #爬虫关闭时,被调用 """ 爬虫关闭时,被调用 :param spider: :return: """ print('爬虫结束')
scrapy中间件
scrapy和Django一样具有中间件功能,可以在scrapy请求网页和下载网页的过程中做统一操作,例如修改equest请求头加爬虫代理,设置response解码.....;
0.在scrapy项目setings.py配置文件注册中间件

SPIDER_MIDDLEWARES = { 'sp33.middlewares.Sp33SpiderMiddleware': 3, } # Enable or disable downloader middlewares # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html DOWNLOADER_MIDDLEWARES = { 'sp33.middlewares.Sp33DownloaderMiddleware': 543, }
1.爬虫中间件
爬虫中间件在Spider组件 和下Engin组件之间

class Sp33SpiderMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the spider middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. #0、 创建爬虫的时候调用! s = cls() #通过信号来扩展spider_opened crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_spider_input(self, response, spider): #3.当爬虫下载完毕,还没有经过parse处理之前调用; print('----------------------------------------------------process_spider_input') return None def process_spider_output(self, response, result, spider): print('------------------------------------------------------process_spider_input') #4.当爬虫下载完毕,经过parse处理之后调用。 # :return: 必须返回包含 Request 或 Item 对象的可迭代对象(iterable) for i in result: yield i def process_spider_exception(self, response, exception, spider): #触发异常是执行 #return: None,继续交给后续中间件处理异常;含 Response 或 Item 的可迭代对象(iterable),交给调度器或pipeline pass def process_start_requests(self, start_requests, spider): print('-------------------------------------------------process_start_requests') #2.在爬虫启动的时 调用 start_requests # Must return only requests (not items). for r in start_requests: yield r def spider_opened(self, spider): #1在爬虫打开自己注册的信号 print('-------------------------------------------------------spider_opened') spider.logger.info('Spider opened: %s' % spider.name)
2.下载中间件
下载中间件在Engin 和Downloader组件之间

class Sp33DownloaderMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_request(self, request, spider): """ spider处理完成,返回时调用 :param response: :param result: :param spider: :return: Response 对象:转交给其他中间件process_response Request 对象:停止中间件,request会被重新调度下载 raise IgnoreRequest 异常:调用Request.errback """ # Called for each request that goes through the downloader # middleware. # Must either: # - return None: continue processing this request # - or return a Response object # - or return a Request object # - or raise IgnoreRequest: process_exception() methods of # installed downloader middleware will be called return None def process_response(self, request, response, spider): """ spider处理完成,返回时调用 :param response: :param result: :param spider: :return: Response 对象:转交给其他中间件process_response Request 对象:停止中间件,request会被重新调度下载 raise IgnoreRequest 异常:调用Request.errback """ # Called with the response returned from the downloader. # Must either; # - return a Response object # - return a Request object # - or raise IgnoreRequest return response def process_exception(self, request, exception, spider): """ 当下载处理器(download handler)或 process_request() (下载中间件)抛出异常 :param response: :param exception: :param spider: :return: None:继续交给后续中间件处理异常; Response对象:停止后续process_exception方法 Request对象:停止中间件,request将会被重新调用下载 """ # Called when a download handler or a process_request() # (from other downloader middleware) raises an exception. # Must either: # - return None: continue processing this exception # - return a Response object: stops process_exception() chain # - return a Request object: stops process_exception() chain pass def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name)
爬虫伪装策略
一般Server端的反爬虫策略,主要包含HTML和Data的渲染机制 、Cokie、UserAgent、Proxy、访问频率等几个方面。
道高一尺魔高一丈,所以对于爬虫来说是如何破解这些反爬虫策略,也就是浏览器伪装策略。
1.设置Cokie信息
有些网站需要cookie信息,这时我们就需要每次请求服务端之前携带好cookie信息。
之前使用requests模块,我们需要使用session类来保持cookie信息。
在Scray爬虫框架中,如果成功登录之后,Scrapy会自动保存cookie信息,保持本次session。

import scrapy from scrapy import Request, FormRequest class A51jobSpider(scrapy.Spider): name = '51job' allowed_domains = ['yaozh.com'] # 1.Get请求 获取登录页面 start_urls = ['https://www.yaozh.com/login/'] def parse(self, response): # 2.发送登录请求-POST,输入用户登录 form_data = {"username": "xiaogouzi", "pwd": "123.com"} yield FormRequest.from_response( response=response, method="POST", formxpath='//*[@id="login_pc"]', formdata=form_data, callback=self.parse_login ) def parse_login(self, response): print(response.body.decode("utf-8")) # 3.发送Get请求获取个人信息 member_url = "https://www.yaozh.com/member/" yield Request(member_url, callback=self.parse_member) def parse_member(self, response): # 4.查看获取到的个人信息 with open("member.html", "wb") as f: f.write(response.body)
2.Scrapy结合Selenium
很多网站的HTML标签是浏览器请求之后,执行JavaScript动态生成的。
我们使用scray发出的请求,server端响应之后,scarpy无法执行JavaScript,就需要花费很大的精力去分析网站后台的策略。
这就导致一个现象,我们F12分析了半天网页结构,发现爬回来的HTML和在浏览器里看到是不一致的。
scrapy结合selinium就是设置scray的下载器中间件,让Selenium去发送请求,然后在使用浏览器的driver对响应内容包含的JavaScript进行渲染,把浏览器渲染之后的response响应给引擎,引擎进而响应给spider。
此时我们爬取的数据就和在浏览器里看到的html结构一致了,这种方式大大提高了爬虫的可用性。

DOWNLOADER_MIDDLEWARES = { 'CrawlerProject.middlewares.SeleniumMiddleware': 543, }
中间件

# -*- coding: utf-8 -*- # Define here the models for your spider middleware # # See documentation in: # https://docs.scrapy.org/en/latest/topics/spider-middleware.html from scrapy import signals from selenium import webdriver from scrapy.http import HtmlResponse import time class SeleniumMiddleware(object): def process_request(self, request, spider): url = request.url spider.browser.get(url) js1 = "document.documentElement.scrollTop = 600" spider.browser.execute_script(js1) time.sleep(0.5) js2 = "document.documentElement.scrollTop = 6000" spider.browser.execute_script(js2) time.sleep(0.5) # js3 = "document.documentElement.scrollTop = 6600" # spider.browser.execute_script(js3) # # time.sleep(3) html = spider.browser.page_source return HtmlResponse(url=url, body=html, request=request, encoding='utf-8') # pass
3.配置USER_AGENT 和proxy
如果我们频繁、大量的发送请求到Sever端,server端就会识别出我们是爬虫程序,进而针对我们的IP进行限流。
我们可以设置爬虫的随机USER_AGENT和request.meta['proxy'] = proxy 这种爬虫伪装手段,防止server端识别出请求是爬虫发来的。
Scrapy扩展、定制
定制URL去重功能
scrapy默认自带&开启了url去重功能,是通过文件保存url访问记录实现的,所以可尝试自己扩展url去重功能,把数以万计的url记录到iowait更快的内存数据库中去!
1.设置去重规则

class RepeatUrl: def __init__(self): self.visited_url = set() # 放在当前服务的内存 @classmethod def from_settings(cls, settings): """ 初始化时,调用 :param settings: :return: """ return cls() def request_seen(self, request): """ 检测当前请求是否已经被访问过 :param request: :return: True表示已经访问过;False表示未访问过 """ print('============================================================='+request.url) if request.url in self.visited_url: return True self.visited_url.add(request.url) return False def open(self): """ 开始爬去请求时,调用 :return: """ print('open replication') def close(self, reason): """ 结束爬虫爬取时,调用 :param reason: :return: """ print('close replication') def log(self, request, spider): """ 记录日志 :param request: :param spider: :return: """ print('repeat', request.url)
2.在scrapy项目setings.py配置文件中注册

DUPEFILTER_CLASS = 'sp2.rep.RepeatUrl'
基于scrapy预留信号自定义扩展
scrapy是一个扩展性极好的框架,类似Django的信号,scrapy同样预留了许多信号钩子,以便我们在爬虫工作的任何环节,做各种自定制扩展。
engine_started
engine_stopped
spider_opened
spider_idle
spider_closed
spider_error
request_scheduled
request_dropped
response_received
response_downloaded
item_scraped
item_dropped
1.settings.py配置文件中注册信号
EXTENSIONS = { 'sp2.extends.MyExtension': 1,# 自定制信号的所在目录:优先级 }
2.扩展内容

from scrapy import signals class MyExtension(object): def __init__(self, value): self.value = value @classmethod def from_crawler(cls, crawler): val = crawler.settings.getint('MMMM') ext = cls(val) # 在scrapy中注册信号: spider_opened crawler.signals.connect(ext.opened, signal=signals.spider_opened) # 在scrapy中注册信号: spider_closed crawler.signals.connect(ext.closed, signal=signals.spider_closed) return ext def opened(self, spider): print('###########################打开爬虫###########################') def closed(self, spider): print('###########################关闭爬虫###########################') # engine_started = object() # engine_stopped = object() # spider_opened = object() # spider_idle = object() # spider_closed = object() # spider_error = object() # request_scheduled = object() # request_dropped = object() # response_received = object() # response_downloaded = object() # item_scraped = object() # item_dropped = object()
扩展scrapy执行命令
1.在spiders同级创建任意目录,如:commands
2.在其中创建 crawlall.py 文件 (此处文件名就是自定义的命令)

from scrapy.commands import ScrapyCommand from scrapy.utils.project import get_project_settings class Command(ScrapyCommand): requires_project = True def syntax(self): return '[options]' def short_desc(self): return 'Runs all of the spiders' def run(self, args, opts): spider_list = self.crawler_process.spiders.list() for name in spider_list: self.crawler_process.crawl(name, **opts.__dict__) self.crawler_process.start() crawlall.py
3.在settings.py 中添加配置 COMMANDS_MODULE = '项目名称.目录名称'
4.在项目目录执行命令:scrapy crawlall,一次启动所有爬虫;
项目
1.爬取硬件价格信息

# -*- coding: utf-8 -*- import scrapy from scrapy.http import Request from selenium import webdriver class HardwarespiderSpider(scrapy.Spider): name = 'HardwareSpider' allowed_domains = ['detail.zol.com.cn'] start_urls = ["https://detail.zol.com.cn/server/1.html"] url_zol = "https://detail.zol.com.cn/server/{}.html" def __init__(self, queue=None, keyword=' ', inter='', pages=50, **kwargs): chrome_options = webdriver.ChromeOptions() chrome_options.add_argument('--headless') chrome_options.add_argument('--disable-gpu') browser = webdriver.Chrome(chrome_options=chrome_options) self.browser = browser super().__init__(None, **kwargs) self.queue = queue self.pages = int(pages) self.keyword = keyword.upper() self.inter = inter def parse_item(self, response): data = {} title = response.xpath('div[@class="pro-intro"]/h3/a/text()').extract_first() data["name"] = title intro = response.xpath('div[@class="pro-intro"]/ul/li') for row in intro: para_name = row.xpath('span/text()').extract_first().strip(":") para_value = row.css('li::attr(title)').extract_first() data[para_name] = para_value price = response.xpath('div[@class="price-box"]/span/b[@class="price-type"]/text()').extract_first() data["price"] = price print(data) def start_requests(self): yield Request(self.url_zol.format(1), dont_filter=True, meta={'page': 1}) def parse(self, response): meta_info = response.meta curr_page = meta_info['page'] print("开始爬取第%s页-------", curr_page) if curr_page < 50: job_elements = response.xpath('//div[@class="list-item clearfix"]') for item in job_elements: self.parse_item(item) curr_page += 1 yield Request(self.url_zol.format(curr_page), dont_filter=True, meta={'page': curr_page})
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· SQL Server 2025 AI相关能力初探
· Linux系列:如何用 C#调用 C方法造成内存泄露
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 阿里最新开源QwQ-32B,效果媲美deepseek-r1满血版,部署成本又又又降低了!
· SQL Server 2025 AI相关能力初探
· AI编程工具终极对决:字节Trae VS Cursor,谁才是开发者新宠?
· 开源Multi-agent AI智能体框架aevatar.ai,欢迎大家贡献代码
· Manus重磅发布:全球首款通用AI代理技术深度解析与实战指南