Python爬虫之Scrapy框架

Scrapy框架命令

Spider爬虫命令

1、创建项目：

scrapy startproject <项目名字>

2、创建爬虫：

cd <项目名字>

scrapy genspider <爬虫名字> <允许爬取的域名>

3、运行爬虫：

scrapy crawl <爬虫名字>

crawlspider爬虫命令

1、创建项目：

scrapy startproject <项目名字>

2、创建crawlspider爬虫:

cd <项目名字>

scrapy genspider -t crawl <爬虫名字> <允许爬取的域名>

3、运行crawlspider爬虫

setings.py常用配置

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36'  # UA伪装
ROBOTSTXT_OBEY = False  # 不遵守Robot协议
LOG_LEVEL = "WARNING"  # 打印日志级别

Scrapy的概念

Scrapy是一个为了爬取网站数据，提取结构性数据而编写的应用框架

Scrapy的工作流程

流程:

爬虫中起始的url构造成request对象-->爬虫中间件-->引擎-->调度器
调度器把request-->引擎-->下载中间件--->下载器
下载器发送请求，获取response响应---->下载中间件---->引擎--->爬虫中间件--->爬虫
爬虫提取url地址，组装成request对象---->爬虫中间件--->引擎--->调度器，重复步骤2
爬虫提取数据--->引擎--->管道处理和保存数据

注意：

图中中文是为了方便理解后加上去的
图中绿色线条的表示数据的传递
注意图中中间件的位置，决定了其作用
注意其中引擎的位置，所有的模块之前相互独立，只和引擎进行交互

scrapy各模块具体作用

scrapy中每个模块的具体作用：

引擎(engine)：负责数据和信号在不腰痛模块间的传递
调度器(scheduler)：实现一个队列，存放引擎发过来的request请求对象
下载器(downloader)：发送引擎发过来的request请求，获取响应，并将响应交给引擎
爬虫(spider)：处理引擎发过来的response，提取数据，提取url，并交给引擎
管道(pipeline)：处理引擎传递过来的数据，比如存储
下载中间件(downloader middleware)：可以自定义的下载扩展，比如设置代理ip
爬虫中间件(spider middleware)：可以自定义request请求和进行response过滤，与下载中间件作用重复

Scrapy项目的结构

三个内置对象

request请求对象
response响应对象
item数据对象

五个组件

spider爬虫模块
pipeline管道
scheduler调度器
downloader下载器
engine引擎

两个中间件

process_request(self, request, spider)
process_response(self, request, response, spider)

Scrapy项目开发流程

创建项目

scrapy startproject <项目名字>

示例：scrapy startproject mySpider

创建爬虫

cd <项目名字>

scrapy genspider <爬虫名字> <允许爬取的域名>

示例：

cd mySpider

scrapy genspider itcast itcast.cn

数据建模

中间件

爬虫文件(itcast.py)

import scrapy

class ItcastSpider(scrapy.Spider):  # 继承scrapy.spider
	# 爬虫名字 
    name = 'itcast' 
    # 允许爬取的范围
    allowed_domains = ['itcast.cn'] 
    # 开始爬取的url地址
    start_urls = ['http://www.itcast.cn/channel/teacher.shtml']
    
    # 数据提取的方法，接受下载中间件传过来的response
    def parse(self, response): 
    	# scrapy的response对象可以直接进行xpath
    	names = response.xpath('//div[@class="tea_con"]//li/div/h3/text()') 
    	print(names)

    	# 获取具体数据文本的方式如下
        # 分组
    	li_list = response.xpath('//div[@class="tea_con"]//li') 
        for li in li_list:
        	# 创建一个数据字典
            item = {}
            # 利用scrapy封装好的xpath选择器定位元素，并通过extract()或extract_first()来获取结果
            item['name'] = li.xpath('.//h3/text()').extract_first() # 老师的名字
            item['level'] = li.xpath('.//h4/text()').extract_first() # 老师的级别
            item['text'] = li.xpath('.//p/text()').extract_first() # 老师的介绍
            print(item)

附：

需要修改的是allowed_domains，start_urls，parse()

定位元素以及提取数据、属性值的方法：

response.xpath方法的返回结果是一个类似list的类型，其中包含的是selector对象，操作和列表一样，但是有一些额外的方法
额外方法extract()：返回一个包含有字符串的列表
额外方法extract_first()：返回列表中的第一个字符串，列表为空没有返回None

response响应对象的常用属性

response.url：当前响应的url地址
response.request.url：当前响应对应的请求的url地址
response.headers：响应头
response.requests.headers：当前响应的请求头
response.body：响应体，也就是html代码，byte类型
response.status：响应状态码

保存数据

在settings.py配置启用管道

ITEM_PIPELINES = {
    'myspider.pipelines.ItcastPipeline': 400
}

配置项中键为使用的管道类，管道类使用.进行分割，第一个为项目目录，第二个为文件，第三个为定义的管道类。

配置项中值为管道的使用顺序，设置的数值约小越优先执行，该值一般设置为1000以内。

运行scrapy

在项目目录下执行:

scrapy crawl <爬虫名字>

示例：scrapy crawl itcast

Scrapy的使用

user-agent，ua池

settings.py中修改/添加:

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36'  # UA伪装

cookie，cookies池

固定cookie,适用于cookie周期长(常见于一些不规范的网站)，爬取数据量不大，能在cookie过期之前把所有的数据拿到的网站

方法一：重构scrapy的start_rquests方法，将带cookies参数的请求返回给引擎

爬虫文件中：

def start_requests(self):  # 重构start_requests方法
    # 这个cookies_str是抓包获取的
    cookies_str = '...' # 抓包获取
    # 将cookies_str转换为cookies_dict
    cookies_dict = {i.split('=')[0]:i.split('=')[1] for i in cookies_str.split('; ')}
    yield scrapy.Request(  # 将带cookies的请求返回给引擎
        self.start_urls[0],
        callback=self.parse,
        cookies=cookies_dict
    )

注意：

scrapy中cookie不能够放在headers中，在构造请求的时候有专门的cookies参数，能够接受字典形式的coookie

方法二:scrapy.FormRequest()发送post请求,适用于频繁更换cookie的网站

import scrapy

class Login2Spider(scrapy.Spider):
   name = 'login'
   allowed_domains = ['']
   start_urls = ['']

   def parse(self, response):
       authenticity_token = response.xpath("//input[@name='authenticity_token']/@value").extract_first()
       utf8 = response.xpath("//input[@name='utf8']/@value").extract_first()
       commit = response.xpath("//input[@name='commit']/@value").extract_first()
        
        #构造POST请求，传递给引擎
       yield scrapy.FormRequest(  # FormRequest请求
           "https://github.com/session",
           formdata={
               "utf8":utf8,
               "commit":commit,
               "login":"username",
               "password":"***"
           },
           callback=self.parse_login
       )

   def parse_login(self,response):
       print(response.body)

附:

在settings.py中通过设置COOKIES_DEBUG=TRUE 能够在终端看到cookie的传递传递过程

ip，ip池

验证码

Middlewares.py

class CaptchaMiddleware(object):
    max_retries = 5
    def process_response(request, response, spider):
        if not request.meta.get('solve_captcha', False):
            return response  # only solve requests that are marked with meta key
        catpcha = find_catpcha(response)
        if not captcha:  # it might not have captcha at all!
            return response
        solved = solve_captcha(captcha)
        if solved:
            response.meta['catpcha'] = captcha
            response.meta['solved_catpcha'] = solved
            return response
        else:
            # retry page for new captcha
            # prevent endless loop
            if request.meta.get('catpcha_retries', 0) == 5:
                logging.warning('max retries for captcha reached for {}'.format(request.url))
                raise IgnoreRequest 
            request.meta['dont_filter'] = True
            request.meta['captcha_retries'] = request.meta.get('captcha_retries', 0) + 1
            return request

class MySpider(scrapy.Spider):
    def parse(self, response):
        url = ''# url that requires captcha
        yield Request(url, callback=self.parse_captchad, meta={'solve_captcha': True},
                      errback=self.parse_fail)

    def parse_captchad(self, response):
        solved = response['solved']
        # do stuff

    def parse_fail(self, response):
        # failed to retrieve captcha in 5 tries :(
        # do stuff

可参考如何设置Scrapy来处理验证码

翻页请求

数据建模(items)

在items.py文件中定义要提取的字段：

class MyspiderItem(scrapy.Item): 
    name = scrapy.Field()   # 讲师的名字
    title = scrapy.Field()  # 讲师的职称
    desc = scrapy.Field()   # 讲师的介绍

在爬虫文件中导入并且实例化，之后的使用方法和使用字典相同

itcast.py:

from myspider.items import MyspiderItem   # 导入Item，注意路径
...
    def parse(self, response)

        item = MyspiderItem() # 实例化后可直接使用

        item['name'] = node.xpath('./h3/text()').extract_first()
        item['title'] = node.xpath('./h4/text()').extract_first()
        item['desc'] = node.xpath('./p/text()').extract_first()
        
        print(item)

from myspider.items import MyspiderItem这一行代码中注意item的正确导入路径，忽略pycharm标记的错误

python中的导入路径要诀：从哪里开始运行，就从哪里开始导入

清洗去重/保存数据(pipelines)

管道能够实现数据的清洗和保存，能够定义多个管道实现不同的功能

数据去重

清洗数据

入库前清洗

入库后清洗

保存数据

一个爬虫

多个爬虫

import json

from itemadapter import ItemAdapter
from pymongo import MongoClient

class ItcastspiderPipeline:
    def open_spider(self, spider):
        if spider.name == 'itcast':
            self.file = open('./itcast.json', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        if spider.name == 'itcast':
            # 将item对象强转成字典
            item = dict(item)
            json_data = json.dumps(item, ensure_ascii=False) + ',\n'
            self.file.write(json_data)
        return item

    def close_spider(self, spider):
        if spider.name == 'itcast':
            self.file.close()

class ItcspiderPipeline:
    def open_spider(self, spider):
        if spider.name == 'itc':
            self.file = open('./itc.json', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        if spider.name == 'itc':
            # 将item对象强转成字典
            item = dict(item)
            json_data = json.dumps(item, ensure_ascii=False) + ',\n'
            self.file.write(json_data)
        return item

    def close_spider(self, spider):
        if spider.name == 'itc':
            self.file.close()

class itMongoPipeline(object):
    def open_spider( self, spider ):
        if spider.name == 'itcast':
            con = MongoClient()
            self.collection = con.itcast.teachers

    def process_item( self, item, spider ):
        if spider.name == 'itcast':
            # # 将item对象强转成字典 如果之前的item已经在pipeline中强转过已经是字典，就不需要再转换
            # item = dict(item)
            self.collection.insert(item)
        return item

开启管道：

在settings.py设置开启pipeline

......
ITEM_PIPELINES = {
   'itcastspider.pipelines.ItcastspiderPipeline': 300,  # 400表示权重,权重值越小，越优先执行！
   'itcastspider.pipelines.ItcspiderPipeline': 301,
   'itcastspider.pipelines.itMongoPipeline': 400,
}
......

注意点

使用之前需要在settings中开启。
pipeline在setting中键表示位置(即pipeline在项目中的位置可以自定义)，值表示距离引擎的远近，越近数据会越先经过：权重值小的优先执行
不同的pipeline可以处理不同爬虫的数据，通过spider.name属性来区分
不同的pipeline能够对一个或多个爬虫进行不同的数据处理的操作，比如一个进行数据清洗，一个进行数据的保存
同一个管道类也可以处理不同爬虫的数据，通过spider.name属性来区分
有多个pipeline的时候，process_item的方法必须return item,否则后一个pipeline取到的数据为None值
pipeline中process_item的方法必须有，否则item没有办法接受和处理
process_item(self,item,spider):实现对item数据的处理，接受item和spider，其中spider表示当前传递item过来的spider
如果item已经在pipelines中使用过已经是字典，就不需要再次转换，看是否被其他的先执行了主要看他的管道设置，管道数值越小表示它越优先执行。
open_spider(spider) :能够在爬虫开启的时候执行一次
close_spider(spider) :能够在爬虫关闭的时候执行一次
上述俩个方法经常用于爬虫和数据库的交互，在爬虫开启的时候建立和数据库的连接，在爬虫关闭的时候断开和数据库的连接

保存数据到MongoDB

itcast.py

......
 def parse(self, response):
        ...
	yield item  # 爬虫文件中需要yield给引擎，pipelines中才能拿到数据
......

pipelines.py

from pymongo import MongoClient

class MongoPipeline(object):
    def open_spider( self, spider ):
            con = MongoClient(host='127.0.0.1', port=27017)  # mongodb默认的host和post都是一样的，在本机可以省略host和port
            self.collection = con.itcast.teachers

    def process_item( self, item, spider ):
            # # 将item对象强转成字典 
            # item = dict(item)   如果之前的item已经在pipeline中强转过已经是字典，就不需要再转换
            self.collection.insert(item)
        return item

在settings.py设置开启pipeline

......
ITEM_PIPELINES = {
    'itcastspider.pipelines.MongoPipeline': 500, # 权重值越小，越优先执行！  itcastspider是当前爬虫项目名
}
......

开启mongodb

MongoDB-->bin-->双击mongodb.exe

查看mongodb是否存储成功

保存数据到MySQL

中间件(middleware)

process_request()截取引擎到下载器的请求

process_response()截取下载器到引擎的响应

scrapy shell

scrapy shell [爬取url]

crawlspider

crawlspider爬虫和Spider爬虫对比

Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),

使用Rule生成链接提取规则， (), 表示元组

自动提取链接发送请求是由crawlspider重写的parse方法实现的

follow：是否在链接提取器(LinkExtractor) 提取的链接对应的响应中继续应用链接提取器提取链接。删除默认为 false

应用：

scrapy-redis

安装：

pip3 install scrapy_redis

scrapy-redis源码

redis管理工具(windows)

原理

源码分析

运行

mycrawler_redis.py

redis_key改成连接的redis数据库

redis_key = 'py'

filter强转成list列表

self.allowed_domains = list(filter(None, domain.split(',')))

REDIS_URL = "redis://127.0.0.1:6379"

依次双击打开redis-server.exe redis-cli.exe

cd 到mycrawler_redis.py的上级目录

scrapy runspider mycrawler_redis.py

2021-06-26

redis-cli.exe中向Redis数据库传入起始url

最终效果：

scrapy_splash

splash文档

splash英文文档 splash中文文档

概念

scrapy的一个组件，Javascript渲染服务。一个实现了HTTP API的轻量级浏览器，Splash是用Python和Lua语言实现的，基于Twisted和QT等模块构建。

功能：模拟浏览器加载js，并返回浏览器全部渲染完成以后的网页源，类似selenium

cd "C:\Program Files\Docker\Docker"

DockerCli.exe -SwitchDaemon

DISM /Online /Enable-Feature /All /FeatureName:Microsoft-Hyper-V

https://github.com/docker/toolbox/releases

Win10安装Docker for Windows及部分问题的解决方式

镜像加速

{
  "registry-mirrors": [
    "https://hub-mirror.c.163.com",
    "https://mirror.baidubce.com"
  ]
}

Hyper-V开启 cmd或者dism下

DISM /Online /Enable-Feature /All /FeatureName:Microsoft-Hyper-V

https://docs.docker.com/docker-for-windows/install/

https://yeasy.gitbook.io/docker_practice/install/mirror

项目部署(远程)

Scrapyd

Gerapy

Log信息

Scrapy实战项目

robots, UA实战

cookie实战

携带cookie参数登录gitee

1、创建gitee项目

scrapy startproject giteeproject

cd giteeproject
scrapy genspider giteespider

2、修改gitee项目

giteespider.py

import scrapy


class GiteeSpider(scrapy.Spider):
    name = 'gitee'
    # allowed_domains = ['gitee.com']
    start_urls = ['https://gitee.com/profile/account_information']
    
	# 重写start_requests方法
    def start_requests( self ):
        url = self.start_urls[0]
        temp = '登录后的gitee cookies字符串'
        # 将cookies字符串遍历切成键值对形式
        cookies = {data.split('=')[0]: data.split('=')[-1] for data in temp.split('; ')}
        # 返回给引擎带cookies的请求
        yield scrapy.Request(
            url=url,
            callback=self.parse,  # 默认会调用parse方法，可以省略callback不写
            cookies=cookies 
        )

    def parse( self, response ):
        title = response.xpath('//div[@class="user-info"]/a/text()').extract_first()
        print(title)

settings.py

将 ROBOTSTXT_OBEY、USER_AGENT、LOG_LEVEL 解除注释并修改：

ROBOTSTXT_OBEY = False  # 不遵守Robots协议
USER_AGENT = 'Mozilla/5.0' # UA伪装
LOG_LEVEL = "WARNING"  # 打印日志级别

其余的文件不用作修改

3、运行gitee项目

scrapy crawl giteespider

发送post请求登录github

实验网站:github登录网站

思路分析

进入github登录网站,F12打开开发者工具,Network --> Preserve log勾选上,点击sign in 按钮

可以看到是 https://github.com/session 携带用户名以及密码等相关参数在发送post请求

分析参数哪些有变动: 发现只有authenticity_token,timestamp,timestamp_secret这三个参数的值是变化的,其余都是不变的

获取参数值: 首先在页首找,发现这三个参数值都可以在login源码中获取

创建github爬虫项目

scrapy startproject githubProject

cd githubProject

scrapy genspider githubSpider github.com

完善代码

githubSpider.py中:

import scrapy


class GithubspiderSpider(scrapy.Spider):
    name = 'githubSpider'
    allowed_domains = ['github.com']
    start_urls = ['https://github.com/login']

    def parse( self, response ):
        # 在login源码中提取post需要携带的参数值
        authenticity_token = response.xpath('//input[@name="authenticity_token"]/@value').extract_first()
        timestamp = response.xpath('//input[@name="timestamp"]/@value').extract_first()
        timestamp_secret = response.xpath('//input[@name="timestamp_secret"]/@value').extract_first()
        # print(f'{authenticity_token}\n{timestamp}\n{timestamp_secret}')
        yield scrapy.FormRequest(  # 用FormRequest发送请求
            'https://github.com/session',
            formdata={
                'commit': 'Sign in',
                'authenticity_token': authenticity_token,
                'login': '你的github帐号',
                'password': '你的gihub帐号登录密码',
                'webauthn-support': 'supported',
                'webauthn-iuvpaa-support': 'supported',
                'timestamp': timestamp,
                'timestamp_secret': timestamp_secret,
            },
            callback=self.parse_login,
        )

    def parse_login( self, response ):
        if 'email' in str(response.body):
            print('yes')
        else:
            print('error')

settings.py中修改添加对应的变量:

USER_AGENT = 'Mozilla/5.0' # UA伪装
ROBOTSTXT_OBEY = False  # 不遵守Robot协议
LOG_LEVEL = "WARNING"  # 打印日志级别

运行github爬虫项目

scrapy crawl githubSpider

发送post请求登录gitee(未完)

ctrl+shift+n打开无痕浏览器,进入gitee登录页面,F12调出开发者工具,network-->把Preserve log勾选上

输入你的用户名和密码,点击登录按钮,观察开发者工具中network的变化,可以看到https://gitee.com/login发送post请求时携带用户名和密码,并进行了302跳转

退出登录,按之前的操作再重新登录一次,可以发现login中的authenticity_token和encrypt_data[user[password]]有变化

ip实战

items实战

pipeline实战

将itcast教师信息保存到mongodb

目标网站

源码

itcast.py

import scrapy
from itcastspider.items import ItcastspiderItem

class ItcastSpider(scrapy.Spider):
    name = 'itcast'
    # allowed_domains = ['itcast.cn']
    start_urls = ['http://www.itcast.cn/channel/teacher.shtml#ajavaee']

    def parse(self, response):
        teachers = response.xpath('//div[@class="maincon"]/ul/li')
        for node in teachers:
            # temp={}
            item = ItcastspiderItem()
            item['name'] = node.xpath('.//div[@class="main_bot"]//text()').extract()
            item['desc'] = node.xpath('.//div[@class="main_mask"]//text()').extract()
            yield item

items.py

import scrapy

class ItcastspiderItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    title = scrapy.Field()
    desc = scrapy.Field()

pipelines.py

from itemadapter import ItemAdapter
from pymongo import MongoClient

class MongoPipeline(object):
    def open_spider( self, spider ):
        con = MongoClient()  # 本机中可省略host和port
        self.collection = con.itcast.teachers

    def process_item( self, item, spider ):
        # 将item对象强转成字典
        item = dict(item)
        self.collection.insert(item)
        return item

settings.py

ROBOTSTXT_OBEY = False
LOG_LEVEL = "WARNING"

ITEM_PIPELINES = {
   'itcastspider.pipelines.MongoPipeline': 200,
}

保存数据到mysql

中间件实战

scrapy_redis实战

参考链接

scrapy官网

Scrapy爬虫，数据存入MongoDB

posted @ 2021-06-17 21:19 Veryl 阅读(414) 评论(0) 编辑收藏举报

刷新页面返回顶部

Veryl

纷纷万事，直道而行

Python爬虫之Scrapy框架

Scrapy框架命令

Spider爬虫命令

crawlspider爬虫命令

setings.py常用配置

Scrapy的概念

Scrapy的工作流程

scrapy各模块具体作用

Scrapy项目的结构

三个内置对象

五个组件

两个中间件

Scrapy项目开发流程

创建项目

创建爬虫

数据建模

中间件

爬虫文件(itcast.py)

保存数据

在settings.py配置启用管道

运行scrapy

Scrapy的使用

user-agent，ua池

cookie，cookies池

ip，ip池

验证码

meta

翻页请求

数据建模(items)

清洗去重/保存数据(pipelines)

数据去重

清洗数据

入库前清洗

入库后清洗

保存数据

一个爬虫

多个爬虫

注意点

保存数据到MongoDB

保存数据到MySQL

中间件(middleware)

scrapy shell

crawlspider

scrapy-redis

原理

源码分析

运行

scrapy_splash

splash文档

概念

项目部署(远程)

Scrapyd

Gerapy

Log信息

Scrapy实战项目

robots, UA实战

cookie实战

携带cookie参数登录gitee

发送post请求登录github

思路分析

创建github爬虫项目

完善代码

运行github爬虫项目

发送post请求登录gitee(未完)

ip实战

items实战

pipeline实战

将itcast教师信息保存到mongodb

源码

保存数据到mysql

中间件实战

scrapy_redis实战

参考链接