day26-爬虫进阶

5.代码书写请求-全栈数据爬取
    例子4：爬取所有页面choutiAll--手动请求发送形式start_urls = ['https://dig.chouti.com/r/pic/hot/1']
    解析抽屉图片下所有的超链！
    #设计了一个所有页码通用的url（pageNum表示的就是不同页码）
    url = 'https://dig.chouti.com/r/pic/hot/%d'
    重点是parse方法的调用yield scrapy.Request(url=url,callback=self.parse)

# -*- coding: utf-8 -*-
import scrapy
from choutiAllPro.items import ChoutiallproItem

class ChoutiSpider(scrapy.Spider):
    name = 'chouti'
    #allowed_domains = ['www.ddd.com']
    start_urls = ['https://dig.chouti.com/r/pic/hot/1']

    #设计了一个所有页码通用的url（pageNum表示的就是不同页码）
    url = 'https://dig.chouti.com/r/pic/hot/%d'
    pageNum = 1
    
    def parse(self, response):
        div_list = response.xpath('//div[@class="content-list"]/div')
        for div in div_list:
            title = div.xpath('./div[3]/div[1]/a/text()').extract_first()
            item = ChoutiallproItem()
            item['title']=title
            
            yield item
        
        #进行其他页码对应url的请求操作
        if self.pageNum <= 120: #假设只有120个页码
            self.pageNum += 1
            url = format(self.url%self.pageNum)
            #print(url)
            #进行手动请求的发送
            yield scrapy.Request(url=url,callback=self.parse) #yield共发送页码的次数，无yield只发一次！parse被递归的调用

chouti.py


    //text获取多个文本内容    /text获取单个文本内容
    scarpy框架会自动处理get请求的cookie
    
    例子5：百度翻译--发post请求--处理cookie--postPro
    修改父类方法：    
    def start_requests(self):
        for url in self.start_urls:
            #该方法可以发起一个post请求
            yield scrapy.FormRequest(url=url,callback=self.parse,formdata={'kw':'dog'})

# -*- coding: utf-8 -*-
import scrapy

#需求：对start_urls列表中的url发起post请求
class PostSpider(scrapy.Spider):
    name = 'post'
    #allowed_domains = ['www.xxx.com']
    start_urls = ['https://fanyi.baidu.com/sug']
    
    #Spider父类中的一个方法：可以将 start_urls列表中的url一次进行请求发送
    def start_requests(self):
        for url in self.start_urls:
            # yield scrapy.Request(url=url, callback=self.parse) #默认发get请求
            #该方法可以发起一个post请求
            yield scrapy.FormRequest(url=url,callback=self.parse,formdata={'kw':'dog'}) #formdata处理携带的参数

    def parse(self, response):
        print(response.text) #结果为json串

post.py

      
    例子6：登录操作(登录豆瓣电影)，发post请求---loginPro
    登录即可获取cookie

# -*- coding: utf-8 -*-
import scrapy


class LoginSpider(scrapy.Spider):
    name = 'login'
    #allowed_domains = ['www.xxx.com']
    start_urls = ['https://accounts.douban.com/login']
    
    def start_requests(self):
        data = {
            'source':    'movie',
            'redir':    'https://movie.douban.com/',
            'form_email':    '15027900535',
            'form_password':    'bobo@15027900535',
            'login':    '登录',
        }
        for url in self.start_urls:
            yield scrapy.FormRequest(url=url,callback=self.parse,formdata=data)
    
    def getPageText(self,response):
        page_text = response.text
        with open('./douban.html','w',encoding='utf-8') as fp:
            fp.write(page_text)
            print('over')
    
    def parse(self, response):
        #对当前用户的个人主页页面进行获取（有用户信息说明携带cookie，否则是登录界面）
        url = 'https://www.douban.com/people/185687620/'
        yield scrapy.Request(url=url,callback=self.getPageText)


 
6.scrapy核心组件--5大核心组件
    总结流程描述：
    引擎调用爬虫文件中的start_requests方法，将列表中url封装成请求对象（start_urls、yield中的），会有一系列的请求对象，引擎将请求对象给调度器，调度器会进行去重，
请求对象放在调度器的队列中，调度器将请求对象调度给下载器，下载器拿着请求对象到互联网中下载，页面数据下载完后给下载器，下载器给爬虫文件，
爬虫文件进行解析（调用parse方法），将解析后的数据封装到item对象中，提交给管道，管道进行持久化存储。
    注意：调度器中队列，调度器对请求对象有去重功能。
    1.引擎：所有方法的调用
    2.调度器：接收引擎发送的请求，压入到队列中，去除重复网址
    3.下载器：下载页面内容，将下载好的页面内容返回给蜘蛛（scrapy，就是爬虫文件）
    4.爬虫文件（spiders）：干活的，将获取的页面数据进行解析操作
    5.管道：进行持久化存储
    互联网

    下载中间件（介于调度器、引擎、爬虫文件和下载器的中间）：可进行代理ip的更换
    例子7：代理中间件的应用----dailiPro
    daili.py的书写；middlewares.py中DailiproDownloaderMiddleware下process_request方法
        def process_request(self, request, spider):
        #request参数表示的就是拦截到的请求对象
        request.meta['proxy'] = "https://151.106.15.3:1080"
        return None
     在settings中DOWNLOADER_MIDDLEWARES开启  55-57行

# -*- coding: utf-8 -*-
import scrapy


class DailiSpider(scrapy.Spider):
    name = 'daili'
    #allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.baidu.com/s?wd=ip']

    def parse(self, response):
       page_text = response.text
       with open('daili.html','w',encoding='utf-8') as fp:
           fp.write(page_text)

daili.py

# -*- coding: utf-8 -*-
from scrapy import signals


class DailiproDownloaderMiddleware(object):

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # request参数表示的就是拦截到的请求对象
        request.meta['proxy'] = "https://151.106.15.3:1080"
        # request.meta={"https":"151.106.15.3:1080"} #不推荐
        return None

    def process_response(self, request, response, spider):
        return response

    def process_exception(self, request, exception, spider):
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

middlewares.py

 1 # -*- coding: utf-8 -*-
 2 
 3 # Scrapy settings for dailiPro project
 4 #
 5 # For simplicity, this file contains only settings considered important or
 6 # commonly used. You can find more settings consulting the documentation:
 7 #
 8 #     https://doc.scrapy.org/en/latest/topics/settings.html
 9 #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
10 #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
11 
12 BOT_NAME = 'dailiPro'
13 
14 SPIDER_MODULES = ['dailiPro.spiders']
15 NEWSPIDER_MODULE = 'dailiPro.spiders'
16 
17 
18 # Crawl responsibly by identifying yourself (and your website) on the user-agent
19 #USER_AGENT = 'dailiPro (+http://www.yourdomain.com)'
20 USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
21 # Obey robots.txt rules
22 ROBOTSTXT_OBEY = False
23 
24 # Configure maximum concurrent requests performed by Scrapy (default: 16)
25 #CONCURRENT_REQUESTS = 32
26 
27 # Configure a delay for requests for the same website (default: 0)
28 # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
29 # See also autothrottle settings and docs
30 #DOWNLOAD_DELAY = 3
31 # The download delay setting will honor only one of:
32 #CONCURRENT_REQUESTS_PER_DOMAIN = 16
33 #CONCURRENT_REQUESTS_PER_IP = 16
34 
35 # Disable cookies (enabled by default)
36 #COOKIES_ENABLED = False
37 
38 # Disable Telnet Console (enabled by default)
39 #TELNETCONSOLE_ENABLED = False
40 
41 # Override the default request headers:
42 #DEFAULT_REQUEST_HEADERS = {
43 #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
44 #   'Accept-Language': 'en',
45 #}
46 
47 # Enable or disable spider middlewares
48 # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
49 #SPIDER_MIDDLEWARES = {
50 #    'dailiPro.middlewares.DailiproSpiderMiddleware': 543,
51 #}
52 
53 # Enable or disable downloader middlewares
54 # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
55 DOWNLOADER_MIDDLEWARES = {
56     'dailiPro.middlewares.DailiproDownloaderMiddleware': 543,
57 }
58 
59 # Enable or disable extensions
60 # See https://doc.scrapy.org/en/latest/topics/extensions.html
61 #EXTENSIONS = {
62 #    'scrapy.extensions.telnet.TelnetConsole': None,
63 #}
64 
65 # Configure item pipelines
66 # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
67 #ITEM_PIPELINES = {
68 #    'dailiPro.pipelines.DailiproPipeline': 300,
69 #}
70 
71 # Enable and configure the AutoThrottle extension (disabled by default)
72 # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
73 #AUTOTHROTTLE_ENABLED = True
74 # The initial download delay
75 #AUTOTHROTTLE_START_DELAY = 5
76 # The maximum download delay to be set in case of high latencies
77 #AUTOTHROTTLE_MAX_DELAY = 60
78 # The average number of requests Scrapy should be sending in parallel to
79 # each remote server
80 #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
81 # Enable showing throttling stats for every response received:
82 #AUTOTHROTTLE_DEBUG = False
83 
84 # Enable and configure HTTP caching (disabled by default)
85 # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
86 #HTTPCACHE_ENABLED = True
87 #HTTPCACHE_EXPIRATION_SECS = 0
88 #HTTPCACHE_DIR = 'httpcache'
89 #HTTPCACHE_IGNORE_HTTP_CODES = []
90 #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
91 
92 #DEBUG  INFO  ERROR  WARNING
93 #LOG_LEVEL = 'ERROR'
94 
95 LOG_FILE = 'log.txt'

settings.py


     
7.日志信息的设置  
日志登记  #DEBUG  INFO  ERROR  WARNING
在settings中写 #LOG_LEVEL = 'ERROR' 只输出error类型的日志
LOG_FILE = 'log.txt'日志输出到文件,上看6.上面settings.py中配置


8.请求传参 ：爬取的数据不在同一个页面中
  正则未生效！？？？
例子8：请求传参---爬取电影详情的数据---moviePro  
  将不同页面的值放到同一个item里（名称和作者）
  手动发请求--yield
  请求传参：通过Request方法的meta参数将某一个具体的数据值传递给request方法中指定的callback方法，callback中方法通过response去取，
item = response.meta['item'] 一个取name，二级子页面中取author
  yield scrapy.Request(url=url,callback=self.getSencodPageText,meta={'item':item}
  
  def getSencodPageText(self,response):
    #2.接收Request方法传递过来的item对象
    item = response.meta['item']

# -*- coding: utf-8 -*-
import scrapy
from moviePro.items import MovieproItem

class MovieSpider(scrapy.Spider):
    name = 'movie'
    #allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.dy2018.com/html/gndy/dyzz/']
    #该方法可以将电影详情页中的数据进行解析
    def getSencodPageText(self,response):
        #2.接收Request方法传递过来的item对象
        item = response.meta['item']
        actor = response.xpath('//*[@id="Zoom"]/p[16]/text()').extract_first()
        item['actor'] = actor
        
        yield item
        
    def parse(self, response):
        print(response.text)
        table_list = response.xpath('//div[@class="co_content8"]/ul/table')
        for table in table_list:
            url = "https://www.dy2018.com"+table.xpath('./tbody/tr[2]/td[2]/b/a/@href').extract_first() #需要加https前缀
            name = table.xpath('./tbody/tr[2]/td[2]/b/a/text()').extract_first()
            print(url)
            item = MovieproItem() #实例化item类型对象
            item['name']=name
            
            #1.让Request方法将item对象传递给getSencodPageText方法，加入meta
            yield scrapy.Request(url=url,callback=self.getSencodPageText,meta={'item':item}) #手动发请求

movie.py



9.SrawlSpider的使用--链接提取器&规则解析器
SrawlSpider可以进行全栈数据的爬取！  --重点！
例子9：SrawlSpider的使用--爬取糗百图片全栈数据--crawlPro
注意：项目创建 scrapy genspider -t crawl qiubai www.xxx.com
    取第一页的标签？--注意allow取得是符合正则的链接 link1 = LinkExtractor(allow=r'/pic/$')

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class QiubaiSpider(CrawlSpider):
    name = 'qiubai'
    #allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.qiushibaike.com/pic/']
    #连接提取器（提取页码连接）：从起始url表示的页面源码中进行指定连接的提取
    #allow参数：正则表达式。可以将起始url页面源码数据中符合该正则的连接进行全部的提取
    link = LinkExtractor(allow=r'/pic/page/\d+\?s=\d+')
    #href="/pic/page/5?s=5144132"
    
    link1 = LinkExtractor(allow=r'/pic/$') #正则表达式提取到的是所有连接的内容
    #href="/pic/"
    rules = (
        #规则解析器：将连接提取器提取到的连接对应的页面数据进行指定（callback）负责解析
        #follow = True:将连接提取器继续作用到连接提取器提取出的连接所对应的页面中（会继续作用于link中）；为False时，只会作用到start_urls，出现几个结果。
        Rule(link, callback='parse_item', follow=True),
        Rule(link1, callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        print(response)

qiubai.py


    
10.分布式爬取--多台机器同时爬取同一页面数据--重点！
在pycharm中下载redis

例子10：分布式爬取--爬取抽屉42区--redisPro
#爬取抽屉42区所有图片所对应的url连接
提交到redis中的管道
settings.py中ITEM_PIPELINES = {
    'scrapy_redis.pipelines.RedisPipeline': 400
}

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from scrapy.linkextractors import LinkExtractor
 4 from scrapy.spiders import CrawlSpider, Rule
 5 from scrapy_redis.spiders import RedisCrawlSpider
 6 from redisPro.items import RedisproItem
 7 #0.将RedisCrawlSpider类进行导入
 8 #1.将爬虫类的父类修改成RedisCrawlSpider
 9 #2.将start_urls修改成redis_key属性
10 #3.编写具体的解析代码
11 # 4.将item提交到scrapy-redis组件中被封装好的管道里（settings.py中ITEM_PIPELINES = {
12 #     'scrapy_redis.pipelines.RedisPipeline': 400
13 # }）
14 #5.将爬虫文件中产生的url对应的请求对象全部都提交到scrapy-redis封装好的调度器中（settings.py中配置95-100）
15 #6.在配置文件中指明将爬取到的数据值存储到哪一个redis数据库中（settings.py中105-108）
16 #7.对redis数据库的配置文件（redis.windows.conf）进行修改：protected-mode no   #bind 127.0.0.1
17 #8.执行爬虫文件：scrapy runspider xxx.py
18 #9.向调度器中扔一个起始的url
19 class ChoutiSpider(RedisCrawlSpider):
20     name = 'chouti'
21     #allowed_domains = ['www.xxx.com']
22     #start_urls = ['http://www.xxx.com/']
23     #调度器队列的名称：将起始的url扔到该名称表示的调度器队列中
24     redis_key = "chouti"
25     
26     rules = (
27         Rule(LinkExtractor(allow=r'/r/news/hot/\d+'), callback='parse_item', follow=True),
28     )
29 
30     def parse_item(self, response):
31         
32         imgUrl_list =  response.xpath('//div[@class="news-pic"]/img/@src').extract()
33         for url in imgUrl_list:
34             item = RedisproItem()
35             item['url'] = url
36             
37             yield item

chouti.py

  1 # -*- coding: utf-8 -*-
  2 
  3 # Scrapy settings for redisPro project
  4 #
  5 # For simplicity, this file contains only settings considered important or
  6 # commonly used. You can find more settings consulting the documentation:
  7 #
  8 #     https://doc.scrapy.org/en/latest/topics/settings.html
  9 #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
 10 #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
 11 
 12 BOT_NAME = 'redisPro'
 13 
 14 SPIDER_MODULES = ['redisPro.spiders']
 15 NEWSPIDER_MODULE = 'redisPro.spiders'
 16 
 17 
 18 # Crawl responsibly by identifying yourself (and your website) on the user-agent
 19 #USER_AGENT = 'redisPro (+http://www.yourdomain.com)'
 20 USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
 21 # Obey robots.txt rules
 22 ROBOTSTXT_OBEY = False
 23 
 24 # Configure maximum concurrent requests performed by Scrapy (default: 16)
 25 #CONCURRENT_REQUESTS = 32
 26 
 27 # Configure a delay for requests for the same website (default: 0)
 28 # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
 29 # See also autothrottle settings and docs
 30 #DOWNLOAD_DELAY = 3
 31 # The download delay setting will honor only one of:
 32 #CONCURRENT_REQUESTS_PER_DOMAIN = 16
 33 #CONCURRENT_REQUESTS_PER_IP = 16
 34 
 35 # Disable cookies (enabled by default)
 36 #COOKIES_ENABLED = False
 37 
 38 # Disable Telnet Console (enabled by default)
 39 #TELNETCONSOLE_ENABLED = False
 40 
 41 # Override the default request headers:
 42 #DEFAULT_REQUEST_HEADERS = {
 43 #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
 44 #   'Accept-Language': 'en',
 45 #}
 46 
 47 # Enable or disable spider middlewares
 48 # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
 49 #SPIDER_MIDDLEWARES = {
 50 #    'redisPro.middlewares.RedisproSpiderMiddleware': 543,
 51 #}
 52 
 53 # Enable or disable downloader middlewares
 54 # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
 55 #DOWNLOADER_MIDDLEWARES = {
 56 #    'redisPro.middlewares.RedisproDownloaderMiddleware': 543,
 57 #}
 58 
 59 # Enable or disable extensions
 60 # See https://doc.scrapy.org/en/latest/topics/extensions.html
 61 #EXTENSIONS = {
 62 #    'scrapy.extensions.telnet.TelnetConsole': None,
 63 #}
 64 
 65 # Configure item pipelines
 66 # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
 67 ITEM_PIPELINES = {
 68     'scrapy_redis.pipelines.RedisPipeline': 400
 69 
 70 #    'redisPro.pipelines.RedisproPipeline': 300,
 71 
 72 }
 73 
 74 # Enable and configure the AutoThrottle extension (disabled by default)
 75 # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
 76 #AUTOTHROTTLE_ENABLED = True
 77 # The initial download delay
 78 #AUTOTHROTTLE_START_DELAY = 5
 79 # The maximum download delay to be set in case of high latencies
 80 #AUTOTHROTTLE_MAX_DELAY = 60
 81 # The average number of requests Scrapy should be sending in parallel to
 82 # each remote server
 83 #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
 84 # Enable showing throttling stats for every response received:
 85 #AUTOTHROTTLE_DEBUG = False
 86 
 87 # Enable and configure HTTP caching (disabled by default)
 88 # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
 89 #HTTPCACHE_ENABLED = True
 90 #HTTPCACHE_EXPIRATION_SECS = 0
 91 #HTTPCACHE_DIR = 'httpcache'
 92 #HTTPCACHE_IGNORE_HTTP_CODES = []
 93 #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
 94 
 95 # 使用scrapy-redis组件的去重队列
 96 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
 97 # 使用scrapy-redis组件自己的调度器
 98 SCHEDULER = "scrapy_redis.scheduler.Scheduler"
 99 # 是否允许暂停
100 SCHEDULER_PERSIST = True
101 
102 
103 
104 
105 REDIS_HOST = '192.168.12.65'
106 REDIS_PORT = 6379
107 #REDIS_ENCODING = ‘utf-8’
108 #REDIS_PARAMS = {‘password’:’123456’}

settings.py


redis配置文件中注释56行 75保存模式改为no
运行：
1.启动redis服务器：进入到redis目录，在cmd中输入redis-server ./redis.windows.conf
2.启动redis 数据库客户端：redis-cli

3.执行配置文件：cmd进入到F:\Python自动化21期\3.Django&项目\day26 爬虫1104\课上代码及笔记\scrapy项目\redisPro\redisPro\spiders下的目录,
scrapy runspider chouti.py  会停在监听的位置

4.在redis中：redis-cli
lpush chouti https://dig.chouti.com/r/news/hot/1 执行之后项目cmd中会进行数据爬取操作

5.在redis中查看爬取的数据 
keys * -------存在chouti:items
lrange chouti:items 0 -1 

删除数据：redis cli
flushall即可

  
小结18：40-50  总结的答案：
1.2种爬虫模块，requests、urllib
2.robots协议作用：防君子不妨小人，常用的一种反扒手段
3.使用云打码或者人工识别--注：验证码也是门户网站的一种反扒手段
4.3种解析方式：xpath、BeautifulSoup、正则
5.selenium--执行js代码/PhantomJs、谷歌无头浏览器
6.重要！数据加密（下载密文），动态数据爬取（梨视频）
token--登录时rkey对应的值
7.5个，爬虫文件、引擎、调度器、下载器、管道
8.sqiders/CrawlSpider/RedisCrawlSpider
9.总结的10步---可以自己尝试--分布式样本保存
10.未讲到




想要的内容括起来

posted @ 2018-11-28 22:09 yuyou123 阅读(242) 评论(0) 收藏举报

刷新页面返回顶部

day26-爬虫进阶

公告