一、基础学习
- scrapy框架 介绍:大而全的爬虫组件。 安装: - Win: 下载:http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted pip3 install wheel pip install Twisted‑18.4.0‑cp36‑cp36m‑win_amd64.whl pip3 install pywin32 pip3 install scrapy - Linux: pip3 install scrapy 使用: Django: # 创建project django-admin startproject mysite cd mysite # 创建app python manage.py startapp app01 python manage.py startapp app02 # 启动项目 python manage.runserver Scrapy: # 创建project scrapy startproject xdb cd xdb # 创建爬虫 scrapy genspider chouti chouti.com scrapy genspider cnblogs cnblogs.com # 启动爬虫 scrapy crawl chouti 1. 创建project scrapy startproject 项目名称 项目名称 项目名称/ - spiders # 爬虫文件 - chouti.py - cnblgos.py .... - items.py # 持久化 - pipelines # 持久化 - middlewares.py # 中间件 - settings.py # 配置文件(爬虫) scrapy.cfg # 配置文件(部署) 2. 创建爬虫 cd 项目名称 scrapy genspider chouti chouti.com scrapy genspider cnblgos cnblgos.com 3. 启动爬虫 scrapy crawl chouti scrapy crawl chouti --nolog 总结: - HTML解析:xpath - 再次发起请求:yield Request对象

 二、eg:爬取抽屉

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http.response.html import HtmlResponse
# import sys,os,io
# sys.stdout=io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')

class ChoutiSpider(scrapy.Spider):
    name = 'chouti'
    allowed_domains = ['chouti.com']
    start_urls = ['http://chouti.com/']

    def parse(self, response):
        # print(response,type(response)) # 对象
        # print(response.text)
        """
        from bs4 import BeautifulSoup
        soup = BeautifulSoup(response.text,'html.parser')
        content_list = soup.find('div',attrs={'id':'content-list'})
        """
        # 去子孙中找div并且id=content-list
        f = open('news.log', mode='a+')
        item_list = response.xpath('//div[@id="content-list"]/div[@class="item"]')
        for item in item_list:
            text = item.xpath('.//a/text()').extract_first()
            href = item.xpath('.//a/@href').extract_first()
            print(href,text.strip())
            f.write(href+'\n')
        f.close()

        page_list = response.xpath('//div[@id="dig_lcpage"]//a/@href').extract()
        for page in page_list:
            from scrapy.http import Request
            page = "https://dig.chouti.com" + page
            yield Request(url=page,callback=self.parse) # https://dig.chouti.com/all/hot/recent/2

 三、知识点

    第一部分:scrapy框架
        1. scrapy依赖twisted
            内部基于事件循环的机制实现爬虫的并发。
            原来的你:
                url_list = ['http://www.baidu.com','http://www.baidu.com','http://www.baidu.com',]
                
                for item in url_list:
                    response = requests.get(item)
                    print(response.text)
            
            现在:                
                from twisted.web.client import getPage, defer
                from twisted.internet import reactor

                # 第一部分:代理开始接收任务
                def callback(contents):
                    print(contents)

                deferred_list = [] # [(龙泰,贝贝),(刘淞,宝件套),(呼呼,东北)]
                url_list = ['http://www.bing.com', 'https://segmentfault.com/','https://stackoverflow.com/' ]
                for url in url_list:
                    deferred = getPage(bytes(url, encoding='utf8')) # (我,要谁)
                    deferred.addCallback(callback)
                    deferred_list.append(deferred)


                # # 第二部分:代理执行完任务后,停止
                dlist = defer.DeferredList(deferred_list)

                def all_done(arg):
                    reactor.stop()

                dlist.addBoth(all_done)

                # 第三部分:代理开始去处理吧
                reactor.run()
        
        2. scrapy 
            命令:
                scrapy startproject xx 
                cd xx 
                scrapy genspider chouti chouti.com 
                
                scrapy crawl chouti --nolog 
            
            编写:
                def parse(self,response):
                    # 1.响应
                    # response封装了响应相关的所有数据:
                        - response.text 
                        - response.encoding
                        - response.body 
                        - response.request # 当前响应是由那个请求发起;请求中 封装(要访问的url,下载完成之后执行那个函数)
                    # 2. 解析
                    # response.xpath('//div[@href="x1"]/a').extract_first()
                    # response.xpath('//div[@href="x1"]/a').extract()
                    # response.xpath('//div[@href="x1"]/a/text()').extract()
                    # response.xpath('//div[@href="x1"]/a/@href').extract()
                    # tag_list = response.xpath('//div[@href="x1"]/a')
                    for tag in tag_list:
                        tag.xpath('.//p/text()').extract_first()
                        
                    # 3. 再次发起请求
                    # yield Request(url='xxxx',callback=self.parse)

 四、持久化

今日内容:scrapy
    - 持久化 pipeline/items
    
    - 去重 
    
    - cookie 
    
    - 组件流程: 
        - 下载中间件
        
    - 深度
    
内容详细:

    1. 持久化 
        目前缺点:
            - 无法完成爬虫刚开始:打开连接; 爬虫关闭时:关闭连接;
            - 分工明确
        pipeline/items
            a. 先写pipeline类
                class XXXPipeline(object):
                    def process_item(self, item, spider):
                        return item
                        
            b. 写Item类
                class XdbItem(scrapy.Item):
                    href = scrapy.Field()
                    title = scrapy.Field()
                            
            c. 配置
                ITEM_PIPELINES = {
                   'xdb.pipelines.XdbPipeline': 300,
                }
            
            d. 爬虫,yield每执行一次,process_item就调用一次。
                
                yield Item对象
        
        编写pipeline:
            from scrapy.exceptions import DropItem

            class FilePipeline(object):

                def __init__(self,path):
                    self.f = None
                    self.path = path

                @classmethod
                def from_crawler(cls, crawler):
                    """
                    初始化时候,用于创建pipeline对象
                    :param crawler:
                    :return:
                    """
                    print('File.from_crawler')
                    path = crawler.settings.get('HREF_FILE_PATH')
                    return cls(path)

                def open_spider(self,spider):
                    """
                    爬虫开始执行时,调用
                    :param spider:
                    :return:
                    """
                    print('File.open_spider')
                    self.f = open(self.path,'a+')

                def process_item(self, item, spider):
                    # f = open('xx.log','a+')
                    # f.write(item['href']+'\n')
                    # f.close()
                    print('File',item['href'])
                    self.f.write(item['href']+'\n')
                    
                    # return item      # 交给下一个pipeline的process_item方法
                    raise DropItem()# 后续的 pipeline的process_item方法不再执行

                def close_spider(self,spider):
                    """
                    爬虫关闭时,被调用
                    :param spider:
                    :return:
                    """
                    print('File.close_spider')
                    self.f.close()


        注意:pipeline是所有爬虫公用,如果想要给某个爬虫定制需要使用spider参数自己进行处理。