scrapy之基础概念与用法

scrapy之基础概念与用法

框架

  所谓的框架就是一个项目的半成品。也可以说成是一个已经被集成了各种功能(高性能异步下载、队列、分布式、解析、持久化等)的具有很强通用性的项目模板。

安装

Linux:

pip3 install scrapy  // pip3具体看自己的pip是pip3

windows:

a. 下载安装wheel

pip3 install wheel  

b. 下载twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted

c. 进入下载文件的目录,下载那安装Twisted

pip3 install Twisted-17.1.0-cp35-cp35m-win_amd64.whl  # cp35为python的版本

d. 下载安装pywin32

pip3 install pywin32

e. 下载安装scrapy

pip install scrapy

使用 

创建工程

scrapy startproject xxoo  # xxoo为项目工程名称

创建爬虫文件

  需要先切换到工程项目的目录中

cd xxoo  # xxoo为项目名称

  然后创建爬虫文件

scrapy genspider ooxx www.xxoo.com # ooxx为爬虫文件的名称, www.xxoo.com为起始URL

 

  爬虫文件会自动创建到spiders文件夹中。

  执行完上边的命令,会产生一个项目工程,文件结构入下:

-- xxoo
  𠃊-- xxoo
𠃊-- spiders # 放置爬虫文件的地方,可以存放多个爬虫文件
𠃊-- __init__.py
𠃊-- ooxx.py # 创建的爬虫文件
𠃊-- __init__.py
𠃊-- items.py # 跟管道一起使用
𠃊-- middlewares.py # 中间件
𠃊-- pipelindes.py # 管道,做通信使用的,
传送解析到的数据,然后进解析到的数据行持久化存储。
     𠃊-- settings.py  # 配置文件
𠃊-- scrapy.cfg # scrapy框架的配置文件,最好不要打开或者擅自修改

  爬虫文件ooxx.py的内的代码:

# -*- coding: utf-8 -*-
import scrapy


# 在虫过程中要接触到四种父类,Spider是其中的一种
# 进行数据的爬取和解析
class OoxxSpider(scrapy.Spider): # OoxxSpider这个类名称是和爬虫文件的文件名称有关系,前边的是爬虫文件名称的首字母大写的名字,后边是Spider
  name = 'ooxx'  # 爬虫文件的名称,根据名称可以定位到指定的爬虫文件
  allowed_domains
= ['www.xxoo.com'] # 允许的域名
  start_urls
= ['https://www.xxoo.com/'] # 起始URL列表,存放的是起始的URL,是通过创建爬虫文件指定的起始URL指定的,可以改变。


  # 用于解析:response就是起始URL对应的响应对象
  def parse(self, response):
    print(response)
    print(response.text) # 获取字符串类型的相应内容
    print(response.body) # 获取字节类型的相应内容
    response.xpath('') # ''单引号中写xpath解析式

  allowed_domains通常都注释掉。当allowed_domains没有注释掉时,start_urls中的URL必须为allowed_domains的子域名,通常网页中的图片的URL都不为allowed_domains的子域名,所以allowed_domains通常都注释掉。

  start_urls可以指定多个URL,有几个URL就调用几次parse()方法。通常start_urls里边只存放一个URL,而这只URL通常为首页URL。

执行

  在cmd(终端)中执行下一跳代码:

scrapy crawl ooxx  # ooxx为爬虫文件的名称

 

  执行上一条代码,将得到打印结果和日志信息,通常我么关注的只是WARING和ERROR级别的日志信息。

scrapy crawl ooxx --nolog  # 只打印结果,打印日志信息,降低CPU的使用率

 

  settings.py文件的配置

# -*- coding: utf-8 -*-

# Scrapy settings for firstblood project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'firstblood'

SPIDER_MODULES = ['firstblood.spiders']
NEWSPIDER_MODULE = 'firstblood.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
# 使用USER_AGENT进行伪装,将请求载体伪装成浏览器 USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36' # Obey robots.txt rules
# ROBOTSTXT_OBEY值为True的时候,遵从ROBOTS协议;值为False时,不遵从ROBOTS协议 ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # False 不处理cookie,True 处理cookie,注释掉默认处理cookie,如果为True,则每次都处理cookie,占用资源,降低爬虫的效率 # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'firstblood.middlewares.FirstbloodSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'firstblood.middlewares.FirstbloodDownloaderMiddleware': 543, #} # Enable or disable extensions # See https://doc.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html #ITEM_PIPELINES = { # 'firstblood.pipelines.FirstbloodPipeline': 300, # 300表示优先级,数值越小,优先级越高。 #} # Enable and configure the AutoThrottle extension (disabled by default) # See https://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

 

posted @ 2019-01-14 15:37  AKA绒滑服贵  阅读(226)  评论(0编辑  收藏  举报