scrapy爬虫笔记(创建一个新的项目并运行)
前期安装请参考: scrapy爬虫笔记(安装)
在确保安装环境没有问题的情况下,新建一个项目需要在cmd中进行
首先,在自定义的文件夹(我的是E:\study\python_anaconda_pf\MyProject\scrapy_study)下面创建一个工程,我的工程名字为movie_250
在文件夹空白位置按照键盘shift不松手点击鼠标右键,选择“在此处打开命令窗口”,或者在cmd中cd到这个文件夹也可
输入命令 scrapy startproject movie_250
查看文件夹会发现自动生成了一个以工程名命名的文件夹,这个文件夹称为“项目文件”
2. 打开PyCharm,找到这个文件夹,看一下文件夹里面的目录结构(都是自动生成的,不需要自行修改名称)
各个文件的含义:
scrapy.cfg 是项目的配置文件,默认内容如下:
# Automatically created by: scrapy startproject # # For more information about the [deploy] section see: # https://scrapyd.readthedocs.io/en/latest/deploy.html [settings] default = movie_250.settings [deploy] #url = http://localhost:6800/ project = movie_250
除注释内容以外,主要声明了两件事情:
定义默认的配置文件settings的位置是在项目模块下的settings文件
定义项目名称为 movie_250
items.py 定义爬虫爬取的项目,可以认为是爬取的字段信息,需自行按照规则(默认生成的)填写,规则如下:
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html import scrapy class Movie250Item(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() pass
按照给出的name字段填写即可,其他不改
或者将代码整体改为(本质上没有任何区别)
from scrapy import Item,Field class Mobie_250Item(Item): #define the fields for your item here like: # name = Field() pass
记住 Movie250Item 这个类(其他文件会引用),是继承了Scrapy模块中的Item类
pipelines.py 字面意思是“管道”,主要作为爬虫数据的处理,在实际项目中主要用于数据的清洗、入库、存储等操作
默认代码如下:
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html class Movie250Pipeline(object): def process_item(self, item, spider): return item
定义的函数接收三个参数,其中self和spider不用管,中间的item是接收的自定义文件Movie_250_spider.py 返回的数据
另外,注释中提到了“需要在setting文件中做相应的配置”,这个放到具体案例中说
settings.py 主要是对爬虫项目的配置,例如请求头的填写、是否符合机器人规则、延时等等,默认代码如下
# -*- coding: utf-8 -*- # Scrapy settings for movie_250 project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://doc.scrapy.org/en/latest/topics/settings.html # https://doc.scrapy.org/en/latest/topics/downloader-middleware.html # https://doc.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'movie_250' SPIDER_MODULES = ['movie_250.spiders'] NEWSPIDER_MODULE = 'movie_250.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'movie_250 (+http://www.yourdomain.com)' # Obey robots.txt rules ROBOTSTXT_OBEY = True # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'movie_250.middlewares.Movie250SpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'movie_250.middlewares.Movie250DownloaderMiddleware': 543, #} # Enable or disable extensions # See https://doc.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html #ITEM_PIPELINES = { # 'movie_250.pipelines.Movie250Pipeline': 300, #} # Enable and configure the AutoThrottle extension (disabled by default) # See https://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
入门级的可能会用到的是:请求头重写、配置使用Pipeline等,这些放在具体案例中说
middlewares 字面意思“中间件”,太复杂了,目前还用不太到,不讲了
两个__init__.py 是空文件
3.
手动在spiders文件夹下新建一个py文件,命名建议为:工程名_spider.py
这个文件是写爬虫规则的
4. 运行程序有两种方法
方法一:在项目文件夹下(也就是顶层的movie_250文件夹)内通过命令行运行
scrapy crawl 项目名
方法二:使用方法一每次运行显得很麻烦,如果有输出的话也不好看,那么就写一个main.py就好了
在第二层movie_250文件夹(这个文件夹称为模块/包)内新建main.py,并写入
from scrapy import cmdline cmdline.execute("scrapy crawl 项目名".split())
然后每次只运行这个文件就ok啦
5. 完整的一个目录结构是这样的: