Scrapy框架的安装与介绍
一. 安装Scrapy
1.1 先升级python的相关工具
python -m pip install --upgrade pip python -m pip install --upgrade setuptools
1.2 安装第三方库
pip install pywin32 -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com pip install constantly -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com pip install queuelib -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com pip install lxml -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com pip install six -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com pip install parsel==1.6.0 -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com pip install itemloaders==1.0.1 -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com pip install incremental==21.3.0 -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com pip install pyopenssl -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com pip install Twisted==21.2.0 -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com pip install Scrapy==2.4.1 -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com
1.3. 将scrapy命令添加到环境变量
D:\Program Files\python39\Scripts>scrapy Scrapy 2.4.1 - no active project Usage: scrapy <command> [options] [args] Available commands: bench Run quick benchmark test commands fetch Fetch a URL using the Scrapy downloader genspider Generate new spider using pre-defined templates runspider Run a self-contained spider (without creating a project) settings Get settings values shell Interactive scraping console startproject Create new project version Print Scrapy version view Open URL in browser, as seen by Scrapy [ more ] More commands available when run from project directory Use "scrapy <command> -h" to see more info about a command
二. 创建一个爬虫项目
2.1 通过命令行来创建一个爬虫项目(myscrapy)
scrapy startproject myscrapy
2.2 将爬虫项目导入PyCharm IDE,并指定python解释器(建议使用python虚拟环境)
# 对项目初始化目录的简要介绍 |-myscrapy |-spiders # 存放爬虫文件,通过scrapy genspider命令可以创建爬虫文件 |-__init__.py |-__init__.py |-items.py # 定义解析字段 |-middlewares.py # 中间件 |-pipelines.py # 定义管道,进行数据的处理与储存 |-settings.py # 全局配置 |-scrapy.cfg # 项目配置文件
2.3 创建一个爬虫funddata
1) 在spider目录下,通过命令行创建一个爬虫文件
scrapy genspider funddata fund.eastmoney.com
2) 命令执行成功后,会在spider目录下生成一个文件funddata.py
# -*- coding: utf-8 -*- # @Time : 2021/5/23 23:00 # @Author : chinablue # @File : funddata.py import scrapy class FunddataSpider(scrapy.Spider): # 爬虫名字 name = 'funddata' # 定义允许爬取的域名 allowed_domains = ['fund.eastmoney.com'] # 定义初始的请求url start_urls = ['http://fund.eastmoney.com/'] # 解析页面信息的方法 def parse(self, response): pass
2.4 创建main.py文件,并运行
1) 为了方便在pycharm中运行项目,在项目根目录下创建一个main.py文件
# -*- coding: utf-8 -*- # @Time : 2021/5/23 23:07 # @Author : chinablue # @File : main.py import os import sys from scrapy.cmdline import execute sys.path.append(os.path.dirname(os.path.abspath(__file__))) # 等价于命令行执行:scrapy crawl funddata execute(["scrapy", "crawl", "funddata"])
2) 修改settings.py文件
# Obey robots.txt rules # ROBOTSTXT_OBEY = True ROBOTSTXT_OBEY = False