Scrapy 安装, 基础使用, 持久化存储
输入命令
scrapy startproject 项目名称
cd project_name(进入项目目录)
scrapy genspider 应用名称 爬取网页的起始url
scrapy crawl 爬虫名称 / scrapy crawl 爬虫名称 --nolog
安装
Linux:
pip3 install scrapy
Windows:
a. pip3 install wheel
b. 下载twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
c. 进入下载目录,执行 pip3 install Twisted‑17.1.0‑cp35‑cp35m‑win_amd64.whl
d. pip3 install pywin32
e. pip3 install scrapy
基础使用
1.创建项目:scrapy startproject 项目名称
项目结构:
project_name/ scrapy.cfg: project_name/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py scrapy.cfg 项目的主配置信息。(真正爬虫相关的配置信息在settings.py文件中) items.py 设置数据存储模板,用于结构化数据,如:Django的Model pipelines 数据持久化处理 settings.py 配置文件,如:递归的层数、并发数,延迟下载等 spiders 爬虫目录,如:创建文件,编写爬虫解析规则
2.创建爬虫应用程序:
cd project_name(进入项目目录)
scrapy genspider 应用名称 爬取网页的起始url (例如:scrapy genspider qiubai www.qiushibaike.com)
3.编写爬虫文件:在步骤2执行完毕后,会在项目的spiders中生成一个应用名的py爬虫文件
4.设置修改settings.py配置文件相关配置:
修改内容及其结果如下: 19行:USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36' #伪装请求载体身份 22行:ROBOTSTXT_OBEY = False #可以忽略或者不遵守robots协议
5.执行爬虫程序:scrapy crawl 应用名称
class MyspiderSpider(scrapy.Spider): name = 'qiubai' # allowed_domains = ['www.baidu.com'] start_urls = ['https://www.qiushibaike.com/text/'] def parse(self, response): lst = response.xpath('//*[@id="content-left"]/div') all_data = [] for el in lst: title = el.xpath('./div[1]/a[2]/h2/text()').extract_first() content = el.xpath('./a[1]/div/span[1]').extract_first() all_data.append({'title', title, 'content', content}) return all_data
执行爬虫程序:
scrapy crawl 爬虫名称 :该种执行形式会显示执行的日志信息
scrapy crawl 爬虫名称 --nolog:该种执行形式不会显示执行的日志信息
scrapy框架持久化存储
1.基于终端指令的持久化存储
- 保证爬虫文件的parse方法中有可迭代类型对象(通常为列表or字典)的返回,该返回值可以通过终端指令的形式写入指定格式的文件中进行持久化操作。
执行输出指定格式进行存储:将爬取到的数据写入不同格式的文件中进行存储 scrapy crawl 爬虫名称 -o xxx.json scrapy crawl 爬虫名称 -o xxx.xml scrapy crawl 爬虫名称 -o xxx.csv
2.基于管道的持久化存储(高效、便捷)
myspider.py
import scrapy
from scrapy1.items import Scrapy1Item
class MyspiderSpider(scrapy.Spider): name = 'qiubai' # allowed_domains = ['www.baidu.com'] start_urls = ['https://www.qiushibaike.com/text/'] def parse(self, response): lst = response.xpath('//*[@id="content-left"]/div') for el in lst: title = el.xpath('./div[1]/a[2]/h2/text()').extract_first() content = el.xpath('./a[1]/div/span[1]').extract_first() item = Scrapy1Item() item['title'] = title item['content'] = content yield item
items.py
import scrapy class Scrapy1Item(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title = scrapy.Field() content = scrapy.Field()
pipelines.py
import pymysql from redis import Redis import json class Scrapy1Pipeline(object): def process_item(self, item, spider): return item class Mypipeline(object): def __init__(self): self.f = None def open_spider(self, spider): print('1111111111111111111') self.f = open('data.txt', 'w', encoding='utf8') def close_spider(self, spider): print('2222222222222222222222') self.f.close() def process_item(self, item, spider): self.f.write(item['title'] + '\n' + item['content']+ '\n') return item class Mypipeline_redis(object): def __init__(self): self.conn = None self.cursor =None def open_spider(self, spider): print('3333333333333333') self.conn = pymysql.Connect(host='127.0.0.1',port=3306,user='root',password='',db='xiubai',charset="utf8") def close_spider(self, spider): self.conn.close() self.cursor.close() print('44444444444444444444444') def process_item(self, item, spider): self.cursor = self.conn.cursor() try: # print('insert into xiubai values ("%s", "%s")' % (item['title'], item['content'])) # self.cursor.execute('insert into xiubai values ("%s", "%s")' % (item['title'], item['content'])) self.cursor.execute('insert into xiubai values ("%s","%s")'%(json.dumps(item['title']), json.dumps(item['content']))) except Exception as e: print(e) self.conn.rollback() return item class Mypipeline_mysql(object): def __init__(self): self.conn = None def open_spider(self, spider): print('555555555555555555555') self.conn = Redis(host='127.0.0.1', port=6379) print(self.conn) def process_item(self, item, spider): dic = { 'name': item['title'], 'salary': item['salary'], } self.conn.lpush('xiubai', json.dumps(dic)) return item
settings.py
BOT_NAME = 'scrapy1' SPIDER_MODULES = ['scrapy1.spiders'] NEWSPIDER_MODULE = 'scrapy1.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36' # Obey robots.txt rules ROBOTSTXT_OBEY = False ITEM_PIPELINES = { # 'scrapy1.pipelines.Scrapy1Pipeline': 300, # 301是优先级, 越低优先级越高 'scrapy1.pipelines.Mypipeline': 301, 'scrapy1.pipelines.Mypipeline_redis': 302, 'scrapy1.pipelines.Mypipeline_mysql': 303, }