使用mongodb作scrapy爬小说的存储

一、背景:学习mongodb,考虑把原使用mysql作scrapy爬小说存储的程序修改为使用mongodb作存储。

二、过程:

1、安装mongodb

(1)配置yum repo

(python) [root@DL ~]# vi /etc/yum.repos.d/mongodb-org-4.0.repo

[mngodb-org]
name=MongoDB Repository
baseurl=http://mirrors.aliyun.com/mongodb/yum/redhat/7Server/mongodb-org/4.0/x86_64/
gpgcheck=0
enabled=1

(2)yum安装

(python) [root@DL ~]# yum -y install mongodb-org

(3)启动mongod服务

(python) [root@DL ~]# systemctl start mongod

(4)进入mongodb的shell

(python) [root@DL ~]# mongo
MongoDB shell version v4.0.20

...

To enable free monitoring, run the following command: db.enableFreeMonitoring()
To permanently disable this reminder, run the following command: db.disableFreeMonitoring()
---
>

(5)安装pymongo模块

(python) [root@DL ~]# pip install pymongo
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting pymongo
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/13/d0/819074b92295149e1c677836d72def88f90814d1efa02199370d8a70f7af/pymongo-3.11.0-cp38-cp38-manylinux2014_x86_64.whl (530kB)
     |████████████████████████████████| 532kB 833kB/s
Installing collected packages: pymongo
Successfully installed pymongo-3.11.0

2、修改pipeline.py程序

(python) [root@localhost xbiquge_w]# vi xbiquge/pipelines.py

 1 # -*- coding: utf-8 -*-
 2 
 3 # Define your item pipelines here
 4 #
 5 # Don't forget to add your pipeline to the ITEM_PIPELINES setting
 6 # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
 7 import os
 8 import time
 9 from twisted.enterprise import adbapi
10 from pymongo import MongoClient
11 
12 class XbiqugePipeline(object):
13     conn = MongoClient('localhost',27017)
#conn = MongoClient('mongodb://ccl:123456@192.168.0.81/admin') #若mongodb数据库设置未授权模式的话,需要按照此模式传入授权参数(其中ccl为用户名,123456为密码,admin为授权数据库)
14 db = conn.novels #建立数据库novels的连接对象db 15 #name_novel = '' 16 17 #定义类初始化动作 18 #def __init__(self): 19 20 #爬虫开始 21 #def open_spider(self, spider): 22 23 #return 24 def clearcollection(self, name_collection): 25 myset = self.db[name_collection] 26 myset.remove() 27 28 def process_item(self, item, spider): 29 #if self.name_novel == '': 30 self.name_novel = item['name'] 31 self.url_firstchapter = item['url_firstchapter'] 32 self.name_txt = item['name_txt'] 33 34 exec('self.db.'+ self.name_novel + '.insert_one(dict(item))') 35 return item 36 37 #从数据库取小说章节内容写入txt文件 38 def content2txt(self,dbname,firsturl,txtname): 39 myset = self.db[dbname] 40 record_num = myset.find().count() #获取小说章节数量 41 print(record_num) 42 counts=record_num 43 url_c = firsturl 44 start_time=time.time() #获取提取小说内容程序运行的起始时间 45 f = open(txtname+".txt", mode='w', encoding='utf-8') #写方式打开小说名称加txt组成的文件 46 for i in range(counts): #括号中为counts 47 record_m = myset.find({"url": url_c},{"content":1,"by":1,"_id":0}) 48 record_content_c2a0 = '' 49 for item_content in record_m: 50 record_content_c2a0 = item_content["content"] #获取小说章节内容 51 #record_content=record_content_c2a0.replace(u'\xa0', u'') #消除特殊字符\xc2\xa0 52 record_content=record_content_c2a0 53 #print(record_content) 54 f.write('\n') 55 f.write(record_content + '\n') 56 f.write('\n\n') 57 url_ct = myset.find({"url": url_c},{"next_page":1,"by":1,"_id":0}) #获取下一章链接的查询对象 58 for item_url in url_ct: 59 url_c = item_url["next_page"] #下一章链接地址赋值给url_c,准备下一次循环。 60 f.close() 61 print(time.time()-start_time) 62 print(txtname + ".txt" + " 文件已生成!") 63 return 64 65 #爬虫结束,调用content2txt方法,生成txt文件 66 def close_spider(self,spider): 67 self.content2txt(self.name_novel,self.url_firstchapter,self.name_txt) 68 return

 

2、修改爬虫程序

(python) [root@localhost xbiquge_w]# vi xbiquge/spiders/sancun.py

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from xbiquge.items import XbiqugeItem
 4 from xbiquge.pipelines import XbiqugePipeline
 5 
 6 class SancunSpider(scrapy.Spider):
 7     name = 'sancun'
 8     allowed_domains = ['www.xbiquge.la']
 9     #start_urls = ['http://www.xbiquge.la/10/10489/']
10     url_ori= "http://www.xbiquge.la"
11     url_firstchapter = "http://www.xbiquge.la/10/10489/4534454.html"
12     name_txt = "./novels/三寸人间"
13 
14     pipeline=XbiqugePipeline()
15     pipeline.clearcollection(name) #清空小说的数据集合(collection),mongodb的collection相当于mysql的数据表table
16     item = XbiqugeItem()
17     item['id'] = 0         #新增id字段,便于查询
18     item['name'] = name
19     item['url_firstchapter'] = url_firstchapter
20     item['name_txt'] = name_txt
21 
22     def start_requests(self):
23         start_urls = ['http://www.xbiquge.la/10/10489/']
24         for url in start_urls:
25             yield scrapy.Request(url=url, callback=self.parse)
26 
27     def parse(self, response):
28         dl = response.css('#list dl dd')     #提取章节链接相关信息
29         for dd in dl:
30             self.url_c = self.url_ori + dd.css('a::attr(href)').extract()[0]   #组合形成小说的各章节链接
31             #print(self.url_c)
32             #yield scrapy.Request(self.url_c, callback=self.parse_c,dont_filter=True)
33             yield scrapy.Request(self.url_c, callback=self.parse_c)    #以生成器模式(yield)调用parse_c方法获得各章节链接、上一页链接、下一页链接和章节内容信息。
34             #print(self.url_c)
35     def parse_c(self, response):
36         #item = XbiqugeItem()
37         #item['name'] = self.name
38         #item['url_firstchapter'] = self.url_firstchapter
39         #item['name_txt'] = self.name_txt
40         self.item['id'] += 1
41         self.item['url'] = response.url
42         self.item['preview_page'] = self.url_ori + response.css('div .bottem1 a::attr(href)').extract()[1]
43         self.item['next_page'] = self.url_ori + response.css('div .bottem1 a::attr(href)').extract()[3]
44         title = response.css('.con_top::text').extract()[4]
45         contents = response.css('#content::text').extract()
46         text=''
47         for content in contents:
48             text = text + content
49         #print(text)
50         self.item['content'] = title + "\n" + text.replace('\15', '\n')     #各章节标题和内容组合成content数据,\15是^M的八进制表示,需要替换为换行符。
51         yield self.item     #以生成器模式(yield)输出Item对象的内容给pipelines模块。

4、修改items.py

(python) [root@DL xbiquge_w]# vi xbiquge/items.py

 1 # -*- coding: utf-8 -*-
 2 
 3 # Define here the models for your scraped items
 4 #
 5 # See documentation in:
 6 # https://docs.scrapy.org/en/latest/topics/items.html
 7 
 8 import scrapy
 9 
10 
11 class XbiqugeItem(scrapy.Item):
12     # define the fields for your item here like:
13     # name = scrapy.Field()
14     id = scrapy.Field()
15     name = scrapy.Field()
16     url_firstchapter = scrapy.Field()
17     name_txt = scrapy.Field()
18     url = scrapy.Field()
19     preview_page = scrapy.Field()
20     next_page = scrapy.Field()
21     content = scrapy.Field()

 

三、小结

mongodb作爬虫存储比mysql更简洁。

 

posted @ 2020-10-16 16:58  sfccl  阅读(464)  评论(0编辑  收藏  举报