基于scrapy框架的爬虫项目(一)
['skræpi:]
一、参考资料
1.官方中文文档 https://scrapy-chs.readthedocs.io/zh_CN/latest/index.html
2.简单易操作的爬虫框架(simplified-scrapy)
3.爬虫框架Scrapy的安装与基本使用 https://www.jianshu.com/p/6bc5a4641629
二、simplified-scrapy的使用方法
1.导入simplified-scrapy包
pip install simplified-scrapy
2.编辑运行python文件
from simplified_scrapy.core.spider import Spider
class ScrapydSpider(Spider):
name = 'scrapyd-spider' #定义爬虫名称
start_urls = ['http://www.scrapyd.cn/'] #初始化入口链接
# models = ['auto_main','auto_obj'] #配置抽取模型
def urlFilter(self,url):
return url.find('/jiaocheng/')>0# 添加采集过滤器,只采集教程数据
# from simplified_scrapy.core.mongo_objstore import MongoObjStore
# obj_store = MongoObjStore(name,{'host':'127.0.0.1','port':27017})
# from simplified_scrapy.core.mongo_urlstore import MongoUrlStore
# url_store = MongoUrlStore(name,{"multiQueue":True})
# from simplified_scrapy.core.mongo_htmlstore import MongoHtmlStore
# html_store = MongoHtmlStore(name)
#自定义抽取数据方法
def extract(self, url, html, models, modelNames):
try:
html = self.removeScripts(html)# 去掉脚本数据,也可以不去
lstA = self.listA(html,url["url"])#抽取页面中的链接
data = []
ele = self.getElementByTag("h1",html)#取标题
if(ele):
title = ele.text
ele = self.getElementByClass("cont",html,"</h1>")#取正文
if(ele):
content = ele.innerHtml
ele = self.getElementsByTag("span",html,'class="title-2"','class="cont"')#取作者和时间
author = None
time = None
if(ele and len(ele)>1):
time = ele[0].text
author = ele[1].text
data.append({"Url": url["url"], "Title": title, "Content": content, "Author":author, "Time":time})
return [{"Urls": lstA, "Data": data}]#将数据返回给框架,由框架处理
except Exception as e:
print (e)
from simplified_scrapy.simplified_main import SimplifiedMain #主函数
SimplifiedMain.startThread(ScrapydSpider())#启动爬虫
3.抽取的数据默认的情况是存放在同级目录的文件夹data下面,格式为json