Mac自己搭建爬虫搜索引擎scrapy+redis+elasticsearch+kibana

1.引言

看上一次失败的尝试，用apache nutch，发现走不通。不过，如果采用低版本，也许是可行的，但用旧版比较别扭，只好放弃了。

不过其中elasticsearch, kibana部分是可以重用的，只是替换nutch为scrapy + scrapy_redis。

2.基本的scrapy爬虫

直接从scrapy官方的例子开始吧，本项目fork自scrapy/quotesbot

我自己的代码在：https://github.com/gfzheng/quotesbot.git

git clone https://github.com/gfzheng/quotesbot.git

cd quotesbot

scrapy list

scrapy crawl toscrape-xpath

官方例子一切正常。

3.增加scrapy-redis分布式特性

安装redis略过。

启动redis命令：redis-server

参考https://github.com/rmax/scrapy-redis的例子(https://github.com/rmax/scrapy-redis/tree/master/example-project)

修改redis配置，在settings.py增加：

# Enables scheduling storing requests queue in redis.
SCHEDULER = "scrapy_redis.scheduler.Scheduler"

# Ensure all spiders share same duplicates filter through redis.
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

# Store scraped item in redis for post-processing.
ITEM_PIPELINES = {
        'scrapy_redis.pipelines.RedisPipeline': 300
}

（可选）修改pipelines.py为

from datetime import datetime

class QuotesbotPipeline(object):
    def process_item(self, item, spider):
        item["crawled"] = datetime.utcnow()
        item["spider"] = spider.name
        return item

修改爬虫为：

# -*- coding: utf-8 -*-
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor

from scrapy_redis.spiders import RedisCrawlSpider
#from scrapy_redis.spiders import RedisSpider


class ToScrapeSpiderCrawl(RedisCrawlSpider):
    """Spider that reads urls from redis queue (toscrawl:start_urls)."""
    name = 'toscrape-crawl'
    redis_key = 'toscrape:start_urls'

    rules = (
        # follow all links
        Rule(LinkExtractor(), callback='parse_page', follow=True),
    )

    def __init__(self, *args, **kwargs):
        # Dynamically define the allowed domains list.
        # scrapy runspider -a domain=quotes.toscrape.com ./spiders/toscrape-crawl.py
        domain = kwargs.pop('domain', '')
        self.allowed_domains = filter(None, domain.split(','))
        super(ToScrapeSpiderCrawl, self).__init__(*args, **kwargs)

    def parse_page(self, response):
        for quote in response.xpath('//div[@class="quote"]'):
            yield {
                'text': quote.xpath('./span[@class="text"]/text()').extract_first(),
                'author': quote.xpath('.//small[@class="author"]/text()').extract_first(),
                'tags': quote.xpath('.//div[@class="tags"]/a[@class="tag"]/text()').extract(),
                'url': response.url
            }

4. 运行分布式爬虫

把redis server运行起来，用redis-cli连接redis，通过keys *命令可以查看redis中保存的数据：

$ redis-cli

127.0.0.1:6379> keys *

1) "toscrape-xpath:items"

127.0.0.1:6379> exit

通过 lpush 命令可以增加key value数据。

>lpush toscrape:start_urls http://quotes.toscrape.com

此时可以运行爬虫：

scrapy runspider -a domain=quotes.toscrape.com ./spiders/health-crawler.py

其中，-a参数可以限制爬取的域名。

至此，分布式爬虫已经搭建完毕。

5. 结合ElasticSearch全文检索

安装运行ES步骤略过。

把爬到的item直接保存到ES中，需要使用：https://github.com/knockrentals/scrapy-elasticsearch

pip install ScrapyElasticSearch

使用方法 (配置 settings.py)

ITEM_PIPELINES = {
    'scrapyelasticsearch.scrapyelasticsearch.ElasticSearchPipeline': 500
}

ELASTICSEARCH_SERVERS = ['localhost']
ELASTICSEARCH_INDEX = 'scrapy'
ELASTICSEARCH_INDEX_DATE_FORMAT = '%Y-%m'
ELASTICSEARCH_TYPE = 'items'
ELASTICSEARCH_UNIQ_KEY = 'url'  # Custom uniqe key

修改完，重新运行爬虫。

在Kibana中，增加Index Patterns: scrapy*

即可在检索爬到的Item了！

6.总结

总之，整个爬虫检索系统包含Scrapy, Redis, ElasticSearch，Kibana几部分。

基本流程就是：把4个服务都运行起来，在redis中lpush增加start_urls，在Kibana中检索爬到的文章。

posted @ 2017-08-17 17:13 Zen_Garden 阅读(921) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

Zen Garden

移动应用开发、物联网、大数据分析