scrapy-redis 分布式爬虫爬取房天下网站所有国内城市的新房和二手房信息

scrapy-redis 分布式爬虫爬取房天下网站所有国内城市的新房和二手房信息

爬取思路

1. 进入 https://www.fang.com/SoufunFamily.htm 页面,解析所有的省份和城市,获取到城市首页链接
2. 通过分析,每个城市的新房都是在首页链接上添加newhouse和house/s/字符串,二手房 都是在首页链接上添加esf字段    
以上海为例:    
首页:https://sh.fang.com/
新房:https://sh.newhouse.fang.com/house/s/
二手房:https://sh.esf.fang.com
所以就可以爬取每个城市的新房和二手房

1. 创建项目

scrapy startproject fang
cd fang
scrapy genspider fangtianxia "fang.com"

2. 编辑需要爬取的数据字段

import scrapy


class FangItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    province = scrapy.Field()
    city_name = scrapy.Field()
    house_name = scrapy.Field()
    size = scrapy.Field()
    address = scrapy.Field()
    tel = scrapy.Field()
    price = scrapy.Field()
    type = scrapy.Field()

3. 编辑爬虫解析数据和请求转发

# -*- coding: utf-8 -*-
import scrapy

from scrapylearn.fang.fang.items import FangItem


class FangtianxiaSpider(scrapy.Spider):
    name = 'fangtianxia'
    allowed_domains = ['fang.com']
    start_urls = ['https://www.fang.com/SoufunFamily.htm']

    def parse(self, response):
        tr_id = None
        province = None
        trs = response.xpath("//div[@class='outCont']//tr")
        # 获取每个省每个城市的新房和二手房链接
        for tr in trs:
            new_tr_id = tr.xpath("@id").get()
            if tr_id != new_tr_id:
                tr_id = new_tr_id
                province = tr.xpath("./td[2]//text()").get()
            citys = tr.xpath("./td[3]/a")
            for city in citys:
                city_name = city.xpath("text()").get()
                city_url = city.xpath("@href").get()
                city_newhouse_url = city_url.replace(".", ".newhouse.", 1) + "house/s/"
                city_esf_url = list5 = city_url.replace(".", ".esf.", 1)
                yield scrapy.Request(city_newhouse_url, callback=self.parse_newhouse,
                                     meta={"info": (province, city_name)})
                yield scrapy.Request(city_esf_url, callback=self.parse_esf, meta={"info": (province, city_name)})

    def parse_newhouse(self, response):
        province, city_name = response.meta["info"]
        type = "新房"
        houses = response.xpath("//div[@id='newhouse_loupai_list']/ul/li[@id]")
        for house in houses:
            house_name = house.xpath(".//div[@class='nlcd_name']/a/text()").get().strip()
            size = house.xpath(".//div[@class='house_type clearfix']/a/text()").getall()
            size = ",".join(size)
            address = house.xpath(".//div[@class='address']/a/@title").get()
            tel = house.xpath(".//div[@class='tel']/p//text()").getall()
            tel = "".join(tel)
            price = house.xpath(".//div[@class='nhouse_price']/*/text()").getall()
            price = " ".join(price)
            item = FangItem(province=province, city_name=city_name, house_name=house_name, size=size, address=address,
                            tel=tel, price=price, type=type)
            yield item
        # 继续抓取下一页
        next_url = response.xpath("//a[@class='active']/following-sibling::a[1]/@href").get()
        if next_url:
            next_url = response.urljoin(next_url)
            yield scrapy.Request(next_url, callback=self.parse_newhouse, meta={"info": (province, city_name)})

    def parse_esf(self, response):
        # 爬取二手房与 parse_newhouse 中爬取新房同理
        pass

4. 将爬取的数据保存到json文件中

from scrapy.exporters import JsonLinesItemExporter


class FangPipeline:
    # 当爬虫被打开的时候会调用
    def open_spider(self, spider):
        print("爬虫开始执行。。。")
        fileName = "fang.json"
        self.fp = open(fileName, "wb")  # 必须以二进制的形式打开文件
        self.exporter = JsonLinesItemExporter(self.fp, ensure_ascii=False, encoding="utf-8")

    # 当爬虫有item传过来的时候会调用
    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    # 当爬虫关闭的时候会调用
    def close_spider(self, spider):
        print("爬虫执行结束")

5. 设置配置文件 settings.py

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

ITEM_PIPELINES = {
   'fang.pipelines.FangPipeline': 300,
}

6. 启动爬虫

scrapy crawl fangtianxia

拓展,将单机版的爬虫转成分布式爬虫

1. 安装scrapy-redis

## 安装scrapy-redis:
pip3 install scrapy-redis -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com

2. 将爬虫的类 scrapy.Spider 换成 scrapy_redis.spiders.RedisSpider

3. 将 start_urls = ['https://www.fang.com/SoufunFamily.htm'] 删掉,添加一个 redis_key

    # start_urls = ['https://www.fang.com/SoufunFamily.htm']
    # 在redis数据库中添加时要添加成列表类型
    # LPUSH sfw:start_url https://www.fang.com/SoufunFamily.htm
    redis_key = "sfw:start_url"

4. 在配置文件中添加配置

# 1:设置去重组件,使用的是scrapy_redis的去重组件,而不是scrapy自己的去重组件了
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 2:设置调度器,使用scrapy——redis重写的调度器,
# 而不再使用scrapy内部的调度器了
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 3:可以实现断点爬取=jondir,(请求的记录不会丢失,会存储在redis数据库中,
# 不会清除 redis的队列,下次直接从redis的队列中爬取)
SCHEDULER_PERSIST = True
# 4:设置任务队列的模式(三选一):
# SpiderPriorityQueue数据scrapy-redis默认使用的队列模式(
# 有自己的优先级)默认第一种
SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderPriorityQueue"
# 使用了队列的形式,任务先进先出。
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderQueue"
# 采用了栈的形式:任务先进后出
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderStack"
#5: 实现这个管道可以将爬虫端获取的item数据,统一保存在redis数据库中
ITEM_PIPELINES = {
    'scrapy_redis.pipelines.RedisPipeline': 400,
}

# 6:指定要存储的redis数据库的主机IP
REDIS_HOST = '127.0.0.1'  # 远端的ip地址

# 指定redis数据库主机的端口
REDIS_PORT = 6379

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

5. 在redis中添加url数据

    # 在redis数据库中添加时要添加成列表类型
    LPUSH sfw:start_url https://www.fang.com/SoufunFamily.htm

6. 启动爬虫,就可以在redis中看到爬取的数据了

posted @ 2020-05-31 09:32  她的开呀  阅读(528)  评论(0编辑  收藏  举报