python小白学习记录 scrapy采用-t crawl模式爬取微信小程序社区

操作命令

建立项目scrapy startproject [项目名]

You can start your first spider with:
cd jxnsh
scrapy genspider example example.com

构建爬虫文件先转到项目目录下

正常情况下再执行scrapy genspider [爬虫名称] [目标网址]

此处执行scrapy genspider -c crawl [爬虫名称] [目标网址] 即对爬虫的模板进行指定

最后在项目根目录下创建启动文件，

正常情况下为，在刚才的cmd窗口中执行

scrapy crawl wxapp_spider 命令 此处爬虫名为 wxapp_spider 实际根据自己创建的爬虫名称进行更改

再提一下 运用scrapy shell [目标网站] 此命令可以帮助测试正则表达式或者xpath表达式

例 start.py

from scrapy import cmdline
cmdline.execute("scrapy crawl wxapp_spider".split())

运行此文件，即可启动爬虫

setting.py文件

禁用robot协议，即

将 ROBOTSTXT_OBEY = True 更改为 ROBOTSTXT_OBEY = False

需要配置请求头

以及对pipelines配置取消注释

items.py配置如下

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class JxnshItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    article_content = scrapy.Field()

此处为wxapp_spider程序

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from lxml import etree
from scrapy.http.response.html import HtmlResponse
from jxnsh.items import JxnshItem

class WxappSpiderSpider(CrawlSpider):
    name = 'wxapp_spider'
    allowed_domains = ['wxapp-union.com']
    start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1']

    rules = (
        Rule(LinkExtractor(allow=r'.+mod=list&catid=2&page=\d'),follow=True),
        Rule(LinkExtractor(allow=r'.+article-.+\.html'),callback="parse_detail", follow=False)
    )

    def parse_detail(self, response):
        # item = {}
        # item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
        # item['name'] = response.xpath('//div[@id="name"]').get()
        # item['description'] = response.xpath('//div[@id="description"]').get()
        title = response.xpath('//h1[@class="ph"]/text()').get()
        article_content = response.xpath('//td[@id="article_content"]//text()').getall()
        article_content = "".join(article_content).strip()
        item = JxnshItem(title=title,article_content=article_content)
        # item["title"] = title
        # item["article_content"] = article_content
        yield item

之后是pipelines

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.exporters import JsonLinesItemExporter

class JxnshPipeline(object):
    def __init__(self):
        self.fp = open("wxapp_content.json","wb")
        self.exporter = JsonLinesItemExporter(self.fp,ensure_ascii=False,encoding="utf-8")

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_spider(self):
        self.fp.close()

附源码：1172118044.top/file/download.html

posted @ 2020-02-18 22:03 jswf 阅读(729) 评论(0) 收藏举报

刷新页面返回顶部

jswf

python小白学习记录 scrapy采用-t crawl模式爬取微信小程序社区

公告