flask服务器 + 协程 + 爬虫 + ui自动化

　　公司有个爬取的需求，要求持续性爬取，需要永久性地挂载到目标网站上，每天爬一次里面的数据。数据有下载表格的，我通过ui自动化点击拿到数据；还有一部分数据是几乎所有的图片信息，信息量近百万，这部分用scrapy。最后，决定什么时候爬取，爬取哪一天的，要通过请求来处理，所以需要搭建一个服务器，这个我用的flask。开始服务器监听，同时启动ui自动化挂载，这个用到协程。

一.flask + 协程

　　总的逻辑在这里。

from gevent import monkey
monkey.patch_all()
from spy_oo.spiders.oo_ui import UI
import queue
from flask import Flask, request
import gevent
from scrapy import cmdline


que = queue.Queue()


def ui():
    op = UI()
    while True:
        if not que.empty():
            k_v = que.get().split("=")
            if len(k_v) == 2:
                type_, date_ = k_v
            else:
                type_, date_ = '', ''
            print("爬取类型为：%s，日期为：%s" % (type_, date_))
            if "spy" in type_:
                print("爬取中...")
                cmdline.execute('scrapy crawl oppo'.split())
            elif "dow" in type_:
                print("下载中...")
                op.download()
        else:
            op.hang_out()


def server():
    app = Flask(__name__)

    @app.route("/***")
    def logic1():
        arg = str(request.query_string, encoding="utf-8")
        if arg:
            que.put(arg)
        else:
            arg = 'query string missing, exp:dow=12'
        return str(arg)

    app.run(debug=True, use_reloader=False)


if __name__ == '__main__':
    g1 = gevent.spawn(ui)
    g2 = gevent.spawn(server)
    g1.join()
    g2.join()

# server = pywsgi.WSGIServer(('127.0.0.1', 5000), app.run(debug=True))
# server.serve_forever()

二. 分-爬虫

　　scrapy爬取。这是spider的部分，也是爬取的核心代码。

import scrapy
import json
from jsonpath import jsonpath
from . import url, data, oppo_cookies, service_id
from ..items import SpyOoItem


class Spider(scrapy.Spider):
    """
    default data
    """
    name = '***'
    allowed_domains = ['***']
    start_urls = ['***', ]

    def start_requests(self):
        """
        get response
        """
        for s_id in service_id.values():
            data["service_id"] = str(s_id)
            yield scrapy.FormRequest(
                url=url,
                formdata=data,
                cookies=oppo_cookies,
                callback=self.parse
            )

    def parse(self, response):
        """
        parse the message from response
        """

        # extract message from response
        s = json.loads(response.text)
        name_ = jsonpath(s, '$..list[0].service_name')[0]
        pics_id = jsonpath(s, '$..list[*].pic_info[0].magazine_id')
        pics_name = jsonpath(s, '$..list[*].magazine_name')
        play_start_time = jsonpath(s, '$..list[*].play_start_time')
        create_time = jsonpath(s, '$..list[*].pic_info[0].create_time')

        # mapping the value and yield it
        for a, b, c, d in zip(pics_id, pics_name, play_start_time, create_time):
            item = SpyOppoItem()
            item["service_name"] = name_
            item["id"] = a
            item["name"] = b
            item["play_time"] = c
            item["upload_time"] = d
            yield item

三. 分-ui自动化

　　界面的爬取用ui自动化，全用seliky库完成，由于标签元素属于目标网站的信息，不便于展示，这部分较简单，相信大家都会。可以在我的自动化的专栏里学seliky的操作。

posted @ 2021-12-22 11:27 测神阅读(288) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

测神

flask服务器 + 协程 + 爬虫 + ui自动化

一.flask + 协程

二. 分-爬虫

三. 分-ui自动化

公告