scrapyd_client deploy 内部实现简单说明

scrapyd_client deploy 提供了对于开发的spider 的打包,同时push 到scrapyd server 中,因为python 的特殊性,我们开发的spider 可能有依赖,scrapyd_client 会结合实际命令打包应用为是否包含依赖的egg 包

egg 包处理

对于egg 包处理,scrapyd_client deploy 会先检查spider 项目中是否包含setup.py 没有的话,会自己生成一个

  • setup.py 模版

settings 使用了spider 项目中的settings , 实际上就是一个scrapy 的扩展

_SETUP_PY_TEMPLATE = """
# Automatically created by: scrapyd-deploy
 
from setuptools import setup, find_packages
 
setup(
    name         = 'project',
    version      = '1.0',
    packages     = find_packages(),
    entry_points = {'scrapy': ['settings = %(settings)s']},
)
""".lstrip()
  • 构建egg
    包含了是否包含依赖,对于包含依赖的基于了uberegg 实现一个uber egg 的打包,否则使用默认的
    构建命令
    实际上就是python setup.py bdist_egg 或者 python setup.py bdist_uberegg
def _build_egg(opts):
    closest = closest_scrapy_cfg()
    os.chdir(os.path.dirname(closest))
    if not os.path.exists("setup.py"):
        settings = get_config().get("settings", "default")
        _create_default_setup_py(settings=settings)
    d = tempfile.mkdtemp(prefix="scrapydeploy-")
    o = open(os.path.join(d, "stdout"), "wb")
    e = open(os.path.join(d, "stderr"), "wb")
 
    if opts.include_dependencies:
        _log("Including dependencies from requirements.txt")
        if not os.path.isfile("requirements.txt"):
            _fail("Error: Missing requirements.txt")
        command = "bdist_uberegg"
    else:
        command = "bdist_egg"
 
    retry_on_eintr(
        check_call,
        [sys.executable, "setup.py", "clean", "-a", command, "-d", d],
        stdout=o,
        stderr=e,
    )
    o.close()
    e.close()
 
    egg = glob.glob(os.path.join(d, "*.egg"))[0]
    return egg, d
  • push scrapyd 服务
    我只说明http 处理部分,核心是对于通过addversion.json endpoint 进行数据post 处理
def _upload_egg(target, eggpath, project, version):
    with open(eggpath, "rb") as f:
        eggdata = f.read()
    data = {
        "project": project,
        "version": version,
        "egg": ("project.egg", eggdata),
    }
    body, content_type = encode_multipart_formdata(data)
    url = _url(target, "addversion.json")
    headers = {
        "Content-Type": content_type,
        "Content-Length": str(len(body)),
    }
    req = Request(url, body, headers)
    _add_auth_header(req, target)
    _log('Deploying to project "%s" in %s' % (project, url))
    return _http_post(req)

说明

以上是关于scrapyd_client deplo 处理的简单说明,后边会结合scrapyd 的spider 调度说明内部运行

参考资料

https://github.com/scrapy/scrapyd-client
https://packaging.python.org/en/latest/specifications/entry-points/

posted on 2024-07-07 05:48  荣锋亮  阅读(5)  评论(0编辑  收藏  举报

导航