scrapyd 运行egg 的内部处理简单说明

以前简单说明了下scrapyd_client deploy 的处理,现在说明下scrapyd 运行egg 的内部处理

内部处理

  • 激活egg 模块
    scrapyd 包装了自己的方法,对于激活的会添加的执行环境中
def activate_egg(eggpath):
    """Activate a Scrapy egg file. This is meant to be used from egg runners
    to activate a Scrapy egg file. Don't use it from other code as it may
    leave unwanted side effects.
    """
    try:
        d = next(pkg_resources.find_distributions(eggpath))
    except StopIteration:
        raise ValueError("Unknown or corrupt egg")
    d.activate()
    settings_module = d.get_entry_info('scrapy', 'settings').module_name
    os.environ.setdefault('SCRAPY_SETTINGS_MODULE', settings_module)
  • activate_egg 使用
    会在runner 中,对于项目环境的处理(runner 实际上spider 任务的执行,scrapyd 会基于一个队列进行处理)
@contextmanager
def project_environment(project):
    config = Config()
    eggstorage_path = config.get(
        'eggstorage', 'scrapyd.eggstorage.FilesystemEggStorage'
    )
    eggstorage_cls = load_object(eggstorage_path)
    eggstorage = eggstorage_cls(config)
 
    eggversion = os.environ.get('SCRAPYD_EGG_VERSION', None)
    version, eggfile = eggstorage.get(project, eggversion)
    if eggfile:
        activate_egg(eggfile.name)
        eggfile.close()
 
    assert 'scrapy.conf' not in sys.modules, "Scrapy settings already loaded"
    yield

之后就是调用scrapy 提供的cmdline 方法进行执行

def main():
    project = os.environ['SCRAPY_PROJECT']
    with project_environment(project):
        from scrapy.cmdline import execute
        execute()
  • 任务运行
    机制上就是对于python cli 的调用,传递参数
def _wait_for_project(self, slot):
        poller = self.app.getComponent(IPoller)
        poller.next().addCallback(self._spawn_process, slot)
 
    def _spawn_process(self, message, slot):
        e = self.app.getComponent(IEnvironment)
        message.setdefault('settings', {})
        message['settings'].update(e.get_settings(message))
        msg = native_stringify_dict(message, keys_only=False)
        project = msg['_project']
        args = [sys.executable, '-m', self.runner, 'crawl']
        args += get_crawl_args(msg)
        env = e.get_environment(msg, slot)
        env = native_stringify_dict(env, keys_only=False)
        pp = ScrapyProcessProtocol(project, msg['_spider'], msg['_job'], env, args)
        pp.deferred.addBoth(self._process_finished, slot)
        reactor.spawnProcess(pp, sys.executable, args=args, env=env)
        self.processes[slot] = pp

说明

scrapyd 对于egg 文件的加载核心是基于pkg_resources.find_distributions进行查找,之后激活,结合任务调度队列进行任务的处理(定时型的),默认一些参考配置

[scrapyd]
eggs_dir    = eggs
logs_dir    = logs
items_dir   =
jobs_to_keep = 5
dbs_dir     = dbs
max_proc    = 0
max_proc_per_cpu = 4
finished_to_keep = 100
poll_interval = 5.0
bind_address = 127.0.0.1
http_port   = 6800
username    =
password    =
prefix_header = x-forwarded-prefix
debug       = off
runner      = scrapyd.runner
jobstorage  = scrapyd.jobstorage.MemoryJobStorage
application = scrapyd.app.application
launcher    = scrapyd.launcher.Launcher
spiderqueue = scrapyd.spiderqueue.SqliteSpiderQueue
webroot     = scrapyd.website.Root
eggstorage  = scrapyd.eggstorage.FilesystemEggStorage
 
[services]
schedule.json     = scrapyd.webservice.Schedule
cancel.json       = scrapyd.webservice.Cancel
addversion.json   = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json  = scrapyd.webservice.ListSpiders
delproject.json   = scrapyd.webservice.DeleteProject
delversion.json   = scrapyd.webservice.DeleteVersion
listjobs.json     = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus

参考资料

https://packaging.pypa.io/en/stable/
https://zopeinterface.readthedocs.io/en/latest/README.html
https://pypi.org/project/packaging/
https://github.com/scrapy/scrapyd
https://scrapyd.readthedocs.io/en/stable/overview.html

posted on 2024-07-08 07:25  荣锋亮  阅读(19)  评论(0编辑  收藏  举报

导航