scrapyd 运行egg 的内部处理简单说明
以前简单说明了下scrapyd_client deploy 的处理,现在说明下scrapyd 运行egg 的内部处理
内部处理
- 激活egg 模块
scrapyd 包装了自己的方法,对于激活的会添加的执行环境中
def activate_egg(eggpath):
"""Activate a Scrapy egg file. This is meant to be used from egg runners
to activate a Scrapy egg file. Don't use it from other code as it may
leave unwanted side effects.
"""
try:
d = next(pkg_resources.find_distributions(eggpath))
except StopIteration:
raise ValueError("Unknown or corrupt egg")
d.activate()
settings_module = d.get_entry_info('scrapy', 'settings').module_name
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', settings_module)
- activate_egg 使用
会在runner 中,对于项目环境的处理(runner 实际上spider 任务的执行,scrapyd 会基于一个队列进行处理)
@contextmanager
def project_environment(project):
config = Config()
eggstorage_path = config.get(
'eggstorage', 'scrapyd.eggstorage.FilesystemEggStorage'
)
eggstorage_cls = load_object(eggstorage_path)
eggstorage = eggstorage_cls(config)
eggversion = os.environ.get('SCRAPYD_EGG_VERSION', None)
version, eggfile = eggstorage.get(project, eggversion)
if eggfile:
activate_egg(eggfile.name)
eggfile.close()
assert 'scrapy.conf' not in sys.modules, "Scrapy settings already loaded"
yield
之后就是调用scrapy 提供的cmdline 方法进行执行
def main():
project = os.environ['SCRAPY_PROJECT']
with project_environment(project):
from scrapy.cmdline import execute
execute()
- 任务运行
机制上就是对于python cli 的调用,传递参数
def _wait_for_project(self, slot):
poller = self.app.getComponent(IPoller)
poller.next().addCallback(self._spawn_process, slot)
def _spawn_process(self, message, slot):
e = self.app.getComponent(IEnvironment)
message.setdefault('settings', {})
message['settings'].update(e.get_settings(message))
msg = native_stringify_dict(message, keys_only=False)
project = msg['_project']
args = [sys.executable, '-m', self.runner, 'crawl']
args += get_crawl_args(msg)
env = e.get_environment(msg, slot)
env = native_stringify_dict(env, keys_only=False)
pp = ScrapyProcessProtocol(project, msg['_spider'], msg['_job'], env, args)
pp.deferred.addBoth(self._process_finished, slot)
reactor.spawnProcess(pp, sys.executable, args=args, env=env)
self.processes[slot] = pp
说明
scrapyd 对于egg 文件的加载核心是基于pkg_resources.find_distributions进行查找,之后激活,结合任务调度队列进行任务的处理(定时型的),默认一些参考配置
[scrapyd]
eggs_dir = eggs
logs_dir = logs
items_dir =
jobs_to_keep = 5
dbs_dir = dbs
max_proc = 0
max_proc_per_cpu = 4
finished_to_keep = 100
poll_interval = 5.0
bind_address = 127.0.0.1
http_port = 6800
username =
password =
prefix_header = x-forwarded-prefix
debug = off
runner = scrapyd.runner
jobstorage = scrapyd.jobstorage.MemoryJobStorage
application = scrapyd.app.application
launcher = scrapyd.launcher.Launcher
spiderqueue = scrapyd.spiderqueue.SqliteSpiderQueue
webroot = scrapyd.website.Root
eggstorage = scrapyd.eggstorage.FilesystemEggStorage
[services]
schedule.json = scrapyd.webservice.Schedule
cancel.json = scrapyd.webservice.Cancel
addversion.json = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json = scrapyd.webservice.ListSpiders
delproject.json = scrapyd.webservice.DeleteProject
delversion.json = scrapyd.webservice.DeleteVersion
listjobs.json = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus
参考资料
https://packaging.pypa.io/en/stable/https://zopeinterface.readthedocs.io/en/latest/README.html
https://pypi.org/project/packaging/
https://github.com/scrapy/scrapyd
https://scrapyd.readthedocs.io/en/stable/overview.html