大连人工智能计算平台——华为昇腾AI平台——高性能计算HPC——调度器dstart的ssh启动方式不可用
根据华为的官方文档:
https://support.huawei.com/enterprise/zh/doc/EDOC1100228705/d1f5a239#ZH-CN_TOPIC_0000001212004449
可以知道,HPC的启动方式如果不指定--mca plm_rsh_agent方式启动,那么默认的启动方式为ssh方式启动MPI,但是实际操作后发现不可行,报错:
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory Host key verification failed. ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory Host key verification failed. Host key verification failed. Host key verification failed. Host key verification failed. Host key verification failed. Host key verification failed. Host key verification failed. Host key verification failed. Host key verification failed. Host key verification failed. Host key verification failed. Host key verification failed. Host key verification failed. Host key verification failed. Host key verification failed. Host key verification failed. Host key verification failed. Host key verification failed. Host key verification failed. Host key verification failed. Host key verification failed. Host key verification failed. Host key verification failed. -------------------------------------------------------------------------- ORTE was unable to reliably start one or more daemons. This usually is caused by: * not finding the required libraries and/or binaries on one or more nodes. Please check your PATH and LD_LIBRARY_PATH settings, or configure OMPI with --enable-orterun-prefix-by-default * lack of authority to execute on one or more specified nodes. Please verify your allocation and authorities. * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). Please check with your sys admin to determine the correct location to use. * compilation of the orted with dynamic libraries when static are required (e.g., on Cray). Please check your configure cmd line and consider using one of the contrib/platform definitions for your system type. * an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -------------------------------------------------------------------------- -------------------------------------------------------------------------- ORTE does not know how to route a message to the specified daemon located on the indicated node: my node: dlhpcshare-agent-37 target node: dlhpcshare-agent-25 This is usually an internal programming error that should be reported to the developers. In the meantime, a workaround may be to set the MCA param routed=direct on the command line or in your environment. We apologize for the problem. -------------------------------------------------------------------------- [dlhpcshare-agent-37:2299732] 22 more processes have sent help message help-errmgr-base.txt / no-path [dlhpcshare-agent-37:2299732] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
结论就是:
这个HPC平台没有为计算节点设置ssh的免密码认证,因此各计算节点通过ssh通信时是无法认证通过的,由此报错;由此可以知道,在该HPC上是不能使用ssh的方式进行计算节点通信的,还是要使用--mca plm_rsh_agent方式来进行子节点的启动通信的。
=========================================================
本博客是博主个人学习时的一些记录,不保证是为原创,个别文章加入了转载的源地址,还有个别文章是汇总网上多份资料所成,在这之中也必有疏漏未加标注处,如有侵权请与博主联系。
如果未特殊标注则为原创,遵循 CC 4.0 BY-SA 版权协议。
标签:
浪潮计算平台
posted on 2023-08-25 12:04 Angry_Panda 阅读(56) 评论(0) 编辑 收藏 举报
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 分享一个免费、快速、无限量使用的满血 DeepSeek R1 模型,支持深度思考和联网搜索!
· 使用C#创建一个MCP客户端
· 基于 Docker 搭建 FRP 内网穿透开源项目(很简单哒)
· ollama系列1:轻松3步本地部署deepseek,普通电脑可用
· 按钮权限的设计及实现
2020-08-25 【转载】 无人机的四旋翼为什么不能运用到大型有人直升机上?
2020-08-25 视频分享---------《无人机背后的PID控制》