Open MPI集群运行
部署完之后,代码也能正确跑起来了,也确实集群分散了。跑一下各种各样的代码,发现了一个错误:
$ ~/OpenMpi/bin/mpiexec -np 10 ~/NetWorkTest My rank is 2 My rank is 7 My rank is 0 My rank is 3 My rank is 6 My rank is 8 My rank is 4 My rank is 1 My rank is 5 ------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. ------------------------------------------------------- -------------------------------------------------------------------------- mpiexec detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[18656,1],2] Exit code: 14 --------------------------------------------------------------------------
这份代码是什么问题导致的呢?然后我不小心把 MPF_Finalize() 函数注释掉了,那么就是说明有一个进程先错误返回了。Master 进程捕获到了。
这里反映了一个事实: 集群中如果有一个进程挂掉了,那么整个进程集都会挂掉
加回去 MPF_Finalize() 函数,这个错误就没了
【推荐】还在用 ECharts 开发大屏?试试这款永久免费的开源 BI 工具!
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步