Ubuntu18.04源码安装与配置slurm19.05
说明:
- 目前Ubuntu18的apt管理器安装默认只有slurm17,但在运行时会出现错误,如“srun: fatal: ../../../src/api/step_launch.c:1038 step_launch_state_destroy: pthread_mutex_destroy(): Device or resource busy”,判断可能是源码的问题,但由于apt只有slurm17,因此需要通过源码安装。log如下:
srun: fatal: ../../../src/api/step_launch.c:1038 step_launch_state_destroy: pthread_mutex_destroy(): Device or resource busy
slurm.pl: 1 / 40 failed, log is in projects/seame/exp/mono0a/log/acc.9.*.log
- 本文只包含单节点slurm控制节点和计算节点的安装和配置。
环境:
- 主机:dell-PowerEdge-T630
- 系统:Ubuntu 18.04.2 LTS (GNU/Linux 4.18.0-25-generic x86_64)
源码下载:
- 源码下载页面:https://www.schedmd.com/downloads.php
- Slurm 19.05:https://download.schedmd.com/slurm/slurm-19.05.1-2.tar.bz2
参考:
安装:
step1:相关包的安装:
$ apt-get update
$ apt-get install git gcc make ruby ruby-dev libpam0g-dev libmariadb-client-lgpl-dev libmysqlclient-dev
$ gem install fpm
- 使用gem安装fpm,fpm后面有对源码打包的用处,不能少。
- 安装munge,此处不详述。
step2:克隆mknoxnv的仓库
$ cd /storage
$ git clone https://github.com/mknoxnv/ubuntu-slurm.git
step3:下载源码包,编译,安装。
$ cd /storage
$ wget https://download.schedmd.com/slurm/slurm-19.05.1-2.tar.bz2
$ tar xvjf slurm-19.05.1-2.tar.bz2
$ cd slurm-19.05.1-2
$ ./configure --prefix=/tmp/slurm-build --sysconfdir=/etc/slurm --enable-pam --with-pam_dir=/lib/x86_64-linux-gnu/security/ --without-shared-libslurm
$ make
$ make contrib
$ make install
此处实际上并没有安装成功,需要对之前安装好的目录/tmp/slurm-biuld使用fpm打包,再用dpkg安装。
$ cd ..
$ fpm -s dir -t deb -v 1.0 -n slurm-19.05.1 --prefix=/usr -C /tmp/slurm-build .
$ dpkg -i slurm-19.05.1_1.0_amd64.deb
step4:然后建立slurm用户,修改相关目录的权限:
$ useradd slurm
$ mkdir -p /etc/slurm /etc/slurm/prolog.d /etc/slurm/epilog.d /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm
$ chown slurm /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm
step5:服务相关文件配置
复制服务相关文件
$ cd /storage
$ cp ubuntu-slurm/slurmctld.service /etc/systemd/system/
$ cp ubuntu-slurm/slurmd.service /etc/systemd/system/
$ cp ubuntu-slurm/slurm.conf /etc/slurm
enable service and start service.
$ systemctl daemon-reload
$ systemctl enable slurmctld
$ systemctl start slurmctld
$ systemctl enable slurmd
$ systemctl start slurmd
在此处一般无法正常开启服务,很大原因是配置文件没有完成,或是参数没有配置对。请看下一部分描述配置文件相关配置、配置写法,以下的参数并不完全。
配置:
- 在当前情况下,ControlMachine需要写机器名字。(也可以是IP)
- SlurmUser需要是slurm
- StateSaveLocation、SlurmdSpoolDir、SlurmctldPidFile、SlurmdPidFile、SlurmctldLogFile、SlurmdLogFile中的目录需要跟前面建立的目录一致,很多人有着不同的配置,有着不同的目录,但是自己的当前情况要与配置文件一致。
- SelectType、SelectTypeParameters的配置可以参考官方手册。https://slurm.schedmd.com/
- 计算节点的声明:NodeName=dell-PowerEdge-T630 CPUs=56 Sockets=2 CoresPerSocket=14 ThreadsPerCore=2 State=UNKNOWN Gres=gpu:4,可以使用slurmd -C命令查看当前计算机的相关参数。
ClusterName=compute-cluster
ControlMachine=dell-PowerEdge-T630
SlurmUser=slurm
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
StateSaveLocation=/var/spool/slurm/ctld
SlurmdSpoolDir=/var/spool/slurm/d
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
PluginDir=/usr/lib/slurm
ReturnToService=1
Proctracktype=proctrack/linuxproc
CacheGroups=0
ReturnToService=2
SelectType=select/cons_res
SelectTypeParameters=CR_Core,CR_ONE_TASK_PER_CORE,CR_CORE_DEFAULT_DIST_BLOCK
TaskPlugin=task/affinity
TaskPluginParam=Sched
KillOnBadExit=1
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
# SCHEDULING
SchedulerType=sched/backfill
FastSchedule=1
# LOGGING
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.log
JobCompType=jobcomp/none
# ACCOUNTING
JobAcctGatherType=jobacct_gather/linux
# COMPUTE NODES
GresTypes=gpu
NodeName=dell-PowerEdge-T630 CPUs=56 Sockets=2 CoresPerSocket=14 ThreadsPerCore=2 State=UNKNOWN Gres=gpu:4
PartitionName=mipitalk Nodes=ALL Default=YES MaxTime=INFINITE State=UP