Linux安装Slurm集群
安装规划
SLURM(Simple Linux Utility for Resource Management)是一个开源、高性能、可扩展的集群管理和作业调度系统,被广泛应用于大型计算集群和超级计算机中。它能够有效地管理集群中的计算资源(如CPU、内存、GPU等),并根据用户的需求对作业进行调度,从而提高集群的利用率。
-
master控制节点:
- 172.16.45.29(920)
-
node计算节点:
- 172.16.45.2(920)
- 172.16.45.4(920)
此次以centos 8 等rpm体系的Linux发行版为例。
创建账号
#! 删除数据库
yum remove mariadb-server mariadb-devel -y
#! 删除Slurm及Munge
yum remove slurm munge munge-libs munge-devel -y
#! 删除用户
userdel -r slurm
userdel -r munge
#! 创建用户
export MUNGEUSER=1051
groupadd -g $MUNGEUSER munge
useradd -m -c "MUNGE Uid 'N' Gid Emporium" -d /var/lib/munge -u $MUNGEUSER -g munge -s /sbin/nologin munge
export SLURMUSER=1052
groupadd -g $SLURMUSER slurm
useradd -m -c "SLURM workload manager" -d /var/lib/slurm -u $SLURMUSER -g slurm -s /bin/bash slurm
#! ssh免密码登录 # 控制节点执行 https://builtin.com/articles/ssh-without-password
ssh-keygen
#! 拷贝密钥到计算节点
ssh-copy-id 172.16.45.2
ssh-copy-id 172.16.45.4
Munge
Munge 是一个用于创建和验证用户凭证的身份验证服务,主要应用于大规模的高性能计算(HPC)集群中。它被设计为高度可扩展,能够在复杂的集群环境中提供安全可靠的身份验证。
Munge的作用
- 身份验证
Munge 允许进程在具有相同普通用户(UID)和组(GID)的主机组中,对另一个本地或远程的进程进行身份验证。这些主机组构成了一个共享密码密钥的安全域。
- 安全域
Munge 通过定义安全域来管理不同主机之间的信任关系。在同一个安全域内的主机可以相互信任,而不同安全域之间的主机则需要进行额外的身份验证。
- 简化身份管理
Munge 可以简化HPC集群中的身份管理。通过使用Munge,管理员可以避免在每个节点上配置复杂的SSH密钥或Kerberos配置。
Munge的工作原理
Munge 通过生成和验证证书来实现身份验证。当一个进程需要访问另一个进程时,它会向Munge服务器请求一个证书。Munge服务器会验证请求者的身份,然后生成一个证书。这个证书会包含请求者的UID、GID以及其他一些信息。被访问的进程会验证这个证书,以确认请求者的身份。
Munge的优势
- 高性能: Munge 被设计为能够处理大量身份验证请求。
- 可扩展性: Munge 可以很容易地扩展到大型集群。
- 安全性: Munge 提供了多种安全机制,可以防止未授权访问。
- 易于使用: Munge 的配置相对简单,易于管理。
安装
#! 所有节点
yum install epel-release -y
yum install munge munge-libs munge-devel -y
管理节点生成secret key
yum install rng-tools -y
rngd -r /dev/urandom
/usr/sbin/create-munge-key -r
dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
chown munge: /etc/munge/munge.key
chmod 400 /etc/munge/munge.key
scp /etc/munge/munge.key root@172.16.45.2:/etc/munge
scp /etc/munge/munge.key root@172.16.45.4:/etc/munge
#! 所有节点
chown -R munge: /etc/munge/ /var/log/munge/
chmod 0700 /etc/munge/ /var/log/munge/
systemctl enable munge
systemctl start munge
#! 在主节点测试
# munge -n
MUNGE:AwQFAAD9xUgg77lK2Ts72xayqCe4IETD9sp4ZEJD8ZTCbDekcojBef1fveBK8YweUi/7ImJMUdw3rO+gl3P02K5cHJAJX0Xq74rhW+1EgZgJZcIxHy4Z3qmsPWk4rVzhJfKGgUQ=:
# munge -n | munge
MUNGE:AwQFAACLbOsTGZWeENLUthY0WyyVWQ1HVEBbGIWEAobpAaLI2T1oMbHKjMO6zOvCTIKZcEPB/0CBhYxbpekFQwK7jeN7RMIxuZ+9dZFUF6jLEh0gbiLIpvgL1z3kGGwZNR+FMR6D/b1pUFPL4Mt9QQd4zjAIOvVnWCoXyE3XTfI64ZIbGJCZypMRj6nD7G2zgEVQ+v23vSPb81mnfC7ne1FaLIdNu9Iy8ZsESaxXJDrVoKFf/3Nax+Iw/LvauIbjF/Ps/Ok6aDcIAoPbOFWfbO7L2rovQzHt/3ABwwzH4yOGDdj9aWyqcyuqegDp/d8l6iJ7TIg=:
# munge -n | ssh 172.16.45.2 unmunge
Authorized users only. All activities may be monitored and reported.
STATUS: Success (0)
ENCODE_HOST: ??? (172.16.45.29)
ENCODE_TIME: 2024-12-10 16:16:55 +0800 (1733818615)
DECODE_TIME: 2024-12-10 16:16:52 +0800 (1733818612)
TTL: 300
CIPHER: aes128 (4)
MAC: sha256 (5)
ZIP: none (0)
UID: root (0)
GID: root (0)
LENGTH: 0
安装Slurm
#! 所有节点
yum install openssl openssl-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel libibmad libibumad perl-ExtUtils-MakeMaker perl-devel gcc mariadb-devel pam-devel rpm-build -y
wget https://download.schedmd.com/slurm/slurm-24.05.4.tar.bz2
rpmbuild -ta slurm-24.05.4.tar.bz2
cd /root/rpmbuild/RPMS/aarch64/
yum --nogpgcheck localinstall * -y
#! 所有节点
yum install openssl openssl-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel libibmad libibumad perl-ExtUtils-MakeMaker perl-devel gcc mariadb-devel pam-devel rpm-build -y
wget https://download.schedmd.com/slurm/slurm-24.05.4.tar.bz2
rpmbuild -ta slurm-24.05.4.tar.bz2
cd /root/rpmbuild/RPMS/aarch64/
yum --nogpgcheck localinstall * -y
mkdir -p /var/log/slurm/
chown slurm: /var/log/slurm/
# vi /etc/slurm/slurm.conf
#
# Example slurm.conf file. Please run configurator.html
# (in doc/html) to build a configuration file customized
# for your environment.
#
#
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=cluster
SlurmctldHost=Donau(172.16.45.29)
#SlurmctldHost=
#
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=67043328
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=lua
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=10000
#MaxStepCount=40000
#MaxTasksPerNode=512
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/cgroup
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/affinity
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/cons_tres
#
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
#AccountingStoreFlags=
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
#JobContainerType=
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm/slurmd.log
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#DebugFlags=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=rabbitmq-node1 NodeAddr=172.16.45.2 CPUs=128 State=UNKNOWN
NodeName=gczxagenta2 NodeAddr=172.16.45.4 CPUs=128 State=UNKNOWN
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
控制节点
mkdir /var/spool/slurmctld
chown slurm: /var/spool/slurmctld
chmod 755 /var/spool/slurmctld
touch /var/log/slurm/slurmctld.log
chown slurm: /var/log/slurm/slurmctld.log
touch /var/log/slurm/slurm_jobacct.log /var/log/slurm/slurm_jobcomp.log
chown slurm: /var/log/slurm/slurm_jobacct.log /var/log/slurm/slurm_jobcomp.log
计算节点
mkdir /var/spool/slurm
chown slurm: /var/spool/slurm
chmod 755 /var/spool/slurm
touch /var/log/slurm/slurmd.log
chown slurm: /var/log/slurm/slurmd.log
所有节点测试配置:
# slurmd -C # 确认没有报错
NodeName=rabbitmq-node1 CPUs=128 Boards=1 SocketsPerBoard=128 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=514413
UpTime=12-07:19:32
# yum install ntp -y
# chkconfig ntpd on
# ntpdate pool.ntp.org
# systemctl start ntpd
计算节点
systemctl enable slurmd.service
systemctl start slurmd.service
systemctl status slurmd.service
# 此时管理节点没有启动,报错是正常的。
参考资料
- 软件测试精品书籍文档下载持续更新 https://github.com/china-testing/python-testing-examples 请点赞,谢谢!
- 本文涉及的python测试开发库 谢谢点赞! https://github.com/china-testing/python_cn_resouce
- python精品书籍下载 https://github.com/china-testing/python_cn_resouce/blob/main/python_good_books.md
- Linux精品书籍下载 https://www.cnblogs.com/testing-/p/17438558.html
- https://github.com/Artlands/Install-Slurm
主节点安装MariaDB
yum install mariadb-server mariadb-devel -y
systemctl enable mariadb
systemctl start mariadb
systemctl status mariadb
mysql
MariaDB[(none)]> GRANT ALL ON slurm_acct_db.* TO 'slurm'@'localhost' IDENTIFIED BY '1234' with grant option;
MariaDB[(none)]> SHOW VARIABLES LIKE 'have_innodb';
MariaDB[(none)]> FLUSH PRIVILEGES;
MariaDB[(none)]> CREATE DATABASE slurm_acct_db;
MariaDB[(none)]> quit;
# vi /etc/my.cnf.d/innodb.cnf
[mysqld]
innodb_buffer_pool_size=1024M
innodb_log_file_size=64M
innodb_lock_wait_timeout=900
# systemctl stop mariadb
mv /var/lib/mysql/ib_logfile? /tmp/
systemctl start mariadb
# vim /etc/slurm/slurmdbd.conf
#
# Example slurmdbd.conf file.
#
# See the slurmdbd.conf man page for more information.
#
# Archive info
#ArchiveJobs=yes
#ArchiveDir="/tmp"
#ArchiveSteps=yes
#ArchiveScript=
#JobPurge=12
#StepPurge=1
#
# Authentication info
AuthType=auth/munge
#AuthInfo=/var/run/munge/munge.socket.2
#
# slurmDBD info
DbdAddr=localhost
DbdHost=localhost
#DbdPort=7031
SlurmUser=slurm
#MessageTimeout=300
DebugLevel=verbose
#DefaultQOS=normal,standby
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/var/run/slurmdbd.pid
#PluginDir=/usr/lib/slurm
#PrivateData=accounts,users,usage,jobs
#TrackWCKey=yes
#
# Database info
StorageType=accounting_storage/mysql
#StorageHost=localhost
#StoragePort=1234
DbdPort=6819
StoragePass=1234
StorageLoc=slurm_acct_db
# chown slurm: /etc/slurm/slurmdbd.conf
chmod 600 /etc/slurm/slurmdbd.conf
touch /var/log/slurmdbd.log
chown slurm: /var/log/slurmdbd.log
systemctl enable slurmdbd
systemctl start slurmdbd
systemctl status slurmdbd
systemctl enable slurmctld.service
systemctl start slurmctld.service
systemctl status slurmctld.service
验证
# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 1 idle gczxagenta2,rabbitmq-node1
# srun -N2 -l /bin/hostname
0: gczxagenta2
1: rabbitmq-node1
巨多的坑
-
fatal error: EXTERN.h :执行 yum -y install perl-devel一般可以解决
-
管理节点和计算节点不要部署在同一台
-
munged: Error: Logfile is insecure: group-writable permissions set on "/var/log"
有时启动会对日志文件的权限有要求,比如:755 -
error: auth_p_get_host: Lookup failed for 172.16.45.34
建议在hosts文件添加IP和主机名的映射,比如:
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
172.16.45.29 Donau
172.16.45.18 rabbitmq-node2
172.16.45.2 rabbitmq-node1
172.16.45.34 Donau2
172.16.45.4 gczxagenta2
-
error: Configured MailProg is invalid: 这个错误无需处理
-
_read_slurm_cgroup_conf: No cgroup.conf file (/etc/slurm/cgroup.conf), using defaults 这个错误无需处理
-
srun: error: Task launch for StepId=12.0 failed on node : Invalid node
检查node的ip等是否有重复