Linux安装Slurm集群

安装规划

SLURM(Simple Linux Utility for Resource Management)是一个开源、高性能、可扩展的集群管理和作业调度系统,被广泛应用于大型计算集群和超级计算机中。它能够有效地管理集群中的计算资源(如CPU、内存、GPU等),并根据用户的需求对作业进行调度,从而提高集群的利用率。

  • master控制节点:

    • 172.16.45.29(920)
  • node计算节点:

    • 172.16.45.2(920)
    • 172.16.45.4(920)

此次以centos 8 等rpm体系的Linux发行版为例。

创建账号


#! 删除数据库
yum remove mariadb-server mariadb-devel -y

#! 删除Slurm及Munge
yum remove slurm munge munge-libs munge-devel -y

#! 删除用户
userdel -r slurm
userdel -r munge

#! 创建用户
export MUNGEUSER=1051
groupadd -g $MUNGEUSER munge
useradd  -m -c "MUNGE Uid 'N' Gid Emporium" -d /var/lib/munge -u $MUNGEUSER -g munge  -s /sbin/nologin munge
export SLURMUSER=1052
groupadd -g $SLURMUSER slurm
useradd  -m -c "SLURM workload manager" -d /var/lib/slurm -u $SLURMUSER -g slurm  -s /bin/bash slurm

#! ssh免密码登录 # 控制节点执行 https://builtin.com/articles/ssh-without-password
ssh-keygen 

#! 拷贝密钥到计算节点
ssh-copy-id 172.16.45.2
ssh-copy-id 172.16.45.4

Munge

Munge 是一个用于创建和验证用户凭证的身份验证服务,主要应用于大规模的高性能计算(HPC)集群中。它被设计为高度可扩展,能够在复杂的集群环境中提供安全可靠的身份验证。

https://github.com/dun/munge

Munge的作用

  • 身份验证

Munge 允许进程在具有相同普通用户(UID)和组(GID)的主机组中,对另一个本地或远程的进程进行身份验证。这些主机组构成了一个共享密码密钥的安全域。

  • 安全域

Munge 通过定义安全域来管理不同主机之间的信任关系。在同一个安全域内的主机可以相互信任,而不同安全域之间的主机则需要进行额外的身份验证。

  • 简化身份管理

Munge 可以简化HPC集群中的身份管理。通过使用Munge,管理员可以避免在每个节点上配置复杂的SSH密钥或Kerberos配置。

Munge的工作原理

Munge 通过生成和验证证书来实现身份验证。当一个进程需要访问另一个进程时,它会向Munge服务器请求一个证书。Munge服务器会验证请求者的身份,然后生成一个证书。这个证书会包含请求者的UID、GID以及其他一些信息。被访问的进程会验证这个证书,以确认请求者的身份。

Munge的优势

  • 高性能: Munge 被设计为能够处理大量身份验证请求。
  • 可扩展性: Munge 可以很容易地扩展到大型集群。
  • 安全性: Munge 提供了多种安全机制,可以防止未授权访问。
  • 易于使用: Munge 的配置相对简单,易于管理。

安装

#! 所有节点
yum install epel-release -y
yum install munge munge-libs munge-devel -y

管理节点生成secret key

yum install rng-tools -y
rngd -r /dev/urandom
/usr/sbin/create-munge-key -r
dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
chown munge: /etc/munge/munge.key
chmod 400 /etc/munge/munge.key

scp /etc/munge/munge.key root@172.16.45.2:/etc/munge
scp /etc/munge/munge.key root@172.16.45.4:/etc/munge
#! 所有节点
chown -R munge: /etc/munge/ /var/log/munge/
chmod 0700 /etc/munge/ /var/log/munge/
systemctl enable munge
systemctl start munge
#! 在主节点测试
# munge -n
MUNGE:AwQFAAD9xUgg77lK2Ts72xayqCe4IETD9sp4ZEJD8ZTCbDekcojBef1fveBK8YweUi/7ImJMUdw3rO+gl3P02K5cHJAJX0Xq74rhW+1EgZgJZcIxHy4Z3qmsPWk4rVzhJfKGgUQ=:
# munge -n | munge
MUNGE:AwQFAACLbOsTGZWeENLUthY0WyyVWQ1HVEBbGIWEAobpAaLI2T1oMbHKjMO6zOvCTIKZcEPB/0CBhYxbpekFQwK7jeN7RMIxuZ+9dZFUF6jLEh0gbiLIpvgL1z3kGGwZNR+FMR6D/b1pUFPL4Mt9QQd4zjAIOvVnWCoXyE3XTfI64ZIbGJCZypMRj6nD7G2zgEVQ+v23vSPb81mnfC7ne1FaLIdNu9Iy8ZsESaxXJDrVoKFf/3Nax+Iw/LvauIbjF/Ps/Ok6aDcIAoPbOFWfbO7L2rovQzHt/3ABwwzH4yOGDdj9aWyqcyuqegDp/d8l6iJ7TIg=:
# munge -n | ssh 172.16.45.2  unmunge

Authorized users only. All activities may be monitored and reported.
STATUS:          Success (0)
ENCODE_HOST:     ??? (172.16.45.29)
ENCODE_TIME:     2024-12-10 16:16:55 +0800 (1733818615)
DECODE_TIME:     2024-12-10 16:16:52 +0800 (1733818612)
TTL:             300
CIPHER:          aes128 (4)
MAC:             sha256 (5)
ZIP:             none (0)
UID:             root (0)
GID:             root (0)
LENGTH:          0

安装Slurm

#! 所有节点
yum install openssl openssl-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel libibmad libibumad perl-ExtUtils-MakeMaker perl-devel  gcc mariadb-devel  pam-devel rpm-build -y

wget https://download.schedmd.com/slurm/slurm-24.05.4.tar.bz2

rpmbuild -ta slurm-24.05.4.tar.bz2

cd /root/rpmbuild/RPMS/aarch64/

yum --nogpgcheck localinstall * -y
#! 所有节点
yum install openssl openssl-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel libibmad libibumad perl-ExtUtils-MakeMaker perl-devel  gcc mariadb-devel  pam-devel rpm-build -y

wget https://download.schedmd.com/slurm/slurm-24.05.4.tar.bz2

rpmbuild -ta slurm-24.05.4.tar.bz2

cd /root/rpmbuild/RPMS/aarch64/

yum --nogpgcheck localinstall * -y

mkdir -p /var/log/slurm/
chown slurm: /var/log/slurm/

# vi /etc/slurm/slurm.conf
#
# Example slurm.conf file. Please run configurator.html
# (in doc/html) to build a configuration file customized
# for your environment.
#
#
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=cluster
SlurmctldHost=Donau(172.16.45.29)
#SlurmctldHost=
#
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=67043328
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=lua
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=10000
#MaxStepCount=40000
#MaxTasksPerNode=512
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/cgroup
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/affinity
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/cons_tres
#
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
#AccountingStoreFlags=
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
#JobContainerType=
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm/slurmd.log
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#DebugFlags=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=rabbitmq-node1 NodeAddr=172.16.45.2 CPUs=128 State=UNKNOWN
NodeName=gczxagenta2 NodeAddr=172.16.45.4 CPUs=128 State=UNKNOWN
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP

控制节点

mkdir /var/spool/slurmctld
chown slurm: /var/spool/slurmctld
chmod 755 /var/spool/slurmctld
touch /var/log/slurm/slurmctld.log
chown slurm: /var/log/slurm/slurmctld.log
touch /var/log/slurm/slurm_jobacct.log /var/log/slurm/slurm_jobcomp.log
chown slurm: /var/log/slurm/slurm_jobacct.log /var/log/slurm/slurm_jobcomp.log

计算节点

mkdir /var/spool/slurm
chown slurm: /var/spool/slurm
chmod 755 /var/spool/slurm
touch /var/log/slurm/slurmd.log
chown slurm: /var/log/slurm/slurmd.log

所有节点测试配置:

# slurmd -C # 确认没有报错
NodeName=rabbitmq-node1 CPUs=128 Boards=1 SocketsPerBoard=128 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=514413
UpTime=12-07:19:32

# yum install ntp -y
# chkconfig ntpd on
# ntpdate pool.ntp.org
# systemctl start ntpd

计算节点

systemctl enable slurmd.service
systemctl start slurmd.service
systemctl status slurmd.service
# 此时管理节点没有启动,报错是正常的。

参考资料

主节点安装MariaDB

yum install mariadb-server mariadb-devel -y
systemctl enable mariadb
systemctl start mariadb
systemctl status mariadb
mysql
MariaDB[(none)]> GRANT ALL ON slurm_acct_db.* TO 'slurm'@'localhost' IDENTIFIED BY '1234' with grant option;
MariaDB[(none)]> SHOW VARIABLES LIKE 'have_innodb';
MariaDB[(none)]> FLUSH PRIVILEGES;
MariaDB[(none)]> CREATE DATABASE slurm_acct_db;
MariaDB[(none)]> quit;

# vi  /etc/my.cnf.d/innodb.cnf

[mysqld]
innodb_buffer_pool_size=1024M
innodb_log_file_size=64M
innodb_lock_wait_timeout=900

# systemctl stop mariadb
mv /var/lib/mysql/ib_logfile? /tmp/
systemctl start mariadb

# vim /etc/slurm/slurmdbd.conf
#
# Example slurmdbd.conf file.
#
# See the slurmdbd.conf man page for more information.
#
# Archive info
#ArchiveJobs=yes
#ArchiveDir="/tmp"
#ArchiveSteps=yes
#ArchiveScript=
#JobPurge=12
#StepPurge=1
#
# Authentication info
AuthType=auth/munge
#AuthInfo=/var/run/munge/munge.socket.2
#
# slurmDBD info
DbdAddr=localhost
DbdHost=localhost
#DbdPort=7031
SlurmUser=slurm
#MessageTimeout=300
DebugLevel=verbose
#DefaultQOS=normal,standby
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/var/run/slurmdbd.pid
#PluginDir=/usr/lib/slurm
#PrivateData=accounts,users,usage,jobs
#TrackWCKey=yes
#
# Database info
StorageType=accounting_storage/mysql
#StorageHost=localhost
#StoragePort=1234
DbdPort=6819
StoragePass=1234
StorageLoc=slurm_acct_db

# chown slurm: /etc/slurm/slurmdbd.conf
chmod 600 /etc/slurm/slurmdbd.conf
touch /var/log/slurmdbd.log
chown slurm: /var/log/slurmdbd.log

systemctl enable slurmdbd
systemctl start slurmdbd
systemctl status slurmdbd

systemctl enable slurmctld.service
systemctl start slurmctld.service
systemctl status slurmctld.service

验证

# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1   idle gczxagenta2,rabbitmq-node1
# srun -N2 -l /bin/hostname
0: gczxagenta2
1: rabbitmq-node1

巨多的坑

  • fatal error: EXTERN.h :执行 yum -y install perl-devel一般可以解决

  • 管理节点和计算节点不要部署在同一台

  • munged: Error: Logfile is insecure: group-writable permissions set on "/var/log"
    有时启动会对日志文件的权限有要求,比如:755

  • error: auth_p_get_host: Lookup failed for 172.16.45.34

建议在hosts文件添加IP和主机名的映射,比如:

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
172.16.45.29 Donau
172.16.45.18 rabbitmq-node2
172.16.45.2 rabbitmq-node1
172.16.45.34 Donau2
172.16.45.4 gczxagenta2
  • error: Configured MailProg is invalid: 这个错误无需处理

  • _read_slurm_cgroup_conf: No cgroup.conf file (/etc/slurm/cgroup.conf), using defaults 这个错误无需处理

  • srun: error: Task launch for StepId=12.0 failed on node : Invalid node

检查node的ip等是否有重复

posted @ 2024-12-10 19:39  磁石空杯  阅读(105)  评论(0编辑  收藏  举报