搭建slurm集群

参考:https://www.wanghaiqing.com/article/911f5d98-b68a-4daa-8db6-ee2052ec8275/

 

 Slurm是面向Linux和Unix的开源工作调度程序,由世界上许多超级计算机使用,主要功能如下: 

1、为用户分配计算节点的资源,以执行工作; 

2、提供的框架在一组分配的节点上启动、执行和监视工作(通常是并行作业); 

3、管理待处理作业的工作队列来仲裁资源争用问题;

 

Slurm架构

 

环境配置

 服务器 IP   主机名 操作系统  配置
 控制节点  172.18.7.31  master  CentOS7.9 8核16G
 计算节点1  172.18.7.32  node01  CentOS7.9 8核32G
 计算节点2  172.18.7.33  node02  CentOS7.9 8核32G  

 

一、基础环境(除说明外,所有机器都要执行)

关闭防火墙

systemctl stop firewalld
systemctl disable firewalld
sed -i -e 's/^SELINUX=.*/SELINUX=disabled/g' /etc/selinux/config
setenforce 0

 

换成阿里云的源

rm -rf /etc/yum.repos.d/*
curl -o /etc/yum.repos.d/CentOS-Base.repo https://mirrors.aliyun.com/repo/Centos-7.repo
curl -o /etc/yum.repos.d/epel.repo https://mirrors.aliyun.com/repo/epel-7.repo
yum clean all
yum makecache fast -y

 

公司里CentOS7的源

rm -rf /etc/yum.repos.d/*
cat > /etc/yum.repos.d/centos7.repo << EOF
[base]
name=base
baseurl=http://172.18.0.61/centos7/base
enabled=1
gpgcheck=0
[extras]
name=extras
baseurl=http://172.18.0.61/centos7/extras
enabled=1
gpgcheck=0
[updates]
name=updates
baseurl=http://172.18.0.61/centos7/updates
enabled=1
gpgcheck=0
[epel]
name=epel
baseurl=http://172.18.0.61/centos7/epel
enabled=1
gpgcheck=0
EOF
yum clean all
yum makecache fast -y

 

设置主机名,主机名一定不能重复(分别执行)

hostnamectl set-hostname master
hostnamectl set-hostname node01
hostnamectl set-hostname node02

 

设置hosts

cat >> /etc/hosts << EOF
172.18.7.31 master
172.18.7.32 node01
172.18.7.33 node02
EOF

 

加快ssh访问

echo "UseDNS no" >> /etc/ssh/sshd_config
systemctl restart sshd

 

安装软件

yum -y install net-tools wget vim ntpdate chrony htop glances nfs-utils rpcbind python3

 

ntpdate 时间同步

# 公网时间服务器
ntpdate time1.aliyun.com
echo "*/5 * * * * /usr/sbin/ntpdate time1.aliyun.com" >> /var/spool/cron/root
timedatectl set-timezone Asia/Shanghai
hwclock --systohc
# 内网时间服务器
ntpdate 172.18.0.162
echo "*/5 * * * * /usr/sbin/ntpdate 172.18.0.162" >> /var/spool/cron/root
timedatectl set-timezone Asia/Shanghai
hwclock --systohc

 

 

配置SSH免登陆

# 控制节点上面执行
echo y| ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
ssh-copy-id -i ~/.ssh/id_rsa.pub -o StrictHostKeyChecking=no root@node01
ssh-copy-id -i ~/.ssh/id_rsa.pub -o StrictHostKeyChecking=no root@node02

 

 

二、配置Munge(除说明外,所有机器都要执行)

 

创建Munge用户

Munge用户要确保Master Node和Compute Nodes的UID和GID相同,所有节点都需要安装Munge;

groupadd -g 1108 munge
useradd -m -c "Munge Uid 'N' Gid Emporium" -d /var/lib/munge -u 1108 -g munge -s /sbin/nologin munge

 

生成熵池

# 安装
yum install -y rng-tools
# 使用/dev/urandom来做熵源
rngd -r /dev/urandom
sed -i 's#^ExecStart.*#ExecStart=/sbin/rngd -f -r /dev/urandom#g' /usr/lib/systemd/system/rngd.service
systemctl daemon-reload
systemctl start rngd
systemctl enable rngd
systemctl status rngd

 

 

部署Munge,Munge是认证服务,实现本地或者远程主机进程的UID、GID验证。

yum install munge munge-libs munge-devel -y

 

 创建全局密钥,在Master Node创建全局使用的密钥

# 控制节点上面执行
/usr/sbin/create-munge-key -r
dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key

 

密钥同步到所有计算节点

# 控制节点上面执行
scp -p /etc/munge/munge.key root@node01:/etc/munge
scp -p /etc/munge/munge.key root@node02:/etc/munge
# 计算节点上面执行
chown munge: /etc/munge/munge.key
chmod 400 /etc/munge/munge.key

 

启动所有节点 

systemctl restart munge
systemctl enable munge
systemctl status munge

 

测试Munge服务,每个计算节点与控制节点进行连接验证

# 本地查看凭据
munge -n
# 本地解码
munge -n | unmunge
# 验证compute node,远程解码
munge -n | ssh node01 unmunge
# Munge凭证基准测试
remunge

 

 三、配置Slurm(除说明外,所有机器都要执行)

 

创建Slurm用户 

groupadd -g 1109 slurm
useradd -m -c "Slurm manager" -d /var/lib/slurm -u 1109 -g slurm -s /bin/bash slurm

 

 安装Slurm依赖

yum install gcc gcc-c++ readline-devel perl-ExtUtils-MakeMaker pam-devel rpm-build mysql-devel http-parser-devel json-c-devel libjwt libjwt-devel -y

 

编译Slurm和安装Slurm

# 下载地址
https://download.schedmd.com/slurm/
wget https://download.schedmd.com/slurm/slurm-22.05.3.tar.bz2
rpmbuild -ta --with mysql --with slurmrestd --with jwt slurm-22.05.3.tar.bz2
cd /root/rpmbuild/RPMS/x86_64/
yum localinstall -y slurm-*

参数 --with slurmrestd支持restful api

 

配置控制节点Slurm 

# 控制节点上面执行
cp /etc/slurm/cgroup.conf.example /etc/slurm/cgroup.conf
cp /etc/slurm/slurm.conf.example /etc/slurm/slurm.conf
cat > /etc/slurm/slurm.conf << EOF
ClusterName=cluster
# SlurmctldHost=master
ControlMachine=master
ControlAddr=172.18.7.31
#
SlurmctldDebug=info
SlurmdDebug=debug3
GresTypes=gpu
MpiDefault=none
ProctrackType=proctrack/cgroup
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurm
SlurmUser=slurm
StateSaveLocation=/var/spool/slurm/ctld
SwitchType=switch/none
TaskPlugin=task/affinity,task/cgroup
# Fix Mentioned Error
# TaskPluginParam=Sched
TaskPluginParam=verbose
# TIMERS
#InactiveLimit=0
#KillWait=30
#ResumeTimeout=600
MinJobAge=172800
#OverTimeLimit=0
#SlurmctldTimeout=12
#SlurmdTimeout=300
#Waittime=0
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
# LOGGING AND ACCOUNTING
AccountingStorageEnforce=limits
AccountingStorageHost=master
AccountingStoragePort=6819
AccountingStorageType=accounting_storage/slurmdbd
# Fix Mentioned Error
# AccountingStoreJobComment=YES
AccountingStoreFlags=job_comment
#JobCompHost=localhost
#JobCompPass=123456
#JobCompPort=3306
#JobCompType=jobcomp/mysql
#JobCompUser=root
#JobAcctGatherFrequency=1
#JobAcctGatherType=jobacct_gather/linux
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd.log
AuthAltTypes=auth/jwt
AuthAltParameters=jwt_key=/var/spool/slurm/ctld/jwt_hs256.key
MaxNodeCount=1000
TreeWidth=65533
# COMPUTE NODES
NodeName=master,node[01-02] CPUs=4 RealMemory=6000 State=UNKNOWN
PartitionName=compute Nodes=node[01-02] Default=YES MaxTime=INFINITE State=UP AllowAccounts=zkxy,root
EOF

 

复制控制节点配置文件到计算节点 

# 控制节点上面执行
scp /etc/slurm/*.conf node01:/etc/slurm/
scp /etc/slurm/*.conf node02:/etc/slurm/

 

设置控制、计算节点文件权限 

mkdir -p /var/spool/slurm
chown slurm: /var/spool/slurm
mkdir -p /var/log/slurm
chown slurm: /var/log/slurm

 

配置控制节点Slurm Accounting,Accounting records为slurm收集作业步骤的信息,可以写入一个文本文件或数据库,但这个文件会变得越来越大,最简单的方法是使用MySQL来存储信息。

 

CentOS7采用yum方式安装mysql5.7(修改存储路径)

 

创建数据库的Slurm用户

# mysql5.7
grant all on slurm_acct_db.* to 'slurm'@'%' identified by 'Slurm*1234' with grant option;
# mysql8.0
CREATE USER 'slurm'@'%' identified with mysql_native_password by 'Slurm*1234';
GRANT ALL ON slurm_acct_db.* TO 'slurm'@'%';
flush privileges;

 

配置slurmdbd.conf文件 

# 控制节点上面执行
cp /etc/slurm/slurmdbd.conf.example /etc/slurm/slurmdbd.conf
cat > /etc/slurm/slurmdbd.conf << 'EOF'
AuthType=auth/munge
AuthInfo=/var/run/munge/munge.socket.2
DbdAddr=172.18.7.31
DbdHost=master
SlurmUser=slurm
DebugLevel=verbose
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/var/run/slurmdbd.pid
StorageType=accounting_storage/mysql
StorageHost=172.18.0.191
StorageUser=slurm
StoragePass=Slurm*1234
StorageLoc=slurm_acct_db #db名,slurmdbd会自动创建db
StoragePort=3306
AuthAltTypes=auth/jwt
AuthAltParameters=jwt_key=/var/spool/slurm/ctld/jwt_hs256.key
EOF

 

设置权限

# 控制节点上面执行
chown slurm: /etc/slurm/slurmdbd.conf
chown slurm: /etc/slurm/slurm.conf

 

Add JWT key to controller (StateSaveLocation目录)

mkdir -p /var/spool/slurm/ctld
dd if=/dev/random of=/var/spool/slurm/ctld/jwt_hs256.key bs=32 count=1
chown slurm:slurm /var/spool/slurm/ctld/jwt_hs256.key
chmod 0600 /var/spool/slurm/ctld/jwt_hs256.key
# chown root:root /etc/slurm
chmod 0755 /var/spool/slurm/ctld
chown slurm:slurm /var/spool/slurm/ctld

 

 启动服务

# 启动控制节点Slurmdbd服务
systemctl restart slurmdbd
systemctl enable slurmdbd
systemctl status slurmdbd
# 启动控制节点slurmctld服务
systemctl restart slurmctld
systemctl enable slurmctld
systemctl status slurmctld
# 启动计算节点的服务
systemctl restart slurmd
systemctl enable slurmd
systemctl status slurmd
# 服务无法启动,可通过直接启动命令查看
slurmdbd -Dvvv
slurmctld -Dvvv
slurmd -Dvvv

 

四、检查Slurm集群

创建用户

useradd zkxy
echo 123456 | passwd --stdin zkxy

 

检查Slurm集群

# 控制节点和计算节点上面都可以执行
# 查看集群
sinfo
scontrol show partition
scontrol show node
# 提交作业
srun -N2 hostname
scontrol show jobs
# 查看作业
squeue -a

 

新建用户  

useradd whq
echo whq | passwd --stdin whq

 

运行slurm api  (不能是root和SlurmUser用户)

cat > /etc/slurm/slurmrestd.conf << 'EOF'
include /etc/slurm/slurm.conf
AuthType=auth/jwt
EOF
chown slurm:slurm /etc/slurm/slurmrestd.conf
su - whq
slurmrestd -f /etc/slurm/slurmrestd.conf 0.0.0.0:6688 -a jwt -s openapi/v0.0.36
slurmrestd -f /etc/slurm/slurmrestd.conf -a rest_auth/jwt -s openapi/v0.0.36 -vvv 0.0.0.0:6688

 

创建systemd服务

cat > /usr/lib/systemd/system/slurmrestd.service <<EOF
[Unit]
Description=slurmrestd service
After=network.service
[Service]
Type=simple
User=whq
Group=whq
WorkingDirectory=/usr/sbin
ExecStart=/usr/sbin/slurmrestd -f /etc/slurm/slurmrestd.conf -a rest_auth/jwt -s openapi/v0.0.36 -vvv 0.0.0.0:6688
Restart=always
ProtectSystem=full
PrivateDevices=yes
PrivateTmp=yes
NoNewPrivileges=true
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl stop slurmrestd
systemctl restart slurmrestd
systemctl enable slurmrestd
systemctl status slurmrestd

 

获取token(默认lifespan=1800,最大为99999999999)

scontrol token lifespan=999999999 username=whq

 

如果node状态为down,slurm Reason=Not responding,重启服务无效的话,可以试一下下面命令

scontrol update NodeName=node01 State=RESUME
scontrol update NodeName=node02 State=RESUME
scontrol update NodeName=node03 State=RESUME
scontrol update NodeName=node04 State=RESUME
scontrol update NodeName=node05 State=RESUME
scontrol update NodeName=node06 State=RESUME
scontrol update NodeName=node07 State=RESUME
scontrol update NodeName=node08 State=RESUME

 

python调用测试api

import requests
url='http://172.18.0.115:6688/slurm/v0.0.36/ping'
headers = {
'X-SLURM-USER-NAME':'whq',
'X-SLURM-USER-TOKEN':'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJleHAiOjE2MzQxOTM4NzMsImlhdCI6MTYzNDE5MjA3Mywic3VuIjoid2hxIn0.82HpB4ss96Iw7o9JAzDp8WGRfFWDOCbPzx-J3Y5nK_U',
}
response = requests.get(url, headers=headers)
print(response.text)

 

slurm-rest_api接口参考

https://app.swaggerhub.com/apis/rherrick/slurm-rest_api/0.0.35#/

 

https://slurm.schedmd.com/SLUG20/REST_API.pdf
https://slurm.schedmd.com/SLUG19/REST_API.pdf

 

REST_API.pdf

 

因偶尔出现远程访问rest接口会比较慢,但是在集群内部访问会比较快。因此将6688端口转发到16688,这样就可以加快接口调用了。

yum install nginx -y
cat > /etc/nginx/conf.d/slurm.conf << 'EOF'
upstream backend {
server 127.0.0.1:6688;
}
server {
listen 16688;
server_name localhost;
location / {
proxy_pass http://backend;
}
}
EOF
systemctl restart nginx
systemctl enable nginx
systemctl status nginx

 

参考

https://www.cnblogs.com/liu-shaobo/p/13285839.html
https://blog.csdn.net/kongxx/article/details/52550653
https://www.jianshu.com/p/c7cf800656dc
https://www.jianshu.com/p/e560b19dbd3e

 

 

jwt参考

 

https://slurm.schedmd.com/jwt.html
https://elwe.rhrk.uni-kl.de/documentation/jwt.html

 

Slurm中文用户手册

 

https://docs.slurm.cn/users/

 

 

支持普通用户执行任务,包含两种方式:AllowAccounts 和 AllowGroups,推荐AllowGroups

 

AllowAccounts 启用队列账号管理

AccountingStorageEnforce=limits
PartitionName=compute Nodes=node[01-08] Default=YES MaxTime=INFINITE State=UP AllowAccounts=zkxy

 

AllowAccounts:后的账号名需要自己创建,下面是账号创建步骤

# 查询集群
sacctmgr list cluster
# 此集群名称需要和 slurm.conf 文件中的 ClusterName 一致,如果 slurm.conf 文件中的 ClusterName 集群已存在则无需再创建集群
sacctmgr add cluster cluster
# 添加账号,账号一定要创建在对应的集群中,也就是 slurm.conf 文件中的 ClusterName。
# 这里root加进来也不好用,必须设置AllowAccounts=zkxy,root才行
sacctmgr add account name=zkxy cluster=cluster
# 查询账号
sacctmgr list account
# 添加用户到帐号并且给用户添加 qos
sacctmgr add user name=admin account=zkxy qos=normal cluster=cluster
# 查询
sacctmgr list assoc
# 执行
srun -N2 hostname
# 需要在每个节点上面创建这个用户
useradd admin
echo 123456 | passwd --stdin admin

 

AllowGroups 启用队列的用户访问控制(这种方式不是独立的,必须依附于AllowAccounts,那这个样子的话,作用不大了
AccountingStorageEnforce=limits
PartitionName=compute Nodes=node[01-08] Default=YES MaxTime=INFINITE State=UP AllowGroups=edauser AllowAccounts=zkxy

AllowGroups:后边的  edauser 组就是 /etc/group 文件中的组名。

 

mkdir /data2
mount -t nfs -o nolock,nfsvers=3 172.18.0.21:/mnt/UserDataTemp/UserTemp /data2
echo "172.18.0.21:/mnt/UserDataTemp/UserTemp /data2 nfs defaults 0 0" >> /etc/fstab

 

启动slurmdbd服务报错时

slurmdbd: error: Database settings not recommended values: innodb_lock_wait_timeout

[mysqld]
innodb_buffer_pool_size=1024M
innodb_log_file_size=64M
innodb_lock_wait_timeout=900

 

 

 

参考

https://www.cnblogs.com/dahu-daqing/p/12693334.html
posted @   iStudyGuy  阅读(2210)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· 全程不用写代码,我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· .NET10 - 预览版1新功能体验(一)
点击右上角即可分享
微信分享提示