搭建slurm集群
参考:https://www.wanghaiqing.com/article/911f5d98-b68a-4daa-8db6-ee2052ec8275/
Slurm是面向Linux和Unix的开源工作调度程序,由世界上许多超级计算机使用,主要功能如下:
1、为用户分配计算节点的资源,以执行工作;
2、提供的框架在一组分配的节点上启动、执行和监视工作(通常是并行作业);
3、管理待处理作业的工作队列来仲裁资源争用问题;
Slurm架构
环境配置
服务器 | IP | 主机名 | 操作系统 | 配置 |
---|---|---|---|---|
控制节点 | 172.18.7.31 | master | CentOS7.9 | 8核16G |
计算节点1 | 172.18.7.32 | node01 | CentOS7.9 | 8核32G |
计算节点2 | 172.18.7.33 | node02 | CentOS7.9 | 8核32G |
一、基础环境(除说明外,所有机器都要执行)
关闭防火墙
systemctl stop firewalld
systemctl disable firewalld
sed -i -e 's/^SELINUX=.*/SELINUX=disabled/g' /etc/selinux/config
setenforce 0
换成阿里云的源
rm -rf /etc/yum.repos.d/*
curl -o /etc/yum.repos.d/CentOS-Base.repo https://mirrors.aliyun.com/repo/Centos-7.repo
curl -o /etc/yum.repos.d/epel.repo https://mirrors.aliyun.com/repo/epel-7.repo
yum clean all
yum makecache fast -y
公司里CentOS7的源
rm -rf /etc/yum.repos.d/*
cat > /etc/yum.repos.d/centos7.repo << EOF
[base]
name=base
baseurl=http://172.18.0.61/centos7/base
enabled=1
gpgcheck=0
[extras]
name=extras
baseurl=http://172.18.0.61/centos7/extras
enabled=1
gpgcheck=0
[updates]
name=updates
baseurl=http://172.18.0.61/centos7/updates
enabled=1
gpgcheck=0
[epel]
name=epel
baseurl=http://172.18.0.61/centos7/epel
enabled=1
gpgcheck=0
EOF
yum clean all
yum makecache fast -y
设置主机名,主机名一定不能重复(分别执行)
hostnamectl set-hostname master
hostnamectl set-hostname node01
hostnamectl set-hostname node02
设置hosts
cat >> /etc/hosts << EOF
172.18.7.31 master
172.18.7.32 node01
172.18.7.33 node02
EOF
加快ssh访问
echo "UseDNS no" >> /etc/ssh/sshd_config
systemctl restart sshd
安装软件
yum -y install net-tools wget vim ntpdate chrony htop glances nfs-utils rpcbind python3
ntpdate 时间同步
# 公网时间服务器
ntpdate time1.aliyun.com
echo "*/5 * * * * /usr/sbin/ntpdate time1.aliyun.com" >> /var/spool/cron/root
timedatectl set-timezone Asia/Shanghai
hwclock --systohc
# 内网时间服务器
ntpdate 172.18.0.162
echo "*/5 * * * * /usr/sbin/ntpdate 172.18.0.162" >> /var/spool/cron/root
timedatectl set-timezone Asia/Shanghai
hwclock --systohc
配置SSH免登陆
# 控制节点上面执行
echo y| ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
ssh-copy-id -i ~/.ssh/id_rsa.pub -o StrictHostKeyChecking=no root@node01
ssh-copy-id -i ~/.ssh/id_rsa.pub -o StrictHostKeyChecking=no root@node02
二、配置Munge(除说明外,所有机器都要执行)
创建Munge用户
Munge用户要确保Master Node和Compute Nodes的UID和GID相同,所有节点都需要安装Munge;
groupadd -g 1108 munge
useradd -m -c "Munge Uid 'N' Gid Emporium" -d /var/lib/munge -u 1108 -g munge -s /sbin/nologin munge
生成熵池
# 安装
yum install -y rng-tools
# 使用/dev/urandom来做熵源
rngd -r /dev/urandom
sed -i 's#^ExecStart.*#ExecStart=/sbin/rngd -f -r /dev/urandom#g' /usr/lib/systemd/system/rngd.service
systemctl daemon-reload
systemctl start rngd
systemctl enable rngd
systemctl status rngd
部署Munge,Munge是认证服务,实现本地或者远程主机进程的UID、GID验证。
yum install munge munge-libs munge-devel -y
创建全局密钥,在Master Node创建全局使用的密钥
# 控制节点上面执行
/usr/sbin/create-munge-key -r
dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
密钥同步到所有计算节点
# 控制节点上面执行
scp -p /etc/munge/munge.key root@node01:/etc/munge
scp -p /etc/munge/munge.key root@node02:/etc/munge
# 计算节点上面执行
chown munge: /etc/munge/munge.key
chmod 400 /etc/munge/munge.key
启动所有节点
systemctl restart munge
systemctl enable munge
systemctl status munge
测试Munge服务,每个计算节点与控制节点进行连接验证
# 本地查看凭据
munge -n
# 本地解码
munge -n | unmunge
# 验证compute node,远程解码
munge -n | ssh node01 unmunge
# Munge凭证基准测试
remunge
三、配置Slurm(除说明外,所有机器都要执行)
创建Slurm用户
groupadd -g 1109 slurm
useradd -m -c "Slurm manager" -d /var/lib/slurm -u 1109 -g slurm -s /bin/bash slurm
安装Slurm依赖
yum install gcc gcc-c++ readline-devel perl-ExtUtils-MakeMaker pam-devel rpm-build mysql-devel http-parser-devel json-c-devel libjwt libjwt-devel -y
编译Slurm和安装Slurm
# 下载地址
https://download.schedmd.com/slurm/
wget https://download.schedmd.com/slurm/slurm-22.05.3.tar.bz2
rpmbuild -ta --with mysql --with slurmrestd --with jwt slurm-22.05.3.tar.bz2
cd /root/rpmbuild/RPMS/x86_64/
yum localinstall -y slurm-*
参数 --with slurmrestd支持restful api
配置控制节点Slurm
# 控制节点上面执行
cp /etc/slurm/cgroup.conf.example /etc/slurm/cgroup.conf
cp /etc/slurm/slurm.conf.example /etc/slurm/slurm.conf
cat > /etc/slurm/slurm.conf << EOF
ClusterName=cluster
# SlurmctldHost=master
ControlMachine=master
ControlAddr=172.18.7.31
#
SlurmctldDebug=info
SlurmdDebug=debug3
GresTypes=gpu
MpiDefault=none
ProctrackType=proctrack/cgroup
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurm
SlurmUser=slurm
StateSaveLocation=/var/spool/slurm/ctld
SwitchType=switch/none
TaskPlugin=task/affinity,task/cgroup
# Fix Mentioned Error
# TaskPluginParam=Sched
TaskPluginParam=verbose
# TIMERS
#InactiveLimit=0
#KillWait=30
#ResumeTimeout=600
MinJobAge=172800
#OverTimeLimit=0
#SlurmctldTimeout=12
#SlurmdTimeout=300
#Waittime=0
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
# LOGGING AND ACCOUNTING
AccountingStorageEnforce=limits
AccountingStorageHost=master
AccountingStoragePort=6819
AccountingStorageType=accounting_storage/slurmdbd
# Fix Mentioned Error
# AccountingStoreJobComment=YES
AccountingStoreFlags=job_comment
#JobCompHost=localhost
#JobCompPass=123456
#JobCompPort=3306
#JobCompType=jobcomp/mysql
#JobCompUser=root
#JobAcctGatherFrequency=1
#JobAcctGatherType=jobacct_gather/linux
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd.log
AuthAltTypes=auth/jwt
AuthAltParameters=jwt_key=/var/spool/slurm/ctld/jwt_hs256.key
MaxNodeCount=1000
TreeWidth=65533
# COMPUTE NODES
NodeName=master,node[01-02] CPUs=4 RealMemory=6000 State=UNKNOWN
PartitionName=compute Nodes=node[01-02] Default=YES MaxTime=INFINITE State=UP AllowAccounts=zkxy,root
EOF
复制控制节点配置文件到计算节点
# 控制节点上面执行
scp /etc/slurm/*.conf node01:/etc/slurm/
scp /etc/slurm/*.conf node02:/etc/slurm/
设置控制、计算节点文件权限
mkdir -p /var/spool/slurm
chown slurm: /var/spool/slurm
mkdir -p /var/log/slurm
chown slurm: /var/log/slurm
配置控制节点Slurm Accounting,Accounting records为slurm收集作业步骤的信息,可以写入一个文本文件或数据库,但这个文件会变得越来越大,最简单的方法是使用MySQL来存储信息。
CentOS7采用yum方式安装mysql5.7(修改存储路径)
创建数据库的Slurm用户
# mysql5.7
grant all on slurm_acct_db.* to 'slurm'@'%' identified by 'Slurm*1234' with grant option;
# mysql8.0
CREATE USER 'slurm'@'%' identified with mysql_native_password by 'Slurm*1234';
GRANT ALL ON slurm_acct_db.* TO 'slurm'@'%';
flush privileges;
配置slurmdbd.conf文件
# 控制节点上面执行
cp /etc/slurm/slurmdbd.conf.example /etc/slurm/slurmdbd.conf
cat > /etc/slurm/slurmdbd.conf << 'EOF'
AuthType=auth/munge
AuthInfo=/var/run/munge/munge.socket.2
DbdAddr=172.18.7.31
DbdHost=master
SlurmUser=slurm
DebugLevel=verbose
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/var/run/slurmdbd.pid
StorageType=accounting_storage/mysql
StorageHost=172.18.0.191
StorageUser=slurm
StoragePass=Slurm*1234
StorageLoc=slurm_acct_db #db名,slurmdbd会自动创建db
StoragePort=3306
AuthAltTypes=auth/jwt
AuthAltParameters=jwt_key=/var/spool/slurm/ctld/jwt_hs256.key
EOF
设置权限
# 控制节点上面执行
chown slurm: /etc/slurm/slurmdbd.conf
chown slurm: /etc/slurm/slurm.conf
Add JWT key to controller (StateSaveLocation目录)
mkdir -p /var/spool/slurm/ctld
dd if=/dev/random of=/var/spool/slurm/ctld/jwt_hs256.key bs=32 count=1
chown slurm:slurm /var/spool/slurm/ctld/jwt_hs256.key
chmod 0600 /var/spool/slurm/ctld/jwt_hs256.key
# chown root:root /etc/slurm
chmod 0755 /var/spool/slurm/ctld
chown slurm:slurm /var/spool/slurm/ctld
启动服务
# 启动控制节点Slurmdbd服务
systemctl restart slurmdbd
systemctl enable slurmdbd
systemctl status slurmdbd
# 启动控制节点slurmctld服务
systemctl restart slurmctld
systemctl enable slurmctld
systemctl status slurmctld
# 启动计算节点的服务
systemctl restart slurmd
systemctl enable slurmd
systemctl status slurmd
# 服务无法启动,可通过直接启动命令查看
slurmdbd -Dvvv
slurmctld -Dvvv
slurmd -Dvvv
四、检查Slurm集群
创建用户
useradd zkxy
echo 123456 | passwd --stdin zkxy
检查Slurm集群
# 控制节点和计算节点上面都可以执行
# 查看集群
sinfo
scontrol show partition
scontrol show node
# 提交作业
srun -N2 hostname
scontrol show jobs
# 查看作业
squeue -a
新建用户
useradd whq
echo whq | passwd --stdin whq
运行slurm api (不能是root和SlurmUser用户)
cat > /etc/slurm/slurmrestd.conf << 'EOF'
include /etc/slurm/slurm.conf
AuthType=auth/jwt
EOF
chown slurm:slurm /etc/slurm/slurmrestd.conf
su - whq
slurmrestd -f /etc/slurm/slurmrestd.conf 0.0.0.0:6688 -a jwt -s openapi/v0.0.36
slurmrestd -f /etc/slurm/slurmrestd.conf -a rest_auth/jwt -s openapi/v0.0.36 -vvv 0.0.0.0:6688
创建systemd服务
cat > /usr/lib/systemd/system/slurmrestd.service <<EOF
[Unit]
Description=slurmrestd service
After=network.service
[Service]
Type=simple
User=whq
Group=whq
WorkingDirectory=/usr/sbin
ExecStart=/usr/sbin/slurmrestd -f /etc/slurm/slurmrestd.conf -a rest_auth/jwt -s openapi/v0.0.36 -vvv 0.0.0.0:6688
Restart=always
ProtectSystem=full
PrivateDevices=yes
PrivateTmp=yes
NoNewPrivileges=true
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl stop slurmrestd
systemctl restart slurmrestd
systemctl enable slurmrestd
systemctl status slurmrestd
获取token(默认lifespan=1800,最大为99999999999)
scontrol token lifespan=999999999 username=whq
如果node状态为down,slurm Reason=Not responding,重启服务无效的话,可以试一下下面命令
scontrol update NodeName=node01 State=RESUME
scontrol update NodeName=node02 State=RESUME
scontrol update NodeName=node03 State=RESUME
scontrol update NodeName=node04 State=RESUME
scontrol update NodeName=node05 State=RESUME
scontrol update NodeName=node06 State=RESUME
scontrol update NodeName=node07 State=RESUME
scontrol update NodeName=node08 State=RESUME
python调用测试api
import requests
url='http://172.18.0.115:6688/slurm/v0.0.36/ping'
headers = {
'X-SLURM-USER-NAME':'whq',
'X-SLURM-USER-TOKEN':'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJleHAiOjE2MzQxOTM4NzMsImlhdCI6MTYzNDE5MjA3Mywic3VuIjoid2hxIn0.82HpB4ss96Iw7o9JAzDp8WGRfFWDOCbPzx-J3Y5nK_U',
}
response = requests.get(url, headers=headers)
print(response.text)
slurm-rest_api接口参考
https://app.swaggerhub.com/apis/rherrick/slurm-rest_api/0.0.35#/
https://slurm.schedmd.com/SLUG20/REST_API.pdf
https://slurm.schedmd.com/SLUG19/REST_API.pdf
因偶尔出现远程访问rest接口会比较慢,但是在集群内部访问会比较快。因此将6688端口转发到16688,这样就可以加快接口调用了。
yum install nginx -y
cat > /etc/nginx/conf.d/slurm.conf << 'EOF'
upstream backend {
server 127.0.0.1:6688;
}
server {
listen 16688;
server_name localhost;
location / {
proxy_pass http://backend;
}
}
EOF
systemctl restart nginx
systemctl enable nginx
systemctl status nginx
参考
https://www.cnblogs.com/liu-shaobo/p/13285839.html
https://blog.csdn.net/kongxx/article/details/52550653
https://www.jianshu.com/p/c7cf800656dc
https://www.jianshu.com/p/e560b19dbd3e
jwt参考
https://slurm.schedmd.com/jwt.html
https://elwe.rhrk.uni-kl.de/documentation/jwt.html
Slurm中文用户手册
https://docs.slurm.cn/users/
支持普通用户执行任务,包含两种方式:AllowAccounts 和 AllowGroups,推荐AllowGroups
AllowAccounts 启用队列账号管理
AccountingStorageEnforce=limits
…
PartitionName=compute Nodes=node[01-08] Default=YES MaxTime=INFINITE State=UP AllowAccounts=zkxy
AllowAccounts:后的账号名需要自己创建,下面是账号创建步骤
# 查询集群
sacctmgr list cluster
# 此集群名称需要和 slurm.conf 文件中的 ClusterName 一致,如果 slurm.conf 文件中的 ClusterName 集群已存在则无需再创建集群
sacctmgr add cluster cluster
# 添加账号,账号一定要创建在对应的集群中,也就是 slurm.conf 文件中的 ClusterName。
# 这里root加进来也不好用,必须设置AllowAccounts=zkxy,root才行
sacctmgr add account name=zkxy cluster=cluster
# 查询账号
sacctmgr list account
# 添加用户到帐号并且给用户添加 qos
sacctmgr add user name=admin account=zkxy qos=normal cluster=cluster
# 查询
sacctmgr list assoc
# 执行
srun -N2 hostname
# 需要在每个节点上面创建这个用户
useradd admin
echo 123456 | passwd --stdin admin
AllowGroups 启用队列的用户访问控制(这种方式不是独立的,必须依附于AllowAccounts,那这个样子的话,作用不大了)
AccountingStorageEnforce=limits
…
PartitionName=compute Nodes=node[01-08] Default=YES MaxTime=INFINITE State=UP AllowGroups=edauser AllowAccounts=zkxy
AllowGroups:后边的 edauser 组就是 /etc/group 文件中的组名。
mkdir /data2
mount -t nfs -o nolock,nfsvers=3 172.18.0.21:/mnt/UserDataTemp/UserTemp /data2
echo "172.18.0.21:/mnt/UserDataTemp/UserTemp /data2 nfs defaults 0 0" >> /etc/fstab
启动slurmdbd服务报错时
slurmdbd: error: Database settings not recommended values: innodb_lock_wait_timeout
[mysqld]
innodb_buffer_pool_size=1024M
innodb_log_file_size=64M
innodb_lock_wait_timeout=900
参考
https://www.cnblogs.com/dahu-daqing/p/12693334.html