搭建slurm集群
参考:https://www.wanghaiqing.com/article/911f5d98-b68a-4daa-8db6-ee2052ec8275/
Slurm是面向Linux和Unix的开源工作调度程序,由世界上许多超级计算机使用,主要功能如下:
1、为用户分配计算节点的资源,以执行工作;
2、提供的框架在一组分配的节点上启动、执行和监视工作(通常是并行作业);
3、管理待处理作业的工作队列来仲裁资源争用问题;
Slurm架构
环境配置
服务器 | IP | 主机名 | 操作系统 | 配置 |
---|---|---|---|---|
控制节点 | 172.18.7.31 | master | CentOS7.9 | 8核16G |
计算节点1 | 172.18.7.32 | node01 | CentOS7.9 | 8核32G |
计算节点2 | 172.18.7.33 | node02 | CentOS7.9 | 8核32G |
一、基础环境(除说明外,所有机器都要执行)
关闭防火墙
systemctl stop firewalld systemctl disable firewalld sed -i -e 's/^SELINUX=.*/SELINUX=disabled/g' /etc/selinux/config setenforce 0
换成阿里云的源
rm -rf /etc/yum.repos.d/* curl -o /etc/yum.repos.d/CentOS-Base.repo https://mirrors.aliyun.com/repo/Centos-7.repo curl -o /etc/yum.repos.d/epel.repo https://mirrors.aliyun.com/repo/epel-7.repo yum clean all yum makecache fast -y
公司里CentOS7的源
rm -rf /etc/yum.repos.d/* cat > /etc/yum.repos.d/centos7.repo << EOF [base] name=base baseurl=http://172.18.0.61/centos7/base enabled=1 gpgcheck=0 [extras] name=extras baseurl=http://172.18.0.61/centos7/extras enabled=1 gpgcheck=0 [updates] name=updates baseurl=http://172.18.0.61/centos7/updates enabled=1 gpgcheck=0 [epel] name=epel baseurl=http://172.18.0.61/centos7/epel enabled=1 gpgcheck=0 EOF yum clean all yum makecache fast -y
设置主机名,主机名一定不能重复(分别执行)
hostnamectl set-hostname master hostnamectl set-hostname node01 hostnamectl set-hostname node02
设置hosts
cat >> /etc/hosts << EOF 172.18.7.31 master 172.18.7.32 node01 172.18.7.33 node02 EOF
加快ssh访问
echo "UseDNS no" >> /etc/ssh/sshd_config systemctl restart sshd
安装软件
yum -y install net-tools wget vim ntpdate chrony htop glances nfs-utils rpcbind python3
ntpdate 时间同步
# 公网时间服务器 ntpdate time1.aliyun.com echo "*/5 * * * * /usr/sbin/ntpdate time1.aliyun.com" >> /var/spool/cron/root timedatectl set-timezone Asia/Shanghai hwclock --systohc # 内网时间服务器 ntpdate 172.18.0.162 echo "*/5 * * * * /usr/sbin/ntpdate 172.18.0.162" >> /var/spool/cron/root timedatectl set-timezone Asia/Shanghai hwclock --systohc
配置SSH免登陆
# 控制节点上面执行 echo y| ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa ssh-copy-id -i ~/.ssh/id_rsa.pub -o StrictHostKeyChecking=no root@node01 ssh-copy-id -i ~/.ssh/id_rsa.pub -o StrictHostKeyChecking=no root@node02
二、配置Munge(除说明外,所有机器都要执行)
创建Munge用户
Munge用户要确保Master Node和Compute Nodes的UID和GID相同,所有节点都需要安装Munge;
groupadd -g 1108 munge useradd -m -c "Munge Uid 'N' Gid Emporium" -d /var/lib/munge -u 1108 -g munge -s /sbin/nologin munge
生成熵池
# 安装 yum install -y rng-tools # 使用/dev/urandom来做熵源 rngd -r /dev/urandom sed -i 's#^ExecStart.*#ExecStart=/sbin/rngd -f -r /dev/urandom#g' /usr/lib/systemd/system/rngd.service systemctl daemon-reload systemctl start rngd systemctl enable rngd systemctl status rngd
部署Munge,Munge是认证服务,实现本地或者远程主机进程的UID、GID验证。
yum install munge munge-libs munge-devel -y
创建全局密钥,在Master Node创建全局使用的密钥
# 控制节点上面执行 /usr/sbin/create-munge-key -r dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
密钥同步到所有计算节点
# 控制节点上面执行 scp -p /etc/munge/munge.key root@node01:/etc/munge scp -p /etc/munge/munge.key root@node02:/etc/munge # 计算节点上面执行 chown munge: /etc/munge/munge.key chmod 400 /etc/munge/munge.key
启动所有节点
systemctl restart munge systemctl enable munge systemctl status munge
测试Munge服务,每个计算节点与控制节点进行连接验证
# 本地查看凭据 munge -n # 本地解码 munge -n | unmunge # 验证compute node,远程解码 munge -n | ssh node01 unmunge # Munge凭证基准测试 remunge
三、配置Slurm(除说明外,所有机器都要执行)
创建Slurm用户
groupadd -g 1109 slurm useradd -m -c "Slurm manager" -d /var/lib/slurm -u 1109 -g slurm -s /bin/bash slurm
安装Slurm依赖
yum install gcc gcc-c++ readline-devel perl-ExtUtils-MakeMaker pam-devel rpm-build mysql-devel http-parser-devel json-c-devel libjwt libjwt-devel -y
编译Slurm和安装Slurm
# 下载地址 https://download.schedmd.com/slurm/ wget https://download.schedmd.com/slurm/slurm-22.05.3.tar.bz2 rpmbuild -ta --with mysql --with slurmrestd --with jwt slurm-22.05.3.tar.bz2 cd /root/rpmbuild/RPMS/x86_64/ yum localinstall -y slurm-*
参数 --with slurmrestd支持restful api
配置控制节点Slurm
# 控制节点上面执行 cp /etc/slurm/cgroup.conf.example /etc/slurm/cgroup.conf cp /etc/slurm/slurm.conf.example /etc/slurm/slurm.conf cat > /etc/slurm/slurm.conf << EOF ClusterName=cluster # SlurmctldHost=master ControlMachine=master ControlAddr=172.18.7.31 # SlurmctldDebug=info SlurmdDebug=debug3 GresTypes=gpu MpiDefault=none ProctrackType=proctrack/cgroup SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurm SlurmUser=slurm StateSaveLocation=/var/spool/slurm/ctld SwitchType=switch/none TaskPlugin=task/affinity,task/cgroup # Fix Mentioned Error # TaskPluginParam=Sched TaskPluginParam=verbose # TIMERS #InactiveLimit=0 #KillWait=30 #ResumeTimeout=600 MinJobAge=172800 #OverTimeLimit=0 #SlurmctldTimeout=12 #SlurmdTimeout=300 #Waittime=0 # SCHEDULING SchedulerType=sched/backfill SelectType=select/cons_tres SelectTypeParameters=CR_Core # LOGGING AND ACCOUNTING AccountingStorageEnforce=limits AccountingStorageHost=master AccountingStoragePort=6819 AccountingStorageType=accounting_storage/slurmdbd # Fix Mentioned Error # AccountingStoreJobComment=YES AccountingStoreFlags=job_comment #JobCompHost=localhost #JobCompPass=123456 #JobCompPort=3306 #JobCompType=jobcomp/mysql #JobCompUser=root #JobAcctGatherFrequency=1 #JobAcctGatherType=jobacct_gather/linux SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdLogFile=/var/log/slurm/slurmd.log AuthAltTypes=auth/jwt AuthAltParameters=jwt_key=/var/spool/slurm/ctld/jwt_hs256.key MaxNodeCount=1000 TreeWidth=65533 # COMPUTE NODES NodeName=master,node[01-02] CPUs=4 RealMemory=6000 State=UNKNOWN PartitionName=compute Nodes=node[01-02] Default=YES MaxTime=INFINITE State=UP AllowAccounts=zkxy,root EOF
复制控制节点配置文件到计算节点
# 控制节点上面执行 scp /etc/slurm/*.conf node01:/etc/slurm/ scp /etc/slurm/*.conf node02:/etc/slurm/
设置控制、计算节点文件权限
mkdir -p /var/spool/slurm chown slurm: /var/spool/slurm mkdir -p /var/log/slurm chown slurm: /var/log/slurm
配置控制节点Slurm Accounting,Accounting records为slurm收集作业步骤的信息,可以写入一个文本文件或数据库,但这个文件会变得越来越大,最简单的方法是使用MySQL来存储信息。
CentOS7采用yum方式安装mysql5.7(修改存储路径)
创建数据库的Slurm用户
# mysql5.7 grant all on slurm_acct_db.* to 'slurm'@'%' identified by 'Slurm*1234' with grant option; # mysql8.0 CREATE USER 'slurm'@'%' identified with mysql_native_password by 'Slurm*1234'; GRANT ALL ON slurm_acct_db.* TO 'slurm'@'%'; flush privileges;
配置slurmdbd.conf文件
# 控制节点上面执行 cp /etc/slurm/slurmdbd.conf.example /etc/slurm/slurmdbd.conf cat > /etc/slurm/slurmdbd.conf << 'EOF' AuthType=auth/munge AuthInfo=/var/run/munge/munge.socket.2 DbdAddr=172.18.7.31 DbdHost=master SlurmUser=slurm DebugLevel=verbose LogFile=/var/log/slurm/slurmdbd.log PidFile=/var/run/slurmdbd.pid StorageType=accounting_storage/mysql StorageHost=172.18.0.191 StorageUser=slurm StoragePass=Slurm*1234 StorageLoc=slurm_acct_db #db名,slurmdbd会自动创建db StoragePort=3306 AuthAltTypes=auth/jwt AuthAltParameters=jwt_key=/var/spool/slurm/ctld/jwt_hs256.key EOF
设置权限
# 控制节点上面执行 chown slurm: /etc/slurm/slurmdbd.conf chown slurm: /etc/slurm/slurm.conf
Add JWT key to controller (StateSaveLocation目录)
mkdir -p /var/spool/slurm/ctld dd if=/dev/random of=/var/spool/slurm/ctld/jwt_hs256.key bs=32 count=1 chown slurm:slurm /var/spool/slurm/ctld/jwt_hs256.key chmod 0600 /var/spool/slurm/ctld/jwt_hs256.key # chown root:root /etc/slurm chmod 0755 /var/spool/slurm/ctld chown slurm:slurm /var/spool/slurm/ctld
启动服务
# 启动控制节点Slurmdbd服务 systemctl restart slurmdbd systemctl enable slurmdbd systemctl status slurmdbd # 启动控制节点slurmctld服务 systemctl restart slurmctld systemctl enable slurmctld systemctl status slurmctld # 启动计算节点的服务 systemctl restart slurmd systemctl enable slurmd systemctl status slurmd # 服务无法启动,可通过直接启动命令查看 slurmdbd -Dvvv slurmctld -Dvvv slurmd -Dvvv
四、检查Slurm集群
创建用户
useradd zkxy echo 123456 | passwd --stdin zkxy
检查Slurm集群
# 控制节点和计算节点上面都可以执行 # 查看集群 sinfo scontrol show partition scontrol show node # 提交作业 srun -N2 hostname scontrol show jobs # 查看作业 squeue -a
新建用户
useradd whq echo whq | passwd --stdin whq
运行slurm api (不能是root和SlurmUser用户)
cat > /etc/slurm/slurmrestd.conf << 'EOF' include /etc/slurm/slurm.conf AuthType=auth/jwt EOF chown slurm:slurm /etc/slurm/slurmrestd.conf su - whq slurmrestd -f /etc/slurm/slurmrestd.conf 0.0.0.0:6688 -a jwt -s openapi/v0.0.36 slurmrestd -f /etc/slurm/slurmrestd.conf -a rest_auth/jwt -s openapi/v0.0.36 -vvv 0.0.0.0:6688
创建systemd服务
cat > /usr/lib/systemd/system/slurmrestd.service <<EOF [Unit] Description=slurmrestd service After=network.service [Service] Type=simple User=whq Group=whq WorkingDirectory=/usr/sbin ExecStart=/usr/sbin/slurmrestd -f /etc/slurm/slurmrestd.conf -a rest_auth/jwt -s openapi/v0.0.36 -vvv 0.0.0.0:6688 Restart=always ProtectSystem=full PrivateDevices=yes PrivateTmp=yes NoNewPrivileges=true [Install] WantedBy=multi-user.target EOF systemctl daemon-reload systemctl stop slurmrestd systemctl restart slurmrestd systemctl enable slurmrestd systemctl status slurmrestd
获取token(默认lifespan=1800,最大为99999999999)
scontrol token lifespan=999999999 username=whq
如果node状态为down,slurm Reason=Not responding,重启服务无效的话,可以试一下下面命令
scontrol update NodeName=node01 State=RESUME scontrol update NodeName=node02 State=RESUME scontrol update NodeName=node03 State=RESUME scontrol update NodeName=node04 State=RESUME scontrol update NodeName=node05 State=RESUME scontrol update NodeName=node06 State=RESUME scontrol update NodeName=node07 State=RESUME scontrol update NodeName=node08 State=RESUME
python调用测试api
import requests url='http://172.18.0.115:6688/slurm/v0.0.36/ping' headers = { 'X-SLURM-USER-NAME':'whq', 'X-SLURM-USER-TOKEN':'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJleHAiOjE2MzQxOTM4NzMsImlhdCI6MTYzNDE5MjA3Mywic3VuIjoid2hxIn0.82HpB4ss96Iw7o9JAzDp8WGRfFWDOCbPzx-J3Y5nK_U', } response = requests.get(url, headers=headers) print(response.text)
slurm-rest_api接口参考
https://app.swaggerhub.com/apis/rherrick/slurm-rest_api/0.0.35#/
https://slurm.schedmd.com/SLUG20/REST_API.pdf https://slurm.schedmd.com/SLUG19/REST_API.pdf
因偶尔出现远程访问rest接口会比较慢,但是在集群内部访问会比较快。因此将6688端口转发到16688,这样就可以加快接口调用了。
yum install nginx -y cat > /etc/nginx/conf.d/slurm.conf << 'EOF' upstream backend { server 127.0.0.1:6688; } server { listen 16688; server_name localhost; location / { proxy_pass http://backend; } } EOF systemctl restart nginx systemctl enable nginx systemctl status nginx
参考
https://www.cnblogs.com/liu-shaobo/p/13285839.html https://blog.csdn.net/kongxx/article/details/52550653 https://www.jianshu.com/p/c7cf800656dc https://www.jianshu.com/p/e560b19dbd3e
jwt参考
https://slurm.schedmd.com/jwt.html https://elwe.rhrk.uni-kl.de/documentation/jwt.html
Slurm中文用户手册
https://docs.slurm.cn/users/
支持普通用户执行任务,包含两种方式:AllowAccounts 和 AllowGroups,推荐AllowGroups
AllowAccounts 启用队列账号管理
AccountingStorageEnforce=limits … PartitionName=compute Nodes=node[01-08] Default=YES MaxTime=INFINITE State=UP AllowAccounts=zkxy
AllowAccounts:后的账号名需要自己创建,下面是账号创建步骤
# 查询集群 sacctmgr list cluster # 此集群名称需要和 slurm.conf 文件中的 ClusterName 一致,如果 slurm.conf 文件中的 ClusterName 集群已存在则无需再创建集群 sacctmgr add cluster cluster # 添加账号,账号一定要创建在对应的集群中,也就是 slurm.conf 文件中的 ClusterName。 # 这里root加进来也不好用,必须设置AllowAccounts=zkxy,root才行 sacctmgr add account name=zkxy cluster=cluster # 查询账号 sacctmgr list account # 添加用户到帐号并且给用户添加 qos sacctmgr add user name=admin account=zkxy qos=normal cluster=cluster # 查询 sacctmgr list assoc # 执行 srun -N2 hostname # 需要在每个节点上面创建这个用户 useradd admin echo 123456 | passwd --stdin admin
AllowGroups 启用队列的用户访问控制(这种方式不是独立的,必须依附于AllowAccounts,那这个样子的话,作用不大了)
AccountingStorageEnforce=limits … PartitionName=compute Nodes=node[01-08] Default=YES MaxTime=INFINITE State=UP AllowGroups=edauser AllowAccounts=zkxy
AllowGroups:后边的 edauser 组就是 /etc/group 文件中的组名。
mkdir /data2 mount -t nfs -o nolock,nfsvers=3 172.18.0.21:/mnt/UserDataTemp/UserTemp /data2 echo "172.18.0.21:/mnt/UserDataTemp/UserTemp /data2 nfs defaults 0 0" >> /etc/fstab
启动slurmdbd服务报错时
slurmdbd: error: Database settings not recommended values: innodb_lock_wait_timeout
[mysqld] innodb_buffer_pool_size=1024M innodb_log_file_size=64M innodb_lock_wait_timeout=900
参考
https://www.cnblogs.com/dahu-daqing/p/12693334.html
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 全程不用写代码,我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· .NET10 - 预览版1新功能体验(一)