Slurm 测试环境配置
Slurm 测试环境配置
1.机器规划
Host:
HPC_Slurm_Main:192.168.141.135
Clients:
HPC_Slurm_Client01:192.168.141.136
HPC_Slurm_Client02:192.168.141.137
HPC_Slurm_Client03:192.168.141.138
2.修改主机名 /etc/hosts, /etc/hostname
192.168.141.136 node1-nfs
192.168.141.137 node2-nfs
192.168.141.138 node3-nfs
192.168.141.135 control1-nfs
192.168.141.136 node1
192.168.141.137 node2
192.168.141.138 node3
192.168.141.135 control1
3.NFS部署
3.1 服务器端:sudo apt-get install nfs-kernel-server
cat /etc/exports /home/xxx/software *(insecure,rw,sync,no_root_squash)
/etc/init.d/nfs-kernel-server restart && systemctl enable nfs-kernel-server
验证:sudo exportfs -rv
3.2 客户端:sudo apt-get install nfs-common
a.客户端开机启动并挂载nfs: 编辑/etc/fstab文件添加如下内容:永久挂载software
control1-nfs:/software /software nfs defaults 0 0
(临时测试方案(不推荐):sudo mount -t nfs control1-nfs:/home/jose/software /home/jose//software)
b.客户端取消挂载:取消挂载:sudo umount /software
sudo reboot
4、Munge部署
1、useradd -m munge
2、apt install munge
Host:
dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key #在Master Node创建全局使用的密钥
chown munge: /etc/munge/munge.key
chmod 400 /etc/munge/munge.key
chown -R munge: /var/lib/munge
chown -R munge: /var/run/munge
chown -R munge: /var/log/munge
scp /etc/munge/munge.key jose@node1:/etc/munge/
scp /etc/munge/munge.key jose@node2:/etc/munge/
scp /etc/munge/munge.key jose@node3:/etc/munge/
systemctl start munge
systemctl enable munge
权限设置,很重要
sudo chmod 1775 /etc/munge
sudo chmod 0600 /etc/munge/munge.key
#如果munge.key的所有者不对,需要执行以下命令
sudo chown munge: /etc/munge/munge.key
Client:
sudo apt install rng-tools5
sudo rngd -r /dev/urandom
sudo chmod 700 /etc/munge
sudo chown -R munge: /etc/munge
sudo chown -R munge: /var/lib/munge
sudo chown -R munge: /var/run/munge
sudo chown -R munge: /var/log/munge
sudo systemctl start rngd
sudo systemctl start munge
sudo systemctl enable rngd
sudo systemctl enable munge
5、Slurm部署
Host:
sudo apt install slurm-wlm -y
sudo apt install slurmctld -y
sudo chmod +r /usr/share/doc/slurmctld/slurm-wlm-configurator.html
Client:
sudo apt install slurmd -y
sudo slurm -c
sudo slurm -D -s
Host:
python3 -m http.server
打开:http://192.168.141.135:8000/slurm-wlm-configurator.easy.html
将内容填入配置文件:/etc/slurm/slurm.conf
sudo mkdir /var/spool/slurmd
sudo mkdir /var/spool/slurmctld
# 启动 slurmd, 日志文件路径为 `/var/log/slurmd.log`
sudo systemctl start slurmd
# 启动 slurmctld, 日志文件路径为 `/var/log/slurmctld.log`
sudo systemctl start slurmctld
# 查看 slurmd 的状态
sudo systemctl status slurmd
# 查看 slurmctld 的状态
sudo systemctl status slurmctld
#ProctrackType=proctrack/cgroup 需要修改成 ProctrackType=proctrack/pgid
6、Slurm Mysql
sudo apt-get install mysql-server libmysqlclient-dev -y
在mysql中创建相应的用户
$ mysql -u root -p
create user 'slurm'@'localhost' identified by '2023@Slurm';
grant all on slurm_acct_db.* to 'slurm'@'localhost';
# scontrol update NodeName=<node> State=RESUME