PBS主要功能是计算机集群的资源管理、作业调度,包含openPBS,PBS Pro和Torque三个主要分支;
Slurm集群部署:https://www.cnblogs.com/liu-shaobo/p/13285839.html
一、基础环境
1、主机名和IP
控制节点:192.168.1.11 m1
计算节点:192.168.1.12 c1
计算节点:192.168.1.13 c2
2、主机配置
系统: Centos7.6 x86_64
CPU: 4C
内存:4G
3、关闭防火墙
# systemctl stop firewalld
# systemctl disable firewalld
# systemctl stop iptables
# systemctl disable iptables
4、修改资源限制
# vim /etc/security/limits.conf * hard nofile 1000000 * soft nofile 1000000 * soft core unlimited * soft stack 10240 * soft memlock unlimited * hard memlock unlimited
5、配置CST时区
# ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
同步NTP服务器
# ntpdate 210.72.145.44 # yum install ntp -y # systemctl start ntpd # systemctl enable ntpd
安装EPEL源
# yum install http://mirrors.sohu.com/fedora-epel/epel-release-latest-7.noarch.rpm
6、安装NFS
# yum install -y nfs-utils rpcbind
编辑/etc/exports文件
# mkdir /software # cat /etc/exports /software *(rw,async,insecure,no_root_squash)
启动NFS
# systemctl start nfs
# systemctl start rpcbind
# systemctl enable nfs
# systemctl enable rpcbind
计算节点挂载NFS
# yum install -y nfs-utils # mkdir /software # mount m1:/software /software
7、管理节点配置SSH免登陆
# ssh-keygen # ssh-copy-id -i .ssh/id_rsa.pub c1 # ssh-copy-id -i .ssh/id_rsa.pub c2
二、部署Torque管理节点
Torque由四个服务组成:
pbs_server :资源管理系统的服务器,根据调度进程提供的可用节点资源清单进行作业分发和回收;
pbs_mom :客户端,监视各计算节点的资源使用情况;
trqauthd :用于授权pbs_mom进程与pbs_server进程之间建立互信连接;
pbs_sched :任务调度器;
1、安装依赖
# yum install -y libtool openssl-devel libxml2-devel boost-devel gcc gcc-c++ hwloc hwloc-devel
2、安装Torque
# wget http://wpfilebase.s3.amazonaws.com/torque/torque-6.1.3.tar.gz # tar zxvf torque-6.1.3.tar.gz # cd torque-6.1.3 # ./configure --prefix=/usr/local/torque --with-scp # make -j4 # make install
生成计算节点需要的安装包,会生成5个可执行脚本
# make packages
# libtool --finish /usr/local/torque/lib
3、配置Torque服务端
添加环境变量
# . /etc/profile.d/torque.sh
初始化serverdb
# qterm
# ./torque.setup root
4、开启Torque服务端
# qterm
# systemctl enable pbs_server
# systemctl start pbs_server
# systemctl enable trqauthd
# systemctl start trqauthd
三、部署Torque计算节点
1、安装客户端
将torque文件夹的安装包复制到计算节点,或复制到NFS目录
# ./torque-package-mom-linux-x86_64.sh --install
# ./torque-package-clients-linux-x86_64.sh --install
2、配置客户端
# vim /var/spool/torque/mom_priv/config $pbsserver m1 $logevent 225 $loglevel 4 $usecp m1:/data /data
3、启动客户端
# systemctl enable pbs_mom
# systemctl start pbs_mom
# systemctl enable trqauthd
# systemctl start trqauthd
确保servern_name文件内容为管理节点名
# cat /var/spool/torque/server_name
查看各节点状态
# qnodes c1 state = free power_state = Running np = 4 ntype = cluster status = opsys=linux,uname=Linux c1 4.19.0-6.el7.ucloud.x86_64 #1 SMP Wed Feb 12 07:32:16 UTC 2020 x86_64,sessions=44684,nsessions=1,nusers=1,idletime=142984,totmem=3873444kb,availmem=3429928kb,physmem=3873444kb,ncpus=4,loadave=0.00,gres=,netload=358955336,state=free,varattr= ,cpuclock=Fixed,macaddr=52:54:00:ba:9e:8b,version=6.1.3,rectime=1597632442,jobs= mom_service_port = 15002 mom_manager_port = 15003 c2 state = free power_state = Running np = 4 ntype = cluster status = opsys=linux,uname=Linux c2 4.19.0-6.el7.ucloud.x86_64 #1 SMP Wed Feb 12 07:32:16 UTC 2020 x86_64,nsessions=0,nusers=0,idletime=2262,totmem=3873444kb,availmem=3553212kb,physmem=3873444kb,ncpus=4,loadave=0.01,gres=,netload=61440494,state=free,varattr= ,cpuclock=Fixed,macaddr=52:54:00:e9:4a:a6,version=6.1.3,rectime=1597632467,jobs= mom_service_port = 15002 mom_manager_port = 15003
四、管理节点配置调度器
1、启动调度器
# cp contrib/systemd/pbs_sched.service /usr/lib/systemd/system/
# systemctl enable pbs_sched
# systemctl start pbs_sched
2、配置队列
# qmgr -c 'create queue batch' # qmgr -c 'set server default_queue=batch' # qmgr -c 'set server query_other_jobs=true' # qmgr -c 'set queue batch queue_type=execution' # qmgr -c 'set queue batch started=true' # qmgr -c 'set queue batch enabled=true' # qmgr -c 'set queue batch resources_default.nodes=1'
# qmgr -c 'set server scheduling=true'
3、测试(配置SSH免密码登录到计算节点,用普通用户执行)
$ echo "sleep 30" | qsub
查看任务信息
# qstat -a