环境配置笔记-作业调度系统的安装(CentOS7系统)

如上一篇CentOS系统安装一样,然而上一篇是失败的,鄙人任性将其失败原因归咎于tarque和CentOS8的不兼容。

1. 更新yum源:

wget -O /etc/yum.repos.d/CentOS-Base.repo http://mirrors.aliyun.com/repo/Centos7.repo (自己找找,我的更新方式见博客BUG-yum源更新

yum clean all

yum makecache

2. 下载gcc等依赖

yum install gcc gcc-c++ tcl-devel tk-devel make -y

yum clean all

3. 下载torque

cd /path

mkdir Ypipe

cd Ypipe

mkdir soft

cd soft

wget https://src.fedoraproject.org/lookaside/pkgs/torque/torque-6.1.1.1.tar.gz

-bash: wget: command not found

wget https://src.fedoraproject.org/lookaside/pkgs/torque/torque-6.1.1.1.tar.gz/sha512/74ff683f56d04a4d08774896c9f9875c68aa2cacfe6c1c8c65246da52396443d3f7497bc8a6a1f06d357f52c65153fc9db00692f514ac30279e4c765547d98c0/torque-6.1.1.1.tar.gz

./configure

configure: error: TORQUE needs libxml2-devel in order to build

yum install libxml2-devel openssl-devel gcc gcc-c++ boost-devel libtool -y

./configure

make

 make install

安装成功!!!

#####################################################

#####################################################

yum install vim

mv /etc/hosts /etc/hosts-bak

vim /etc/hosts

ip1 node1

ip2 node2

 make packages # 为所有计算节点安装包

 

useradd usr_name

torque.setup usr_name

$ pbsnodes

报错重启

$ pbsnodes
socket_connect_unix failed: 15137
pbsnodes: cannot connect to server 179.70.168.192.in-addr.arpa, error=15137 (could not connect to trqauthd)

service pbs_server start

服务器节点配置

cp pbs_mom pbs_sched pbs_server trqauthd /etc/init.d/

$ for i in pbs_mom pbs_sched pbs_server trqauthd;do chkconfig --add $i; chkconfig $i on; service $i start; done
Note: Forwarding request to 'systemctl enable pbs_mom.service'.
Created symlink from /etc/systemd/system/multi-user.target.wants/pbs_mom.service to /usr/lib/systemd/system/pbs_mom.service.
Starting pbs_mom (via systemctl): [ OK ]
Starting pbs_sched (via systemctl): [ OK ]
Note: Forwarding request to 'systemctl enable pbs_server.service'.
Created symlink from /etc/systemd/system/multi-user.target.wants/pbs_server.service to /usr/lib/systemd/system/pbs_server.service.
Starting pbs_server (via systemctl): [ OK ]
Note: Forwarding request to 'systemctl enable trqauthd.service'.
Created symlink from /etc/systemd/system/multi-user.target.wants/trqauthd.service to /usr/lib/systemd/system/trqauthd.service.
Starting trqauthd (via systemctl): [ OK ]

 

查看节点线程数:

less /proc/cpuinfo

 

 本节点两个cpu。

配置计算节点:

vi /var/spool/torque/server_priv/nodes

 

 

$ vi /etc/sysconfig/network #更改主机名

vim /etc/profile # 添加信息并source

TORQUE=/opt/torque
MAUI=/opt/maui
if [ ·id -u· -eq 0 ]; then
PATH=$PATH:$TORQUE/bin:$TORQUE/sbin:$TORQUE/bin:$MAUI/sbin:$MAUI/bin
else
PATH=$PATH:$TORQUE/bin:$MAUI/bin
fi

 

##################

计算节点配置:

拷贝文件到计算节点:

for i in 192.168.70.179 192.168.70.178;do
scp /chenlab/Ypipe/soft/torque-6.1.1.1/contrib/init.d/{pbs_mom,trqauthd} root@$i:/etc/init.d/
scp /chenlab/Ypipe/soft/torque-6.1.1.1/{torque-package-clients-linux-x86_64.sh,torque-package-mom-linux-x86_64.sh} root@$i:/root
done

以下操作为每个节点的单独操作:

cd /root

./torque-package-clients-linux-x86_64.sh --install

./torque-package-mom-linux-x86_64.sh --install

####################

vi /var/spool/torque/server_name

# 修改主机名为登录节点主机名

 

vi /var/spool/torque/mom_priv/config

 vi /var/spool/torque/mom_priv/config
pbsserver 179 # note: hostname running pbs_server
logevent 255 # bitmap of which events to log

$ for i in trqauthd pbs_mom; do chkconfig --add $i; chkconfig $i on; service $i start; done
Note: Forwarding request to 'systemctl enable trqauthd.service'.
Starting trqauthd (via systemctl): [ OK ]
Note: Forwarding request to 'systemctl enable pbs_mom.service'.
Starting pbs_mom (via systemctl): [ OK ]

 

vim /etc/profile # 添加信息并source 同登陆节点

重启。

好了,试试结果。

 

 pbsnodes

Unable to communicate with 192.168

Unable to communicate with 192.168.
pbsnodes: cannot connect to server 192.168., error=111 (Connection refused)

 

推测是因为防火墙的问题

systemctl stop firewalld.service

不对

==============================================

488 cd Ypipe/soft/torque-6.1.1.1
489 ll
490 cd src
491 ll
492 cd tools
493 ll
494 cd init.d
495 ll
496 cp pbs /etc/init.d/
497 cd /etc/init.d/
498 ll
499 chkconfig –add pbs
500 chkconfig –list | grep pbs
501 chkconfig -–add pbs
502 chkconfig --add pbs
503 chkconfig –list | grep pbs
504 chkconfig --list | grep pbs
505 history | less

==============================

service pbs start

service pbs stop

service pbs restart

Restarting PBS
Stopping PBS
Starting PBS
pbs_server port already bound: Address already in use
PBS server
pbs_mom: LOG_ERROR::Resource temporarily unavailable (11) in pbs_mom, cannot lock '/var/spool/torque/mom_priv/mom.lock' - another mom running
cannot lock '/var/spool/torque/mom_priv/mom.lock' - another mom running
PBS mom
pbs_sched: LOG_ERROR::Address already in use (98) in main, bind
PBS sched

=====================

又去网上一顿搜索,都无法解决问题

=====================

配置一些乱七八遭的东西,然后重启。

 

$ service pbs start
Starting PBS
pbs_server port already bound: Address already in use
PBS server
pbs_mom: LOG_ERROR::Resource temporarily unavailable (11) in pbs_mom, cannot lock '/var/spool/torque/mom_priv/mom.lock' - another mom running
cannot lock '/var/spool/torque/mom_priv/mom.lock' - another mom running
PBS mom
pbs_sched: LOG_ERROR::Address already in use (98) in main, bind
PBS sched

mv mom.lock mom.lock- # 出现报错,删除了这个lock文件

$ service pbs restart
Restarting PBS
Stopping PBS
Starting PBS
pbs_server port already bound: Address already in use
PBS server
pbs_mom: LOG_ERROR::pbs_mom, server port = 15002, errno = 98 (Address already in use), already in use
server port = 15002, errno = 98 (Address already in use), already in use
PBS mom
pbs_sched: LOG_ERROR::Address already in use (98) in main, bind
PBS sched

还有报错

$ ps -e |grep pbs_
2000 ? 00:00:01 pbs_mom
2003 ? 00:00:00 pbs_server
2026 ? 00:00:00 pbs_server
2045 ? 00:00:00 pbs_sched

$ kill -9 2000 2003 2045

$ service pbs restart
Restarting PBS
Stopping PBS
Starting PBS
pbs_server port already bound: Address already in use
PBS server
PBS mom
PBS sched

$ pbsnodes
node1
state = free
power_state = Running
np = 16
ntype = cluster
status = opsys=linux,uname=Linux node1 3.10.0-1160.el7.x86_64 #1 SMP Mon Oct 19 16:18:59 UTC 2020 x86_64,nsessions=0,nusers=0,idletime=824,totmem=135691344kb,availmem=131037940kb,physmem=131497044kb,ncpus=56,loadave=0.00,gres=,netload=34949482,state=free,varattr= ,cpuclock=Fixed,macaddr=ac:1f:6b:3a:80:94,version=6.1.1.1,rectime=1614222200,jobs=
mom_service_port = 15002
mom_manager_port = 15003

node2
state = down
power_state = Running
np = 16
ntype = cluster
mom_service_port = 15002
mom_manager_port = 15003

好像是ok了。

进入用户账号测试:

 

$ qsub -W queue=normal zsleep.sh
qsub: submit error (Queue is not enabled MSG=queue is disabled: user fuyuan@chen179, queue normal)

queue 没有开启,好吧。

 

开启queue见下一章,全部测试通了之后会在公众号上更新。

 

posted on 2021-02-23 17:05  Yuan-SW-F(abysw)  阅读(2091)  评论(0编辑  收藏  举报

导航