LSF之OpenLava作业限制
1、限制每个主机的job数量
设置 lsb.hosts
文件中 rd2的策略
1)修改前
82 [fhu@rd2 11:38:04 ~]$ bhosts
HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV
rd1 unavail - 1 0 0 0 0 0
rd2 ok - 2 0 0 0 0 0
rd3 ok - 2 0 0 0 0 0
2)lsb.hosts中加入rd2的策略,MXJ主机的最大job数设置为0,每个用户的最大job数设置为0
# lsb.hosts文件
Begin Host
HOST_NAME MXJ JL/U r1m pg ls tmp DISPATCH_WINDOW # Keywords
#host0 1 1 3.5/4.5 15/ 12/15 0 () # Example
#host1 () 2 3.5 15/18 12/ 0/ (5:19:00-1:8:30 20:00-8:30)
#host2 () () 3.5/5 18 15 () () # Example
default ! () () () () () () # Example
rd2 0 0 () () () () () # Example
End Host
3)修改后,badmin reconfig 配置生效,rd2就不再接收job
84 [fhu@rd2 11:40:15 ~]$ bhosts
HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV
rd1 unavail - 1 0 0 0 0 0
rd2 closed 0 0 0 0 0 0 0
rd3 ok - 2 0 0 0 0 0
4)可见,所有RUN状态的job均在rd3上
101 [fhu@rd2 11:42:55 ~]$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
872 fhu RUN normal rd2 rd3 *_sleep.py Jun 22 11:42
873 fhu RUN normal rd2 rd3 *_sleep.py Jun 22 11:42
874 fhu PEND normal rd2 *_sleep.py Jun 22 11:42
875 fhu PEND normal rd2 *_sleep.py Jun 22 11:42
876 fhu PEND normal rd2 *_sleep.py Jun 22 11:42
5)一旦注释 rd2 的配置,badmin reconfig后生效配置, PEND的job会立即分配到rd2
102 [fhu@rd2 11:48:06 ~]$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
872 fhu RUN normal rd2 rd3 *_sleep.py Jun 22 11:42
873 fhu RUN normal rd2 rd3 *_sleep.py Jun 22 11:42
874 fhu RUN normal rd2 rd2 *_sleep.py Jun 22 11:42
875 fhu RUN normal rd2 rd2 *_sleep.py Jun 22 11:42
876 fhu PEND normal rd2 *_sleep.py Jun 22 11:42
2、 限制每个queue的job数量
设置 lsb.queues
文件中normal队列的 QJOB_LIMIT
,即队列最大job数量
1)修改前 bqueues
104 [fhu@rd2 13:40:37 ~]$ bqueues
QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP
normal 30 Open:Active - - - - 0 0 0 0
106 [fhu@rd2 13:42:52 ~]$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
986 fhu RUN normal rd2 rd3 *_sleep.py Jun 22 13:42
987 fhu RUN normal rd2 rd3 *_sleep.py Jun 22 13:42
988 fhu RUN normal rd2 rd2 *_sleep.py Jun 22 13:42
989 fhu RUN normal rd2 rd2 *_sleep.py Jun 22 13:42
990 fhu PEND normal rd2 *_sleep.py Jun 22 13:42
2)修改 QJOB_LIMIT = 1
# lsb.queues 文件
Begin Queue
QUEUE_NAME = normal
PRIORITY = 30
NICE = 20
#QJOB_LIMIT = 60 # 该queue的最大job数量
QJOB_LIMIT = 1 # job limit of the queue
#UJOB_LIMIT = 5 # 每个user的最大job数量
#PJOB_LIMIT = 2 # 每个processor的最大job数量
#RUN_WINDOW = 5:19:00-1:8:30 20:00-8:30
#r1m = 0.7/2.0 # loadSched/loadStop
#r15m = 1.0/2.5
#pg = 4.0/8
#ut = 0.2
#io = 50/240
#CPULIMIT = 180/apple # job的CPU使用限制
#FILELIMIT = 20000
#MEMLIMIT = 5000 # jobs bigger than this (5M) will be niced
#DATALIMIT = 20000 # jobs data segment limit
#STACKLIMIT = 2048
#CORELIMIT = 20000
#PROCLIMIT = 5 # job processor limit
#USERS = all # 指定哪些用户可以提交,可以用lsb.users中的UserGroup和User,多个User使用空格间隔
#HOSTS = all # 指定提交给哪些主机,可以用lsb.hosts中的HostGroup和Host
#PRE_EXEC = /usr/local/lsf/misc/testq_pre >> /tmp/pre.out
#POST_EXEC = /usr/local/lsf/misc/testq_post |grep -v "Hey"
#REQUEUE_EXIT_VALUES = 55 34 78
#ROUND_ROBIN_POLICY = y
#FAIRSHARE = USER_SHARES[[G1,1] [G2,1]]
#HOSTS_SHARES = [all, 5]
DESCRIPTION = For normal low priority jobs, running only if hosts are \
lightly loaded.
End Queue
3)badmin reconfig 生效,此时队列中便只能有1个job在run
119 [fhu@rd2 13:44:39 ~]$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1095 fhu RUN normal rd2 rd3 *_sleep.py Jun 22 13:44
1096 fhu PEND normal rd2 *_sleep.py Jun 22 13:44
1097 fhu PEND normal rd2 *_sleep.py Jun 22 13:44
1098 fhu PEND normal rd2 *_sleep.py Jun 22 13:44
1099 fhu PEND normal rd2 *_sleep.py Jun 22 13:44
1100 fhu PEND normal rd2 *_sleep.py Jun 22 13:44
3、 限制每个user的job数量
一)方式一:指定用户配置
设置 lsb.user
文件中 User的 MAX_JOBS
,控制所有机器上某个用户job数量
1)修改前, fhu用户可以运行4个job
106 [fhu@rd2 13:42:52 ~]$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
986 fhu RUN normal rd2 rd3 *_sleep.py Jun 22 13:42
987 fhu RUN normal rd2 rd3 *_sleep.py Jun 22 13:42
988 fhu RUN normal rd2 rd2 *_sleep.py Jun 22 13:42
989 fhu RUN normal rd2 rd2 *_sleep.py Jun 22 13:42
990 fhu PEND normal rd2 *_sleep.py Jun 22 13:42
2)修改 MAX_JOBS=1
# lsb.user 文件
Begin User
USER_NAME MAX_JOBS JL/P
#develop@ 20 8
#support 50 -
fhu 1 -
End User
3)badmin reconfig生效后,fhu用户在所有机器上只有1个job在run
119 [fhu@rd2 13:44:39 ~]$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1095 fhu RUN normal rd2 rd3 *_sleep.py Jun 22 13:44
1096 fhu PEND normal rd2 *_sleep.py Jun 22 13:44
1097 fhu PEND normal rd2 *_sleep.py Jun 22 13:44
1098 fhu PEND normal rd2 *_sleep.py Jun 22 13:44
1099 fhu PEND normal rd2 *_sleep.py Jun 22 13:44
1100 fhu PEND normal rd2 *_sleep.py Jun 22 13:44
1101 fhu PEND normal rd2 *_sleep.py Jun 22 13:44
4)修改 MAX_JOBS=3,并 badmin reconfig生效后,fhu用户在所有机器上有3个job在run
120 [fhu@rd2 14:13:20 ~]$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1308 fhu RUN normal rd2 rd2 *_sleep.py Jun 22 14:12
1309 fhu RUN normal rd2 rd2 *_sleep.py Jun 22 14:12
1310 fhu RUN normal rd2 rd3 *_sleep.py Jun 22 14:12
1311 fhu PEND normal rd2 *_sleep.py Jun 22 14:12
1312 fhu PEND normal rd2 *_sleep.py Jun 22 14:12
1313 fhu PEND normal rd2 *_sleep.py Jun 22 14:12
1314 fhu PEND normal rd2 *_sleep.py Jun 22 14:12
二)方式二:全部用户配置
设置 lsb.queues
的 UJOB_JOBS
,控制queue中所有用户job数量
1)修改前, fhu用户可以运行4个job
106 [fhu@rd2 13:42:52 ~]$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
986 fhu RUN normal rd2 rd3 *_sleep.py Jun 22 13:42
987 fhu RUN normal rd2 rd3 *_sleep.py Jun 22 13:42
988 fhu RUN normal rd2 rd2 *_sleep.py Jun 22 13:42
989 fhu RUN normal rd2 rd2 *_sleep.py Jun 22 13:42
990 fhu PEND normal rd2 *_sleep.py Jun 22 13:42
2)修改 UJOB_LIMIT=1
# lsb.queues 文件
Begin Queue
QUEUE_NAME = normal
PRIORITY = 30
NICE = 20
#QJOB_LIMIT = 60 # 该queue的最大job数量
QJOB_LIMIT = 1 # job limit of the queue
UJOB_LIMIT = 1 # 每个user的最大job数量
#PJOB_LIMIT = 2 # 每个processor的最大job数量
#RUN_WINDOW = 5:19:00-1:8:30 20:00-8:30
#r1m = 0.7/2.0 # loadSched/loadStop
#r15m = 1.0/2.5
#pg = 4.0/8
#ut = 0.2
#io = 50/240
#CPULIMIT = 180/apple # job的CPU使用限制
#FILELIMIT = 20000
#MEMLIMIT = 5000 # jobs bigger than this (5M) will be niced
#DATALIMIT = 20000 # jobs data segment limit
#STACKLIMIT = 2048
#CORELIMIT = 20000
#PROCLIMIT = 5 # job processor limit
#USERS = all # 指定哪些用户可以提交,可以用lsb.users中的UserGroup和User, 多个User使用空格间隔
#HOSTS = all # 指定提交给哪些主机,可以用lsb.hosts中的HostGroup和Host
#PRE_EXEC = /usr/local/lsf/misc/testq_pre >> /tmp/pre.out
#POST_EXEC = /usr/local/lsf/misc/testq_post |grep -v "Hey"
#REQUEUE_EXIT_VALUES = 55 34 78
#ROUND_ROBIN_POLICY = y
#FAIRSHARE = USER_SHARES[[G1,1] [G2,1]]
#HOSTS_SHARES = [all, 5]
DESCRIPTION = For normal low priority jobs, running only if hosts are \
lightly loaded.
End Queue
3)badmin reconfig生效后,fhu用户在所有机器上只有1个job在run
119 [fhu@rd2 13:44:39 ~]$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1095 fhu RUN normal rd2 rd3 *_sleep.py Jun 22 13:44
1096 fhu PEND normal rd2 *_sleep.py Jun 22 13:44
1097 fhu PEND normal rd2 *_sleep.py Jun 22 13:44
1098 fhu PEND normal rd2 *_sleep.py Jun 22 13:44
1099 fhu PEND normal rd2 *_sleep.py Jun 22 13:44
1100 fhu PEND normal rd2 *_sleep.py Jun 22 13:44
1101 fhu PEND normal rd2 *_sleep.py Jun 22 13:44
4)修改 UJOB_LIMIT=3,并 badmin reconfig生效后,fhu用户在所有机器上有3个job在run
120 [fhu@rd2 14:13:20 ~]$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1308 fhu RUN normal rd2 rd2 *_sleep.py Jun 22 14:12
1309 fhu RUN normal rd2 rd2 *_sleep.py Jun 22 14:12
1310 fhu RUN normal rd2 rd3 *_sleep.py Jun 22 14:12
1311 fhu PEND normal rd2 *_sleep.py Jun 22 14:12
1312 fhu PEND normal rd2 *_sleep.py Jun 22 14:12
1313 fhu PEND normal rd2 *_sleep.py Jun 22 14:12
1314 fhu PEND normal rd2 *_sleep.py Jun 22 14:12
三)注意
- 当
lsb.users
的MAX_JOBS
和lsb.queues
的UJOB_LIMIT
同时配置时,最终配置结果会选择两者中配置较小的一个值
博客内容仅供参考,部分参考他人优秀博文,仅供学习使用
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· Manus重磅发布:全球首款通用AI代理技术深度解析与实战指南
· 被坑几百块钱后,我竟然真的恢复了删除的微信聊天记录!
· 没有Manus邀请码?试试免邀请码的MGX或者开源的OpenManus吧
· 园子的第一款AI主题卫衣上架——"HELLO! HOW CAN I ASSIST YOU TODAY
· 【自荐】一款简洁、开源的在线白板工具 Drawnix
2021-06-22 坑(二十)——正则分组返回结果