凉城旧巷
Python从入门到自闭,Java从自闭到放弃,数据库从删库到跑路,Linux从rm -rf到完犊子!!!

LSF之OpenLava作业限制

1、限制每个主机的job数量

设置 lsb.hosts 文件中 rd2的策略

1)修改前

82 [fhu@rd2 11:38:04 ~]$ bhosts
HOST_NAME          STATUS       JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV
rd1                unavail         -      1      0      0      0      0      0
rd2                ok              -      2      0      0      0      0      0
rd3                ok              -      2      0      0      0      0      0

 

2)lsb.hosts中加入rd2的策略,MXJ主机的最大job数设置为0,每个用户的最大job数设置为0

# lsb.hosts文件

Begin Host
HOST_NAME     MXJ JL/U   r1m    pg    ls     tmp  DISPATCH_WINDOW  # Keywords
#host0        1    1   3.5/4.5  15/   12/15  0      ()             # Example
#host1       ()   2     3.5  15/18   12/    0/  (5:19:00-1:8:30 20:00-8:30)
#host2        ()   ()   3.5/5   18    15     ()     ()             # Example
default       !   ()     ()    ()    ()     ()     ()              # Example
rd2           0   0     ()    ()    ()     ()     ()               # Example
End Host

 

3)修改后,badmin reconfig 配置生效,rd2就不再接收job

84 [fhu@rd2 11:40:15 ~]$ bhosts
HOST_NAME          STATUS       JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV
rd1                unavail         -      1      0      0      0      0      0
rd2                closed          0      0      0      0      0      0      0
rd3                ok              -      2      0      0      0      0      0

 

4)可见,所有RUN状态的job均在rd3上

101 [fhu@rd2 11:42:55 ~]$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
872     fhu     RUN   normal     rd2         rd3         *_sleep.py Jun 22 11:42
873     fhu     RUN   normal     rd2         rd3         *_sleep.py Jun 22 11:42
874     fhu     PEND  normal     rd2                     *_sleep.py Jun 22 11:42
875     fhu     PEND  normal     rd2                     *_sleep.py Jun 22 11:42
876     fhu     PEND  normal     rd2                     *_sleep.py Jun 22 11:42

 

5)一旦注释 rd2 的配置,badmin reconfig后生效配置, PEND的job会立即分配到rd2

102 [fhu@rd2 11:48:06 ~]$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
872     fhu     RUN   normal     rd2         rd3         *_sleep.py Jun 22 11:42
873     fhu     RUN   normal     rd2         rd3         *_sleep.py Jun 22 11:42
874     fhu     RUN   normal     rd2         rd2         *_sleep.py Jun 22 11:42
875     fhu     RUN   normal     rd2         rd2         *_sleep.py Jun 22 11:42
876     fhu     PEND  normal     rd2                     *_sleep.py Jun 22 11:42

 

2、 限制每个queue的job数量

设置 lsb.queues 文件中normal队列的 QJOB_LIMIT,即队列最大job数量

1)修改前 bqueues

104 [fhu@rd2 13:40:37 ~]$ bqueues
QUEUE_NAME     PRIO      STATUS      MAX  JL/U JL/P JL/H NJOBS  PEND  RUN  SUSP
normal          30    Open:Active      -    -    -    -     0     0     0     0


106 [fhu@rd2 13:42:52 ~]$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
986     fhu     RUN   normal     rd2         rd3         *_sleep.py Jun 22 13:42
987     fhu     RUN   normal     rd2         rd3         *_sleep.py Jun 22 13:42
988     fhu     RUN   normal     rd2         rd2         *_sleep.py Jun 22 13:42
989     fhu     RUN   normal     rd2         rd2         *_sleep.py Jun 22 13:42
990     fhu     PEND  normal     rd2                     *_sleep.py Jun 22 13:42

 

2)修改 QJOB_LIMIT = 1

# lsb.queues 文件

Begin Queue
QUEUE_NAME   = normal
PRIORITY     = 30
NICE         = 20
#QJOB_LIMIT   = 60              # 该queue的最大job数量
QJOB_LIMIT   = 1               # job limit of the queue
#UJOB_LIMIT   = 5               # 每个user的最大job数量
#PJOB_LIMIT   = 2               # 每个processor的最大job数量
#RUN_WINDOW   = 5:19:00-1:8:30 20:00-8:30
#r1m         = 0.7/2.0        # loadSched/loadStop
#r15m         = 1.0/2.5
#pg           = 4.0/8
#ut           = 0.2
#io           = 50/240
#CPULIMIT     = 180/apple      # job的CPU使用限制
#FILELIMIT    = 20000
#MEMLIMIT     = 5000           # jobs bigger than this (5M) will be niced
#DATALIMIT    = 20000          # jobs data segment limit
#STACKLIMIT   = 2048
#CORELIMIT    = 20000
#PROCLIMIT    = 5              # job processor limit
#USERS        = all            # 指定哪些用户可以提交,可以用lsb.users中的UserGroup和User,多个User使用空格间隔
#HOSTS        = all            # 指定提交给哪些主机,可以用lsb.hosts中的HostGroup和Host
#PRE_EXEC     = /usr/local/lsf/misc/testq_pre >> /tmp/pre.out
#POST_EXEC    = /usr/local/lsf/misc/testq_post |grep -v "Hey"
#REQUEUE_EXIT_VALUES = 55 34 78
#ROUND_ROBIN_POLICY = y
#FAIRSHARE = USER_SHARES[[G1,1] [G2,1]]
#HOSTS_SHARES = [all, 5]
DESCRIPTION  = For normal low priority jobs, running only if hosts are \
lightly loaded.
End Queue

 

3)badmin reconfig 生效,此时队列中便只能有1个job在run

119 [fhu@rd2 13:44:39 ~]$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1095    fhu     RUN   normal     rd2         rd3         *_sleep.py Jun 22 13:44
1096    fhu     PEND  normal     rd2                     *_sleep.py Jun 22 13:44
1097    fhu     PEND  normal     rd2                     *_sleep.py Jun 22 13:44
1098    fhu     PEND  normal     rd2                     *_sleep.py Jun 22 13:44
1099    fhu     PEND  normal     rd2                     *_sleep.py Jun 22 13:44
1100    fhu     PEND  normal     rd2                     *_sleep.py Jun 22 13:44

 

3、 限制每个user的job数量

一)方式一:指定用户配置

设置 lsb.user 文件中 User的 MAX_JOBS,控制所有机器上某个用户job数量

1)修改前, fhu用户可以运行4个job

106 [fhu@rd2 13:42:52 ~]$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
986     fhu     RUN   normal     rd2         rd3         *_sleep.py Jun 22 13:42
987     fhu     RUN   normal     rd2         rd3         *_sleep.py Jun 22 13:42
988     fhu     RUN   normal     rd2         rd2         *_sleep.py Jun 22 13:42
989     fhu     RUN   normal     rd2         rd2         *_sleep.py Jun 22 13:42
990     fhu     PEND  normal     rd2                     *_sleep.py Jun 22 13:42

 

2)修改 MAX_JOBS=1

# lsb.user 文件

Begin User
USER_NAME       MAX_JOBS        JL/P
#develop@        20              8
#support         50              -
fhu              1               -
End User

 

3)badmin reconfig生效后,fhu用户在所有机器上只有1个job在run

119 [fhu@rd2 13:44:39 ~]$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1095    fhu     RUN   normal     rd2         rd3         *_sleep.py Jun 22 13:44
1096    fhu     PEND  normal     rd2                     *_sleep.py Jun 22 13:44
1097    fhu     PEND  normal     rd2                     *_sleep.py Jun 22 13:44
1098    fhu     PEND  normal     rd2                     *_sleep.py Jun 22 13:44
1099    fhu     PEND  normal     rd2                     *_sleep.py Jun 22 13:44
1100    fhu     PEND  normal     rd2                     *_sleep.py Jun 22 13:44
1101    fhu     PEND  normal     rd2                     *_sleep.py Jun 22 13:44

 

4)修改 MAX_JOBS=3,并 badmin reconfig生效后,fhu用户在所有机器上有3个job在run

120 [fhu@rd2 14:13:20 ~]$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1308    fhu     RUN   normal     rd2         rd2         *_sleep.py Jun 22 14:12
1309    fhu     RUN   normal     rd2         rd2         *_sleep.py Jun 22 14:12
1310    fhu     RUN   normal     rd2         rd3         *_sleep.py Jun 22 14:12
1311    fhu     PEND  normal     rd2                     *_sleep.py Jun 22 14:12
1312    fhu     PEND  normal     rd2                     *_sleep.py Jun 22 14:12
1313    fhu     PEND  normal     rd2                     *_sleep.py Jun 22 14:12
1314    fhu     PEND  normal     rd2                     *_sleep.py Jun 22 14:12

 

二)方式二:全部用户配置

设置 lsb.queuesUJOB_JOBS,控制queue中所有用户job数量

1)修改前, fhu用户可以运行4个job

106 [fhu@rd2 13:42:52 ~]$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
986     fhu     RUN   normal     rd2         rd3         *_sleep.py Jun 22 13:42
987     fhu     RUN   normal     rd2         rd3         *_sleep.py Jun 22 13:42
988     fhu     RUN   normal     rd2         rd2         *_sleep.py Jun 22 13:42
989     fhu     RUN   normal     rd2         rd2         *_sleep.py Jun 22 13:42
990     fhu     PEND  normal     rd2                     *_sleep.py Jun 22 13:42

 

2)修改 UJOB_LIMIT=1

# lsb.queues 文件

Begin Queue
QUEUE_NAME   = normal
PRIORITY     = 30
NICE         = 20
#QJOB_LIMIT   = 60              # 该queue的最大job数量
QJOB_LIMIT   = 1               # job limit of the queue
UJOB_LIMIT   = 1               # 每个user的最大job数量
#PJOB_LIMIT   = 2               # 每个processor的最大job数量
#RUN_WINDOW   = 5:19:00-1:8:30 20:00-8:30
#r1m         = 0.7/2.0        # loadSched/loadStop
#r15m         = 1.0/2.5
#pg           = 4.0/8
#ut           = 0.2
#io           = 50/240
#CPULIMIT     = 180/apple      # job的CPU使用限制
#FILELIMIT    = 20000
#MEMLIMIT     = 5000           # jobs bigger than this (5M) will be niced
#DATALIMIT    = 20000          # jobs data segment limit
#STACKLIMIT   = 2048
#CORELIMIT    = 20000
#PROCLIMIT    = 5              # job processor limit
#USERS        = all            # 指定哪些用户可以提交,可以用lsb.users中的UserGroup和User, 多个User使用空格间隔
#HOSTS        = all            # 指定提交给哪些主机,可以用lsb.hosts中的HostGroup和Host
#PRE_EXEC     = /usr/local/lsf/misc/testq_pre >> /tmp/pre.out
#POST_EXEC    = /usr/local/lsf/misc/testq_post |grep -v "Hey"
#REQUEUE_EXIT_VALUES = 55 34 78
#ROUND_ROBIN_POLICY = y
#FAIRSHARE = USER_SHARES[[G1,1] [G2,1]]
#HOSTS_SHARES = [all, 5]
DESCRIPTION  = For normal low priority jobs, running only if hosts are \
lightly loaded.
End Queue

 

3)badmin reconfig生效后,fhu用户在所有机器上只有1个job在run

119 [fhu@rd2 13:44:39 ~]$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1095    fhu     RUN   normal     rd2         rd3         *_sleep.py Jun 22 13:44
1096    fhu     PEND  normal     rd2                     *_sleep.py Jun 22 13:44
1097    fhu     PEND  normal     rd2                     *_sleep.py Jun 22 13:44
1098    fhu     PEND  normal     rd2                     *_sleep.py Jun 22 13:44
1099    fhu     PEND  normal     rd2                     *_sleep.py Jun 22 13:44
1100    fhu     PEND  normal     rd2                     *_sleep.py Jun 22 13:44
1101    fhu     PEND  normal     rd2                     *_sleep.py Jun 22 13:44

 

4)修改 UJOB_LIMIT=3,并 badmin reconfig生效后,fhu用户在所有机器上有3个job在run

120 [fhu@rd2 14:13:20 ~]$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1308    fhu     RUN   normal     rd2         rd2         *_sleep.py Jun 22 14:12
1309    fhu     RUN   normal     rd2         rd2         *_sleep.py Jun 22 14:12
1310    fhu     RUN   normal     rd2         rd3         *_sleep.py Jun 22 14:12
1311    fhu     PEND  normal     rd2                     *_sleep.py Jun 22 14:12
1312    fhu     PEND  normal     rd2                     *_sleep.py Jun 22 14:12
1313    fhu     PEND  normal     rd2                     *_sleep.py Jun 22 14:12
1314    fhu     PEND  normal     rd2                     *_sleep.py Jun 22 14:12

 

三)注意

  • lsb.usersMAX_JOBSlsb.queuesUJOB_LIMIT同时配置时,最终配置结果会选择两者中配置较小的一个值
posted on 2022-06-22 16:25  凉城旧巷  阅读(1600)  评论(1编辑  收藏  举报