【原创】Kuberneters-Job(短时任务)的实践

一、什么是Job?

Job负责批量处理短暂的一次性任务,完成任务后,容器就会退出,即只执行一次任务,保证批量处理任务中的1个或多个Pod成功结束。

二、Job的使用场景

Job适用于执行完一次性工作任务就不再执行,非持续性工作的使用场景,如:AI模型训练、批量计算、数据分析等场景。

三、Job的实践

1、非并行的Pod

 此类方式的Job通常kind为Job的容器只创建1个pod,且其启动方式只支持Never和onFailured,不支持Always,若填写的是Always,如下所示,会直接报错

[root@k8s-master job]# cat job-one.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: job-demo-onepod
spec:
  template:
    metadata:
      name: job-demo
    spec:
      restartPolicy: Always
      containers:
      - name: counter
        image: busybox
        command:
        - "bin/sh"
        - "-c"
        - "for i in 9 8 7 6 5 4 3 2 1; do echo $i; done"
[root@k8s-master job]# kubectl create -f  job-one.yaml 
The Job "job-demo-onepod" is invalid: spec.template.spec.restartPolicy: Unsupported value: "Always": supported values: "OnFailure", "Never"   

 删除正在运行中的pod,若Job任务未执行完成,会自动重新启动1个Pod继续执行任务,直到任务执行完成

[root@k8s-master job]# cat job-one-testrestart.yaml 
apiVersion: batch/v1
kind: Job
metadata:
  name: job-demo-onepod-testrestart
spec:
  template:
    metadata:
      name: job-demo
    spec:
      restartPolicy: Never
      containers:
      - name: counter-restart
        image: busybox
        command:
        - "bin/sh"
        - "-c"
        - "touch /tmp/healthy;sleep 1000000"                     # 让pod较长时间处于运行状态
[root@k8s-master job]# kubectl get pod
NAME                                    READY   STATUS             RESTARTS   AGE
job-demo-onepod-testrestart-rpvhh       1/1     Running            0          3m23s

此时删除Pod,能看到会删除任务中旧的Pod,重新起一个新的Pod,如下图所示

[root@k8s-master job]# kubectl get pod
NAME                                    READY   STATUS             RESTARTS   AGE
job-demo-onepod-testrestart-5js9m       1/1     Running            0          6s
job-demo-onepod-testrestart-rpvhh       1/1     Terminating        0          4m36s

如果是已经运行完成的Pod,删除之后还会重启吗,再来实验下

[root@k8s-master job]# kubectl get pod
NAME                                    READY   STATUS             RESTARTS   AGE
job-demo-onepod-kmm74                   0/1     Completed          0          22m   
job-demo-onepod-testrestart-5js9m       1/1     Running            0          107s
[root@k8s-master job]# kubectl delete pod job-demo-onepod-kmm74
pod "job-demo-onepod-kmm74" deleted
[root@k8s-master job]# kubectl get pod
NAME                                    READY   STATUS             RESTARTS   AGE
job-demo-onepod-testrestart-5js9m       1/1     Running            0          3m27s

job-demo-onepod-kmm74这个Pod已经完成了任务,其状态变为:Completed,执行删除操作之后,彻底删除了,不会再重新启动新的Pod,因此,若Job未执行完成,执行删除Pod的操作,会一直起新的Pod继续执行Job直到达到Completed状态

2、固定结束次数的Job

设置.spec.completions,但不设置spec.Parallelism,创建多个Pod,直到.spec.completions个Pod成功结束

[root@k8s-master job]# cat completions-pod-job.yaml 
apiVersion: batch/v1
kind: Job
metadata:
  name: job-demo-comppod
spec:
  completions: 5    # 指定运行的Pod数为5
  template:
    metadata:
      name: job-demoi
    spec:
      restartPolicy: Never
      containers:
      - name: counter
        image: busybox
        command:
        - "bin/sh"
        - "-c"
        - "for i in 9 8 7 6 5 4 3 2 1; do echo $i; done
[root@k8s-master job]# kubectl create -f completions-pod-job.yaml 
job.batch/job-demo-comppod created
[root@k8s-master job]# kubectl  get pod | grep job-demo
job-demo-comppod-b2v2n                  0/1     Completed           0          11s
job-demo-comppod-ptfqp                  0/1     Completed           0          18s
job-demo-comppod-xrkm9                  0/1     ContainerCreating   0          3s
[root@k8s-master job]# kubectl  get pod | grep job-demo   
job-demo-comppod-b2v2n                  0/1     Completed          0          69s
job-demo-comppod-jxhqd                  0/1     Completed          0          54s
job-demo-comppod-p8q7t                  0/1     Completed          0          42s
job-demo-comppod-ptfqp                  0/1     Completed          0          76s
job-demo-comppod-xrkm9                  0/1     Completed          0          61s

可以看到运行了5个Pod,且都运行完成成功退出了,但是在有些场景下,期望能够指定一次运行几个Pod并行处理,这个时候就需要用到固定结束次数的并行Job的方式了。

如果设置.spec.completions了的情况下,设置spec.Parallelism为0,是否可以正常运行呢,验证下

apiVersion: batch/v1
kind: Job
metadata:
  name: job-demo-comppod
spec:
  completions: 1
  parallelism: 0     # 设置为0
  template:
    metadata:
      name: job-demoi
    spec:
      restartPolicy: Never
      containers:
      - name: counter
        image: busybox
        command:
        - "bin/sh"
        - "-c"
        - "for i in 9 8 7 6 5 4 3 2 1; do echo $i; done"
[root@k8s-master job]# kubectl get job 
NAME               COMPLETIONS   DURATION   AGE
job-demo-comppod   0/1                      5m59s
[root@k8s-master job]# kubectl describe job job-demo-comppod 
Name:           job-demo-comppod
Namespace:      default
Selector:       controller-uid=a0b677be-6ea1-4e09-8b1c-d45048cb9f57
Labels:         controller-uid=a0b677be-6ea1-4e09-8b1c-d45048cb9f57
                job-name=job-demo-comppod
Annotations:    <none>
Parallelism:    0
Completions:    1
Pods Statuses:  0 Running / 0 Succeeded / 0 Failed               
Pod Template:
  Labels:  controller-uid=a0b677be-6ea1-4e09-8b1c-d45048cb9f57
           job-name=job-demo-comppod
  Containers:
   counter:
    Image:      busybox
    Port:       <none>
    Host Port:  <none>
    Command:
      bin/sh
      -c
      for i in 9 8 7 6 5 4 3 2 1; do echo $i; done
    Environment:  <none>
    Mounts:       <none>
  Volumes:        <none>
Events:           <none>

无Pod执行任务,将Parallelism去掉,再验证下

apiVersion: batch/v1
kind: Job
metadata:
  name: job-demo-comppod
spec:
  completions: 1
  parallelism: 0
  template:
    metadata:
      name: job-demoi
    spec:
      restartPolicy: Never
      containers:
      - name: counter
        image: busybox
        command:
        - "bin/sh"
        - "-c"
        - "for i in 9 8 7 6 5 4 3 2 1; do echo $i; done"
[root@k8s-master job]# kubectl create -f completions-pod-job.yaml 
job.batch/job-demo-comppod created
[root@k8s-master job]# kubectl get pod |grep demo
job-demo-comppod-gm9xs                  0/1     Completed          0          39s

39SPod就执行完了任务,因此Parallelism若设置为0,将会导致任务执行挂起。

3、带有工作队列的并行Pod

 设置.spec.Parallelism参数但不设置.spec.completions,当所有Pod结束并且至少一个成功时,Job就认为是成功。

[root@k8s-master job]# cat parallelism_pod.yaml 
apiVersion: batch/v1
kind: Job
metadata:
  name: job-demo-parall
spec:
  parallelism: 3   # 指定并行pod数为3
  template:
    metadata:
      name: job-demo-para
    spec:
      restartPolicy: Never
      containers:
      - name: counter
        image: busybox
        command:
        - "bin/sh"
        - "-c"
        - "for i in 9 8 7 6 5 4 3 2 1; do echo $i; done"
[root@k8s-master job]# kubectl create -f parallelism_pod.yaml 
job.batch/job-demo-parall created
[root@k8s-master job]# kubectl get pod| grep para
job-demo-parall-76jtn                   0/1     ContainerCreating   0          41s
job-demo-parall-b7x27                   0/1     Completed           0          41s
job-demo-parall-rqmtk                   0/1     Completed           0          41s
[root@k8s-master job]# kubectl get pod| grep para
job-demo-parall-76jtn                   0/1     ContainerCreating   0          51s
job-demo-parall-b7x27                   0/1     Completed           0          51s
job-demo-parall-rqmtk                   0/1     Completed           0          51s
[root@k8s-master job]# kubectl get pod| grep para
\job-demo-parall-76jtn                   0/1     ContainerCreating   0          54s
job-demo-parall-b7x27                   0/1     Completed           0          54s
job-demo-parall-rqmtk                   0/1     Completed           0          54s

 可以看到Pod的创建都是相同的,验证了Pod的并行。

若指定.spec.completions参数为0,验证下是否能正常执行完任务,先删除job-demo-comppod,然后再创建

[root@k8s-master job]# cat completions-pod-job.yaml 
apiVersion: batch/v1
kind: Job
metadata:
  name: job-demo-comppod
spec:
  completions: 0
  parallelism: 1
  template:
    metadata:
      name: job-demoi
    spec:
      restartPolicy: Never
      containers:
      - name: counter
        image: busybox
        command:
        - "bin/sh"
        - "-c"
        - "for i in 9 8 7 6 5 4 3 2 1; do echo $i; done"
[root@k8s-master job]# kubectl create -f completions-pod-job.yaml
job.batch/job-demo-comppod created
[root@k8s-master job]# kubectl get job
NAME               COMPLETIONS   DURATION   AGE
job-demo-comppod   0/0           0s         119s

119Spod未执行完任务,去掉.spec.completions再验证下,先删除job-demo-comppod,然后再创建

[root@k8s-master job]# cat completions-pod-job.yaml 
apiVersion: batch/v1
kind: Job
metadata:
  name: job-demo-comppod
spec:
  parallelism: 1
  template:
    metadata:
      name: job-demoi
    spec:
      restartPolicy: Never
      containers:
      - name: counter
        image: busybox
        command:
        - "bin/sh"
        - "-c"
        - "for i in 9 8 7 6 5 4 3 2 1; do echo $i; done"
[root@k8s-master job]# kubectl create -f completions-pod-job.yaml
job.batch/job-demo-comppod created
[root@k8s-master job]# kubectl get job
NAME               COMPLETIONS   DURATION   AGE
job-demo-comppod   1/1           19s        20s

20SPod执行完了任务,因此可以判断若指定spec.Parallelism=0则任务也会被挂起

4、固定结束次数的并行Job

同时设置.spec.completions.spec.Parallelism参数,多个Pod同时处理工作队列,先删除这个job,然后重新创建1个新的job,并且指定每次并行运行的Pod数量

[root@k8s-master job]# kubectl delete job job-demo-comppod 
job.batch "job-demo-comppod" deleted
[root@k8s-master job]# cat completions-pod-job.yaml 
apiVersion: batch/v1
kind: Job
metadata:
  name: job-demo-comppod
spec:
  completions: 5     # 指定本次job任务需要运行5个Pod
  parallelism: 2     # 指定每次并行2个Pod运行
  template:
    metadata:
      name: job-demoi
    spec:
      restartPolicy: Never
      containers:
      - name: counter
        image: busybox
        command:
        - "bin/sh"
        - "-c"
        - "for i in 9 8 7 6 5 4 3 2 1; do echo $i; done"
job.batch/job-demo-comppod created
[root@k8s-master job]# kubectl get pod |grep job-demo
job-demo-comppod-5svj4                  0/1     ContainerCreating   0          2s
job-demo-comppod-8tfch                  0/1     ContainerCreating   0          2s
[root@k8s-master job]# kubectl get pod |grep job-demo
job-demo-comppod-5svj4                  0/1     ContainerCreating   0          5s
job-demo-comppod-8tfch                  0/1     ContainerCreating   0          5s
[root@k8s-master job]# kubectl get pod |grep job-demo
job-demo-comppod-5svj4                  0/1     ContainerCreating   0          7s
job-demo-comppod-8tfch                  0/1     ContainerCreating   0          7s
[root@k8s-master job]# kubectl get pod |grep job-demo
job-demo-comppod-5svj4                  0/1     ContainerCreating   0          10s
job-demo-comppod-8tfch                  0/1     Completed           0          10s
job-demo-comppod-9dwwh                  0/1     ContainerCreating   0          2s
[root@k8s-master job]# kubectl get pod |grep job-demo
job-demo-comppod-5svj4                  0/1     Completed           0          14s
job-demo-comppod-8tfch                  0/1     Completed           0          14s
job-demo-comppod-9dwwh                  0/1     ContainerCreating   0          6s
job-demo-comppod-w67g7                  0/1     ContainerCreating   0          0s
[root@k8s-master job]# kubectl get pod |grep job-demo
job-demo-comppod-5svj4                  0/1     Completed           0          17s
job-demo-comppod-8tfch                  0/1     Completed           0          17s
job-demo-comppod-9dwwh                  0/1     ContainerCreating   0          9s
job-demo-comppod-w67g7                  0/1     ContainerCreating   0          3s
[root@k8s-master job]# kubectl get pod |grep job-demo
job-demo-comppod-5svj4                  0/1     Completed           0          20s
job-demo-comppod-6xzj4                  0/1     ContainerCreating   0          2s
job-demo-comppod-8tfch                  0/1     Completed           0          20s
job-demo-comppod-9dwwh                  0/1     Completed           0          12s
job-demo-comppod-w67g7                  0/1     ContainerCreating   0          6s
[root@k8s-master job]# kubectl get pod |grep job-demo
job-demo-comppod-5svj4                  0/1     Completed           0          25s
job-demo-comppod-6xzj4                  0/1     ContainerCreating   0          7s
job-demo-comppod-8tfch                  0/1     Completed           0          25s
job-demo-comppod-9dwwh                  0/1     Completed           0          17s
job-demo-comppod-w67g7                  0/1     Completed           0          11s
[root@k8s-master job]# kubectl get pod |grep job-demo
job-demo-comppod-5svj4                  0/1     Completed           0          27s
job-demo-comppod-6xzj4                  0/1     ContainerCreating   0          9s
job-demo-comppod-8tfch                  0/1     Completed           0          27s
job-demo-comppod-9dwwh                  0/1     Completed           0          19s
job-demo-comppod-w67g7                  0/1     Completed           0          13s
[root@k8s-master job]# kubectl get pod |grep job-demo
job-demo-comppod-5svj4                  0/1     Completed           0          29s
job-demo-comppod-6xzj4                  0/1     ContainerCreating   0          11s
job-demo-comppod-8tfch                  0/1     Completed           0          29s
job-demo-comppod-9dwwh                  0/1     Completed           0          21s
job-demo-comppod-w67g7                  0/1     Completed           0          15s
[root@k8s-master job]# kubectl get pod |grep job-demo
job-demo-comppod-5svj4                  0/1     Completed          0          30s
job-demo-comppod-6xzj4                  0/1     Completed          0          12s
job-demo-comppod-8tfch                  0/1     Completed          0          30s
job-demo-comppod-9dwwh                  0/1     Completed          0          22s
job-demo-comppod-w67g7                  0/1     Completed          0          16s

从面的验证过程发现,Job在执行的任意时刻都会保持最多2个Pod的执行,但是这里不包括Completed的状态Pod

若parallelism的值大于completions任务是否可以正常执行,验证如下,先删除job-demo-comppod,然后再创建

[root@k8s-master job]# kubectl delete job job-demo-comppod
job.batch "job-demo-comppod" deleted

 

[root@k8s-master job]# kubectl get job
NAME               COMPLETIONS   DURATION   AGE
job-demo-comppod   0/5           8h         8h
[root@k8s-master job]# kubectl describe job job-demo-comppod 
Name:           job-demo-comppod
Namespace:      default
Selector:       controller-uid=dfea8576-9f3e-4b00-bd72-1c884f9e420c
Labels:         controller-uid=dfea8576-9f3e-4b00-bd72-1c884f9e420c
                job-name=job-demo-comppod
Annotations:    <none>
Parallelism:    6
Completions:    5
Start Time:     Mon, 06 Jul 2020 20:29:07 +0800
Pods Statuses:  5 Running / 0 Succeeded / 0 Failed
Pod Template:
  Labels:  controller-uid=dfea8576-9f3e-4b00-bd72-1c884f9e420c
           job-name=job-demo-comppod
  Containers:
   counter:
    Image:      busybox
    Port:       <none>
    Host Port:  <none>
    Command:
      bin/sh
      -c
      for i in 9 8 7 6 5 4 3 2 1; do echo $i; done
    Environment:  <none>
    Mounts:       <none>
  Volumes:        <none>
Events:
  Type    Reason            Age   From            Message
  ----    ------            ----  ----            -------
  Normal  SuccessfulCreate  8h    job-controller  Created pod: job-demo-comppod-lqgdw
  Normal  SuccessfulCreate  8h    job-controller  Created pod: job-demo-comppod-hw5ql
  Normal  SuccessfulCreate  8h    job-controller  Created pod: job-demo-comppod-tn5nz
  Normal  SuccessfulCreate  8h    job-controller  Created pod: job-demo-comppod-8qfv8
  Normal  SuccessfulCreate  8h    job-controller  Created pod: job-demo-comppod-pprd7
[root@k8s-master job]# kubectl get pod |grep demo
job-demo-comppod-8qfv8                  0/1     ContainerCreating   0          8h
job-demo-comppod-hw5ql                  0/1     ContainerCreating   0          8h
job-demo-comppod-lqgdw                  0/1     ContainerCreating   0          8h
job-demo-comppod-pprd7                  0/1     ContainerCreating   0          8h
job-demo-comppod-tn5nz                  0/1     ContainerCreating   0 

验证后发现,Pod会创建成功为running状态,但是任务不能执行完成。

 5、设置Job的超时时间

如果想要在执行job的过程中,设置一个超时时间,如果超过了此超时时间,Pod即便未运行完成也希望它被终止掉,可以指定activeDeadlineSeconds

[root@k8s-master job]# cat job-one-testrestart.yaml 
apiVersion: batch/v1
kind: Job
metadata:
  name: job-demo-onepod-timesecondtest
spec:
  activeDeadlineSeconds : 10   # 设置超时时间为10S
  template:
    metadata:
      name: job-demo
    spec:
      restartPolicy: Never
      containers:
      - name: counter-time
        image: busybox
        command:
        - "bin/sh"
        - "-c"
        - "touch /tmp/healthy;sleep 1000000"
[root@k8s-master job]# kubectl get pod |grep one 
job-demo-onepod-timesecondtest-pwjrg    1/1     Terminating        0          26s

[root@k8s-master job]# kubectl get pod |grep one

创建完成之后,因为设置的容器sleep 超过了10秒,10秒之后容器进入到Terminating状态,随后此容器会被删掉,且不因为重启策略设置的是Never,容器不会被重启,activeDeadlineSeconds适用于需要指定超时时间的场景。

6、指定Job失败后,容器重试的次数

[root@k8s-master job]# cat job-one-testrestart.yaml 
apiVersion: batch/v1
kind: Job
metadata:
  name: job-demo-onepod-timesecondtest
spec:
  backoffLimit : 3      # 指定job失败后进行重试的次数
  template:
    metadata:
      name: job-demo
    spec:
      restartPolicy: Never
      containers:
      - name: counter-time
        image: busybox
        command:
        - "bin/sh"
        - "*********"               # 将启动命名设置为异常
        - "touch /tmp/healthy;sleep 1000000"
[root@k8s-master job]# kubectl get pod |grep one
job-demo-onepod-timesecondtest-6sttm    0/1     Error              0          4m1s
job-demo-onepod-timesecondtest-hpnz6    0/1     Error              0          4m22s
job-demo-onepod-timesecondtest-jcfgn    0/1     Error              0          3m17s

这里需要注意的是,每次重试都会生成一个新的Pod,而不是针对一个Pod进行重启,所以上面显示的是3个异常的Pod。

四、总结

本文介绍了K8S中的Job用法,在实际的生产环境中若是需要管理持续性(非一次性)的工作负载需要采用无状态负载(Deployment)或有状态负载(Statufulset),若是只需要执行一次性工作任务的场景,可以采用Job类的工作负载。

 

作者简介:云计算容器\Docker\K8s\Serverless方向产品经理,学点技术,为更好地设计产品。

posted @ 2020-07-04 15:45  lightinglei  阅读(501)  评论(0编辑  收藏  举报