【原创】Kuberneters-Job(短时任务)的实践
一、什么是Job?
Job负责批量处理短暂的一次性任务,完成任务后,容器就会退出,即只执行一次任务,保证批量处理任务中的1个或多个Pod成功结束。
二、Job的使用场景
Job适用于执行完一次性工作任务就不再执行,非持续性工作的使用场景,如:AI模型训练、批量计算、数据分析等场景。
三、Job的实践
1、非并行的Pod
此类方式的Job通常kind为Job的容器只创建1个pod,且其启动方式只支持Never和onFailured,不支持Always,若填写的是Always,如下所示,会直接报错
[root@k8s-master job]# cat job-one.yaml apiVersion: batch/v1 kind: Job metadata: name: job-demo-onepod spec: template: metadata: name: job-demo spec: restartPolicy: Always containers: - name: counter image: busybox command: - "bin/sh" - "-c" - "for i in 9 8 7 6 5 4 3 2 1; do echo $i; done"
[root@k8s-master job]# kubectl create -f job-one.yaml The Job "job-demo-onepod" is invalid: spec.template.spec.restartPolicy: Unsupported value: "Always": supported values: "OnFailure", "Never"
删除正在运行中的pod,若Job任务未执行完成,会自动重新启动1个Pod继续执行任务,直到任务执行完成
[root@k8s-master job]# cat job-one-testrestart.yaml apiVersion: batch/v1 kind: Job metadata: name: job-demo-onepod-testrestart spec: template: metadata: name: job-demo spec: restartPolicy: Never containers: - name: counter-restart image: busybox command: - "bin/sh" - "-c" - "touch /tmp/healthy;sleep 1000000" # 让pod较长时间处于运行状态
[root@k8s-master job]# kubectl get pod NAME READY STATUS RESTARTS AGE job-demo-onepod-testrestart-rpvhh 1/1 Running 0 3m23s
此时删除Pod,能看到会删除任务中旧的Pod,重新起一个新的Pod,如下图所示
[root@k8s-master job]# kubectl get pod NAME READY STATUS RESTARTS AGE job-demo-onepod-testrestart-5js9m 1/1 Running 0 6s job-demo-onepod-testrestart-rpvhh 1/1 Terminating 0 4m36s
如果是已经运行完成的Pod,删除之后还会重启吗,再来实验下
[root@k8s-master job]# kubectl get pod
NAME READY STATUS RESTARTS AGE
job-demo-onepod-kmm74 0/1 Completed 0 22m
job-demo-onepod-testrestart-5js9m 1/1 Running 0 107s
[root@k8s-master job]# kubectl delete pod job-demo-onepod-kmm74
pod "job-demo-onepod-kmm74" deleted
[root@k8s-master job]# kubectl get pod
NAME READY STATUS RESTARTS AGE
job-demo-onepod-testrestart-5js9m 1/1 Running 0 3m27s
job-demo-onepod-kmm74这个Pod已经完成了任务,其状态变为:Completed,执行删除操作之后,彻底删除了,不会再重新启动新的Pod,因此,若Job未执行完成,执行删除Pod的操作,会一直起新的Pod继续执行Job直到达到Completed状态
2、固定结束次数的Job
设置.spec.completions
,但不设置spec.Parallelism,创建多个Pod,直到.spec.completions
个Pod成功结束
[root@k8s-master job]# cat completions-pod-job.yaml apiVersion: batch/v1 kind: Job metadata: name: job-demo-comppod spec: completions: 5 # 指定运行的Pod数为5 template: metadata: name: job-demoi spec: restartPolicy: Never containers: - name: counter image: busybox command: - "bin/sh" - "-c" - "for i in 9 8 7 6 5 4 3 2 1; do echo $i; done
[root@k8s-master job]# kubectl create -f completions-pod-job.yaml job.batch/job-demo-comppod created [root@k8s-master job]# kubectl get pod | grep job-demo job-demo-comppod-b2v2n 0/1 Completed 0 11s job-demo-comppod-ptfqp 0/1 Completed 0 18s job-demo-comppod-xrkm9 0/1 ContainerCreating 0 3s [root@k8s-master job]# kubectl get pod | grep job-demo job-demo-comppod-b2v2n 0/1 Completed 0 69s job-demo-comppod-jxhqd 0/1 Completed 0 54s job-demo-comppod-p8q7t 0/1 Completed 0 42s job-demo-comppod-ptfqp 0/1 Completed 0 76s job-demo-comppod-xrkm9 0/1 Completed 0 61s
可以看到运行了5个Pod,且都运行完成成功退出了,但是在有些场景下,期望能够指定一次运行几个Pod并行处理,这个时候就需要用到固定结束次数的并行Job的方式了。
如果设置.spec.completions了的情况下,设置spec.Parallelism为0,是否可以正常运行呢,验证下
apiVersion: batch/v1 kind: Job metadata: name: job-demo-comppod spec: completions: 1 parallelism: 0 # 设置为0 template: metadata: name: job-demoi spec: restartPolicy: Never containers: - name: counter image: busybox command: - "bin/sh" - "-c" - "for i in 9 8 7 6 5 4 3 2 1; do echo $i; done"
[root@k8s-master job]# kubectl get job NAME COMPLETIONS DURATION AGE job-demo-comppod 0/1 5m59s
[root@k8s-master job]# kubectl describe job job-demo-comppod Name: job-demo-comppod Namespace: default Selector: controller-uid=a0b677be-6ea1-4e09-8b1c-d45048cb9f57 Labels: controller-uid=a0b677be-6ea1-4e09-8b1c-d45048cb9f57 job-name=job-demo-comppod Annotations: <none> Parallelism: 0 Completions: 1 Pods Statuses: 0 Running / 0 Succeeded / 0 Failed Pod Template: Labels: controller-uid=a0b677be-6ea1-4e09-8b1c-d45048cb9f57 job-name=job-demo-comppod Containers: counter: Image: busybox Port: <none> Host Port: <none> Command: bin/sh -c for i in 9 8 7 6 5 4 3 2 1; do echo $i; done Environment: <none> Mounts: <none> Volumes: <none> Events: <none>
无Pod执行任务,将Parallelism去掉,再验证下
apiVersion: batch/v1 kind: Job metadata: name: job-demo-comppod spec: completions: 1 parallelism: 0 template: metadata: name: job-demoi spec: restartPolicy: Never containers: - name: counter image: busybox command: - "bin/sh" - "-c" - "for i in 9 8 7 6 5 4 3 2 1; do echo $i; done"
[root@k8s-master job]# kubectl create -f completions-pod-job.yaml
job.batch/job-demo-comppod created
[root@k8s-master job]# kubectl get pod |grep demo job-demo-comppod-gm9xs 0/1 Completed 0 39s
39SPod就执行完了任务,因此Parallelism若设置为0,将会导致任务执行挂起。
3、带有工作队列的并行Pod
设置.spec.Parallelism参数
但不设置.spec.completions
,当所有Pod结束并且至少一个成功时,Job就认为是成功。
[root@k8s-master job]# cat parallelism_pod.yaml apiVersion: batch/v1 kind: Job metadata: name: job-demo-parall spec: parallelism: 3 # 指定并行pod数为3 template: metadata: name: job-demo-para spec: restartPolicy: Never containers: - name: counter image: busybox command: - "bin/sh" - "-c" - "for i in 9 8 7 6 5 4 3 2 1; do echo $i; done"
[root@k8s-master job]# kubectl create -f parallelism_pod.yaml
job.batch/job-demo-parall created
[root@k8s-master job]# kubectl get pod| grep para job-demo-parall-76jtn 0/1 ContainerCreating 0 41s job-demo-parall-b7x27 0/1 Completed 0 41s job-demo-parall-rqmtk 0/1 Completed 0 41s [root@k8s-master job]# kubectl get pod| grep para job-demo-parall-76jtn 0/1 ContainerCreating 0 51s job-demo-parall-b7x27 0/1 Completed 0 51s job-demo-parall-rqmtk 0/1 Completed 0 51s [root@k8s-master job]# kubectl get pod| grep para \job-demo-parall-76jtn 0/1 ContainerCreating 0 54s job-demo-parall-b7x27 0/1 Completed 0 54s job-demo-parall-rqmtk 0/1 Completed 0 54s
可以看到Pod的创建都是相同的,验证了Pod的并行。
若指定.spec.completions参数为0,验证下是否能正常执行完任务,先删除job-demo-comppod,然后再创建
[root@k8s-master job]# cat completions-pod-job.yaml apiVersion: batch/v1 kind: Job metadata: name: job-demo-comppod spec: completions: 0 parallelism: 1 template: metadata: name: job-demoi spec: restartPolicy: Never containers: - name: counter image: busybox command: - "bin/sh" - "-c" - "for i in 9 8 7 6 5 4 3 2 1; do echo $i; done"
[root@k8s-master job]# kubectl create -f completions-pod-job.yaml
job.batch/job-demo-comppod created
[root@k8s-master job]# kubectl get job NAME COMPLETIONS DURATION AGE job-demo-comppod 0/0 0s 119s
119Spod未执行完任务,去掉.spec.completions再验证下,先删除job-demo-comppod,然后再创建
[root@k8s-master job]# cat completions-pod-job.yaml apiVersion: batch/v1 kind: Job metadata: name: job-demo-comppod spec: parallelism: 1 template: metadata: name: job-demoi spec: restartPolicy: Never containers: - name: counter image: busybox command: - "bin/sh" - "-c" - "for i in 9 8 7 6 5 4 3 2 1; do echo $i; done"
[root@k8s-master job]# kubectl create -f completions-pod-job.yaml
job.batch/job-demo-comppod created
[root@k8s-master job]# kubectl get job NAME COMPLETIONS DURATION AGE job-demo-comppod 1/1 19s 20s
20SPod执行完了任务,因此可以判断若指定spec.Parallelism=0则任务也会被挂起
4、固定结束次数的并行Job
同时设置.spec.completions
和.spec.Parallelism参数
,多个Pod同时处理工作队列,先删除这个job,然后重新创建1个新的job,并且指定每次并行运行的Pod数量
[root@k8s-master job]# kubectl delete job job-demo-comppod job.batch "job-demo-comppod" deleted
[root@k8s-master job]# cat completions-pod-job.yaml apiVersion: batch/v1 kind: Job metadata: name: job-demo-comppod spec: completions: 5 # 指定本次job任务需要运行5个Pod parallelism: 2 # 指定每次并行2个Pod运行 template: metadata: name: job-demoi spec: restartPolicy: Never containers: - name: counter image: busybox command: - "bin/sh" - "-c" - "for i in 9 8 7 6 5 4 3 2 1; do echo $i; done"
job.batch/job-demo-comppod created [root@k8s-master job]# kubectl get pod |grep job-demo job-demo-comppod-5svj4 0/1 ContainerCreating 0 2s job-demo-comppod-8tfch 0/1 ContainerCreating 0 2s [root@k8s-master job]# kubectl get pod |grep job-demo job-demo-comppod-5svj4 0/1 ContainerCreating 0 5s job-demo-comppod-8tfch 0/1 ContainerCreating 0 5s [root@k8s-master job]# kubectl get pod |grep job-demo job-demo-comppod-5svj4 0/1 ContainerCreating 0 7s job-demo-comppod-8tfch 0/1 ContainerCreating 0 7s [root@k8s-master job]# kubectl get pod |grep job-demo job-demo-comppod-5svj4 0/1 ContainerCreating 0 10s job-demo-comppod-8tfch 0/1 Completed 0 10s job-demo-comppod-9dwwh 0/1 ContainerCreating 0 2s [root@k8s-master job]# kubectl get pod |grep job-demo job-demo-comppod-5svj4 0/1 Completed 0 14s job-demo-comppod-8tfch 0/1 Completed 0 14s job-demo-comppod-9dwwh 0/1 ContainerCreating 0 6s job-demo-comppod-w67g7 0/1 ContainerCreating 0 0s [root@k8s-master job]# kubectl get pod |grep job-demo job-demo-comppod-5svj4 0/1 Completed 0 17s job-demo-comppod-8tfch 0/1 Completed 0 17s job-demo-comppod-9dwwh 0/1 ContainerCreating 0 9s job-demo-comppod-w67g7 0/1 ContainerCreating 0 3s [root@k8s-master job]# kubectl get pod |grep job-demo job-demo-comppod-5svj4 0/1 Completed 0 20s job-demo-comppod-6xzj4 0/1 ContainerCreating 0 2s job-demo-comppod-8tfch 0/1 Completed 0 20s job-demo-comppod-9dwwh 0/1 Completed 0 12s job-demo-comppod-w67g7 0/1 ContainerCreating 0 6s [root@k8s-master job]# kubectl get pod |grep job-demo job-demo-comppod-5svj4 0/1 Completed 0 25s job-demo-comppod-6xzj4 0/1 ContainerCreating 0 7s job-demo-comppod-8tfch 0/1 Completed 0 25s job-demo-comppod-9dwwh 0/1 Completed 0 17s job-demo-comppod-w67g7 0/1 Completed 0 11s [root@k8s-master job]# kubectl get pod |grep job-demo job-demo-comppod-5svj4 0/1 Completed 0 27s job-demo-comppod-6xzj4 0/1 ContainerCreating 0 9s job-demo-comppod-8tfch 0/1 Completed 0 27s job-demo-comppod-9dwwh 0/1 Completed 0 19s job-demo-comppod-w67g7 0/1 Completed 0 13s [root@k8s-master job]# kubectl get pod |grep job-demo job-demo-comppod-5svj4 0/1 Completed 0 29s job-demo-comppod-6xzj4 0/1 ContainerCreating 0 11s job-demo-comppod-8tfch 0/1 Completed 0 29s job-demo-comppod-9dwwh 0/1 Completed 0 21s job-demo-comppod-w67g7 0/1 Completed 0 15s [root@k8s-master job]# kubectl get pod |grep job-demo job-demo-comppod-5svj4 0/1 Completed 0 30s job-demo-comppod-6xzj4 0/1 Completed 0 12s job-demo-comppod-8tfch 0/1 Completed 0 30s job-demo-comppod-9dwwh 0/1 Completed 0 22s job-demo-comppod-w67g7 0/1 Completed 0 16s
从面的验证过程发现,Job在执行的任意时刻都会保持最多2个Pod的执行,但是这里不包括Completed的状态Pod
若parallelism的值大于completions任务是否可以正常执行,验证如下,先删除job-demo-comppod,然后再创建
[root@k8s-master job]# kubectl delete job job-demo-comppod job.batch "job-demo-comppod" deleted
[root@k8s-master job]# kubectl get job NAME COMPLETIONS DURATION AGE job-demo-comppod 0/5 8h 8h [root@k8s-master job]# kubectl describe job job-demo-comppod Name: job-demo-comppod Namespace: default Selector: controller-uid=dfea8576-9f3e-4b00-bd72-1c884f9e420c Labels: controller-uid=dfea8576-9f3e-4b00-bd72-1c884f9e420c job-name=job-demo-comppod Annotations: <none> Parallelism: 6 Completions: 5 Start Time: Mon, 06 Jul 2020 20:29:07 +0800 Pods Statuses: 5 Running / 0 Succeeded / 0 Failed Pod Template: Labels: controller-uid=dfea8576-9f3e-4b00-bd72-1c884f9e420c job-name=job-demo-comppod Containers: counter: Image: busybox Port: <none> Host Port: <none> Command: bin/sh -c for i in 9 8 7 6 5 4 3 2 1; do echo $i; done Environment: <none> Mounts: <none> Volumes: <none> Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SuccessfulCreate 8h job-controller Created pod: job-demo-comppod-lqgdw Normal SuccessfulCreate 8h job-controller Created pod: job-demo-comppod-hw5ql Normal SuccessfulCreate 8h job-controller Created pod: job-demo-comppod-tn5nz Normal SuccessfulCreate 8h job-controller Created pod: job-demo-comppod-8qfv8 Normal SuccessfulCreate 8h job-controller Created pod: job-demo-comppod-pprd7 [root@k8s-master job]# kubectl get pod |grep demo job-demo-comppod-8qfv8 0/1 ContainerCreating 0 8h job-demo-comppod-hw5ql 0/1 ContainerCreating 0 8h job-demo-comppod-lqgdw 0/1 ContainerCreating 0 8h job-demo-comppod-pprd7 0/1 ContainerCreating 0 8h job-demo-comppod-tn5nz 0/1 ContainerCreating 0
验证后发现,Pod会创建成功为running状态,但是任务不能执行完成。
5、设置Job的超时时间
如果想要在执行job的过程中,设置一个超时时间,如果超过了此超时时间,Pod即便未运行完成也希望它被终止掉,可以指定activeDeadlineSeconds
[root@k8s-master job]# cat job-one-testrestart.yaml apiVersion: batch/v1 kind: Job metadata: name: job-demo-onepod-timesecondtest spec: activeDeadlineSeconds : 10 # 设置超时时间为10S template: metadata: name: job-demo spec: restartPolicy: Never containers: - name: counter-time image: busybox command: - "bin/sh" - "-c" - "touch /tmp/healthy;sleep 1000000"
[root@k8s-master job]# kubectl get pod |grep one job-demo-onepod-timesecondtest-pwjrg 1/1 Terminating 0 26s
[root@k8s-master job]# kubectl get pod |grep one
创建完成之后,因为设置的容器sleep 超过了10秒,10秒之后容器进入到Terminating状态,随后此容器会被删掉,且不因为重启策略设置的是Never,容器不会被重启,activeDeadlineSeconds适用于需要指定超时时间的场景。
6、指定Job失败后,容器重试的次数
[root@k8s-master job]# cat job-one-testrestart.yaml apiVersion: batch/v1 kind: Job metadata: name: job-demo-onepod-timesecondtest spec: backoffLimit : 3 # 指定job失败后进行重试的次数 template: metadata: name: job-demo spec: restartPolicy: Never containers: - name: counter-time image: busybox command: - "bin/sh" - "*********" # 将启动命名设置为异常 - "touch /tmp/healthy;sleep 1000000"
[root@k8s-master job]# kubectl get pod |grep one job-demo-onepod-timesecondtest-6sttm 0/1 Error 0 4m1s job-demo-onepod-timesecondtest-hpnz6 0/1 Error 0 4m22s job-demo-onepod-timesecondtest-jcfgn 0/1 Error 0 3m17s
这里需要注意的是,每次重试都会生成一个新的Pod,而不是针对一个Pod进行重启,所以上面显示的是3个异常的Pod。
四、总结
本文介绍了K8S中的Job用法,在实际的生产环境中若是需要管理持续性(非一次性)的工作负载需要采用无状态负载(Deployment)或有状态负载(Statufulset),若是只需要执行一次性工作任务的场景,可以采用Job类的工作负载。
作者简介:云计算容器\Docker\K8s\Serverless方向产品经理,学点技术,为更好地设计产品。