pod的常见状态,故障原因和探针检测的使用

Pod概念

Pod代表一个deployment单元：a single instance of an application in Kubernetes。

运行单个容器的Pod：将pod看作是单个容器的包装器，kubernetes不直接管理容器，而是管理pods。

运行多个容器的Pod：pod可以将多个容器封装一个单个应用。pod中的多个容器作为整体调度到集群中同一物理机或虚拟机上。pod中多个容器共享资源和依赖项，彼此通信。

pod将这些容器、网络资源和存储资源作为一个单一的可管理实体包装在一起。每个pod分配唯一的ip地址，pod中容器共享netns，包括ip地址和port端口。多个容器之间使用localhost通信。当pod中容器与其他pod通信，需要使用共享的网络资源。pod可以使用多个volume，pod中所有容器都可以访问这些卷。

Pod是Kubernetes的最重要的概念，每一个pod都有一个特殊的被称为“根容器”的Pause容器。Pause容器对应的镜像属于Kubernetes平台的一部分，除了Pause容器，每个Pod还包含一个或多个紧密相关的用户业务容器。通过pause容器，把其他业务容器加入到Pause容器里面，让所有业务容器在同一名称空间内，可以实现网络共享。

Pod status

Unschedulable：Pod不能被调度，kube-scheduler没有匹配到合适的node节点。

PodScheduled：pod正处于调度中，在kube-scheduler刚开始调度的时候，还没有将pod分配到指定的node，在筛选出合适的节点后就会更新etcd数据，将pod分配到指定的node。

Failed：Pod中有容器启动失败而导致pod工作异常。

Unknown：由于某种原因无法获得pod的当前状态，通常是由于与pod所在的node节点通信错误。

Initialized：所有pod中的初始化容器已经完成了

ContainerCreating 正在创建

Running：Pod内部的容器已经被创建并且启动。

Ready：表示pod中的容器已经可以提供访问服务。

Completed 运行完成

Error： pod启动过程中发生错误

NodeLost：Pod所在节点失联

Waiting：Pod等待启动

Terminating：Pod正在被销毁

CrashLoopBackOff ：pod创建失败，但是kubelet正在将它重启

ErrImagePull：镜像拉取出错，超时或下载被强制终止

ImagePullBackOff：Pod所在的node节点下载镜像失败

Pending：正在创建Pod但是Pod中的容器还没有全部被创建完成=处于此状态的Pod应该检查Pod依赖的存储是否有权限挂载等。

InvalidImageName：node节点无法解析镜像名称导致的镜像无法下载

ImageInspectError：无法校验镜像，镜像不完整导致

ErrImageNeverPull：策略禁止拉取镜像，镜像中心权限是私有等

RegistryUnavailable：镜像服务器不可用，网络原因或harbor宕机

CreateContainerConfigError：不能创建kubelet使用的容器配置

CreateContainerError：创建容器失败

RunContainerError：pod运行失败，容器中没有初始化PID为1的守护进程等

ContainersNotInitialized：pod没有初始化完毕

ContainersNotReady：pod没有准备完毕

ContainerCreating：pod正在创建中

PodInitializing：pod正在初始化中

DockerDaemonNotReady：node节点decker服务没有启动

NetworkPluginNotReady：网络插件没有启动

imagePullPolicy，镜像下载策略。

Always，总是从仓库下载镜像，默认值。

Never，只使用本地镜像，不下载。

IfNotPresent，优先使用本地镜像，如果没有才从仓库下载镜像。

restartPolicy (容器重启策略)：

Pod一旦配置探针，在检测失败时候，会基于 restartPolicy对Pod进行下一步操作：

Always：当容器异常时，k8s自动重启该容器， ReplicationController/Replicaset/Deployment，默认为Always。

OnFailure：当容器失败时(容器停止运行且退出码不为0)，k8s自动重启该容器。

Never：不论容器运行状态如何都不会重启该容器，Job或CronJob。

Init容器

理解 Init 容器

每个Pod中可以包含多个容器，应用运行在这些容器里面，同时Pod也可以有一个或多个先于应用容器启动的Init容器。

Init容器与普通的容器非常像，除了如下两点：

它们总是运行到完成。
每个都必须在下一个启动之前成功完成。

如果Pod的Init容器失败，kubelet会不断地重启该Init容器直到该容器成功为止。然而，如果Pod对应的restartPolicy值为"Never"，并且Pod的Init容器失败，则 Kubernetes会将整个Pod状态设置为失败。

为 Pod设置Init 容器需要在Pod规约中添加initContainers字段，该字段以 Container类型对象数组的形式组织，和应用的containers数组同级相邻。参阅API参考的容器章节了解详情。

Init容器的状态在status.initContainerStatuses字段中以容器状态数组的格式返回（类似status.containerStatuses字段）。

与普通容器的不同之处

Init容器支持应用容器的全部字段和特性，包括资源限制、数据卷和安全设置。然而，Init容器对资源请求和限制的处理稍有不同。

同时Init容器不支持lifecycle、livenessProbe、readinessProbe和startupProbe，因为它们必须在Pod就绪之前运行完成。

如果为一个Pod指定了多个Init容器，这些容器会按顺序逐个运行。每个Init 容器必须运行成功，下一个才能够运行。当所有Init容器运行完成时，Kubernetes 才会为Pod初始化应用容器并像平常一样运行。

init容器的作用

1、可以为业务容器提前准备好业务容器的运行环境，比如将业务容器需要的配置文件提前生成并放在指定位置、检查数据权限或完整性、软件版本等基础运行环境。

2、可以在运行业务容器之前准备好需要的业务数据，比如从OSS下载、或者从其它位置copy。

3、检查依赖的服务是否能够访问。

init容器的特点：

1、一个pod可以有多个业务容器还能在有多个init容器，但是每个init容器和业务容器的运行环境都是隔离的。

2、init容器会比业务容器先启动。

3、init容器运行成功之后才会继续运行业务容器。

4、如果一个pod有多个init容器，则需要从上到下逐个运行并且全部成功，最后才会运行业务容器。

5、init容器不支持探针检测(因为初始化完成后就退出再也不运行了)。

使用 Init 容器的情况

例如：

定义了一个具有2个Init容器的简单Pod。第一个等待myservice 启动，第二个等待mydb启动。一旦这两个Init容器都启动完成，Pod将启动spec节中的应用容器。

apiVersion: v1
kind: Pod
metadata:
  name: myapp-pod
  labels:
    app: myapp
spec:
  containers:
  - name: myapp-container
    image: busybox:1.28
    command: ['sh', '-c', 'echo The app is running! && sleep 3600']
  initContainers:
  - name: init-myservice
    image: busybox:1.28
    command: ['sh', '-c', "until nslookup myservice.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for myservice; sleep 2; done"]
  - name: init-mydb
    image: busybox:1.28
    command: ['sh', '-c', "until nslookup mydb.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for mydb; sleep 2; done"]

pause容器

Pause容器，又叫Infra 容器，是pod的基础容器，镜像体积只有几百KB 左右，配置在kubelet中，主要的功能是一个pod中多个容器的网络通信。

Infra容器被创建后会初始化Network Namespace，之后其它容器就可以加入到Infra容器中共享Infra容器的网络了，因此如果一个Pod中的两个容器A和B，那么关系如下：

1、A容器和B容器能够直接使用localhost通信；

2、A容器和B容器可以可以看到网卡、IP与端口监听信息。

3、Pod只有一个IP地址，也就是该Pod的Network Namespace对应的IP地址(由Infra容器初始化并创建)。

4、k8s环境中的每个Pod有一个独立的IP地址(前提是地址足够用)，并且此IP被当前Pod中所有容器在内部共享使用。

5、pod删除后Infra容器随机被删除,其IP被回收。

Pause容器共享的Namespace：

1、NET Namespace：Pod中的多个容器共享同一个网络命名空间，即使用相同的IP和端口信息。

2、IPC Namespace：Pod中的多个容器可以使用System V IPC或POSIX消息队列进行通信。

3、UTS Namespace：pod中的多个容器共享一个主机名。MNT Namespace、PID Namespace、User Namespace未共享。

探针检测

探针是由kubelet对容器执行的定期诊断，以保证Pod的状态始终处于运行状态，要执行诊断，kubelet调用由容器实现的Handler(处理程序)，也成为Hook(钩子)，

有三种类型的处理程序

ExecAction：

在容器内执行指定命令，如果命令退出时返回码为0则认为诊断成功。

TCPSocketAction：

对指定端口上的容器的IP地址进行TCP检查，如果端口打开，则诊断被认为是成功的。

HTTPGetAction：

对指定的端口和路径上的容器的IP地址执行HTTPGet请求，如果响应的状态码大于等于200且小于400，则诊断被认为是成功的。

每次探测都将获得以下三种结果之一：

成功：容器通过了诊断。

失败：容器未通过诊断。

未知：诊断失败，因此不会采取任何行动。

探针类型

1、startupProbe：启动探针，kubernetes v1.16引入

判断容器内的应用程序是否已启动完成，如果配置了启动探测，则会先禁用所有其它的探测，直到startupProbe检测成功为止，如果startupProbe探测失败，则kubelet将杀死容器，容器将按照重启策略进行下一步操作，如果容器没有提供启动探测，则默认状态为成功

2、livenessProbe：存活探针

检测容器容器是否正在运行，如果存活探测失败，则kubelet会杀死容器，并且容器将受到其重启策略的影响，如果容器不提供存活探针，则默认状态为 Success，livenessProbe用于控制是否重启pod。

3、readinessProbe：就绪探针

如果就绪探测失败，端点控制器将从与Pod匹配的所有Service的端点中删除该Pod的IP地址，初始延迟之前的就绪状态默认为Failure(失败)，如果容器不提供就绪探针，则默认状态为Success，readinessProbe用于控制pod是否添加至service。

探针通用配置参数

探针有很多配置字段，可以使用这些字段精确的控制存活和就绪检测的行为：

官网文档：

https://kubernetes.io/zh-cn/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/

initialDelaySeconds: 120

初始化延迟时间，告诉kubelet在执行第一次探测前应该等待多少秒，默认是0秒，最小值是0

periodSeconds: 60

探测周期间隔时间，指定了kubelet应该每多少秒秒执行一次存活探测，默认是10秒。最小值是1

timeoutSeconds: 5

单次探测超时时间，探测的超时后等待多少秒，默认值是1秒，最小值是1。

successThreshold: 1

从失败转为成功的重试次数，探测器在失败后，被视为成功的最小连续成功数，默认值是1，存活探测的这个值必须是1，最小值是 1。

failureThreshold: 3

从成功转为失败的重试次数，当Pod启动了并且探测到失败，Kubernetes的重试次数，存活探测情况下的放弃就意味着重新启动容器，就绪探测情况下的放弃Pod 会被打上未就绪的标签，默认值是3，最小值是1。

Pod检测时间为：

initialDelaySeconds+ failureThreshold*periodSeconds=Total

探针http配置参数:

HTTP探测器可以在httpGet上配置额外的字段：

host:

连接使用的主机名，默认是Pod的IP，也可以在HTTP头中设置 “Host” 来代替。

scheme: HTTP

用于设置连接主机的方式（HTTP还是HTTPS），默认是HTTP。

path: /monitor/index.html

访问HTTP服务的路径。

httpHeaders:

请求中自定义的HTTP头,HTTP头字段允许重复。

port: 80

访问容器的端口号或者端口名，如果数字必须在1～65535之间。

示例：

1、使用livenessprobe中的httpGet探针检测pod

检测时间为

initialDelaySeconds+ failureThreshold * periodSeconds=Total

5 3 10 35s

root@deploy:/health-check# vim http-livenessprobe.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-deployment
  namespace: test
spec:
  replicas: 2
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - image: nginx
        name: nginx
        imagePullPolicy: IfNotPresent
        ports:
          - containerPort: 80
        livenessProbe:
          httpGet:
            path: /index.html
            port: 80
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 10
          timeoutSeconds: 5
          successThreshold: 1
          failureThreshold: 3
---
apiVersion: v1
kind: Service
metadata:
  name: web-svc
  namespace: test
spec:
  ports:
  - name: http
    port: 80
    targetPort: 80
    protocol: TCP
  type: NodePort
  selector:
    app: web

查看pod

root@deploy:/health-check# kubectl get pods
NAME              READY   STATUS    RESTARTS   AGE
web-deployment-6cb7795f8c-bqmdh   1/1     Running   0          9m39s
web-deployment-6cb7795f8c-w8qlb   1/1     Running   0          9m39s

将livenessprobe中的path URL路径修改成错误的

        livenessProbe:
          httpGet:
            path: /test/index.html
            port: 80
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 10
          timeoutSeconds: 5
          successThreshold: 1
          failureThreshold: 3

30秒后进行第一次重启容器

第三次重启容器则判定诊断失败，pod状态为CrashLoopBackOff重载重启状态

查看pod详细运行信息

root@deploy:/health-check# kubectl describe pods web-deployment-5d747cdcd-gw8gm

2、使用readinessprobe中的httpGet探针检测pod

root@deploy:/health-check# cat http-readinessprobe.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-deployment
  namespace: test
spec:
  replicas: 2
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - image: nginx
        name: nginx
        imagePullPolicy: IfNotPresent
        ports:
          - containerPort: 80
        readinessProbe:
          httpGet:
            path: /index.html
            port: 80
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 3
          timeoutSeconds: 5
          successThreshold: 1
          failureThreshold: 3
---
apiVersion: v1
kind: Service
metadata:
  name: web-svc
  namespace: test
spec:
  ports:
  - name: http
    port: 80
    targetPort: 80
    protocol: TCP
  type: NodePort
  selector:
    app: web

模拟故障，修改httpGet path的URL地址

        readinessProbe:
          httpGet:
            path: /tmp/index.html
            port: 80
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 3
          timeoutSeconds: 5
          successThreshold: 1
          failureThreshold: 3

等待两轮检测后，由于readiness探针检测判断容器检测失败，则控制service 的endpoints地址取消添加pod 的ip

使用startupprobe探针httpGet检测

启动时对pod进行httpGet path URL请求路径检测

root@deploy:/health-check# cat http-startupprobe.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-deployment
  namespace: test
spec:
  replicas: 2
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - image: nginx
        name: nginx
        imagePullPolicy: IfNotPresent
        ports:
          - containerPort: 80
        startupProbe:
          httpGet:
            path: /tmp/index.html
            port: 80
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 3
          timeoutSeconds: 5
          successThreshold: 1
          failureThreshold: 3
---
apiVersion: v1
kind: Service
metadata:
  name: web-svc
  namespace: test
spec:
  ports:
  - name: http
    port: 80
    targetPort: 80
    protocol: TCP
  type: NodePort
  selector:
    app: web

3、使用livenessprobe中的tcpSocket探针检测pod

root@deploy:/health-check# cat tcp-livenessprobe.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-deployment
  namespace: test
spec:
  replicas: 2
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - image: nginx
        name: nginx
        imagePullPolicy: IfNotPresent
        ports:
          - containerPort: 80
        livenessProbe:
          tcpSocket:
            port: 80
          initialDelaySeconds: 5
          periodSeconds: 3
          timeoutSeconds: 5
          successThreshold: 1
          failureThreshold: 3
---
apiVersion: v1
kind: Service
metadata:
  name: web-svc
  namespace: test
spec:
  ports:
  - name: http
    port: 80
    targetPort: 80
    protocol: TCP
  type: NodePort
  selector:
    app: web

模拟故障，修改tcpSocket检测端口原80为81

        livenessProbe:
          tcpSocket:
            port: 81
          initialDelaySeconds: 5
          periodSeconds: 3
          timeoutSeconds: 5
          successThreshold: 1
          failureThreshold: 3

livenessprobe检测容器失败，第一次重启pod

三次检测后，将pod状态置为CrashLoopBackOff

4、readinessprobe使用tcpSocket探针检测pod

root@deploy:/health-check# cat tcp-readinessprobe.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-deployment
  namespace: test
spec:
  replicas: 2
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - image: nginx
        name: nginx
        imagePullPolicy: IfNotPresent
        ports:
          - containerPort: 80
        readinessProbe:
          tcpSocket:
            port: 81
          initialDelaySeconds: 5
          periodSeconds: 3
          timeoutSeconds: 5
          successThreshold: 1
          failureThreshold: 3
---
apiVersion: v1
kind: Service
metadata:
  name: web-svc
  namespace: test
spec:
  ports:
  - name: http
    port: 80
    targetPort: 80
    protocol: TCP
  type: NodePort
  selector:
    app: web

探针检测失败，service无法得到endpoint的pod ip

5、cmd探针检测使用

模拟使用cmd命令探针检测pod运行状态

root@deploy:/health-check# cat cmd-livenessprobe.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-deployment
  namespace: test
spec:
  replicas: 1
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - image: nginx
        name: nginx
        imagePullPolicy: IfNotPresent
        ports:
          - containerPort: 80
        livenessProbe:
          exec:
            command: ["curl","http://localhost:8080"]
          initialDelaySeconds: 10
          periodSeconds: 2
          timeoutSeconds: 5
          successThreshold: 1
          failureThreshold: 3
---
apiVersion: v1
kind: Service
metadata:
  name: web-svc
  namespace: test
spec:
  ports:
  - name: http
    port: 80
    targetPort: 80
    protocol: TCP
  type: NodePort
  selector:
    app: web

运行后则检测pod失败，重启pod

6、三种探针检测结合使用

创建startupProbe、livenessProbe和readinessProbe探针检测

root@deploy:/health-check# cat all.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-deployment
  namespace: test
spec:
  replicas: 1
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - image: nginx
        name: nginx
        imagePullPolicy: IfNotPresent
        ports:
          - containerPort: 80
        startupProbe:
          httpGet:
            path: /index.html
            port: 80
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 3
          failureThreshold: 3
        livenessProbe:
          httpGet:
            path: /index.html
            port: 80
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 3
          timeoutSeconds: 5
          successThreshold: 1
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /index.html
            port: 80
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 3
          timeoutSeconds: 5
          successThreshold: 1
          failureThreshold: 3
---
apiVersion: v1
kind: Service
metadata:
  name: web-svc
  namespace: test
spec:
  ports:
  - name: http
    port: 80
    targetPort: 80
    protocol: TCP
  type: NodePort
  selector:
    app: web

模拟startupProbe探针故障，修改成错误的URL地址

启动失败，startupProbe探针检测失败

模拟livenessProbe探针故障，修改成错误的URL地址

第一次启动成功，startupProbe探针检测失败，重启pod

模拟readinessProbe探针故障，修改成错误的URL地址

readinessProbe探针检测失败，将svc endpoint地址置空

postStart 和 preStop

postStart和preStop handlers处理函数

postStart:

Pod被创建后立即执行检测，即不等待pod中的服务启动。

如果postStart执行失败pod不会继续创建

preStop：

在pod被停止前执行

示例：

启动pod前和删除pod前，修改index.html文件

root@deploy:/health-check# cat deployment.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-deployment
  namespace: test
spec:
  replicas: 1
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - image: nginx
        name: nginx
        imagePullPolicy: IfNotPresent
        lifecycle:
          postStart:
            exec:
              command: ["bin/sh","-c","echo 'hello' > /usr/share/nginx/html/index.html"]
          preStop:
            exec:
              command: ["bin/sh","-c","echo 'goodbye' /usr/share/nginx/html/index.html"]
---
apiVersion: v1
kind: Service
metadata:
  name: web-svc
  namespace: test
spec:
  ports:
  - name: http
    port: 80
    targetPort: 80
    protocol: TCP
  type: NodePort
  selector:
    app: web

查看svc nodeport

root@deploy:/health-check# kubectl get svc
NAME      TYPE       CLUSTER-IP     EXTERNAL-IP   PORT(S)        AGE
web-svc   NodePort   10.100.200.2   <none>        80:46874/TCP   2m38s

Pod的终止流程

删除pod Pod被设置为”Terminating”状态、从service的Endpoints列表中删除并不再接受客户端请求。

执行PreStop Kubernetes向pod中的容器发送SIGTERM信号(正常终止信号)终止pod里面的主进程，这个信号让容器知道自己很快将会被关闭。

terminationGracePeriodSeconds: 60

可选终止等待期，如果有设置删除宽限时间，则等待宽限时间到期，否则最多等待30s。

查看详细帮助描述terminationGracePeriodSeconds

root@deploy:/health-check# kubectl explain Deployment.spec.template.spec.terminationGracePeriodSeconds
KIND:     Deployment
VERSION:  apps/v1
terminationGracePeriodSeconds <integer>

root@deploy:/health-check# cat deployment.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-deployment
  namespace: test
spec:
  replicas: 1
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      terminationGracePeriodSeconds: 60
      containers:
      - image: nginx
        name: nginx
        imagePullPolicy: IfNotPresent
        lifecycle:
          postStart:
            exec:
              command: ["bin/sh","-c","echo 'hello' > /usr/share/nginx/html/index.html"]
          preStop:
            exec:
              command: ["bin/sh","-c","echo 'goodbye' /usr/share/nginx/html/index.html"]

Kubernetes等待指定的时间称为优雅终止宽限期,默认情况下是30秒,值得注意的是等待期与preStop Hook和SIGTERM信号并行执行，即Kubernetes可能不会等待preStop Hook完成(最长30秒之后主进程还没有结束就就强制终止pod)。 SIGKILL信号被发送到Pod，并删除Pod。

posted @ 2022-08-09 17:30 PunchLinux 阅读(2390) 评论(0) 编辑收藏举报

刷新页面返回顶部

puchlinux