kubernets之高级调度

一 节点的污点以及pod的容忍度以及节点的亲缘性对比

  

  1.1 首先需要介绍的是节点的污点以及pod的污点容忍度

    污点是节点的属性,容忍度是pod的属性,只有当一个pod的容忍度包含节点的污点,pod才能够将其调度到该节点上

 

  1.2 对比污点和容忍度以及节点的亲缘性的应用场景

    节点的污点是,通过对现有的节点上面添加污点,来拒绝某些pod被调度过来,而节点的亲缘性是在pod定义上明确的指出这个pod可以或者不可以调度到某个节点上面。

 

二 认识了解节点的污点以及pod的容忍度

 

  2.1 查看集群节点的污点

Name:               master
Roles:              master
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=master
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/master=
Annotations:        flannel.alpha.coreos.com/backend-data: {"VNI":1,"VtepMAC":"b6:6a:dc:5d:74:7e"}
                    flannel.alpha.coreos.com/backend-type: vxlan
                    flannel.alpha.coreos.com/kube-subnet-manager: true
                    flannel.alpha.coreos.com/public-ip: 172.16.70.6
                    kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Mon, 21 Dec 2020 11:40:57 +0800
Taints:             node-role.kubernetes.io/master:NoSchedule
  • 主节点包含一个 node-role.kubernetes.io/master:NoSchedule的污点
  • 一般没有这个容忍度的pod无法调度到这个节点,只有系统级别的pod才能调度到这个系统节点上面来

  2.2 显示pod的容忍度

Name:                 kube-proxy-z6nwk
Namespace:            kube-system

......

QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     op=Exists
                 CriticalAddonsOnly op=Exists
                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/network-unavailable:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists
                 node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                 node.kubernetes.io/unreachable:NoExecute op=Exists
                 node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:          <none>
  • 红色字体部分表示是该系统pod的污点容忍度

 

  2.3 了解污点的效果

   每一个节点的污点都关联一个效果

    • NoSchedule表示如果pod没有容忍这些污点,pod则不能调度含有这些污点的节点上面
    • preferNoSchedule表示尽量阻止pod调度到这个节点上面来,如果pod实在没地方调度,也允许调度这个节点
    • NoExecute表示对节点上面的pod都有影响,就是如果节点上面的pod没有这些容忍度,就算已经在了,也会被驱逐出去

  2.4  为节点添加污点或者删除污点

##添加污点

k taint node node01 node-type=product:NoSchedule

###删除污点

k taint node node01 node-type:NoSchedule-

 

  2.5 创建几个pod观察效果

[root@node01 wxm]# k get po test{1,2,3,4,5} -o wide
NAME    READY   STATUS    RESTARTS   AGE     IP             NODE     NOMINATED NODE   READINESS GATES
test1   1/1     Running   0          2m41s   10.244.1.126   node02   <none>           <none>
test2   1/1     Running   0          2m26s   10.244.1.127   node02   <none>           <none>
test3   1/1     Running   0          2m21s   10.244.1.128   node02   <none>           <none>
test4   1/1     Running   0          2m15s   10.244.1.129   node02   <none>           <none>
test5   1/1     Running   0          2m10s   10.244.1.130   node02   <none>           <none>
  • 可以看到pod都被调度到node02上面去了,因为node01被加了污点

  2.6 在pod上面添加污点容忍度

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: prod
spec:
  replicas: 5
  template:
    metadata:
      name: prod
      labels:
        app: prod
    spec:
      containers:
      - image: busybox
        command: ["sleep","999999"]
        name: busybox
      tolerations:
      - key: node-type
        operator: Equal
        value: product
        effect: NoSchedule
  • 在pod上面添加污点容忍度,表示允许将pod调度那些有污点的节点上去,但是不代表只能调度到相应的污点节点上面去

   2.7 观察其他pod上面的污点容忍度

Name:           prod-7c8c7f9b47-xcbkt
Namespace:      default
     ........
Tolerations:     node-type=product:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
     ........
  • 这个pod有三个污点容忍度,第一个是可以容忍node-type的值为product,可以把这个pod调度到具有这些污点的节点
  • 第二个第三个分别是当节点为不可达或者为unreachable的时候,节点上面的pod将会被重新调度走,300s是这个调度走之前的等待时间,当节点被标记为不可调度或者不可达的时候,还会继续等待300s,如果300s后仍然是是当前状态,则会将这些pod调度到其他节点上面去

 

三 认识节点的亲缘性

 

  3.1  将集群的2个节点分别添加可用去标签以及是否独享和共享的标签

k label node node01 availability-zone=zone1
k label node node02 availability-zone=zone2
k label node node01 share-type=dedicated
k label node node02 share-type=shared

  

  3.2 之后部署一组deployment并且对zone1和dedicated的亲和权重分别设置为80%以及20%

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: prod
spec:
  replicas: 5
  template:
    metadata:
      name: prod
      labels:
        app: prod
    spec:
      containers:
      - image: busybox
        command: ["sleep","999999"]
        name: busybox
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 80
            preference:
              matchExpressions:
              - key: availability-zone
                operator: In
                values:
                - zone1
          - weight: 20
            preference:
              matchExpressions:
              - key: share-type
                operator: In
                values:
                - dedicated

 

  3.3 观察这些pod被调度到了什么位置上面去了

[root@node01 Chapter16]# k get po -o wide
NAME                  READY   STATUS    RESTARTS   AGE   IP             NODE     NOMINATED NODE   READINESS GATES
prod-5ffd8886-4646n   1/1     Running   0          52m   10.244.2.130   node01   <none>           <none>
prod-5ffd8886-c8p4v   1/1     Running   0          52m   10.244.2.129   node01   <none>           <none>
prod-5ffd8886-crhtp   1/1     Running   0          52m   10.244.1.156   node02   <none>           <none>
prod-5ffd8886-hrn2b   1/1     Running   0          52m   10.244.2.131   node01   <none>           <none>
prod-5ffd8886-x4qv6   1/1     Running   0          52m   10.244.2.132   node01   <none>           <none>
  • 可以看到这个deployment的的pod基本都调度到了node01上面这个是符合预期的,因为我们node01节点上面的labels为zone1,正好符合deployment的预期
  • 你可能有一点奇怪的是,为什么会有一台跑到了节点2上面去了,那是因为调度算法里面有一个条例尽量不要让所有的pod都在同一台节点上,这样当这个节点挂了,就没有其他pod可以对外提供服务了

 

四 认识pod之间的亲缘性

 

  4.1 什么是pod之间的亲缘性?有什么应用场景

    pod之间的亲缘性,就是可以在某个节点上面的pod具有某个标签,然后将其他pod也部署到这个节点上面去,举一个例子来说,我在node01上面部署了一个后段pod,我们希望在部署3个前端pod的时候,让这些前端的pod跟他部署在同一个机器上面,这样就能大大的增加性能。

    

  4.2 部署一个例子,第一步部署一个后段pod

k run backend -l app=backend --image=busybox -- sleep 999999

  4.3 之后部署5个前端pod

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: prod
spec:
  replicas: 5
  template:
    metadata:
      name: prod
      labels:
        app: prod
    spec:
      containers:
      - image: busybox
        command: ["sleep","999999"]
        name: busybox
      affinity:
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - topologyKey: kubernetes.io/hostname
            labelSelector:
              matchLabels:
                app: backend

  4.3 观察是否在同一个机器上面

NAME                   READY   STATUS    RESTARTS   AGE    IP             NODE     NOMINATED NODE   READINESS GATES
backend                1/1     Running   0          12m    10.244.1.157   node02   <none>           <none>
prod-bdd66cb75-7bcmj   1/1     Running   0          118s   10.244.1.162   node02   <none>           <none>
prod-bdd66cb75-r69q7   1/1     Running   0          118s   10.244.1.161   node02   <none>           <none>
prod-bdd66cb75-r8fjn   1/1     Running   0          118s   10.244.1.159   node02   <none>           <none>
prod-bdd66cb75-twphc   1/1     Running   0          118s   10.244.1.158   node02   <none>           <none>
prod-bdd66cb75-vkrm8   1/1     Running   0          118s   10.244.1.160   node02   <none>           <none>

  结果符合预期:前端的pod都已经调度到了与后段相同的节点上面了

 

五 表达pod的亲缘性的取消强制要求以及节点的非亲缘性

  

  5.1 将之前的强制变为优先调度到这些节点上,如果不满足也可以调度到其他节点上,一个例子如何所示

[root@node01 Chapter16]# cat frontend-podaffinity-host-2.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: prod
spec:
  replicas: 5
  template:
    metadata:
      name: prod
      labels:
        app: prod
    spec:
      containers:
      - image: busybox
        command: ["sleep","999999"]
        name: busybox
      affinity:
        podAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 80
            podAffinityTerm:
              topologyKey: kubernetes.io/hostname
              labelSelector:
                matchLabels:
                  app: backend

 

  5.2 查看前端的pod如何调度

[root@node01 Chapter16]# k get po -o wide
NAME                    READY   STATUS    RESTARTS   AGE   IP             NODE     NOMINATED NODE   READINESS GATES
backend                 1/1     Running   0          37m   10.244.1.157   node02   <none>           <none>
prod-7cd8bf84c4-862qw   1/1     Running   0          11s   10.244.1.165   node02   <none>           <none>
prod-7cd8bf84c4-db6j5   1/1     Running   0          11s   10.244.1.164   node02   <none>           <none>
prod-7cd8bf84c4-dzsq7   1/1     Running   0          11s   10.244.2.133   node01   <none>           <none>
prod-7cd8bf84c4-fhds5   1/1     Running   0          11s   10.244.1.163   node02   <none>           <none>
prod-7cd8bf84c4-mkps2   1/1     Running   0          11s   10.244.1.166   node02   <none>           <none>
  • 大部分都调度到了与backend相同的机器上面去了
  • 还有一个调度到了节点以上面,这就是我们不加强制性的好处
  • 所以应该用非强制性而不是强制性调度

  

  5.3 利用pod的非亲缘性将pod调度到不同的节点

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: prod
spec:
  replicas: 5
  template:
    metadata:
      name: prod
      labels:
        app: prod
    spec:
      containers:
      - image: busybox
        command: ["sleep","999999"]
        name: busybox
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - topologyKey: kubernetes.io/hostname
            labelSelector:
              matchLabels:
                app: prod
  • 非亲缘性节点是与之相同的pod的标签都无法调度上去
  • 关键字是 podAntiAffinity

  5.4 查看调度结果

NAME                    READY   STATUS    RESTARTS   AGE     IP             NODE     NOMINATED NODE   READINESS GATES
prod-76d8477ff8-5jvv5   1/1     Running   0          3m10s   10.244.2.141   node01   <none>           <none>
prod-76d8477ff8-bb2t9   0/1     Pending   0          3m10s   <none>         <none>   <none>           <none>
prod-76d8477ff8-gtj2t   0/1     Pending   0          3m10s   <none>         <none>   <none>           <none>
prod-76d8477ff8-hwskt   1/1     Running   0          3m10s   10.244.1.175   node02   <none>           <none>
prod-76d8477ff8-hzz4j   0/1     Pending   0          3m10s   <none>         <none>   <none>           <none>
  • 集群只有2个节点,所以只有2个pod被调度,其余都无法调度,符合预期

  5.5 看下pod的结果

  

QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  29s (x5 over 4m54s)  default-scheduler  0/3 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules
  • 调度失败原因符合预期
posted @ 2021-01-25 14:21  伊铭(netease)  阅读(412)  评论(0编辑  收藏  举报