k8s 使用显卡

k8s 使用 nvidia.com/gpu: "1" 方式显卡不能共用,8张显卡只能调度8个pod 使用,docker 是可以共享使用的;

如果想共享使用显卡,需要和docker 使用方式一致,增加env  NVIDIA_VISIBLE_DEVICES 即可

环境

固定gpu 服务器,需要先给指定的gpu 服务器打上污点
shell>kubectl taint node  lgy-dev-gpu-k8s-node7-105  server_type=gpu:NoSchedule
  • 准备k8s deployment 配置文件
# cat pengyun-python-test111.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    name: pengyun-python-test111
  name: pengyun-python-test111
  namespace: pengyun
spec:
  replicas: 1
  selector:
    matchLabels:
      name: pengyun-python-test111
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        name: pengyun-python-test111
    spec:
      containers:
      - image: harbor.k8s.moviebook.cn/pengyun/dev/000003-pengyun/python_dev:20220106140120
        imagePullPolicy: IfNotPresent
        name: pengyun-python-test111
        resources:
          limits:
            cpu: "1"
            nvidia.com/gpu: "1"
          requests:
            cpu: "1"
            memory: 2Gi
            nvidia.com/gpu: "1"
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      tolerations:   #设置容忍污点
      - effect: NoSchedule
        key: server_type
        operator: Equal
        value: gpu
  • 创建yaml
shell> kubectl apply -f pengyun-python-test111.yaml
  • 查看pod 创建情况
确认pod 已调度至指定污点的node 节点
kubectl get pod -n pengyun -o wide  |grep test111

  • 查看pod gpu显卡
shell>kubectl exec -it -n pengyun  pengyun-python-test111-d7d895867-txtxp bash
shell>nvidia-smi

  • 查看宿主机显卡

k8s 如果共享使用显卡,需要增加env设置,不需要在limits字段增加 nvidia.com/gpu: "1"

  • yaml 文件配置,比如10个pod共享使用 第三张显卡,配置如下
# cat deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ainpack
  namespace: kube-system
  labels:
    app: ainpack
spec:
  replicas: 10
  selector: # define how the deployment finds the pods it manages
    matchLabels:
      app: ainpack
  template: # define the pods specifications
    metadata:
      labels:
        app: ainpack
    spec:
      nodeSelector:
        gpushare: "true"
      containers:
      - name: ainpack
        image: yz.xxx.com/base/python_dev:20211012102640 
        imagePullPolicy: IfNotPresent
        env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: "3"
      tolerations:
      - effect: NoSchedule
        key: server_type
        operator: Equal
        value: gpu-A100
  • 查看创建的 gpu pod
# kubectl get pod -n kube-system  -o wide  |grep ainpack
ainpack-7dcf955d4d-5c7c2                  1/1     Running   0          9m16s   172.30.25.18     slave-gpu-109   <none>           <none>
ainpack-7dcf955d4d-9njb8                  1/1     Running   0          9m14s   172.30.25.20     slave-gpu-109   <none>           <none>
ainpack-7dcf955d4d-hc8j4                  1/1     Running   0          9m14s   172.30.25.19     slave-gpu-109   <none>           <none>
ainpack-7dcf955d4d-hrfq2                  1/1     Running   0          9m16s   172.30.25.17     slave-gpu-109   <none>           <none>
ainpack-7dcf955d4d-jjd22                  1/1     Running   0          9m12s   172.30.25.23     slave-gpu-109   <none>           <none>
ainpack-7dcf955d4d-jzzrd                  1/1     Running   0          9m12s   172.30.25.22     slave-gpu-109   <none>           <none>
ainpack-7dcf955d4d-rvv97                  1/1     Running   0          9m16s   172.30.25.14     slave-gpu-109   <none>           <none>
ainpack-7dcf955d4d-sk8xc                  1/1     Running   0          9m16s   172.30.25.15     slave-gpu-109   <none>           <none>
ainpack-7dcf955d4d-vxg2b                  1/1     Running   0          9m13s   172.30.25.21     slave-gpu-109   <none>           <none>
ainpack-7dcf955d4d-wl47s                  1/1     Running   0          9m16s   172.30.25.16     slave-gpu-109   <none>           <none>\

  • 查看pod 使用的显卡
# kubectl exec -it -n kube-system ainpack-7dcf955d4d-sk8xc bash

#nvidia-smi 

  • 登录GPU 宿主机查看 第三张显卡编号,发现一致,证明可以共享
# nvidia-smi 

posted @ 2022-02-09 15:33  lixinliang  阅读(1550)  评论(0编辑  收藏  举报