一、GPU虚拟化简介

GPU是一种用于矩阵计算的PCIe设备，一般用于解码、渲染和科学计算等并行计算场景，不同场景对GPU使用方式不同，使用的加速库也各不相同，本文提到的GPU虚拟化主要针对科学计算场景，使用的加速库为nvidia cuda。

从用户角度，GPU虚拟化可以简单分为两种类型，虚拟机层面的虚拟化和容器层面的虚拟化。虚拟机层面的虚拟化是将GPU硬件设备虚拟给多个KVM虚拟机使用，各个虚拟机独立安装驱动，这样既保证了虚拟机内的GPU功能完备又实现GPU资源的隔离和共享，唯一缺点就是资源损耗相对较大。容器层面的虚拟化则有两个思路，一个是将GPU纳入cgroup管理，目前尚未有成熟的提案，短期内难以实现，二是基于GPU驱动封装实现，用户根据需要对驱动的某些关键接口（如显存分配、cuda thread创建等）进行封装劫持，在劫持过程中限制用户进程对计算资源的使用，此类方案缺点是兼容性依赖于厂商驱动，但是整体方案较为轻量化，性能损耗极小。GPUManager即为第二类容器层面的虚拟化方案，本文主要介绍GPUManager方案原理和部署流程。

二、GPUManager架构介绍

GPUManager是一个运行在k8s上的GPU虚拟化方案，了解GPUManager方案架构前我们先看一下k8s对异构资源的支持。1.6版本开始，k8s的in-tree代码里开始引入Nvidia GPU相关的代码，但不支持GPU调度无法在实际生产环境中使用，为了满足越来越多的异构资源（如GPU、Infiniband、FPGA等）使用需求，1.8版本社区提出了Extended Resource和Device Plugin方案，以OutOfTree形式支持异构资源的调度和映射。

GPUManager是腾讯自研的容器层GPU虚拟化方案，除兼容Nvidia 官方插件的GPU资源管理功能外，还增加碎片资源调度、GPU调度拓扑优化、GPU资源Quota等功能，在容器层面实现了GPU资源的化整为零，而在原理上仅使用了wrap library和linux动态库链接技术，就实现了GPU 算力和显存的上限隔离。

在工程设计上，GPUManager方案包括三个部分，cuda封装库vcuda、k8s device plugin 插件gpu-manager-daemonset和k8s调度插件gpu-quota-admission。

vcuda库是一个对nvidia-ml和libcuda库的封装库，通过劫持容器内用户程序的cuda调用限制当前容器内进程对GPU和显存的使用

gpu-manager-daemonset是标准的k8s device plugin，实现了GPU拓扑感知、设备和驱动映射等功能。GPUManager支持共享和独占两种模式，当负载里tencent.com/vcuda-core request 值在0~100情况下，采用共享模式调度，优先将碎片资源集中到一张卡上，当负载里的tencent.com/vcuda-core request为100的倍数时，采用独占模式调度，gpu-manager-daemonset会根据GPU拓扑结构生成GPU卡的拓扑树，选择最优的结构（距离最短的叶子节点）进行调度分配。需要注意的是GPUManager仅支持0~100和100的整数倍的GPU需求调度，无法支持150，220类的非100整数倍的GPU需求调度。每张 GPU 卡一共有100个单位的资源，仅支持0 - 1的小数卡，以及1的倍数的整数卡设置。显存资源是以256MiB为最小的一个单位的分配显存。

gpu-quota-admission是一个k8s Scheduler extender，实现了Scheduler的predicates接口，kube-scheduler在调度tencent.com/vcuda-core资源请求的Pod时，predicates阶段会调用gpu-quota-admission的predicates接口对节点进行过滤和绑定，同时gpu-quota-admission提供了GPU资源池调度功能，解决不同类型的GPU在namespace下的配额问题

GPUManager整体方案如下：

三、GPUManager部署

## github
gpu-admission：   https://github.com/tkestack/gpu-admission
gpu-manager：     https://github.com/tkestack/gpu-manager

1、驱动安装

参考文档： https://www.cnblogs.com/deny/p/16305945.html

2、部署

1）部署gpu-quota-admission服务

kubectl apply -f   gpu-admission.yaml

内容如下：

apiVersion: v1
kind: ConfigMap
metadata:
  name: gpu-quota-admission
  namespace: kube-system
data:
  gpu-quota-admission.config: |
    {
         "QuotaConfigMapName": "gpuquota",
         "QuotaConfigMapNamespace": "kube-system",
         "GPUModelLabel": "gaia.tencent.com/gpu-model",
         "GPUPoolLabel": "gaia.tencent.com/gpu-pool"
     }

---
apiVersion: v1
kind: Service
metadata:
  name: gpu-quota-admission
  namespace: kube-system
spec:
  ports:
  - port: 3456
    protocol: TCP
    targetPort: 3456
  selector:
    k8s-app: gpu-quota-admission
  type: ClusterIP

---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    k8s-app: gpu-quota-admission
  name: gpu-quota-admission
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      k8s-app: gpu-quota-admission
  template:
    metadata:
      labels:
        k8s-app: gpu-quota-admission
      namespace: kube-system
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - preference:
              matchExpressions:
              - key: node-role.kubernetes.io/master
                operator: Exists
            weight: 1
      containers:
      - env:
        - name: LOG_LEVEL
          value: "4"
        - name: EXTRA_FLAGS
          value: --incluster-mode=true
        image: ccr.ccs.tencentyun.com/tkeimages/gpu-quota-admission:latest
        imagePullPolicy: IfNotPresent
        name: gpu-quota-admission
        ports:
        - containerPort: 3456
          protocol: TCP
        resources:
          limits:
            cpu: "2"
            memory: 2Gi
          requests:
            cpu: "1"
            memory: 1Gi
        volumeMounts:
        - mountPath: /root/gpu-quota-admission/
          name: config
      dnsPolicy: ClusterFirstWithHostNet
      initContainers:
      - command:
        - sh
        - -c
        - ' mkdir -p /etc/kubernetes/ && cp /root/gpu-quota-admission/gpu-quota-admission.config
          /etc/kubernetes/'
        image: busybox
        imagePullPolicy: Always
        name: init-kube-config
        securityContext:
          privileged: true
        volumeMounts:
        - mountPath: /root/gpu-quota-admission/
          name: config
      priority: 2000000000
      priorityClassName: system-cluster-critical
      restartPolicy: Always
      serviceAccount: gpu-manager
      serviceAccountName: gpu-manager
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
      volumes:
      - configMap:
          defaultMode: 420
          name: gpu-quota-admission
        name: config

2 ）部署gpu-manager-daemonset

kubectl apply -f   gpu-manager.yaml

内容如下：

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: gpu-manager
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
- kind: ServiceAccount
  name: gpu-manager
  namespace: kube-system

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: gpu-manager
  namespace: kube-system

---
apiVersion: v1
kind: Service
metadata:
  name: gpu-manager-metric
  namespace: kube-system
  annotations:
    prometheus.io/scrape: "true"
  labels:
    kubernetes.io/cluster-service: "true"
spec:
  clusterIP: None
  ports:
    - name: metrics
      port: 5678
      protocol: TCP
      targetPort: 5678
  selector:
    name: gpu-manager-ds

---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: gpu-manager-daemonset
  namespace: kube-system
spec:
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      name: gpu-manager-ds
  template:
    metadata:
      # This annotation is deprecated. Kept here for backward compatibility
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        name: gpu-manager-ds
    spec:
      serviceAccount: gpu-manager
      tolerations:
        # This toleration is deprecated. Kept here for backward compatibility
        # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
        - key: CriticalAddonsOnly
          operator: Exists
        - key: tencent.com/vcuda-core
          operator: Exists
          effect: NoSchedule
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      # only run node has gpu device
      nodeSelector:
        nvidia-device-enable: enable
      hostPID: true
      containers:
        - image: tkestack/gpu-manager:v1.1.5
          imagePullPolicy: IfNotPresent
          name: gpu-manager
          securityContext:
            privileged: true
          ports:
            - containerPort: 5678
          volumeMounts:
            - name: device-plugin
              mountPath: /var/lib/kubelet/device-plugins
            - name: vdriver
              mountPath: /etc/gpu-manager/vdriver
            - name: vmdata
              mountPath: /etc/gpu-manager/vm
            - name: log
              mountPath: /var/log/gpu-manager
            - name: checkpoint
              mountPath: /etc/gpu-manager/checkpoint
            - name: run-dir
              mountPath: /var/run
            - name: cgroup
              mountPath: /sys/fs/cgroup
              readOnly: true
            - name: usr-directory
              mountPath: /usr/local/host
              readOnly: true
          env:
            - name: LOG_LEVEL
              value: "4"
            - name: EXTRA_FLAGS
              value: "--logtostderr=false"
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
      volumes:
        - name: device-plugin
          hostPath:
            type: Directory
            path: /var/lib/kubelet/device-plugins
        - name: vmdata
          hostPath:
            type: DirectoryOrCreate
            path: /etc/gpu-manager/vm
        - name: vdriver
          hostPath:
            type: DirectoryOrCreate
            path: /etc/gpu-manager/vdriver
        - name: log
          hostPath:
            type: DirectoryOrCreate
            path: /etc/gpu-manager/log
        - name: checkpoint
          hostPath:
            type: DirectoryOrCreate
            path: /etc/gpu-manager/checkpoint
        # We have to mount the whole /var/run directory into container, because of bind mount docker.sock
        # inode change after host docker is restarted
        - name: run-dir
          hostPath:
            type: Directory
            path: /var/run
        - name: cgroup
          hostPath:
            type: Directory
            path: /sys/fs/cgroup
        # We have to mount /usr directory instead of specified library path, because of non-existing
        # problem for different distro
        - name: usr-directory
          hostPath:
            type: Directory
            path: /usr

3 ）给GPU节点打nvidia-device-enable=enable 标签

kubectl label node *.*.*.* nvidia-device-enable=enable

4 ）验证gpu-manager-daemonset是否正确派发到GPU节点

kubectl get pods -n kube-system

3、自定义调度器

1）准备自定义调度器文件 /etc/kubernetes/scheduler-policy-config.json，配置文件内容：

{
  "kind": "Policy",
  "apiVersion": "v1",
  "predicates": [
    {
      "name": "PodFitsHostPorts"
    },
    {
      "name": "PodFitsResources"
    },
    {
      "name": "NoDiskConflict"
    },
    {
      "name": "MatchNodeSelector"
    },
    {
      "name": "HostName"
    }
  ],
  "extenders": [
    {
      "urlPrefix": "http://gpu-quota-admission.kube-system:3456/scheduler",
      "apiVersion": "v1beta1",
      "filterVerb": "predicates",
      "enableHttps": false,
      "nodeCacheCapable": false
    }
  ],
  "hardPodAffinitySymmetricWeight": 10,
  "alwaysCheckAllPredicates": false
}

其中"urlPrefix": "http://gpu-quota-admission.kube-system:3456/scheduler"中的IP地址和端口号，如果有特殊需求则按照需求更换，没有特殊需求这样写就可以了

2）修改调度器scheduler的manifest文件

如果是kubeadm部署的k8s，调度器是以pod形式运行的，kubelet会一直监听manifest文件的修改，发现文件被修改后会自动重启pod以加载新的配置。因此，这里我们只需要修改调度器的manifest文件即可。

cp /etc/kubernetes/manifests/kube-scheduler.yaml /etc/kubernetes/manifests/kube-scheduler.yaml.bak

在command关键字下面加两行内容：

--policy-config-file=/etc/kubernetes/scheduler-policy-config.json
--use-legacy-policy-config=true

修改后文件为：

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    component: kube-scheduler
    tier: control-plane
  name: kube-scheduler
  namespace: kube-system
spec:
  containers:
  - command:
    - kube-scheduler
    - --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
    - --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
    - --bind-address=127.0.0.1
    - --kubeconfig=/etc/kubernetes/scheduler.conf
    - --leader-elect=true
    - --port=0
    - --policy-config-file=/etc/kubernetes/scheduler-policy-config.json              ####  增加项
    - --use-legacy-policy-config=true                                                #### 增加项
    image: 10.2.57.16:5000/kubernetes/kube-scheduler:v1.19.8
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 8
      httpGet:
        host: 127.0.0.1
        path: /healthz
        port: 10259
        scheme: HTTPS
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
    name: kube-scheduler
    resources:
      requests:
        cpu: 100m
    startupProbe:
      failureThreshold: 24
      httpGet:
        host: 127.0.0.1
        path: /healthz
        port: 10259
        scheme: HTTPS
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
    volumeMounts:
    - mountPath: /etc/kubernetes/scheduler.conf
      name: kubeconfig
      readOnly: true
    - mountPath: /etc/kubernetes/scheduler-policy-config.json              #### 将文件挂载
      name: policyconfig
      readOnly: true
  hostNetwork: true
  dnsPolicy: ClusterFirstWithHostNet                                   #### 修改dns策略
  priorityClassName: system-node-critical
  volumes:
  - hostPath:
      path: /etc/kubernetes/scheduler.conf
      type: FileOrCreate
    name: kubeconfig
  - hostPath:
      path: /etc/kubernetes/scheduler-policy-config.json
      type: FileOrCreate
    name: policyconfig
status: {}

保存退出后就自动生效了

可以用如下命令确定一下：

[root@cri3dp1 manifests]# kubectl -n kube-system get pod | grep sch
kube-scheduler-cri3dp1                       1/1     Running   0          141m

输出中找到一个名字为 kube-scheduler-XXX 的pod，看后面对应的AGE项，是不是刚刚启动。如果刚启动过，代表调度器配置已经更新。

4、查看gpu节点信息

[root@cri3dp1 manifests]# kubectl describe node k8s-node3
.........
Capacity:
  cpu:                       20
  ephemeral-storage:         958487280Ki
  hugepages-1Gi:             0
  hugepages-2Mi:             0
  memory:                    65492456Ki
  pods:                      110
  tencent.com/vcuda-core:    100
  tencent.com/vcuda-memory:  96
Allocatable:
  cpu:                       20
  ephemeral-storage:         883341875786
  hugepages-1Gi:             0
  hugepages-2Mi:             0
  memory:                    65390056Ki
  pods:                      110
  tencent.com/vcuda-core:    100
  tencent.com/vcuda-memory:  96.........

四、方案测试

方案测试采用Tensorflow框架，内置了Mnist，cifar10和Alexnet benchmark等测试数据集，可以根据需要选择不同的测试方案。

测试步骤：

1、使用TensorFlow框架+minst数据集进行测试验证，TensorFlow镜像：

ccr.ccs.tencentyun.com/menghe/tensorflow-gputest:0.2

2、创建一个测试负载，yaml文件如下：

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    k8s-app: vcuda-test
    qcloud-app: vcuda-test
  name: vcuda-test
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      k8s-app: vcuda-test
  template:
    metadata:
      labels:
        k8s-app: vcuda-test
        qcloud-app: vcuda-test
    spec:
      containers:
      - command:
        - sleep
        - 360000s
        env:
        - name: PATH
          value: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
        image: ccr.ccs.tencentyun.com/menghe/tensorflow-gputest:0.2
        imagePullPolicy: IfNotPresent
        name: tensorflow-test
        resources:
          limits:
            cpu: "4"
            memory: 8Gi
            tencent.com/vcuda-core: "50"
            tencent.com/vcuda-memory: "32"
          requests:
            cpu: "4"
            memory: 8Gi
            tencent.com/vcuda-core: "50"
            tencent.com/vcuda-memory: "32"

3、进入测试容器（在默认default namespace下，如修改了测试yaml，按需指定namespace）

kubectl exec -it `kubectl get pods -o name | cut -d '/' -f2` -- bash

4、执行测试命令，可以根据需求选择不同训练框架/数据集

a. Mnist

cd /data/tensorflow/mnist && time python convolutional.py

b. AlexNet

cd /data/tensorflow/alexnet && time python alexnet_benchmark.py

c. Cifar10

cd /data/tensorflow/cifar10 && time python cifar10_train.py

5、在物理机上通过nvidia-smi pmon -s u -d 1命令查看GPU资源使用情况

五、pod使用

下面给出 yaml 示例：

1）使用1张卡的 P4 设备：

apiVersion: v1
kind: Pod
...
spec:
containers:
 - name: gpu
resources:
  limits:
    cpu: "4"
    memory: 8Gi
    tencent.com/vcuda-core: "100"
  requests:
    cpu: "4"
    memory: 8Gi
    tencent.com/vcuda-core: "100"

2）使用0.3张卡，5GiB 显存的应用：

apiVersion: v1
kind: Pod
...
spec:
containers:
 - name: gpu
resources:
  limits:
    cpu: "4"
    memory: 8Gi
    tencent.com/vcuda-core: "30"
    tencent.com/vcuda-memory: "20"
  requests:
    cpu: "4"
    memory: 8Gi
    tencent.com/vcuda-core: "30"
    tencent.com/vcuda-memory: "20"

posted @ 2022-05-24 15:33 流年晕开时光阅读(8828) 评论(4) 收藏举报

刷新页面返回顶部

流年晕开时光

腾讯vCUDA（gpu-manager）部署

一、GPU虚拟化简介

二、GPUManager架构介绍

三、GPUManager部署

1、驱动安装

2、部署

1）部署gpu-quota-admission服务

2 ）部署gpu-manager-daemonset

3 ）给GPU节点打nvidia-device-enable=enable 标签

4 ）验证gpu-manager-daemonset是否正确派发到GPU节点

3、自定义调度器

4、查看gpu节点信息

四、方案测试

五、pod使用

公告

流年晕开时光

腾讯vCUDA（gpu-manager）部署

一、GPU虚拟化简介

二、GPUManager架构介绍

三、GPUManager部署

1、驱动安装

2、部署

1）部署gpu-quota-admission服务

2 ）部署gpu-manager-daemonset

3 ）给GPU节点打nvidia-device-enable=enable 标签

4 ） 验证gpu-manager-daemonset是否正确派发到GPU节点

3、自定义调度器

4、查看gpu节点信息

四、方案测试

五、pod使用

公告

4 ）验证gpu-manager-daemonset是否正确派发到GPU节点