腾讯vCUDA(gpu-manager)部署
参考文档:https://cloud.tencent.com/developer/article/1685122
https://blog.csdn.net/o0haidee0o/article/details/119407372
https://www.jianshu.com/p/7d795bc226c7
一、GPU虚拟化简介
GPU是一种用于矩阵计算的PCIe设备,一般用于解码、渲染和科学计算等并行计算场景,不同场景对GPU使用方式不同,使用的加速库也各不相同,本文提到的GPU虚拟化主要针对科学计算场景,使用的加速库为nvidia cuda。
从用户角度,GPU虚拟化可以简单分为两种类型,虚拟机层面的虚拟化和容器层面的虚拟化。虚拟机层面的虚拟化是将GPU硬件设备虚拟给多个KVM虚拟机使用,各个虚拟机独立安装驱动,这样既保证了虚拟机内的GPU功能完备又实现GPU资源的隔离和共享,唯一缺点就是资源损耗相对较大。容器层面的虚拟化则有两个思路,一个是将GPU纳入cgroup管理,目前尚未有成熟的提案,短期内难以实现,二是基于GPU驱动封装实现,用户根据需要对驱动的某些关键接口(如显存分配、cuda thread创建等)进行封装劫持,在劫持过程中限制用户进程对计算资源的使用,此类方案缺点是兼容性依赖于厂商驱动,但是整体方案较为轻量化,性能损耗极小。GPUManager即为第二类容器层面的虚拟化方案,本文主要介绍GPUManager方案原理和部署流程。
二、GPUManager架构介绍
GPUManager是一个运行在k8s上的GPU虚拟化方案,了解GPUManager方案架构前我们先看一下k8s对异构资源的支持。1.6版本开始,k8s的in-tree代码里开始引入Nvidia GPU相关的代码,但不支持GPU调度无法在实际生产环境中使用,为了满足越来越多的异构资源(如GPU、Infiniband、FPGA等)使用需求,1.8版本社区提出了Extended Resource和Device Plugin方案,以OutOfTree形式支持异构资源的调度和映射。
GPUManager是腾讯自研的容器层GPU虚拟化方案,除兼容Nvidia 官方插件的GPU资源管理功能外,还增加碎片资源调度、GPU调度拓扑优化、GPU资源Quota等功能,在容器层面实现了GPU资源的化整为零,而在原理上仅使用了wrap library和linux动态库链接技术,就实现了GPU 算力和显存的上限隔离。
在工程设计上,GPUManager方案包括三个部分,cuda封装库vcuda、k8s device plugin 插件gpu-manager-daemonset和k8s调度插件gpu-quota-admission。
vcuda库是一个对nvidia-ml和libcuda库的封装库,通过劫持容器内用户程序的cuda调用限制当前容器内进程对GPU和显存的使用
gpu-manager-daemonset是标准的k8s device plugin,实现了GPU拓扑感知、设备和驱动映射等功能。GPUManager支持共享和独占两种模式,当负载里tencent.com/vcuda-core request 值在0~100情况下,采用共享模式调度,优先将碎片资源集中到一张卡上,当负载里的tencent.com/vcuda-core request为100的倍数时,采用独占模式调度,gpu-manager-daemonset会根据GPU拓扑结构生成GPU卡的拓扑树,选择最优的结构(距离最短的叶子节点)进行调度分配。需要注意的是GPUManager仅支持0~100和100的整数倍的GPU需求调度,无法支持150,220类的非100整数倍的GPU需求调度。每张 GPU 卡一共有100个单位的资源,仅支持0 - 1的小数卡,以及1的倍数的整数卡设置。显存资源是以256MiB为最小的一个单位的分配显存。
gpu-quota-admission是一个k8s Scheduler extender,实现了Scheduler的predicates接口,kube-scheduler在调度tencent.com/vcuda-core资源请求的Pod时,predicates阶段会调用gpu-quota-admission的predicates接口对节点进行过滤和绑定,同时gpu-quota-admission提供了GPU资源池调度功能,解决不同类型的GPU在namespace下的配额问题
GPUManager整体方案如下:
三、GPUManager部署
## github gpu-admission: https://github.com/tkestack/gpu-admission gpu-manager: https://github.com/tkestack/gpu-manager
1、驱动安装
参考文档: https://www.cnblogs.com/deny/p/16305945.html
2、部署
1)部署gpu-quota-admission服务
kubectl apply -f gpu-admission.yaml
内容如下:
apiVersion: v1 kind: ConfigMap metadata: name: gpu-quota-admission namespace: kube-system data: gpu-quota-admission.config: | { "QuotaConfigMapName": "gpuquota", "QuotaConfigMapNamespace": "kube-system", "GPUModelLabel": "gaia.tencent.com/gpu-model", "GPUPoolLabel": "gaia.tencent.com/gpu-pool" } --- apiVersion: v1 kind: Service metadata: name: gpu-quota-admission namespace: kube-system spec: ports: - port: 3456 protocol: TCP targetPort: 3456 selector: k8s-app: gpu-quota-admission type: ClusterIP --- apiVersion: apps/v1 kind: Deployment metadata: labels: k8s-app: gpu-quota-admission name: gpu-quota-admission namespace: kube-system spec: replicas: 1 selector: matchLabels: k8s-app: gpu-quota-admission template: metadata: labels: k8s-app: gpu-quota-admission namespace: kube-system spec: affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - preference: matchExpressions: - key: node-role.kubernetes.io/master operator: Exists weight: 1 containers: - env: - name: LOG_LEVEL value: "4" - name: EXTRA_FLAGS value: --incluster-mode=true image: ccr.ccs.tencentyun.com/tkeimages/gpu-quota-admission:latest imagePullPolicy: IfNotPresent name: gpu-quota-admission ports: - containerPort: 3456 protocol: TCP resources: limits: cpu: "2" memory: 2Gi requests: cpu: "1" memory: 1Gi volumeMounts: - mountPath: /root/gpu-quota-admission/ name: config dnsPolicy: ClusterFirstWithHostNet initContainers: - command: - sh - -c - ' mkdir -p /etc/kubernetes/ && cp /root/gpu-quota-admission/gpu-quota-admission.config /etc/kubernetes/' image: busybox imagePullPolicy: Always name: init-kube-config securityContext: privileged: true volumeMounts: - mountPath: /root/gpu-quota-admission/ name: config priority: 2000000000 priorityClassName: system-cluster-critical restartPolicy: Always serviceAccount: gpu-manager serviceAccountName: gpu-manager terminationGracePeriodSeconds: 30 tolerations: - effect: NoSchedule key: node-role.kubernetes.io/master volumes: - configMap: defaultMode: 420 name: gpu-quota-admission name: config
2 )部署gpu-manager-daemonset
kubectl apply -f gpu-manager.yaml
内容如下:
--- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: gpu-manager roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: cluster-admin subjects: - kind: ServiceAccount name: gpu-manager namespace: kube-system --- apiVersion: v1 kind: ServiceAccount metadata: name: gpu-manager namespace: kube-system --- apiVersion: v1 kind: Service metadata: name: gpu-manager-metric namespace: kube-system annotations: prometheus.io/scrape: "true" labels: kubernetes.io/cluster-service: "true" spec: clusterIP: None ports: - name: metrics port: 5678 protocol: TCP targetPort: 5678 selector: name: gpu-manager-ds --- apiVersion: apps/v1 kind: DaemonSet metadata: name: gpu-manager-daemonset namespace: kube-system spec: updateStrategy: type: RollingUpdate selector: matchLabels: name: gpu-manager-ds template: metadata: # This annotation is deprecated. Kept here for backward compatibility # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/ annotations: scheduler.alpha.kubernetes.io/critical-pod: "" labels: name: gpu-manager-ds spec: serviceAccount: gpu-manager tolerations: # This toleration is deprecated. Kept here for backward compatibility # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/ - key: CriticalAddonsOnly operator: Exists - key: tencent.com/vcuda-core operator: Exists effect: NoSchedule # Mark this pod as a critical add-on; when enabled, the critical add-on # scheduler reserves resources for critical add-on pods so that they can # be rescheduled after a failure. # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/ priorityClassName: "system-node-critical" # only run node has gpu device nodeSelector: nvidia-device-enable: enable hostPID: true containers: - image: tkestack/gpu-manager:v1.1.5 imagePullPolicy: IfNotPresent name: gpu-manager securityContext: privileged: true ports: - containerPort: 5678 volumeMounts: - name: device-plugin mountPath: /var/lib/kubelet/device-plugins - name: vdriver mountPath: /etc/gpu-manager/vdriver - name: vmdata mountPath: /etc/gpu-manager/vm - name: log mountPath: /var/log/gpu-manager - name: checkpoint mountPath: /etc/gpu-manager/checkpoint - name: run-dir mountPath: /var/run - name: cgroup mountPath: /sys/fs/cgroup readOnly: true - name: usr-directory mountPath: /usr/local/host readOnly: true env: - name: LOG_LEVEL value: "4" - name: EXTRA_FLAGS value: "--logtostderr=false" - name: NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName volumes: - name: device-plugin hostPath: type: Directory path: /var/lib/kubelet/device-plugins - name: vmdata hostPath: type: DirectoryOrCreate path: /etc/gpu-manager/vm - name: vdriver hostPath: type: DirectoryOrCreate path: /etc/gpu-manager/vdriver - name: log hostPath: type: DirectoryOrCreate path: /etc/gpu-manager/log - name: checkpoint hostPath: type: DirectoryOrCreate path: /etc/gpu-manager/checkpoint # We have to mount the whole /var/run directory into container, because of bind mount docker.sock # inode change after host docker is restarted - name: run-dir hostPath: type: Directory path: /var/run - name: cgroup hostPath: type: Directory path: /sys/fs/cgroup # We have to mount /usr directory instead of specified library path, because of non-existing # problem for different distro - name: usr-directory hostPath: type: Directory path: /usr
3 )给GPU节点打nvidia-device-enable=enable 标签
kubectl label node *.*.*.* nvidia-device-enable=enable
4 ) 验证gpu-manager-daemonset是否正确派发到GPU节点
kubectl get pods -n kube-system
3、自定义调度器
1)准备自定义调度器文件 /etc/kubernetes/scheduler-policy-config.json,配置文件内容:
{ "kind": "Policy", "apiVersion": "v1", "predicates": [ { "name": "PodFitsHostPorts" }, { "name": "PodFitsResources" }, { "name": "NoDiskConflict" }, { "name": "MatchNodeSelector" }, { "name": "HostName" } ], "extenders": [ { "urlPrefix": "http://gpu-quota-admission.kube-system:3456/scheduler", "apiVersion": "v1beta1", "filterVerb": "predicates", "enableHttps": false, "nodeCacheCapable": false } ], "hardPodAffinitySymmetricWeight": 10, "alwaysCheckAllPredicates": false }
其中"urlPrefix": "http://gpu-quota-admission.kube-system:3456/scheduler"中的IP地址和端口号,如果有特殊需求则按照需求更换,没有特殊需求这样写就可以了
2)修改调度器scheduler的manifest文件
如果是kubeadm部署的k8s,调度器是以pod形式运行的,kubelet会一直监听manifest文件的修改,发现文件被修改后会自动重启pod以加载新的配置。因此,这里我们只需要修改调度器的manifest文件即可。
cp /etc/kubernetes/manifests/kube-scheduler.yaml /etc/kubernetes/manifests/kube-scheduler.yaml.bak
在command关键字下面加两行内容:
--policy-config-file=/etc/kubernetes/scheduler-policy-config.json --use-legacy-policy-config=true
修改后文件为:
apiVersion: v1 kind: Pod metadata: creationTimestamp: null labels: component: kube-scheduler tier: control-plane name: kube-scheduler namespace: kube-system spec: containers: - command: - kube-scheduler - --authentication-kubeconfig=/etc/kubernetes/scheduler.conf - --authorization-kubeconfig=/etc/kubernetes/scheduler.conf - --bind-address=127.0.0.1 - --kubeconfig=/etc/kubernetes/scheduler.conf - --leader-elect=true - --port=0 - --policy-config-file=/etc/kubernetes/scheduler-policy-config.json #### 增加项 - --use-legacy-policy-config=true #### 增加项 image: 10.2.57.16:5000/kubernetes/kube-scheduler:v1.19.8 imagePullPolicy: IfNotPresent livenessProbe: failureThreshold: 8 httpGet: host: 127.0.0.1 path: /healthz port: 10259 scheme: HTTPS initialDelaySeconds: 10 periodSeconds: 10 timeoutSeconds: 15 name: kube-scheduler resources: requests: cpu: 100m startupProbe: failureThreshold: 24 httpGet: host: 127.0.0.1 path: /healthz port: 10259 scheme: HTTPS initialDelaySeconds: 10 periodSeconds: 10 timeoutSeconds: 15 volumeMounts: - mountPath: /etc/kubernetes/scheduler.conf name: kubeconfig readOnly: true - mountPath: /etc/kubernetes/scheduler-policy-config.json #### 将文件挂载 name: policyconfig readOnly: true hostNetwork: true dnsPolicy: ClusterFirstWithHostNet #### 修改dns策略 priorityClassName: system-node-critical volumes: - hostPath: path: /etc/kubernetes/scheduler.conf type: FileOrCreate name: kubeconfig - hostPath: path: /etc/kubernetes/scheduler-policy-config.json type: FileOrCreate name: policyconfig status: {}
保存退出后就自动生效了
可以用如下命令确定一下:
[root@cri3dp1 manifests]# kubectl -n kube-system get pod | grep sch kube-scheduler-cri3dp1 1/1 Running 0 141m
输出中找到一个名字为 kube-scheduler-XXX 的pod,看后面对应的AGE项,是不是刚刚启动。如果刚启动过,代表调度器配置已经更新。
4、查看gpu节点信息
[root@cri3dp1 manifests]# kubectl describe node k8s-node3 ......... Capacity: cpu: 20 ephemeral-storage: 958487280Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 65492456Ki pods: 110 tencent.com/vcuda-core: 100 tencent.com/vcuda-memory: 96 Allocatable: cpu: 20 ephemeral-storage: 883341875786 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 65390056Ki pods: 110 tencent.com/vcuda-core: 100 tencent.com/vcuda-memory: 96.........
四、方案测试
方案测试采用Tensorflow框架,内置了Mnist,cifar10和Alexnet benchmark等测试数据集,可以根据需要选择不同的测试方案。
测试步骤:
1、使用TensorFlow框架+minst数据集进行测试验证,TensorFlow镜像:
ccr.ccs.tencentyun.com/menghe/tensorflow-gputest:0.2
2、创建一个测试负载,yaml文件如下:
apiVersion: apps/v1 kind: Deployment metadata: labels: k8s-app: vcuda-test qcloud-app: vcuda-test name: vcuda-test namespace: default spec: replicas: 1 selector: matchLabels: k8s-app: vcuda-test template: metadata: labels: k8s-app: vcuda-test qcloud-app: vcuda-test spec: containers: - command: - sleep - 360000s env: - name: PATH value: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin image: ccr.ccs.tencentyun.com/menghe/tensorflow-gputest:0.2 imagePullPolicy: IfNotPresent name: tensorflow-test resources: limits: cpu: "4" memory: 8Gi tencent.com/vcuda-core: "50" tencent.com/vcuda-memory: "32" requests: cpu: "4" memory: 8Gi tencent.com/vcuda-core: "50" tencent.com/vcuda-memory: "32"
3、进入测试容器(在默认default namespace下,如修改了测试yaml,按需指定namespace)
kubectl exec -it `kubectl get pods -o name | cut -d '/' -f2` -- bash
4、执行测试命令,可以根据需求选择不同训练框架/数据集
a. Mnist
cd /data/tensorflow/mnist && time python convolutional.py
b. AlexNet
cd /data/tensorflow/alexnet && time python alexnet_benchmark.py
c. Cifar10
cd /data/tensorflow/cifar10 && time python cifar10_train.py
5、在物理机上通过nvidia-smi pmon -s u -d 1命令查看GPU资源使用情况
五、pod使用
下面给出 yaml 示例:
1)使用1张卡的 P4 设备:
apiVersion: v1 kind: Pod ... spec: containers: - name: gpu resources: limits: cpu: "4" memory: 8Gi tencent.com/vcuda-core: "100" requests: cpu: "4" memory: 8Gi tencent.com/vcuda-core: "100"
2)使用0.3张卡,5GiB 显存的应用:
apiVersion: v1 kind: Pod ... spec: containers: - name: gpu resources: limits: cpu: "4" memory: 8Gi tencent.com/vcuda-core: "30" tencent.com/vcuda-memory: "20" requests: cpu: "4" memory: 8Gi tencent.com/vcuda-core: "30" tencent.com/vcuda-memory: "20"