y-k8s安装NVIDIA GPU Operator¶
备注
之前实践过 安装NVIDIA GPU Operator 是在 私有云架构 的 z-k8s
集群,当时还没有搞 NVIDIA Virtual GPU (vGPU) ,所以是将完整的 Nvidia Tesla P10 GPU运算卡 直接通过 采用OVMF实现passthrough GPU和NVMe存储 实现。
为了能够更好迷你大规模 GPU Kubernetes ,我重新部署 Kubernetes集群(y-k8s) (集群部署采用 Kubespray快速起步 )来实现多 NVIDIA Virtual GPU (vGPU) 的 Machine Learning Atlas
本文为 安装NVIDIA GPU Operator 再次实践
准备工作¶
在安装NVIDIA GPU Operator之前,去需要确保 Kubernetes 集群 ( Kubernetes集群(z-k8s) )满足以下条件:
Kubernetes工作节点已经配置好容器引擎如 Docker Atlas CE/EE, cri-o容器运行时 或 containerd运行时(runtime) ( 注意
NVIDIA GPU Operator
会自动完成节点的 NVIDIA容器运行时 配置,所以不需要手工 为containerd安装NVIDIA Container Toolkit ,只需要标准安装的 容器运行时(Container Runtimes) ): 通过 Kubespray快速起步 部署的集群默认采用了 containerd运行时(runtime)每个节点都需要部署Node Feature Discovery (NFD) : 默认情况下 NVIDIA GPU Operator会自动部署NFD master 和 worker
在Kubernetes 1.13 和 1.14,需要激活
kubelet
的KubeletPodResources
功能,从 Kubernetes 1.15以后是默认激活的
此外还需要确认:
每个hypervisor主机 NVIDIA Virtual GPU (vGPU) 加速Kubernetes worker节点虚拟机必须先完成安装 NVIDIA vGPU Host Driver version 12.0 (or later) :
需要安装NVIDIA vGPU License Server服务于所有Kubernetes虚拟机节点
部署好私有仓库以便能够上传NVIDIA vGPU specific driver container image
每个Kubernetes worker节点能够访问私有仓库
需要使用 Git 和 Docker/Podman 来构建vGPU 驱动镜像(从源代码仓库)并推送到私有仓库
安装NVIDIA GPU Operator¶
安装 helm :
version=3.12.2
wget https://get.helm.sh/helm-v${version}-linux-amd64.tar.gz
tar -zxvf helm-v${version}-linux-amd64.tar.gz
sudo mv linux-amd64/helm /usr/local/bin/helm
添加NVIDIA Helm仓库:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
&& helm repo update
Operands(操作数)¶
备注
这步跳过,目前我的 y-k8s
集群的2个worker都已经部署了 NVIDIA Virtual GPU (vGPU) ,而且也没有不部署的必须跳过的节点( NVIDIA GPU Operator
会自动部署有GPU的节点 )
默认情况下,GPU Operator operands会部署到集群中所有GPU工作节点。GPU工作节点的标签由 feature.node.kubernetes.io/pci-10de.present=true
标记,这里的 0x10de
是PCI vender ID,是分配给 NVIDIA 的供应商ID
首先给集群中安装了NVIDIA GPU的节点打上标签:
kubectl label nodes z-k8s-n-1 feature.node.kubernetes.io/pci-10de.present=true
要禁止在一个GPU工作节点部署operands,则对该节点标签
nvidia.com/gpu.deploy.operands=false
:
kubectl label nodes z-k8s-n-2 nvidia.com/gpu.deploy.operands=false
部署GNU Operator¶
有多重安装场景,以下是常见场景,请选择合适的方法
在Ubuntu的Bare-metal/Passthrough上使用默认配置(很好,正是我需要的场景,因为我只有一个绩点有 passthrough 的 Nvidia Tesla P10 GPU运算卡 )
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator
安装输出信息:
NAME: gpu-operator-1690303523
LAST DEPLOYED: Wed Jul 26 00:45:31 2023
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None
完成后见检查:
kubectl get pods -n gpu-operator -o wide
可以看到运行了如下pods:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
gpu-feature-discovery-bksgp 1/1 Running 0 3m52s 10.233.89.132 y-k8s-n-1 <none> <none>
gpu-feature-discovery-wstlz 1/1 Running 0 3m52s 10.233.78.71 y-k8s-n-2 <none> <none>
gpu-operator-1690303523-node-feature-discovery-master-6f5b7rdpm 1/1 Running 0 7m46s 10.233.109.74 y-k8s-m-1 <none> <none>
gpu-operator-1690303523-node-feature-discovery-worker-bblkw 1/1 Running 0 7m46s 10.233.109.75 y-k8s-m-1 <none> <none>
gpu-operator-1690303523-node-feature-discovery-worker-dw5gs 1/1 Running 0 7m47s 10.233.93.213 y-k8s-m-3 <none> <none>
gpu-operator-1690303523-node-feature-discovery-worker-l7hbw 1/1 Running 0 7m47s 10.233.89.129 y-k8s-n-1 <none> <none>
gpu-operator-1690303523-node-feature-discovery-worker-nc2dh 1/1 Running 0 7m46s 10.233.78.67 y-k8s-n-2 <none> <none>
gpu-operator-1690303523-node-feature-discovery-worker-sb5hn 1/1 Running 0 7m47s 10.233.121.11 y-k8s-m-2 <none> <none>
gpu-operator-56849f4cc-82vqm 1/1 Running 0 7m46s 10.233.78.66 y-k8s-n-2 <none> <none>
nvidia-container-toolkit-daemonset-g7sf9 1/1 Running 0 3m55s 10.233.89.131 y-k8s-n-1 <none> <none>
nvidia-container-toolkit-daemonset-tgjqk 1/1 Running 0 3m55s 10.233.78.69 y-k8s-n-2 <none> <none>
nvidia-cuda-validator-45ngk 0/1 Completed 0 2m35s 10.233.78.73 y-k8s-n-2 <none> <none>
nvidia-cuda-validator-wjqvw 0/1 Completed 0 2m22s 10.233.89.136 y-k8s-n-1 <none> <none>
nvidia-dcgm-exporter-85qt9 1/1 Running 0 3m53s 10.233.89.133 y-k8s-n-1 <none> <none>
nvidia-dcgm-exporter-sgnkt 1/1 Running 0 3m53s 10.233.78.72 y-k8s-n-2 <none> <none>
nvidia-device-plugin-daemonset-2p6hw 1/1 Running 0 3m54s 10.233.78.74 y-k8s-n-2 <none> <none>
nvidia-device-plugin-daemonset-bccw8 1/1 Running 0 3m54s 10.233.89.134 y-k8s-n-1 <none> <none>
nvidia-device-plugin-validator-c4lcq 0/1 Completed 0 70s 10.233.78.75 y-k8s-n-2 <none> <none>
nvidia-device-plugin-validator-h8v2q 0/1 Completed 0 66s 10.233.89.137 y-k8s-n-1 <none> <none>
nvidia-operator-validator-mrhp9 1/1 Running 0 3m55s 10.233.89.135 y-k8s-n-1 <none> <none>
nvidia-operator-validator-vkgx5 1/1 Running 0 3m54s 10.233.78.70 y-k8s-n-2 <none> <none>
运行案例GPU应用¶
CUDA VectorAdd¶
首先可以运行NVIDIA提供了一个简单的CUDA示例,将两个向量相加:
cat << EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
name: cuda-vectoradd
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vectoradd
image: "nvidia/samples:vectoradd-cuda11.2.1"
resources:
limits:
nvidia.com/gpu: 1
EOF
提示信息:
pod/cuda-vectoradd created
在 安装NVIDIA GPU Operator 初次实践时遇到过问题和排查(见原文)
检查:
kubectl get pods -o wide
状态输出:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
cuda-vectoradd 0/1 Completed 0 24s 10.233.89.138 y-k8s-n-1 <none> <none>
通过检查日志来了解计算结果:
kubectl logs cuda-vectoradd
显示如下,表明NVIDIA GPU Operator安装正常:
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
备注
计算结果后pod部署不删除会导致GPU始终被占用,就无法继续调度新的GPU计算任务,此时需要删除 Completed
状态的pod来释放GPU资源。详见 GPU节点调度异常排查
下一步¶
有什么好玩的呢?
玩一下 在Kubernetes部署Stable Diffusion ,体验在GPU加速下的 文本倒置(Textual Inversion)
图像生成