安装NVIDIA GPU Operator¶
准备工作¶
在安装NVIDIA GPU Operator之前,去需要确保 Kubernetes 集群 ( Kubernetes集群(z-k8s) )满足以下条件:
Kubernetes工作节点已经配置好容器引擎如 Docker Atlas CE/EE, cri-o容器运行时 或 containerd运行时(runtime) ( 为containerd安装NVIDIA Container Toolkit )
每个节点都需要部署Node Feature Discovery (NFD) : 默认情况下 NVIDIA GPU Operator会自动部署NFD master 和 worker
在Kubernetes 1.13 和 1.14,需要激活
kubelet
的KubeletPodResources
功能,从 Kubernetes 1.15以后是默认激活的
此外还需要确认:
每个hypervisor主机 NVIDIA Virtual GPU (vGPU) 加速Kubernetes worker节点虚拟机必须先完成安装 NVIDIA vGPU Host Driver version 12.0 (or later)
需要安装NVIDIA vGPU License Server服务于所有Kubernetes虚拟机节点
部署好私有仓库以便能够上传NVIDIA vGPU specific driver container image
每个Kubernetes worker节点能够访问私有仓库
需要使用 Git 和 Docker/Podman 来构建vGPU 驱动镜像(从源代码仓库)并推送到私有仓库
安装NVIDIA GPU Operator¶
安装 helm :
version=3.12.2
wget https://get.helm.sh/helm-v${version}-linux-amd64.tar.gz
tar -zxvf helm-v${version}-linux-amd64.tar.gz
sudo mv linux-amd64/helm /usr/local/bin/helm
添加NVIDIA Helm仓库:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
&& helm repo update
Operands(操作数)¶
默认情况下,GPU Operator operands会部署到集群中所有GPU工作节点。GPU工作节点的标签由 feature.node.kubernetes.io/pci-10de.present=true
标记,这里的 0x10de
是PCI vender ID,是分配给 NVIDIA 的供应商ID
首先给集群中安装了NVIDIA GPU的节点打上标签:
kubectl label nodes z-k8s-n-1 feature.node.kubernetes.io/pci-10de.present=true
要禁止在一个GPU工作节点部署operands,则对该节点标签
nvidia.com/gpu.deploy.operands=false
:
kubectl label nodes z-k8s-n-2 nvidia.com/gpu.deploy.operands=false
部署GNU Operator¶
有多重安装场景,以下是常见场景,请选择合适的方法
在Ubuntu的Bare-metal/Passthrough上使用默认配置(很好,正是我需要的场景,因为我只有一个绩点有 passthrough 的 Nvidia Tesla P10 GPU运算卡 )
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator
完成后见检查:
kubectl get pods -n gpu-operator -o wide
可以看到运行了如下pods:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
gpu-feature-discovery-sc6z5 1/1 Running 0 77m 10.0.3.131 z-k8s-n-1 <none> <none>
gpu-operator-1673526262-node-feature-discovery-master-6594glnlk 1/1 Running 0 77m 10.0.1.130 z-k8s-m-2 <none> <none>
gpu-operator-1673526262-node-feature-discovery-worker-2x4w9 1/1 Running 0 77m 10.0.5.172 z-k8s-n-3 <none> <none>
gpu-operator-1673526262-node-feature-discovery-worker-7l8c5 1/1 Running 0 77m 10.0.4.133 z-k8s-n-2 <none> <none>
gpu-operator-1673526262-node-feature-discovery-worker-7rn5w 1/1 Running 0 77m 10.0.7.75 z-k8s-n-4 <none> <none>
gpu-operator-1673526262-node-feature-discovery-worker-c6g75 1/1 Running 0 77m 10.0.6.251 z-k8s-n-5 <none> <none>
gpu-operator-1673526262-node-feature-discovery-worker-x52bp 1/1 Running 0 77m 10.0.3.147 z-k8s-n-1 <none> <none>
gpu-operator-7bd648f56b-h5bdp 1/1 Running 0 77m 10.0.5.86 z-k8s-n-3 <none> <none>
nvidia-container-toolkit-daemonset-sc7xl 1/1 Running 0 77m 10.0.3.150 z-k8s-n-1 <none> <none>
nvidia-cuda-validator-knjt4 0/1 Completed 0 77m 10.0.3.25 z-k8s-n-1 <none> <none>
nvidia-dcgm-exporter-csqkq 1/1 Running 0 77m 10.0.3.63 z-k8s-n-1 <none> <none>
nvidia-device-plugin-daemonset-vldq6 1/1 Running 0 77m 10.0.3.85 z-k8s-n-1 <none> <none>
nvidia-device-plugin-validator-zcrjm 0/1 Completed 0 77m 10.0.3.1 z-k8s-n-1 <none> <none>
nvidia-operator-validator-qz2rh 1/1 Running 0 77m 10.0.3.166 z-k8s-n-1 <none> <none>
运行案例GPU应用¶
CUDA VectorAdd¶
首先可以运行NVIDIA提供了一个简单的CUDA示例,将两个向量相加:
cat << EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
name: cuda-vectoradd
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vectoradd
image: "nvidia/samples:vectoradd-cuda11.2.1"
resources:
limits:
nvidia.com/gpu: 1
EOF
提示信息:
pod/cuda-vectoradd created
这里我遇到启动问题(容器)
排查¶
检查pod状态:
kubectl get pods -o wide
可以看到没有就绪(NotReady):
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
cuda-vectoradd 1/2 NotReady 0 9h 10.0.3.168 z-k8s-n-1 <none> <none>
检查pods启动失败原因:
kubectl describe pods cuda-vectoradd
输出显示:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 40s default-scheduler Successfully assigned default/cuda-vectoradd to z-k8s-n-1
Normal Pulled 38s kubelet Container image "busybox:1.31.1" already present on machine
Normal Created 37s kubelet Created container dns-probe
Normal Started 37s kubelet Started container dns-probe
Normal Pulled 31s kubelet Container image "quay.io/cilium/istio_proxy:1.10.6" already present on machine
Normal Created 30s kubelet Created container istio-init
Normal Started 30s kubelet Started container istio-init
Normal Pulled 30s kubelet Container image "nvidia/samples:vectoradd-cuda11.2.1" already present on machine
Normal Created 29s kubelet Created container cuda-vectoradd
Normal Started 29s kubelet Started container cuda-vectoradd
Normal Pulled 29s kubelet Container image "quay.io/cilium/istio_proxy:1.10.6" already present on machine
Normal Created 29s kubelet Created container istio-proxy
Normal Started 29s kubelet Started container istio-proxy
Warning Unhealthy 26s (x4 over 28s) kubelet Readiness probe failed: Get "http://10.0.3.168:15021/healthz/ready": dial tcp 10.0.3.168:15021: connect: connection refused
备注
汗,原来这是正常的,这个NVIDIA CUDA的案例就是运算完成后自动退出,所以服务状态就是不可访问的(这个是一次性运行)。其实只要查看pod日志就可以验证CUDA是否工作正常(见下文)
通过检查日志来了解计算结果:
kubectl logs cuda-vectoradd
显示如下,表明NVIDIA GPU Operator安装正常:
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
GPU节点调度异常排查¶
发现重启了一次
z-k8s-n-1
之后,出现部分gpu-operator
容器没有正常运行:
$ kubectl get pods -n gpu-operator -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
gpu-feature-discovery-sc6z5 1/1 Running 1 (9m49s ago) 14h 10.0.3.249 z-k8s-n-1 <none> <none>
gpu-operator-1673526262-node-feature-discovery-master-6594glnlk 1/1 Running 0 14h 10.0.1.130 z-k8s-m-2 <none> <none>
gpu-operator-1673526262-node-feature-discovery-worker-2x4w9 1/1 Running 0 14h 10.0.5.172 z-k8s-n-3 <none> <none>
gpu-operator-1673526262-node-feature-discovery-worker-7l8c5 1/1 Running 0 14h 10.0.4.133 z-k8s-n-2 <none> <none>
gpu-operator-1673526262-node-feature-discovery-worker-7rn5w 1/1 Running 0 14h 10.0.7.75 z-k8s-n-4 <none> <none>
gpu-operator-1673526262-node-feature-discovery-worker-c6g75 1/1 Running 0 14h 10.0.6.251 z-k8s-n-5 <none> <none>
gpu-operator-1673526262-node-feature-discovery-worker-x52bp 1/1 Running 1 (9m49s ago) 14h 10.0.3.65 z-k8s-n-1 <none> <none>
gpu-operator-7bd648f56b-h5bdp 1/1 Running 0 14h 10.0.5.86 z-k8s-n-3 <none> <none>
nvidia-container-toolkit-daemonset-sc7xl 1/1 Running 1 (9m49s ago) 14h 10.0.3.52 z-k8s-n-1 <none> <none>
nvidia-cuda-validator-qwq9d 0/1 Completed 0 7m42s 10.0.3.58 z-k8s-n-1 <none> <none>
nvidia-dcgm-exporter-csqkq 1/1 Running 1 (9m48s ago) 14h 10.0.3.102 z-k8s-n-1 <none> <none>
nvidia-device-plugin-daemonset-vldq6 1/1 Running 1 (9m49s ago) 14h 10.0.3.193 z-k8s-n-1 <none> <none>
nvidia-device-plugin-validator-vz52p 0/1 UnexpectedAdmissionError 0 97s <none> z-k8s-n-1 <none> <none>
nvidia-operator-validator-qz2rh 0/1 Init:3/4 2 (109s ago) 14h 10.0.3.229 z-k8s-n-1 <none> <none>
通过
kubectl describe pods
命令检查没有启动原因:
$ kubectl -n gpu-operator describe pods nvidia-operator-validator-qz2rh
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning NodeNotReady 16m node-controller Node is not ready
Warning FailedCreatePodSandBox 13m (x2 over 13m) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
Warning FailedCreatePodSandBox 12m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "3e375ba6472653f520eb42a19a99acc288aeca922852251b0fc13f4470eaea67": plugin type="cilium-cni" name="cilium" failed (add): Unable to create endpoint: [PUT /endpoint/{id}][429] putEndpointIdTooManyRequests
Normal SandboxChanged 12m (x5 over 14m) kubelet Pod sandbox changed, it will be killed and re-created.
...
为何会出现 failed to get sandbox runtime: no runtime for "nvidia" is configured
备注
检查 z-k8s-n-1
,发现原来是 NVIDIA GPU Operator
安装会修改 /etc/containerd/config.toml
,把我上文通过 containerd-config.path
修订的配置给冲掉了。看起来NVIDIA官方文档中对containerd配置进行patch的方法现在应该不需要了,通过安装 nvidia-container-toolkit
就能够自动修正配置。
这个猜测以后有机会再验证
nvidia-device-plugin-validator
没有启动的原因是检查到设备为0导致无法运行:
$ kubectl -n gpu-operator describe pods nvidia-device-plugin-validator-fg2gd
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning UnexpectedAdmissionError 2m51s kubelet Allocate failed due to requested number of devices unavailable for nvidia.com/gpu. Requested: 1, Available: 0, which is unexpected
为什么呢?
参考了 pod的状态出现UnexpectedAdmissionError是什么鬼? 有所启发:
因为是 采用OVMF实现passthrough GPU和NVMe存储 而不是 NVIDIA Virtual GPU (vGPU) ,实际上虚拟机中只有一块GPU卡
对于passthrough的GPU卡,实际上只能分配个一个pod容器,一旦分配,第二个需要使用GPU的pods就无法调度到这个节点上
NVIDIA GPU Operator
的一组pod中的nvidia-device-plugin-validator
是一个特殊的pod,你会看到每次启动之后,这个pods都会进入Completed
状态nvidia-device-plugin-validator
只在节点启动时运行一次,功能就是检查验证工作节点是否有NVIDIA的GPU设备,一旦检查通过就会结束自己这个pod运行另外一个类似的验证功能的pods是
nvidia-cuda-validator
也是启动时运行一次检验
我之前为了验证
NVIDIA GPU Operator
运行了一个简单的CUDA示例,两个向量相加
,这个pods运行完成后不会自动删除,一直保持在NotReady
状态。问题就在这里,这个pod占用了GPU设备,导致后续的Pods,例如 在Kubernetes部署Stable Diffusion 无法调度我重启了GPU工作节点,上述
简单的CUDA示例,两个向量相加
pod没有删除,在启动时会抢占GPU设备,这也就导致了nvidia-device-plugin-validator
无法拿到GPU设备进行检测,无法通过验证
解决方法
删除掉不使用的
简单的CUDA示例,两个向量相加
pod,然后就会看到nvidia-device-plugin-validator
正常运行完成,进入了Completed
状态这也就解决了 在Kubernetes部署Stable Diffusion 无法调度的问题
下一步¶
有什么好玩的呢?
玩一下 在Kubernetes部署Stable Diffusion ,体验在GPU加速下的 文本倒置(Textual Inversion)
图像生成