安装NVIDIA GPU Operator

准备工作

在安装NVIDIA GPU Operator之前,去需要确保 Kubernetes 集群 ( Kubernetes集群(z-k8s) )满足以下条件:

此外还需要确认:

  • 每个hypervisor主机 NVIDIA Virtual GPU (vGPU) 加速Kubernetes worker节点虚拟机必须先完成安装 NVIDIA vGPU Host Driver version 12.0 (or later)

  • 需要安装NVIDIA vGPU License Server服务于所有Kubernetes虚拟机节点

  • 部署好私有仓库以便能够上传NVIDIA vGPU specific driver container image

  • 每个Kubernetes worker节点能够访问私有仓库

  • 需要使用 Git 和 Docker/Podman 来构建vGPU 驱动镜像(从源代码仓库)并推送到私有仓库

安装NVIDIA GPU Operator

在Linux平台安装helm
version=3.12.2
wget https://get.helm.sh/helm-v${version}-linux-amd64.tar.gz
tar -zxvf helm-v${version}-linux-amd64.tar.gz
sudo mv linux-amd64/helm /usr/local/bin/helm
  • 添加NVIDIA Helm仓库:

添加NVIDIA仓库
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
   && helm repo update

Operands(操作数)

默认情况下,GPU Operator operands会部署到集群中所有GPU工作节点。GPU工作节点的标签由 feature.node.kubernetes.io/pci-10de.present=true 标记,这里的 0x10de 是PCI vender ID,是分配给 NVIDIA 的供应商ID

  • 首先给集群中安装了NVIDIA GPU的节点打上标签:

为Kubernetes集群的NVIDIA GPU工作节点打上标签
kubectl label nodes z-k8s-n-1 feature.node.kubernetes.io/pci-10de.present=true
  • 要禁止在一个GPU工作节点部署operands,则对该节点标签 nvidia.com/gpu.deploy.operands=false :

为Kubernetes集群GPU工作节点打上标签禁止部署operands
kubectl label nodes z-k8s-n-2 nvidia.com/gpu.deploy.operands=false

部署GNU Operator

有多重安装场景,以下是常见场景,请选择合适的方法

  • 在Ubuntu的Bare-metal/Passthrough上使用默认配置(很好,正是我需要的场景,因为我只有一个绩点有 passthrough 的 Nvidia Tesla P10 GPU运算卡 )

Ubuntu上Barmetal/Passthrough默认配置,helm 安装GNU Operator
helm install --wait --generate-name \
     -n gpu-operator --create-namespace \
     nvidia/gpu-operator
  • 完成后见检查:

安装完GNU Operator之后,检查集群中nvidia gnu-operators相关pods
kubectl get pods -n gpu-operator -o wide

可以看到运行了如下pods:

安装完GNU Operator之后,检查集群中nvidia gnu-operators相关pods
NAME                                                              READY   STATUS    RESTARTS   AGE   IP           NODE        NOMINATED NODE   READINESS GATES
gpu-feature-discovery-sc6z5                                       1/1     Running   0          77m   10.0.3.131   z-k8s-n-1   <none>           <none>
gpu-operator-1673526262-node-feature-discovery-master-6594glnlk   1/1     Running   0          77m   10.0.1.130   z-k8s-m-2   <none>           <none>
gpu-operator-1673526262-node-feature-discovery-worker-2x4w9       1/1     Running   0          77m   10.0.5.172   z-k8s-n-3   <none>           <none>
gpu-operator-1673526262-node-feature-discovery-worker-7l8c5       1/1     Running   0          77m   10.0.4.133   z-k8s-n-2   <none>           <none>
gpu-operator-1673526262-node-feature-discovery-worker-7rn5w       1/1     Running   0          77m   10.0.7.75    z-k8s-n-4   <none>           <none>
gpu-operator-1673526262-node-feature-discovery-worker-c6g75       1/1     Running   0          77m   10.0.6.251   z-k8s-n-5   <none>           <none>
gpu-operator-1673526262-node-feature-discovery-worker-x52bp       1/1     Running   0          77m   10.0.3.147   z-k8s-n-1   <none>           <none>
gpu-operator-7bd648f56b-h5bdp                                     1/1     Running   0          77m   10.0.5.86    z-k8s-n-3   <none>           <none>
nvidia-container-toolkit-daemonset-sc7xl                          1/1     Running   0          77m   10.0.3.150   z-k8s-n-1   <none>           <none>
nvidia-cuda-validator-knjt4                                       0/1     Completed 0          77m   10.0.3.25    z-k8s-n-1   <none>           <none>
nvidia-dcgm-exporter-csqkq                                        1/1     Running   0          77m   10.0.3.63    z-k8s-n-1   <none>           <none>
nvidia-device-plugin-daemonset-vldq6                              1/1     Running   0          77m   10.0.3.85    z-k8s-n-1   <none>           <none>
nvidia-device-plugin-validator-zcrjm                              0/1     Completed 0          77m   10.0.3.1     z-k8s-n-1   <none>           <none>
nvidia-operator-validator-qz2rh                                   1/1     Running   0          77m   10.0.3.166   z-k8s-n-1   <none>           <none>

运行案例GPU应用

CUDA VectorAdd

  • 首先可以运行NVIDIA提供了一个简单的CUDA示例,将两个向量相加:

运行一个简单的CUDA示例,两个向量相加
cat << EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-vectoradd
    image: "nvidia/samples:vectoradd-cuda11.2.1"
    resources:
      limits:
         nvidia.com/gpu: 1
EOF

提示信息:

运行一个简单的CUDA示例提示创建成功
pod/cuda-vectoradd created

这里我遇到启动问题(容器)

排查

  • 检查pod状态:

检查 CUDA示例 pods
kubectl get pods -o wide

可以看到没有就绪(NotReady):

NAME                              READY   STATUS      RESTARTS      AGE     IP           NODE        NOMINATED NODE   READINESS GATES
cuda-vectoradd                    1/2     NotReady    0             9h      10.0.3.168   z-k8s-n-1   <none>           <none>
  • 检查pods启动失败原因:

    kubectl describe pods cuda-vectoradd
    

输出显示:

运行一个简单的CUDA示例失败原因排查(实际正常,见下文)
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  40s                default-scheduler  Successfully assigned default/cuda-vectoradd to z-k8s-n-1
  Normal   Pulled     38s                kubelet            Container image "busybox:1.31.1" already present on machine
  Normal   Created    37s                kubelet            Created container dns-probe
  Normal   Started    37s                kubelet            Started container dns-probe
  Normal   Pulled     31s                kubelet            Container image "quay.io/cilium/istio_proxy:1.10.6" already present on machine
  Normal   Created    30s                kubelet            Created container istio-init
  Normal   Started    30s                kubelet            Started container istio-init
  Normal   Pulled     30s                kubelet            Container image "nvidia/samples:vectoradd-cuda11.2.1" already present on machine
  Normal   Created    29s                kubelet            Created container cuda-vectoradd
  Normal   Started    29s                kubelet            Started container cuda-vectoradd
  Normal   Pulled     29s                kubelet            Container image "quay.io/cilium/istio_proxy:1.10.6" already present on machine
  Normal   Created    29s                kubelet            Created container istio-proxy
  Normal   Started    29s                kubelet            Started container istio-proxy
  Warning  Unhealthy  26s (x4 over 28s)  kubelet            Readiness probe failed: Get "http://10.0.3.168:15021/healthz/ready": dial tcp 10.0.3.168:15021: connect: connection refused

备注

汗,原来这是正常的,这个NVIDIA CUDA的案例就是运算完成后自动退出,所以服务状态就是不可访问的(这个是一次性运行)。其实只要查看pod日志就可以验证CUDA是否工作正常(见下文)

  • 通过检查日志来了解计算结果:

通过 kubectl logs 获取pods的日志来判断计算结果
kubectl logs cuda-vectoradd

显示如下,表明NVIDIA GPU Operator安装正常:

通过 kubectl logs 获取pods的日志来判断计算结果
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

GPU节点调度异常排查

  • 发现重启了一次 z-k8s-n-1 之后,出现部分 gpu-operator 容器没有正常运行:

重启z-k8s-n-1之后 nvidia-device-plugin-validator pod没有启动
$ kubectl get pods -n gpu-operator -o wide
NAME                                                              READY   STATUS                     RESTARTS        AGE     IP           NODE        NOMINATED NODE   READINESS GATES
gpu-feature-discovery-sc6z5                                       1/1     Running                    1 (9m49s ago)   14h     10.0.3.249   z-k8s-n-1   <none>           <none>
gpu-operator-1673526262-node-feature-discovery-master-6594glnlk   1/1     Running                    0               14h     10.0.1.130   z-k8s-m-2   <none>           <none>
gpu-operator-1673526262-node-feature-discovery-worker-2x4w9       1/1     Running                    0               14h     10.0.5.172   z-k8s-n-3   <none>           <none>
gpu-operator-1673526262-node-feature-discovery-worker-7l8c5       1/1     Running                    0               14h     10.0.4.133   z-k8s-n-2   <none>           <none>
gpu-operator-1673526262-node-feature-discovery-worker-7rn5w       1/1     Running                    0               14h     10.0.7.75    z-k8s-n-4   <none>           <none>
gpu-operator-1673526262-node-feature-discovery-worker-c6g75       1/1     Running                    0               14h     10.0.6.251   z-k8s-n-5   <none>           <none>
gpu-operator-1673526262-node-feature-discovery-worker-x52bp       1/1     Running                    1 (9m49s ago)   14h     10.0.3.65    z-k8s-n-1   <none>           <none>
gpu-operator-7bd648f56b-h5bdp                                     1/1     Running                    0               14h     10.0.5.86    z-k8s-n-3   <none>           <none>
nvidia-container-toolkit-daemonset-sc7xl                          1/1     Running                    1 (9m49s ago)   14h     10.0.3.52    z-k8s-n-1   <none>           <none>
nvidia-cuda-validator-qwq9d                                       0/1     Completed                  0               7m42s   10.0.3.58    z-k8s-n-1   <none>           <none>
nvidia-dcgm-exporter-csqkq                                        1/1     Running                    1 (9m48s ago)   14h     10.0.3.102   z-k8s-n-1   <none>           <none>
nvidia-device-plugin-daemonset-vldq6                              1/1     Running                    1 (9m49s ago)   14h     10.0.3.193   z-k8s-n-1   <none>           <none>
nvidia-device-plugin-validator-vz52p                              0/1     UnexpectedAdmissionError   0               97s     <none>       z-k8s-n-1   <none>           <none>
nvidia-operator-validator-qz2rh                                   0/1     Init:3/4                   2 (109s ago)    14h     10.0.3.229   z-k8s-n-1   <none>           <none>
  • 通过 kubectl describe pods 命令检查没有启动原因:

通过kubectl describe pods检查nvidia-operator-validator-rbmwr没有启动原因
$ kubectl -n gpu-operator describe pods nvidia-operator-validator-qz2rh
...
Events:
  Type     Reason                  Age                  From             Message
  ----     ------                  ----                 ----             -------
  Warning  NodeNotReady            16m                  node-controller  Node is not ready
  Warning  FailedCreatePodSandBox  13m (x2 over 13m)    kubelet          Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
  Warning  FailedCreatePodSandBox  12m                  kubelet          Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "3e375ba6472653f520eb42a19a99acc288aeca922852251b0fc13f4470eaea67": plugin type="cilium-cni" name="cilium" failed (add): Unable to create endpoint: [PUT /endpoint/{id}][429] putEndpointIdTooManyRequests
  Normal   SandboxChanged          12m (x5 over 14m)    kubelet          Pod sandbox changed, it will be killed and re-created.
...

为何会出现 failed to get sandbox runtime: no runtime for "nvidia" is configured

备注

检查 z-k8s-n-1 ,发现原来是 NVIDIA GPU Operator 安装会修改 /etc/containerd/config.toml ,把我上文通过 containerd-config.path 修订的配置给冲掉了。看起来NVIDIA官方文档中对containerd配置进行patch的方法现在应该不需要了,通过安装 nvidia-container-toolkit 就能够自动修正配置。

这个猜测以后有机会再验证

nvidia-device-plugin-validator 没有启动的原因是检查到设备为0导致无法运行:

通过kubectl describe pods检查nvidia-device-plugin-validator没有启动是因为设备没有检测到
$ kubectl -n gpu-operator describe pods nvidia-device-plugin-validator-fg2gd
...
Events:
  Type     Reason                    Age    From     Message
  ----     ------                    ----   ----     -------
  Warning  UnexpectedAdmissionError  2m51s  kubelet  Allocate failed due to requested number of devices unavailable for nvidia.com/gpu. Requested: 1, Available: 0, which is unexpected

为什么呢?

参考了 pod的状态出现UnexpectedAdmissionError是什么鬼? 有所启发:

  • 因为是 采用OVMF实现passthrough GPU和NVMe存储 而不是 NVIDIA Virtual GPU (vGPU) ,实际上虚拟机中只有一块GPU卡

  • 对于passthrough的GPU卡,实际上只能分配个一个pod容器,一旦分配,第二个需要使用GPU的pods就无法调度到这个节点上

  • NVIDIA GPU Operator 的一组pod中的 nvidia-device-plugin-validator 是一个特殊的pod,你会看到每次启动之后,这个pods都会进入 Completed 状态

    • nvidia-device-plugin-validator 只在节点启动时运行一次,功能就是检查验证工作节点是否有NVIDIA的GPU设备,一旦检查通过就会结束自己这个pod运行

    • 另外一个类似的验证功能的pods是 nvidia-cuda-validator 也是启动时运行一次检验

  • 我之前为了验证 NVIDIA GPU Operator 运行了一个 简单的CUDA示例,两个向量相加 ,这个pods运行完成后不会自动删除,一直保持在 NotReady 状态。问题就在这里,这个pod占用了GPU设备,导致后续的Pods,例如 在Kubernetes部署Stable Diffusion 无法调度

  • 我重启了GPU工作节点,上述 简单的CUDA示例,两个向量相加 pod没有删除,在启动时会抢占GPU设备,这也就导致了 nvidia-device-plugin-validator 无法拿到GPU设备进行检测,无法通过验证

解决方法

  • 删除掉不使用的 简单的CUDA示例,两个向量相加 pod,然后就会看到 nvidia-device-plugin-validator 正常运行完成,进入了 Completed 状态

  • 这也就解决了 在Kubernetes部署Stable Diffusion 无法调度的问题

下一步

有什么好玩的呢?

玩一下 在Kubernetes部署Stable Diffusion ,体验在GPU加速下的 文本倒置(Textual Inversion) 图像生成

参考