y-k8s安装NVIDIA GPU Operator

备注

之前实践过 安装NVIDIA GPU Operator 是在 私有云架构z-k8s 集群,当时还没有搞 NVIDIA Virtual GPU (vGPU) ,所以是将完整的 Nvidia Tesla P10 GPU运算卡 直接通过 采用OVMF实现passthrough GPU和NVMe存储 实现。

为了能够更好迷你大规模 GPU Kubernetes ,我重新部署 Kubernetes集群(y-k8s) (集群部署采用 Kubespray快速起步 )来实现多 NVIDIA Virtual GPU (vGPU)Machine Learning Atlas

本文为 安装NVIDIA GPU Operator 再次实践

准备工作

在安装NVIDIA GPU Operator之前,去需要确保 Kubernetes 集群 ( Kubernetes集群(z-k8s) )满足以下条件:

此外还需要确认:

安装NVIDIA GPU Operator

在Linux平台安装helm
version=3.12.2
wget https://get.helm.sh/helm-v${version}-linux-amd64.tar.gz
tar -zxvf helm-v${version}-linux-amd64.tar.gz
sudo mv linux-amd64/helm /usr/local/bin/helm
  • 添加NVIDIA Helm仓库:

添加NVIDIA仓库
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
   && helm repo update

Operands(操作数)

备注

这步跳过,目前我的 y-k8s 集群的2个worker都已经部署了 NVIDIA Virtual GPU (vGPU) ,而且也没有不部署的必须跳过的节点( NVIDIA GPU Operator 会自动部署有GPU的节点 )

默认情况下,GPU Operator operands会部署到集群中所有GPU工作节点。GPU工作节点的标签由 feature.node.kubernetes.io/pci-10de.present=true 标记,这里的 0x10de 是PCI vender ID,是分配给 NVIDIA 的供应商ID

  • 首先给集群中安装了NVIDIA GPU的节点打上标签:

为Kubernetes集群的NVIDIA GPU工作节点打上标签
kubectl label nodes z-k8s-n-1 feature.node.kubernetes.io/pci-10de.present=true
  • 要禁止在一个GPU工作节点部署operands,则对该节点标签 nvidia.com/gpu.deploy.operands=false :

为Kubernetes集群GPU工作节点打上标签禁止部署operands
kubectl label nodes z-k8s-n-2 nvidia.com/gpu.deploy.operands=false

部署GNU Operator

有多重安装场景,以下是常见场景,请选择合适的方法

  • 在Ubuntu的Bare-metal/Passthrough上使用默认配置(很好,正是我需要的场景,因为我只有一个绩点有 passthrough 的 Nvidia Tesla P10 GPU运算卡 )

Ubuntu上Barmetal/Passthrough默认配置,helm 安装GNU Operator
helm install --wait --generate-name \
     -n gpu-operator --create-namespace \
     nvidia/gpu-operator

安装输出信息:

Ubuntu上Barmetal/Passthrough默认配置,helm 安装GNU Operator,输出信息
NAME: gpu-operator-1690303523
LAST DEPLOYED: Wed Jul 26 00:45:31 2023
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None
  • 完成后见检查:

安装完GNU Operator之后,检查集群中nvidia gnu-operators相关pods
kubectl get pods -n gpu-operator -o wide

可以看到运行了如下pods:

安装完GNU Operator之后,检查集群中nvidia gnu-operators相关pods
NAME                                                              READY   STATUS      RESTARTS   AGE     IP              NODE        NOMINATED NODE   READINESS GATES
gpu-feature-discovery-bksgp                                       1/1     Running     0          3m52s   10.233.89.132   y-k8s-n-1   <none>           <none>
gpu-feature-discovery-wstlz                                       1/1     Running     0          3m52s   10.233.78.71    y-k8s-n-2   <none>           <none>
gpu-operator-1690303523-node-feature-discovery-master-6f5b7rdpm   1/1     Running     0          7m46s   10.233.109.74   y-k8s-m-1   <none>           <none>
gpu-operator-1690303523-node-feature-discovery-worker-bblkw       1/1     Running     0          7m46s   10.233.109.75   y-k8s-m-1   <none>           <none>
gpu-operator-1690303523-node-feature-discovery-worker-dw5gs       1/1     Running     0          7m47s   10.233.93.213   y-k8s-m-3   <none>           <none>
gpu-operator-1690303523-node-feature-discovery-worker-l7hbw       1/1     Running     0          7m47s   10.233.89.129   y-k8s-n-1   <none>           <none>
gpu-operator-1690303523-node-feature-discovery-worker-nc2dh       1/1     Running     0          7m46s   10.233.78.67    y-k8s-n-2   <none>           <none>
gpu-operator-1690303523-node-feature-discovery-worker-sb5hn       1/1     Running     0          7m47s   10.233.121.11   y-k8s-m-2   <none>           <none>
gpu-operator-56849f4cc-82vqm                                      1/1     Running     0          7m46s   10.233.78.66    y-k8s-n-2   <none>           <none>
nvidia-container-toolkit-daemonset-g7sf9                          1/1     Running     0          3m55s   10.233.89.131   y-k8s-n-1   <none>           <none>
nvidia-container-toolkit-daemonset-tgjqk                          1/1     Running     0          3m55s   10.233.78.69    y-k8s-n-2   <none>           <none>
nvidia-cuda-validator-45ngk                                       0/1     Completed   0          2m35s   10.233.78.73    y-k8s-n-2   <none>           <none>
nvidia-cuda-validator-wjqvw                                       0/1     Completed   0          2m22s   10.233.89.136   y-k8s-n-1   <none>           <none>
nvidia-dcgm-exporter-85qt9                                        1/1     Running     0          3m53s   10.233.89.133   y-k8s-n-1   <none>           <none>
nvidia-dcgm-exporter-sgnkt                                        1/1     Running     0          3m53s   10.233.78.72    y-k8s-n-2   <none>           <none>
nvidia-device-plugin-daemonset-2p6hw                              1/1     Running     0          3m54s   10.233.78.74    y-k8s-n-2   <none>           <none>
nvidia-device-plugin-daemonset-bccw8                              1/1     Running     0          3m54s   10.233.89.134   y-k8s-n-1   <none>           <none>
nvidia-device-plugin-validator-c4lcq                              0/1     Completed   0          70s     10.233.78.75    y-k8s-n-2   <none>           <none>
nvidia-device-plugin-validator-h8v2q                              0/1     Completed   0          66s     10.233.89.137   y-k8s-n-1   <none>           <none>
nvidia-operator-validator-mrhp9                                   1/1     Running     0          3m55s   10.233.89.135   y-k8s-n-1   <none>           <none>
nvidia-operator-validator-vkgx5                                   1/1     Running     0          3m54s   10.233.78.70    y-k8s-n-2   <none>           <none>

运行案例GPU应用

CUDA VectorAdd

  • 首先可以运行NVIDIA提供了一个简单的CUDA示例,将两个向量相加:

运行一个简单的CUDA示例,两个向量相加
cat << EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-vectoradd
    image: "nvidia/samples:vectoradd-cuda11.2.1"
    resources:
      limits:
         nvidia.com/gpu: 1
EOF

提示信息:

运行一个简单的CUDA示例提示创建成功
pod/cuda-vectoradd created

安装NVIDIA GPU Operator 初次实践时遇到过问题和排查(见原文)

  • 检查:

检查 CUDA示例 pods
kubectl get pods -o wide

状态输出:

检查 CUDA示例 pods 输出信息
NAME             READY   STATUS      RESTARTS   AGE   IP              NODE        NOMINATED NODE   READINESS GATES
cuda-vectoradd   0/1     Completed   0          24s   10.233.89.138   y-k8s-n-1   <none>           <none>
  • 通过检查日志来了解计算结果:

通过 kubectl logs 获取pods的日志来判断计算结果
kubectl logs cuda-vectoradd

显示如下,表明NVIDIA GPU Operator安装正常:

通过 kubectl logs 获取pods的日志来判断计算结果
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

备注

计算结果后pod部署不删除会导致GPU始终被占用,就无法继续调度新的GPU计算任务,此时需要删除 Completed 状态的pod来释放GPU资源。详见 GPU节点调度异常排查

下一步

有什么好玩的呢?

玩一下 在Kubernetes部署Stable Diffusion ,体验在GPU加速下的 文本倒置(Textual Inversion) 图像生成

参考