安装NVIDIA Device Plugin

备注

通常情况下不需要单独安装 NVIDIA Device Plugin 。如果直接安装 NVIDIA GPU Operator 会自动安装 plugins!!!

不过,在实际生产环境中,也有只安装 NVIDIA Device Plugin 的部署方案,一般是结合现有公司已经部署的监控平台开发定制插件来采集监控数据。这种方式对大型公司现有监控体系是一种补充,但是我觉得实际开发维护成本极高,且难以充分借助NVIDIA公司的技术架构以及开源社区的力量。我不是很推荐这种方式。

备注

在Kuternetes集成GPU可观测能力 方案中,采用了分步骤集成方式,没有采用 安装NVIDIA GPU Operator ,而是采用了本文步骤作为完整步骤中的一环。

要在Kubernetes中使用GPU,需要安装 NVIDIA Device Plugins 。这个NVIDIA Device Plugin是一个daemonset,可以自动列出集群每个节点的GPU数量并允许在GPU上运行pod。

在Linux平台安装helm
version=3.12.2
wget https://get.helm.sh/helm-v${version}-linux-amd64.tar.gz
tar -zxvf helm-v${version}-linux-amd64.tar.gz
sudo mv linux-amd64/helm /usr/local/bin/helm
  • 添加 nvidia-device-plugin helm 仓库:

添加nvidia-device-plugin helm仓库
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin \
   && helm repo update
  • 部署 NVIDIA Device Plugins :

使用helm安装nvidia-device-plugin
helm upgrade -i nvdp nvdp/nvidia-device-plugin \
  --namespace nvidia-device-plugin \
  --create-namespace \
  --version 0.13.0

安装后检查:

检查nvidia-device-plugin安装
kubectl get pods -n kube-system -o wide | grep nvidia-device-plugin

排查NVIDIA Device Plugin启动失败

在安装完 NVIDIA Device Plugins 我发现容器启动失败:

检查nvidia-device-plugin容器,启动失败
nvidia-device-plugin-1673515385-gdfm4   0/1     CrashLoopBackOff   5 (108s ago)     5m8s   10.0.6.118      z-k8s-n-5   <none>           <none>
nvidia-device-plugin-1673515385-h89zm   0/1     CrashLoopBackOff   5 (113s ago)     5m8s   10.0.5.208      z-k8s-n-3   <none>           <none>
nvidia-device-plugin-1673515385-hrfxk   0/1     CrashLoopBackOff   5 (112s ago)     5m8s   10.0.4.224      z-k8s-n-2   <none>           <none>
nvidia-device-plugin-1673515385-kg7p7   0/1     CrashLoopBackOff   5 (103s ago)     5m8s   10.0.3.132      z-k8s-n-1   <none>           <none>
nvidia-device-plugin-1673515385-n8l9m   0/1     CrashLoopBackOff   5 (118s ago)     5m8s   10.0.7.110      z-k8s-n-4   <none>           <none>

参考 CrashLoopBackOff when running nvidia-device-plugin-daemonset 可以采用以下方法来搜集信息:

使用 nvidia-container-cli 从控制台搜集NVDIA容器运行失败原因
sudo nvidia-container-cli -k -d /dev/tty info

可以看到有很多运行依赖缺失:

使用 nvidia-container-cli 从控制台搜集NVDIA容器运行失败信息,显示有很多依赖缺失
-- WARNING, the following logs are for debugging purposes only --

I0112 09:51:35.061763 1819575 nvc.c:376] initializing library context (version=1.11.0, build=c8f267be0bac1c654d59ad4ea5df907141149977)
I0112 09:51:35.061960 1819575 nvc.c:350] using root /
I0112 09:51:35.061990 1819575 nvc.c:351] using ldcache /etc/ld.so.cache
I0112 09:51:35.062013 1819575 nvc.c:352] using unprivileged user 65534:65534
I0112 09:51:35.062072 1819575 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0112 09:51:35.062391 1819575 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
I0112 09:51:35.070168 1819576 nvc.c:278] loading kernel module nvidia
I0112 09:51:35.070908 1819576 nvc.c:282] running mknod for /dev/nvidiactl
I0112 09:51:35.074067 1819576 nvc.c:286] running mknod for /dev/nvidia0
I0112 09:51:35.074366 1819576 nvc.c:290] running mknod for all nvcaps in /dev/nvidia-caps
I0112 09:51:35.082837 1819576 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap1 from /proc/driver/nvidia/capabilities/mig/config
I0112 09:51:35.083287 1819576 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap2 from /proc/driver/nvidia/capabilities/mig/monitor
I0112 09:51:35.086078 1819576 nvc.c:296] loading kernel module nvidia_uvm
I0112 09:51:35.086365 1819576 nvc.c:300] running mknod for /dev/nvidia-uvm
I0112 09:51:35.086474 1819576 nvc.c:305] loading kernel module nvidia_modeset
I0112 09:51:35.086543 1819576 nvc.c:309] running mknod for /dev/nvidia-modeset
I0112 09:51:35.087461 1819577 rpc.c:71] starting driver rpc service
I0112 09:51:35.110666 1819578 rpc.c:71] starting nvcgo rpc service
I0112 09:51:35.115202 1819575 nvc_info.c:766] requesting driver information with ''
I0112 09:51:35.118564 1819575 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.525.60.13
I0112 09:51:35.118755 1819575 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.525.60.13
I0112 09:51:35.118841 1819575 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.525.60.13
I0112 09:51:35.118925 1819575 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.525.60.13
I0112 09:51:35.119037 1819575 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.525.60.13
I0112 09:51:35.119230 1819575 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.525.60.13
I0112 09:51:35.119323 1819575 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.525.60.13
I0112 09:51:35.119432 1819575 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.525.60.13
I0112 09:51:35.119550 1819575 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.525.60.13
I0112 09:51:35.119645 1819575 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.525.60.13
I0112 09:51:35.119735 1819575 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.525.60.13
I0112 09:51:35.119914 1819575 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.525.60.13
I0112 09:51:35.120032 1819575 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.525.60.13
I0112 09:51:35.120152 1819575 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.525.60.13
I0112 09:51:35.120244 1819575 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.525.60.13
I0112 09:51:35.120349 1819575 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.525.60.13
I0112 09:51:35.120491 1819575 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.525.60.13
I0112 09:51:35.120619 1819575 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.525.60.13
I0112 09:51:35.120803 1819575 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libcudadebugger.so.525.60.13
I0112 09:51:35.120966 1819575 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.525.60.13
I0112 09:51:35.121119 1819575 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.525.60.13
I0112 09:51:35.121219 1819575 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.525.60.13
I0112 09:51:35.121314 1819575 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.525.60.13
I0112 09:51:35.121409 1819575 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.525.60.13
I0112 09:51:35.121543 1819575 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-tls.so.525.60.13
I0112 09:51:35.121632 1819575 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-ptxjitcompiler.so.525.60.13
I0112 09:51:35.121754 1819575 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-opticalflow.so.525.60.13
I0112 09:51:35.121866 1819575 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-opencl.so.525.60.13
I0112 09:51:35.121961 1819575 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-ml.so.525.60.13
I0112 09:51:35.122110 1819575 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-glvkspirv.so.525.60.13
I0112 09:51:35.122205 1819575 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-glsi.so.525.60.13
I0112 09:51:35.122333 1819575 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-glcore.so.525.60.13
I0112 09:51:35.122424 1819575 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-fbc.so.525.60.13
I0112 09:51:35.122539 1819575 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-encode.so.525.60.13
I0112 09:51:35.122651 1819575 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-eglcore.so.525.60.13
I0112 09:51:35.122747 1819575 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-compiler.so.525.60.13
I0112 09:51:35.122838 1819575 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvcuvid.so.525.60.13
I0112 09:51:35.122969 1819575 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libcuda.so.525.60.13
I0112 09:51:35.123196 1819575 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libGLX_nvidia.so.525.60.13
I0112 09:51:35.123350 1819575 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libGLESv2_nvidia.so.525.60.13
I0112 09:51:35.123461 1819575 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libGLESv1_CM_nvidia.so.525.60.13
I0112 09:51:35.123586 1819575 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libEGL_nvidia.so.525.60.13
W0112 09:51:35.123645 1819575 nvc_info.c:399] missing library libnvidia-nscq.so
W0112 09:51:35.123664 1819575 nvc_info.c:399] missing library libnvidia-fatbinaryloader.so
W0112 09:51:35.123679 1819575 nvc_info.c:399] missing library libnvidia-pkcs11.so
W0112 09:51:35.123693 1819575 nvc_info.c:399] missing library libvdpau_nvidia.so
W0112 09:51:35.123710 1819575 nvc_info.c:399] missing library libnvidia-ifr.so
W0112 09:51:35.123724 1819575 nvc_info.c:399] missing library libnvidia-cbl.so
W0112 09:51:35.123737 1819575 nvc_info.c:403] missing compat32 library libnvidia-cfg.so
W0112 09:51:35.123751 1819575 nvc_info.c:403] missing compat32 library libnvidia-nscq.so
W0112 09:51:35.123766 1819575 nvc_info.c:403] missing compat32 library libcudadebugger.so
W0112 09:51:35.123779 1819575 nvc_info.c:403] missing compat32 library libnvidia-fatbinaryloader.so
W0112 09:51:35.123794 1819575 nvc_info.c:403] missing compat32 library libnvidia-allocator.so
W0112 09:51:35.123808 1819575 nvc_info.c:403] missing compat32 library libnvidia-pkcs11.so
W0112 09:51:35.123822 1819575 nvc_info.c:403] missing compat32 library libnvidia-ngx.so
W0112 09:51:35.123835 1819575 nvc_info.c:403] missing compat32 library libvdpau_nvidia.so
W0112 09:51:35.123849 1819575 nvc_info.c:403] missing compat32 library libnvidia-ifr.so
W0112 09:51:35.123863 1819575 nvc_info.c:403] missing compat32 library libnvidia-rtcore.so
W0112 09:51:35.123878 1819575 nvc_info.c:403] missing compat32 library libnvoptix.so
W0112 09:51:35.123892 1819575 nvc_info.c:403] missing compat32 library libnvidia-cbl.so
I0112 09:51:35.124371 1819575 nvc_info.c:299] selecting /usr/bin/nvidia-smi
I0112 09:51:35.124428 1819575 nvc_info.c:299] selecting /usr/bin/nvidia-debugdump
I0112 09:51:35.124477 1819575 nvc_info.c:299] selecting /usr/bin/nvidia-persistenced
I0112 09:51:35.124534 1819575 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-control
I0112 09:51:35.124571 1819575 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-server
W0112 09:51:35.124635 1819575 nvc_info.c:425] missing binary nv-fabricmanager
W0112 09:51:35.124687 1819575 nvc_info.c:349] missing firmware path /lib/firmware/nvidia/525.60.13/gsp.bin
I0112 09:51:35.124740 1819575 nvc_info.c:529] listing device /dev/nvidiactl
I0112 09:51:35.124756 1819575 nvc_info.c:529] listing device /dev/nvidia-uvm
I0112 09:51:35.124772 1819575 nvc_info.c:529] listing device /dev/nvidia-uvm-tools
I0112 09:51:35.124786 1819575 nvc_info.c:529] listing device /dev/nvidia-modeset
I0112 09:51:35.124833 1819575 nvc_info.c:343] listing ipc path /run/nvidia-persistenced/socket
W0112 09:51:35.124875 1819575 nvc_info.c:349] missing ipc path /var/run/nvidia-fabricmanager/socket
W0112 09:51:35.124907 1819575 nvc_info.c:349] missing ipc path /tmp/nvidia-mps
I0112 09:51:35.124922 1819575 nvc_info.c:822] requesting device information with ''
I0112 09:51:35.133070 1819575 nvc_info.c:713] listing device /dev/nvidia0 (GPU-794d1de5-b8c7-9b49-6fe3-f96f8fd98a19 at 00000000:09:00.0)
NVRM version:   525.60.13
CUDA version:   12.0

Device Index:   0
Device Minor:   0
Model:          NVIDIA Graphics Device
Brand:          Tesla
GPU UUID:       GPU-794d1de5-b8c7-9b49-6fe3-f96f8fd98a19
Bus Location:   00000000:09:00.0
Architecture:   6.1

删除NVIDIA Device Plugin

  • 使用 helm 检查release:

检查通过helm已经安装的软件release(删除时候必须指定release)
helm list -A
  • 删除错误部署的NVIDIA Device Plugin:

使用helm uninstall删除指定release,注意必须指定namespace(如果不是默认namespace)
helm uninstall nvidia-device-plugin-1673515385 -n kube-system

参考