为containerd安装NVIDIA Container Toolkit

准备工作

在部署NVIDIA Container Toolkit之前

  • 首先 在OVMF虚拟机中安装NVIDIA Linux驱动 (比较波折,请参考我的实践记录)

  • 检查系统确保满足:

    • NVIDIA Linux drivers 版本 >= 418.81.07

    • 内核要求 > 3.10

    • Docker >= 19.03

    • NVIDIA GPU架构 >= Kepler

containerd运行时(runtime)

按照 containerd官方介绍文档 完成 containerd运行时(runtime) 安装,例如,我采用 安装containerd官方执行程序

备注

NVIDIA Cloud Native Documentation: Installation Guide >> containerd 提供了Ubuntu系统安装containerd的步骤介绍,可参考。

  • 安装containerd官方执行程序 有一步是生成默认配置文件(我当时仅修改了一个参数 SystemdCgroup = true ),按照NVIDIA手册,先执行生成默认配置,然后再执行patch修订:

生成Kuberntes自举所需的默认containerd网络配置
sudo mkdir /etc/containerd
containerd config default | sudo tee /etc/containerd/config.toml
  • 要确保结合 containerd运行时(runtime) 使用NVIDIA Container Runtime,需要做以下附加配置: 将 nvidia 作为runtime添加到配置中,并且使用 systemd 作为cgroup driver

执行以下命令创建 containerd-config.path 文件:

创建 containerd-config.patch 补丁文件
cat <<EOF > containerd-config.patch
--- config.toml.orig    2020-12-18 18:21:41.884984894 +0000
+++ /etc/containerd/config.toml 2020-12-18 18:23:38.137796223 +0000
@@ -94,6 +94,15 @@
        privileged_without_host_devices = false
        base_runtime_spec = ""
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
+            SystemdCgroup = true
+       [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
+          privileged_without_host_devices = false
+          runtime_engine = ""
+          runtime_root = ""
+          runtime_type = "io.containerd.runc.v1"
+          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
+            BinaryName = "/usr/bin/nvidia-container-runtime"
+            SystemdCgroup = true
    [plugins."io.containerd.grpc.v1.cri".cni]
    bin_dir = "/opt/cni/bin"
    conf_dir = "/etc/cni/net.d"
EOF

备注

NVIDIA提供的patch文件实际上 安装containerd官方执行程序 生成的默认 config.toml 不兼容,所以我实际是手工修改

修订 /etc/containerd/config.toml
...
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
            BinaryName = ""
            CriuImagePath = ""
            CriuPath = ""
            CriuWorkPath = ""
            IoGid = 0
            IoUid = 0
            NoNewKeyring = false
            NoPivotRoot = false
            Root = ""
            ShimCgroup = ""
            SystemdCgroup = true
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
             privileged_without_host_devices = false
             runtime_engine = ""
             runtime_root = ""
             runtime_type = "io.containerd.runc.v1"
             [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
               BinaryName = "/usr/bin/nvidia-container-runtime"
               SystemdCgroup = true

      [plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime]
...
  • 重启 containerd :

重启 containerd
sudo systemctl restart containerd
  • 通过Docker helo-world 容器测试:

重启 containerd
sudo ctr image pull docker.io/library/hello-world:latest \
    && sudo ctr run --rm -t docker.io/library/hello-world:latest hello-world

备注

注意,此时还没有安装 NVIDIA Container Toolkit ,所以实际上还没有 /usr/bin/nvidia-container-runtime ,插件尚未工作。上述验证只是表明 containerd 能工作

安装 NVIDIA Container Toolkit

备注

我只在 Ubuntu Linux 22.04 虚拟机上安装实践,其他操作系统,例如 RedHat Linux 系列请参考官方 NVIDIA Cloud Native Documentation: Installation Guide >> containerd

  • Ubuntu Linux 22.04 虚拟机 中添加NVIDIA仓库配置和GPG密钥:

在Ubuntu环境添加NVIDIA官方仓库配置和GPG密钥
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
    && curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add - \
    && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

需要注意,实际上添加到 /etc/apt/sources.list.d/nvidia-container-toolkit.list 仓库配置内容是:

deb https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/$(ARCH) /
  • 执行安装:

在Ubuntu环境安装NVIDIA Container Toolkit(使用官方软件仓库)
sudo apt update \
    && sudo apt install -y nvidia-container-toolkit
  • 检查安装的软件包:

在Ubuntu环境检查已经安装的NVIDIA软件包
sudo apt list --installed *nvidia*

测试安装

  • 测试GPU容器:

测试GPU容器运行
sudo ctr image pull docker.io/nvidia/cuda:11.6.2-base-ubuntu20.04

sudo ctr run --rm -t \
    --runc-binary=/usr/bin/nvidia-container-runtime \
    --env NVIDIA_VISIBLE_DEVICES=all \
    docker.io/nvidia/cuda:11.6.2-base-ubuntu20.04 \
    cuda-11.6.2-base-ubuntu20.04 nvidia-smi

如果没有异常,则验证容器输出信息类似如下:

测试GPU容器运行输出信息显示NVIDIA Container Toolkit安装成功
Wed Jan 11 17:12:23 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA Graphics...  On   | 00000000:09:00.0 Off |                    0 |
| N/A   29C    P8    10W / 150W |      0MiB / 23040MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

参考