为containerd安装NVIDIA Container Toolkit¶
准备工作¶
在部署NVIDIA Container Toolkit之前
首先 在OVMF虚拟机中安装NVIDIA Linux驱动 (比较波折,请参考我的实践记录)
检查系统确保满足:
NVIDIA Linux drivers 版本 >= 418.81.07
内核要求 > 3.10
Docker >= 19.03
NVIDIA GPU架构 >= Kepler
containerd运行时(runtime)¶
按照 containerd官方介绍文档 完成 containerd运行时(runtime) 安装,例如,我采用 安装containerd官方执行程序
备注
NVIDIA Cloud Native Documentation: Installation Guide >> containerd 提供了Ubuntu系统安装containerd的步骤介绍,可参考。
在 安装containerd官方执行程序 有一步是生成默认配置文件(我当时仅修改了一个参数
SystemdCgroup = true
),按照NVIDIA手册,先执行生成默认配置,然后再执行patch修订:
sudo mkdir /etc/containerd
containerd config default | sudo tee /etc/containerd/config.toml
要确保结合 containerd运行时(runtime) 使用NVIDIA Container Runtime,需要做以下附加配置: 将
nvidia
作为runtime添加到配置中,并且使用systemd
作为cgroup driver
执行以下命令创建 containerd-config.path
文件:
cat <<EOF > containerd-config.patch
--- config.toml.orig 2020-12-18 18:21:41.884984894 +0000
+++ /etc/containerd/config.toml 2020-12-18 18:23:38.137796223 +0000
@@ -94,6 +94,15 @@
privileged_without_host_devices = false
base_runtime_spec = ""
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
+ SystemdCgroup = true
+ [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
+ privileged_without_host_devices = false
+ runtime_engine = ""
+ runtime_root = ""
+ runtime_type = "io.containerd.runc.v1"
+ [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
+ BinaryName = "/usr/bin/nvidia-container-runtime"
+ SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/opt/cni/bin"
conf_dir = "/etc/cni/net.d"
EOF
备注
NVIDIA提供的patch文件实际上 安装containerd官方执行程序 生成的默认 config.toml
不兼容,所以我实际是手工修改
...
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
BinaryName = ""
CriuImagePath = ""
CriuPath = ""
CriuWorkPath = ""
IoGid = 0
IoUid = 0
NoNewKeyring = false
NoPivotRoot = false
Root = ""
ShimCgroup = ""
SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v1"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime]
...
重启
containerd
:
sudo systemctl restart containerd
通过Docker
helo-world
容器测试:
sudo ctr image pull docker.io/library/hello-world:latest \
&& sudo ctr run --rm -t docker.io/library/hello-world:latest hello-world
备注
注意,此时还没有安装 NVIDIA Container Toolkit ,所以实际上还没有 /usr/bin/nvidia-container-runtime
,插件尚未工作。上述验证只是表明 containerd
能工作
安装 NVIDIA Container Toolkit¶
备注
我只在 Ubuntu Linux 22.04 虚拟机上安装实践,其他操作系统,例如 RedHat Linux 系列请参考官方 NVIDIA Cloud Native Documentation: Installation Guide >> containerd
在 Ubuntu Linux 22.04 虚拟机 中添加NVIDIA仓库配置和GPG密钥:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add - \
&& curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
需要注意,实际上添加到 /etc/apt/sources.list.d/nvidia-container-toolkit.list
仓库配置内容是:
deb https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/$(ARCH) /
执行安装:
sudo apt update \
&& sudo apt install -y nvidia-container-toolkit
检查安装的软件包:
sudo apt list --installed *nvidia*
测试安装¶
测试GPU容器:
sudo ctr image pull docker.io/nvidia/cuda:11.6.2-base-ubuntu20.04
sudo ctr run --rm -t \
--runc-binary=/usr/bin/nvidia-container-runtime \
--env NVIDIA_VISIBLE_DEVICES=all \
docker.io/nvidia/cuda:11.6.2-base-ubuntu20.04 \
cuda-11.6.2-base-ubuntu20.04 nvidia-smi
如果没有异常,则验证容器输出信息类似如下:
Wed Jan 11 17:12:23 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13 Driver Version: 525.60.13 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA Graphics... On | 00000000:09:00.0 Off | 0 |
| N/A 29C P8 10W / 150W | 0MiB / 23040MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+