安装NVIDIA Virtual GPU Manager

准备工作

在物理主机上安装 NVIDIA Virtual GPU Manager 的准备工作:

  • KVM服务器上安装好以下软件包:

    • x86_64 GNU Compiler Collection (GCC)

    • Linux kernel headers

在Ubuntu服务器上安装GCC和Linux Kernel Headers
sudo apt install gcc linux-headers-$(uname -r)

安装Virtual GPU Manager Package for Linux KVM

备注

我的实践是在 Ubuntu Linux 22.04 上使用 Nvidia Tesla P10 GPU运算卡 ,官方文档提供了 Installing and Configuring the NVIDIA Virtual GPU Manager for Ubuntu 所以我改为按照这部分资料来完成实践

NVIDIA官方文档非常详尽(繁琐),需要仔细核对你的软硬件环境来找到最适配的文档部分进行参考

  • 安装非常简单,实际上就是运行 NVIDIA Host Drivers安装:

在Host主机上安装vGPU Manager for Linux KVM
chmod +x NVIDIA-Linux-x86_64-510.85.03-vgpu-kvm.run
sudo sh ./NVIDIA-Linux-x86_64-510.85.03-vgpu-kvm.run

安装执行很快,会编译内核模块并完成安装

警告

我发现 Installing and Configuring the NVIDIA Virtual GPU Manager for Ubuntu 文档中 Installing the Virtual GPU Manager Package for Ubuntu 使用的是 .deb 软件包安装,安装以后 lsmod | grep vfio 设备也是具备了 mdev 模块的。

这和我这里 在Host主机上安装vGPU Manager for Linux KVM 结果不同, 令人困惑

这时, Was the vfio_mdev module removed from the 5.15 kernel? 给了我一个指引: Kernel 5.15开始, mdev 模块取代了 vfio_mdev ,依然可以在 kernel 5.15 上通过 mdev 来使用 vfio

Proxmox 7 vGPU – v2 提供了详细的指导

  • 上述安装 vGPU Manager for Linux KVM/etc/systemd/system/multi-user.target.wants 添加了链接,实际上激活了以下两个 vgpu 服务:

    nvidia-vgpud.service -> /lib/systemd/system/nvidia-vgpud.service
    nvidia-vgpu-mgr.service -> /lib/systemd/system/nvidia-vgpu-mgr.service
    

但是我的实践实际发现 nvidia-vgpud.service 运行有异常,见下文 “ nvidia-vgpudnvidia-vgpu-mgr 服务段落”

  • 重启服务器,重启后检查 vfio 模块:

执行 lsmod 查看 vfio相关模块
lsmod | grep vfio

这里只看到2个vfio相关模块,并没有如文档中具备了 vfio_mdev 模块(原因: 内核 5.15 以后 mdev 取代了 vfio_mdev ) :

执行 lsmod 查看 vfio相关模块,但是没有看到mdev
nvidia_vgpu_vfio       57344  0
mdev                   28672  1 nvidia_vgpu_vfio

注意 Verifying the Installation of the NVIDIA vGPU Software for Red Hat Enterprise Linux KVM or RHV (这里参考官方文档) 显示 vfio_mdev 是 kernel 5.15 之前的内核模块, Ubuntu Linux 22.04 使用 kernel 5.15系列, mdev 模块已经取代了 vfio_mdev

按照文档RHEL中 lsmod 查看 vfio相关模块应该能够看到mdev
nvidia_vgpu_vfio       27099  0
nvidia              12316924  1 nvidia_vgpu_vfio
vfio_mdev              12841  0
mdev                   20414  2 vfio_mdev,nvidia_vgpu_vfio
vfio_iommu_type1       22342  0
vfio                   32331  3 vfio_mdev,nvidia_vgpu_vfio,vfio_iommu_type1
  • 检查设备对应加载的驱动可以使用如下命令:

执行 lspci -vvvnnn 检查驱动详情
lspci -vvvnnn -s 82:00.0 | grep -i kernel

输出显示已经加载了 nvidia 驱动:

执行 lspci -vvvnnn 检查驱动详情
    Kernel driver in use: nvidia
    Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
  • 此时检查 nvidia-smi 可以看到当前只有一个物理GPU:

执行 nvidia-smi 检查GPU
sudo nvidia-smi

输出显示只有一个GPU:

执行 nvidia-smi 检查显示只有1块GPU卡
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.03    Driver Version: 510.85.03    CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA Graphics...  Off  | 00000000:82:00.0 Off |                    0 |
| N/A   40C    P0    42W / 150W |     50MiB / 23040MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

为KVM Hypervisor准备工作: 获取GPU的BDF和Domain

  • 获取物理GPU的PCI设备 bus/device/function (BDF):

获取GPU设备的BDF
lspci | grep NVIDIA

此时看到的物理GPU设备如下:

获取GPU设备的BDF
82:00.0 3D controller: NVIDIA Corporation GP102GL [Tesla P10] (rev a1)

这里显示输出的 82:00.0 就是GPU的 PCI 设备 BDF

  • 从GPU的PCI设备BDF获得GPU的完整标识: 注意,这里要将 82:00.0 转换成 82_00_0 (也就是所谓的 transformed-bdf )

使用转换后的GPU的BDF,通过 virsh nodedev-list 获得完整的GPU标识
virsh nodedev-list --cap pci | grep 82_00_0

这里输出的结果如下:

使用转换后的GPU的BDF,通过 virsh nodedev-list 获得完整的GPU标识
pci_0000_82_00_0

记录下这里输出的完整PCI设备identifier pci_0000_82_00_0 ,我们将用这个标识字符串来获得 virsh 中使用的 GPU 的 domain, bus, slot 以及 function

  • 获取GPU设备完整的virsh配置:

使用完整GPU标识,通过 virsh nodedev-dumpxml 获得完整的GPU配置(domain, bus, slot 以及 function)
virsh nodedev-dumpxml pci_0000_82_00_0 | egrep 'domain|bus|slot|function'

输出内容:

使用完整GPU标识,通过 virsh nodedev-dumpxml 获得完整的GPU配置(domain, bus, slot 以及 function)
    <domain>0</domain>
    <bus>130</bus>
    <slot>0</slot>
    <function>0</function>
      <address domain='0x0000' bus='0x82' slot='0x00' function='0x0'/>

记录下这个输出内容备用(最后一行)

创建KVM Hypervisor的NVIDIA vGPU

为KVM Hypervisor创建NVIDIA vGPU分为两种方式:

传统的NVIDIA vGPU (分时切分vGPU)

警告

最初我在 Ubuntu Linux 22.04 上实践不成功,原因是 Nvidia Tesla P10 GPU运算卡 默认关闭了 NVIDIA Virtual GPU (vGPU) 支持。在完成 vgpu_unlock 解锁了 NVIDIA Virtual GPU (vGPU) 功能之后,才能完成本段配置

  • 首先进入物理GPU对应的 mdev_supported_types 目录,这个目录的完整路径结合了上文我们获得的 domain, bus, slot, and function

进入物理GPU应的 mdev_supported_types 目录
# virsh nodedev-dumpxml pci_0000_82_00_0 | egrep 'domain|bus|slot|function' 输出获得:
# <address domain='0x0000' bus='0x82' slot='0x00' function='0x0'/>

domain=0000
bus=82
slot=00
function=0

cd /sys/class/mdev_bus/${domain}\:${bus}\:${slot}.${function}/mdev_supported_types/

这里我遇到一个问题, /sys/class/mdev_bus/ 目录不存在,也就没有进入所谓物理GPU对应 mdev_supported_types 目录。这是为何呢? => 原因已经找到: Nvidia Tesla P10 GPU运算卡 需要通过 vgpu_unlock 解锁 NVIDIA Virtual GPU (vGPU) 支持

这个问题需要分设备来解决:

  • 检查GPU设备详情:

使用 lspci -v 检查GPU设备
lspci -v -s 82:00.0

输出显示:

使用 lspci -v 检查GPU设备
82:00.0 3D controller: NVIDIA Corporation GP102GL [Tesla P10] (rev a1)
        Subsystem: NVIDIA Corporation GP102GL [Tesla P10]
        Physical Slot: 3
        Flags: bus master, fast devsel, latency 0, IRQ 183, NUMA node 1, IOMMU group 80
        Memory at c8000000 (32-bit, non-prefetchable) [size=16M]
        Memory at 3b000000000 (64-bit, prefetchable) [size=32G]
        Memory at 3b800000000 (64-bit, prefetchable) [size=32M]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Endpoint, MSI 00
        Capabilities: [100] Virtual Channel
        Capabilities: [128] Power Budgeting <?>
        Capabilities: [420] Advanced Error Reporting
        Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900] Secondary PCI Express
        Kernel driver in use: nvidia
        Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia

奇怪,我的Tasla P10 GPU卡确实是插在物理slot 3上,为何前面使用virsh nodedev-dumpxml输出显示slot=0x00 两者是什么关系?

  • 检查 vgpu 状态:

使用 nvidia-smi vgpu 查看vgpu状态
nvidia-smi vgpu

输出显示只有一个vGPU:

使用 nvidia-smi vgpu 查看vgpu状态
Wed Jun  7 23:32:40 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.03              Driver Version: 510.85.03                 |
|---------------------------------+------------------------------+------------+
| GPU  Name                       | Bus-Id                       | GPU-Util   |
|      vGPU ID     Name           | VM ID     VM Name            | vGPU-Util  |
|=================================+==============================+============|
|   0  NVIDIA Graphics Device     | 00000000:82:00.0             |   0%       |
+---------------------------------+------------------------------+------------+

nvidia-vgpudnvidia-vgpu-mgr 服务

  • 检查 nvidia-vgpu-mgr 服务:

检查 nvidia-vgpu-mgr 服务状态
systemctl status nvidia-vgpu-mgr.service

这里观察 nvidia-vgpu-mgr 服务运行正常:

nvidia-vgpu-mgr 服务状态正常
 nvidia-vgpu-mgr.service - NVIDIA vGPU Manager Daemon
     Loaded: loaded (/lib/systemd/system/nvidia-vgpu-mgr.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2023-06-08 23:29:10 CST; 8s ago
    Process: 12170 ExecStart=/usr/bin/nvidia-vgpu-mgr (code=exited, status=0/SUCCESS)
   Main PID: 12171 (nvidia-vgpu-mgr)
      Tasks: 1 (limit: 464054)
     Memory: 260.0K
        CPU: 4ms
     CGroup: /system.slice/nvidia-vgpu-mgr.service
             └─12171 /usr/bin/nvidia-vgpu-mgr

Jun 08 23:29:10 zcloud.staging.huatai.me systemd[1]: Starting NVIDIA vGPU Manager Daemon...
Jun 08 23:29:10 zcloud.staging.huatai.me systemd[1]: Started NVIDIA vGPU Manager Daemon.
Jun 08 23:29:10 zcloud.staging.huatai.me nvidia-vgpu-mgr[12171]: notice: vmiop_env_log: nvidia-vgpu-mgr daemon started
  • 但是检查 nvidia-vgpud 服务:

检查 nvidia-vgpud 服务状态
systemctl status nvidia-vgpud.service

发现 nvidia-vgpud 启动失败:

nvidia-vgpud 服务启动失败
× nvidia-vgpud.service - NVIDIA vGPU Daemon
     Loaded: loaded (/lib/systemd/system/nvidia-vgpud.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Thu 2023-06-08 23:29:40 CST; 3s ago
    Process: 12179 ExecStart=/usr/bin/nvidia-vgpud (code=exited, status=0/SUCCESS)
    Process: 12181 ExecStopPost=/bin/rm -rf /var/run/nvidia-vgpud (code=exited, status=0/SUCCESS)
   Main PID: 12180 (code=exited, status=6)
        CPU: 35ms

Jun 08 23:29:40 zcloud.staging.huatai.me nvidia-vgpud[12180]: vGPU types: 613
Jun 08 23:29:40 zcloud.staging.huatai.me nvidia-vgpud[12180]:
Jun 08 23:29:40 zcloud.staging.huatai.me nvidia-vgpud[12180]: pciId of gpu [0]: 0:82:0:0
Jun 08 23:29:40 zcloud.staging.huatai.me nvidia-vgpud[12180]: GPU not supported by vGPU at PCI Id: 0:82:0:0 DevID: 0x10de / 0x1b39 / 0x10de / 0x1217
Jun 08 23:29:40 zcloud.staging.huatai.me nvidia-vgpud[12180]: error: failed to send vGPU configuration info to RM: 6
Jun 08 23:29:40 zcloud.staging.huatai.me nvidia-vgpud[12180]: PID file unlocked.
Jun 08 23:29:40 zcloud.staging.huatai.me nvidia-vgpud[12180]: PID file closed.
Jun 08 23:29:40 zcloud.staging.huatai.me nvidia-vgpud[12180]: Shutdown (12180)
Jun 08 23:29:40 zcloud.staging.huatai.me systemd[1]: nvidia-vgpud.service: Main process exited, code=exited, status=6/NOTCONFIGURED
Jun 08 23:29:40 zcloud.staging.huatai.me systemd[1]: nvidia-vgpud.service: Failed with result 'exit-code'.

为什么 nvidia-vgpud 启动失败? error: failed to send vGPU configuration info to RM: 6

Hacking NVidia Cards into their Professional Counterparts 有用户提供了 Tesla P4 和 GTX 1080( 和 Tesla P4 是相同的 GP104核型 )启动日志对比,很不幸,我的 Nvidia Tesla P10 GPU运算卡 启动日志居然和不支持 vGPU 的 GTX 1080相同。 <= 确实,经过验证 Nvidia Tesla P10 GPU运算卡 和消费级显卡一样需要 vgpu_unlock 之后才能使用 NVIDIA Virtual GPU (vGPU) 功能

问了以下GPT 3.5,居然也提示: 根据日志显示,nvidia-vgpud服务启动失败,具体原因是GPU不支持vGPU。 ,而且GPT 3.5还告诉我 NVIDIA Tesla P10不支持vGPU功能 ,建议我升级到Tesla P40

难道我的 Nvidia Tesla P10 GPU运算卡 这张隐形卡,真的是老黄刀法精准的阉割Tesla卡? 我不服,扶我起来,我还能打!

备注

在完成 vgpu_unlock 解锁 NVIDIA Virtual GPU (vGPU) 功能之后,才能正常运行 nvidia-vgpud

  • nvidia-smi 提供了 query :

nvidia-smi -q 查询GPU
nvidia-smi -q
nvidia-smi -q 查询GPU显示支持VGPU
==============NVSMI LOG==============

Timestamp                                 : Fri Jun  9 00:12:17 2023
Driver Version                            : 510.85.03
CUDA Version                              : Not Found

Attached GPUs                             : 1
GPU 00000000:82:00.0
    Product Name                          : NVIDIA Graphics Device
    Product Brand                         : Tesla
    Product Architecture                  : Pascal
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Enabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 0000000000000
    GPU UUID                              : GPU-794d1de5-b8c7-9b49-6fe3-f96f8fd98a19
    Minor Number                          : 0
    VBIOS Version                         : 86.02.4B.00.01
    MultiGPU Board                        : No
    Board ID                              : 0x8200
    GPU Part Number                       : 000-00000-0000-000
    Module ID                             : 0
    Inforom Version
        Image Version                     : G000.0000.00.00
        OEM Object                        : 1.1
        ECC Object                        : 4.1
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : Host VGPU
        Host VGPU Mode                    : Non SR-IOV
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x82
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x1B3910DE
        Bus Id                            : 00000000:82:00.0
        Sub System Id                     : 0x121710DE
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 1
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : N/A
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 23040 MiB
        Reserved                          : 0 MiB
        Used                              : 50 MiB
        Free                              : 22989 MiB
    BAR1 Memory Usage
        Total                             : 32768 MiB
        Used                              : 2 MiB
        Free                              : 32766 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            Single Bit
                Device Memory             : 0
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : 0
            Double Bit
                Device Memory             : 0
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : 0
        Aggregate
            Single Bit
                Device Memory             : 0
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : 0
            Double Bit
                Device Memory             : 0
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : 0
    Retired Pages
        Single Bit ECC                    : 0
        Double Bit ECC                    : 0
        Pending Page Blacklist            : No
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 40 C
        GPU Shutdown Temp                 : 95 C
        GPU Slowdown Temp                 : 92 C
        GPU Max Operating Temp            : N/A
        GPU Target Temperature            : N/A
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 18.88 W
        Power Limit                       : 150.00 W
        Default Power Limit               : 150.00 W
        Enforced Power Limit              : 150.00 W
        Min Power Limit                   : 75.00 W
        Max Power Limit                   : 150.00 W
    Clocks
        Graphics                          : 544 MHz
        SM                                : 544 MHz
        Memory                            : 405 MHz
        Video                             : 544 MHz
    Applications Clocks
        Graphics                          : 1025 MHz
        Memory                            : 3008 MHz
    Default Applications Clocks
        Graphics                          : 1025 MHz
        Memory                            : 3008 MHz
    Max Clocks
        Graphics                          : 1531 MHz
        SM                                : 1531 MHz
        Memory                            : 3008 MHz
        Video                             : 1544 MHz
    Max Customer Boost Clocks
        Graphics                          : 1531 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : N/A
    Processes                             : None

可以看到这块GPU卡是支持 Sigle Root I/O Virtualization(SR-IOV) 模式的 Host VGPU

  • 进一步查询 vgpu :

nvidia-smi vgpu -q 查询vGPU
nvidia-smi vgpu -q

输出显示只激活了一个vGPU:

nvidia-smi vgpu -q 查询vGPU显示只有一个vGPU
GPU 00000000:82:00.0
    Active vGPUs                          : 0

解决: 采用 vgpu_unlock

果然, Nvidia Tesla P10 GPU运算卡 是一块被NVIDIA关闭vGPU功能的计算卡,类似消费级GPU,需要采用 vgpu_unlock 来解锁 Nvidia Tesla P10 GPU运算卡 vGPU能力。在完成了 vgpu_unlock 之后,再次检查就可以看到 nvidia-vgpud 服务正常运行:

采用 vgpu_unlock 之后 nvidia-vgpud.service 能够正常运行显示状态
○ nvidia-vgpud.service - NVIDIA vGPU Daemon
     Loaded: loaded (/lib/systemd/system/nvidia-vgpud.service; enabled; vendor preset: enabled)
     Active: inactive (dead) since Sat 2023-06-10 00:14:29 CST; 1s ago
    Process: 3815 ExecStart=/opt/vgpu_unlock/vgpu_unlock /usr/bin/nvidia-vgpud (code=exited, status=0/SUCCESS)
    Process: 3855 ExecStopPost=/bin/rm -rf /var/run/nvidia-vgpud (code=exited, status=0/SUCCESS)
   Main PID: 3819 (code=exited, status=0/SUCCESS)
        CPU: 449ms

Jun 10 00:14:29 zcloud.staging.huatai.me nvidia-vgpud[3839]: BAR1 Length: 0x4000
Jun 10 00:14:29 zcloud.staging.huatai.me nvidia-vgpud[3839]: Frame Rate Limiter enabled: 0x1
Jun 10 00:14:29 zcloud.staging.huatai.me nvidia-vgpud[3839]: Number of Displays: 1
Jun 10 00:14:29 zcloud.staging.huatai.me nvidia-vgpud[3839]: Max pixels: 8847360
Jun 10 00:14:29 zcloud.staging.huatai.me nvidia-vgpud[3839]: Display: width 4096, height 2160
Jun 10 00:14:29 zcloud.staging.huatai.me nvidia-vgpud[3839]: License: NVIDIA-vComputeServer,9.0;Quadro-Virtual-DWS,5.0
Jun 10 00:14:29 zcloud.staging.huatai.me nvidia-vgpud[3839]: PID file unlocked.
Jun 10 00:14:29 zcloud.staging.huatai.me nvidia-vgpud[3839]: PID file closed.
Jun 10 00:14:29 zcloud.staging.huatai.me nvidia-vgpud[3839]: Shutdown (3839)
Jun 10 00:14:29 zcloud.staging.huatai.me systemd[1]: nvidia-vgpud.service: Deactivated successfully.

继续: 为KVM Hypervisor创建NVIDIA vGPU设备

手工创建 mdev 设备

在进入物理GPU对应的 mdev_supported_types 目录之后:

进入物理GPU应的 mdev_supported_types 目录
# virsh nodedev-dumpxml pci_0000_82_00_0 | egrep 'domain|bus|slot|function' 输出获得:
# <address domain='0x0000' bus='0x82' slot='0x00' function='0x0'/>

domain=0000
bus=82
slot=00
function=0

cd /sys/class/mdev_bus/${domain}\:${bus}\:${slot}.${function}/mdev_supported_types/

检查该目录下内容可以看到类似如下设备入口:

nvidia-156  nvidia-241  nvidia-284  nvidia-286  nvidia-46  nvidia-48  nvidia-50  nvidia-52  nvidia-54  nvidia-56  nvidia-58  nvidia-60  nvidia-62
nvidia-215  nvidia-283  nvidia-285  nvidia-287  nvidia-47  nvidia-49  nvidia-51  nvidia-53  nvidia-55  nvidia-57  nvidia-59  nvidia-61

那么我们该使用哪个设备呢?

使用 mdevctl types 命令扫描 mdev_supported_types 目录获得 NVIDIA Virtual GPU (vGPU) 设备配置
mdevctl types

可以看到不同规格vGPU命名以及对应配置:

使用 mdevctl types 命令扫描 mdev_supported_types 目录获得 NVIDIA Virtual GPU (vGPU) 设备配置
0000:82:00.0
  nvidia-156
    Available instances: 12
    Device API: vfio-pci
    Name: GRID P40-2B
    Description: num_heads=4, frl_config=45, framebuffer=2048M, max_resolution=5120x2880, max_instance=12
  nvidia-215
    Available instances: 12
    Device API: vfio-pci
    Name: GRID P40-2B4
    Description: num_heads=4, frl_config=45, framebuffer=2048M, max_resolution=5120x2880, max_instance=12
  nvidia-241
    Available instances: 24
    Device API: vfio-pci
    Name: GRID P40-1B4
    Description: num_heads=4, frl_config=45, framebuffer=1024M, max_resolution=5120x2880, max_instance=24
  nvidia-283
    Available instances: 6
    Device API: vfio-pci
    Name: GRID P40-4C
    Description: num_heads=1, frl_config=60, framebuffer=4096M, max_resolution=4096x2160, max_instance=6
  nvidia-284
    Available instances: 4
    Device API: vfio-pci
    Name: GRID P40-6C
    Description: num_heads=1, frl_config=60, framebuffer=6144M, max_resolution=4096x2160, max_instance=4
  nvidia-285
    Available instances: 3
    Device API: vfio-pci
    Name: GRID P40-8C
    Description: num_heads=1, frl_config=60, framebuffer=8192M, max_resolution=4096x2160, max_instance=3
  nvidia-286
    Available instances: 2
    Device API: vfio-pci
    Name: GRID P40-12C
    Description: num_heads=1, frl_config=60, framebuffer=12288M, max_resolution=4096x2160, max_instance=2
  nvidia-287
    Available instances: 1
    Device API: vfio-pci
    Name: GRID P40-24C
    Description: num_heads=1, frl_config=60, framebuffer=24576M, max_resolution=4096x2160, max_instance=1
  nvidia-46
    Available instances: 24
    Device API: vfio-pci
    Name: GRID P40-1Q
    Description: num_heads=4, frl_config=60, framebuffer=1024M, max_resolution=5120x2880, max_instance=24
  nvidia-47
    Available instances: 12
    Device API: vfio-pci
    Name: GRID P40-2Q
    Description: num_heads=4, frl_config=60, framebuffer=2048M, max_resolution=7680x4320, max_instance=12
  nvidia-48
    Available instances: 8
    Device API: vfio-pci
    Name: GRID P40-3Q
    Description: num_heads=4, frl_config=60, framebuffer=3072M, max_resolution=7680x4320, max_instance=8
  nvidia-49
    Available instances: 6
    Device API: vfio-pci
    Name: GRID P40-4Q
    Description: num_heads=4, frl_config=60, framebuffer=4096M, max_resolution=7680x4320, max_instance=6
  nvidia-50
    Available instances: 4
    Device API: vfio-pci
    Name: GRID P40-6Q
    Description: num_heads=4, frl_config=60, framebuffer=6144M, max_resolution=7680x4320, max_instance=4
  nvidia-51
    Available instances: 3
    Device API: vfio-pci
    Name: GRID P40-8Q
    Description: num_heads=4, frl_config=60, framebuffer=8192M, max_resolution=7680x4320, max_instance=3
  nvidia-52
    Available instances: 2
    Device API: vfio-pci
    Name: GRID P40-12Q
    Description: num_heads=4, frl_config=60, framebuffer=12288M, max_resolution=7680x4320, max_instance=2
  nvidia-53
    Available instances: 1
    Device API: vfio-pci
    Name: GRID P40-24Q
    Description: num_heads=4, frl_config=60, framebuffer=24576M, max_resolution=7680x4320, max_instance=1
  nvidia-54
    Available instances: 24
    Device API: vfio-pci
    Name: GRID P40-1A
    Description: num_heads=1, frl_config=60, framebuffer=1024M, max_resolution=1280x1024, max_instance=24
  nvidia-55
    Available instances: 12
    Device API: vfio-pci
    Name: GRID P40-2A
    Description: num_heads=1, frl_config=60, framebuffer=2048M, max_resolution=1280x1024, max_instance=12
  nvidia-56
    Available instances: 8
    Device API: vfio-pci
    Name: GRID P40-3A
    Description: num_heads=1, frl_config=60, framebuffer=3072M, max_resolution=1280x1024, max_instance=8
  nvidia-57
    Available instances: 6
    Device API: vfio-pci
    Name: GRID P40-4A
    Description: num_heads=1, frl_config=60, framebuffer=4096M, max_resolution=1280x1024, max_instance=6
  nvidia-58
    Available instances: 4
    Device API: vfio-pci
    Name: GRID P40-6A
    Description: num_heads=1, frl_config=60, framebuffer=6144M, max_resolution=1280x1024, max_instance=4
  nvidia-59
    Available instances: 3
    Device API: vfio-pci
    Name: GRID P40-8A
    Description: num_heads=1, frl_config=60, framebuffer=8192M, max_resolution=1280x1024, max_instance=3
  nvidia-60
    Available instances: 2
    Device API: vfio-pci
    Name: GRID P40-12A
    Description: num_heads=1, frl_config=60, framebuffer=12288M, max_resolution=1280x1024, max_instance=2
  nvidia-61
    Available instances: 1
    Device API: vfio-pci
    Name: GRID P40-24A
    Description: num_heads=1, frl_config=60, framebuffer=24576M, max_resolution=1280x1024, max_instance=1
  nvidia-62
    Available instances: 24
    Device API: vfio-pci
    Name: GRID P40-1B
    Description: num_heads=4, frl_config=45, framebuffer=1024M, max_resolution=5120x2880, max_instance=24
vGPU类型和用途关系

类型

建议用途

A

虚拟应用程序(Virtual Applications / vApps)

B

虚拟桌面(Virtual Desktops / vPC)

C

AI/机器学习/训练( vCS or vWS )

Q

虚拟工作站 (Virtual Workstations / vWS)

我有两种规划:

这里我为 微软飞行模拟 准备 6GB 显存规格 的 P40-6Q 对应查询:

在物理GPU应的 mdev_supported_types 目录下查找实例名
# 这里 P40-6Q 是 mdevctl types 输出中找到6GB显存规格的GRID名字
# Name: GRID P40-6Q

domain=0000
bus=82
slot=00
function=0

cd /sys/class/mdev_bus/${domain}\:${bus}\:${slot}.${function}/mdev_supported_types/
grep -l P40-6Q nvidia-*/name

输出显示:

在物理GPU应的 mdev_supported_types 目录下查找实例名
nvidia-50/name

检查这个vGPU命名可以对应几个实例:

检查vGPU类型对应的可创建实例数量
cat nvidia-50/available_instances

输出结果是:

4

备注

这里 available_instances 会随着vGPU分配而递减。例如对于 Nvidia Tesla P10 GPU运算卡 可以分配4个6G规格,每创建一个 P40-6Qmdev 设备,这个 available_instances 就会减1,直到减为0。

  • 创建vGPU设备的方法是向该规格目录下 create 文件输入一个随机uuid:

创建vGPU设备
UUID=`uuidgen`
echo "$UUID" > nvidia-284/create 

此时检查 mdev 设备

检查vGPU设备
# ls -lh /sys/bus/mdev/devices/
lrwxrwxrwx 1 root root 0 Jun 10 14:33 e991023e-0f0e-484a-8763-df6b6874b82e -> ../../../devices/pci0000:80/0000:80:02.0/0000:82:00.0/e991023e-0f0e-484a-8763-df6b6874b82e

可以看到 /sys/bus/mdev/devices/ 目录下增加了新的虚拟vGPU设备软连接

再重复3此,一共创建4个vGPU实例

检查vGPU(mdevctl)实例:

检查所有的 mdev
mdevctl list

输出可以卡到已经有4个vGPU:

检查所有的 mdev
e991023e-0f0e-484a-8763-df6b6874b82e 0000:82:00.0 nvidia-284
23501256-ff15-439a-98b1-e4f6d01e459f 0000:82:00.0 nvidia-284
58fe7cf4-e9de-41f4-ae4b-c424a2a81193 0000:82:00.0 nvidia-284
e19fa267-ff3a-4ce8-bcf6-6ae402871085 0000:82:00.0 nvidia-284

使用 mdevctl 管理 mdev (创建和销毁)

上述手工创建 mdev 设备方法需要在 /sys 文件系统中访问文件方式创建和检查,比较繁琐。 mdevctl 工具则提供了完整增的创建、检查、删除 vGPU设备的维护方法。这里完整重现一遍上述操作,不过采用 mdevctl 会方便很多

  • 首先,依然是使用 mdevctl types 检查系统GPU提供的所有支持类型,方便挑选合适的 profile 类型(这里不再重复): 通过 mdev 设备列表,我们可以选择需要的profile,例如我选择 P40-12CP40-6Q 分别对应 nvidia-286nvidia-50

  • 前面我按照官方文档,通过 virsh nodedev-dumpxml 来获得GPU设备的 完整的GPU配置(domain, bus, slot 以及 function) 。其实有一个更为简单的方法, nvidia-smi 实际上可以直接获得这个信息,只不过需要稍微转换一下: 显示输出信息中有一个 Bus-Id 内容是 00000000:82:00.0 ,实际上只要忽略开头的4个0就能获得我们实际想要的 0000:82:00.0

  • 生成4个随机uuid:

    uuid -n 4
    

这里会输出4个随机的UUID:

334852fe-079b-11ee-9fc7-77463608f467
3348556a-079b-11ee-9fc8-7fb0c612aedd
334855e2-079b-11ee-9fc9-83e0dccb6713
33485650-079b-11ee-9fca-8f6415d2734c

将用于 mdev 设备标识

  • 创建vGPUprofile,采用 mdevctl start 命令:

    mdevctl start -u 334852fe-079b-11ee-9fc7-77463608f467 -p 0000:82:00.0 -t nvidia-50
    mdevctl start -u 3348556a-079b-11ee-9fc8-7fb0c612aedd -p 0000:82:00.0 -t nvidia-50
    mdevctl start -u 334855e2-079b-11ee-9fc9-83e0dccb6713 -p 0000:82:00.0 -t nvidia-50
    mdevctl start -u 33485650-079b-11ee-9fca-8f6415d2734c -p 0000:82:00.0 -t nvidia-50
    
  • 此时执行 mdevctl list 可以看到4个vGPU设备如下:

    334855e2-079b-11ee-9fc9-83e0dccb6713 0000:82:00.0 nvidia-50
    33485650-079b-11ee-9fca-8f6415d2734c 0000:82:00.0 nvidia-50
    334852fe-079b-11ee-9fc7-77463608f467 0000:82:00.0 nvidia-50
    3348556a-079b-11ee-9fc8-7fb0c612aedd 0000:82:00.0 nvidia-50
    
  • 如果要将profile持久化,只需要使用 mdevctl define -a -u UUID 就可以,类似:

    mdevctl define -a -u 334855e2-079b-11ee-9fc9-83e0dccb6713
    mdevctl define -a -u 33485650-079b-11ee-9fca-8f6415d2734c
    mdevctl define -a -u 334852fe-079b-11ee-9fc7-77463608f467
    mdevctl define -a -u 3348556a-079b-11ee-9fc8-7fb0c612aedd
    

OK,就这么简单

  • 要删除vGPU设备也很简单,使用 mdevctl stop -u UUID 就可以,例如:

    mdevctl stop -u 334855e2-079b-11ee-9fc9-83e0dccb6713
    

添加vGPU设备到虚拟机(失败)

备注

这段我参考SUSE文档,但是启动虚拟机失败,所以改为参考 NVIDIA 官方手册 NVIDIA Docs Hub > NVIDIA AI Enterprise > Red Hat Enterprise Linux with KVM Deployment Guide > Setting Up NVIDIA vGPU Devices ,见下一段

  • 获取GPU设备完整的virsh配置(上文已经执行过):

使用完整GPU标识,通过 virsh nodedev-dumpxml 获得完整的GPU配置(domain, bus, slot 以及 function)
virsh nodedev-dumpxml pci_0000_82_00_0 | egrep 'domain|bus|slot|function'

已经获得过:

使用完整GPU标识,通过 virsh nodedev-dumpxml 获得完整的GPU配置(domain, bus, slot 以及 function)
    <domain>0</domain>
    <bus>130</bus>
    <slot>0</slot>
    <function>0</function>
      <address domain='0x0000' bus='0x82' slot='0x00' function='0x0'/>

所以我们现在组件的4个vGPU设备的配置如下:

配置4个vGPU的yaml,添加到需要使用的VM中
<hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='off'>
  <source>
    <address uuid='334855e2-079b-11ee-9fc9-83e0dccb6713'/>
  </source>
  <address type='pci' domain='0x0000' bus='0x82' slot='0x00' function='0x0'/>
</hostdev>
<hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='off'>
  <source>
    <address uuid='33485650-079b-11ee-9fca-8f6415d2734c'/>
  </source>
  <address type='pci' domain='0x0000' bus='0x82' slot='0x00' function='0x1'/>
</hostdev>
<hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='off'>
  <source>
    <address uuid='334852fe-079b-11ee-9fc7-77463608f467'/>
  </source>
  <address type='pci' domain='0x0000' bus='0x82' slot='0x00' function='0x2'/>
</hostdev>
<hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='off'>
  <source>
    <address uuid='3348556a-079b-11ee-9fc8-7fb0c612aedd'/>
  </source>
  <address type='pci' domain='0x0000' bus='0x82' slot='0x00' function='0x3'/>
</hostdev>

备注

使用 Q 系列(虚拟工作站),则配置 display='on' ,如果是 C 系列(机器学习),则配置 display='off'

我这里遇到过2个报错:

  • error: XML error: Attempted double use of PCI Address 0000:82:00.0 : 原因是我将所有的GPU设备pci信息都写成了:

    <address type='pci' domain='0x0000' bus='0x82' slot='0x00' function='0x0'/>
    

经过尝试,每个设备的 function= 设置为不同值

  • error: unsupported configuration: graphics device is needed for attribute value 'display=on' in <hostdev> : 我在配置Q系列时设置为 'display=on' 但是有这个报错,暂时改成 'display=off'

上述2个错误解决后,我启动 y-k8s-n-1 虚拟机(已添加上述4个vGPU),结果启动报错:

启动添加了4个vGPU的虚拟机报错
error: Failed to start domain 'y-k8s-n-1'
error: internal error: qemu unexpectedly closed the monitor: 2023-06-10T15:22:45.243840Z qemu-system-x86_64: -device vfio-pci,id=hostdev0,sysfsdev=/sys/bus/mdev/devices/334855e2-079b-11ee-9fc9-83e0dccb6713,display=off,bus=pci.130,multifunction=on,addr=0x0: warning: vfio 334855e2-079b-11ee-9fc9-83e0dccb6713: Could not enable error recovery for the device
2023-06-10T15:22:45.272339Z qemu-system-x86_64: -device vfio-pci,id=hostdev1,sysfsdev=/sys/bus/mdev/devices/33485650-079b-11ee-9fca-8f6415d2734c,display=off,bus=pci.130,addr=0x0.0x1: vfio 33485650-079b-11ee-9fca-8f6415d2734c: error getting device from group 126: Input/output error
Verify all devices in group 126 are bound to vfio-<bus> or pci-stub and not already in use

添加vGPU设备到虚拟机(未完全成功)

警告

我遇到一个问题尚未解决,将1个GPU划分成4个vGPU, mdevctl 启动设备显示正常,但是尝试将多个vGPU添加到同一个虚拟机时,添加不报错,但是启动虚拟机报错

然而,在一个虚拟机中只添加一个vGPU则能正常工作

  • 使用 virsh nodedev-dumpxml 输出完整的 mdev 设备的详细信息(也就是 mdevctl list 输出信息的翻版xml):

virsh nodedev-dumpxml 获取 pci 设备的mdev xml配置
virsh nodedev-dumpxml pci_0000_82_00_0

这里会完整输出(前文采用了过滤):

virsh nodedev-dumpxml 获取 pci 设备的mdev xml配置
<device>
  <name>pci_0000_82_00_0</name>
  <path>/sys/devices/pci0000:80/0000:80:02.0/0000:82:00.0</path>
  <parent>pci_0000_80_02_0</parent>
  <driver>
    <name>nvidia</name>
  </driver>
  <capability type='pci'>
    <class>0x030200</class>
    <domain>0</domain>
    <bus>130</bus>
    <slot>0</slot>
    <function>0</function>
    <product id='0x1b39'>GP102GL [Tesla P10]</product>
    <vendor id='0x10de'>NVIDIA Corporation</vendor>
    <capability type='mdev_types'>
      <type id='nvidia-241'>
        <name>GRID P40-1B4</name>
        <deviceAPI>vfio-pci</deviceAPI>
        <availableInstances>0</availableInstances>
      </type>
      <type id='nvidia-58'>
        <name>GRID P40-6A</name>
        <deviceAPI>vfio-pci</deviceAPI>
        <availableInstances>0</availableInstances>
      </type>
      <type id='nvidia-48'>
        <name>GRID P40-3Q</name>
        <deviceAPI>vfio-pci</deviceAPI>
        <availableInstances>0</availableInstances>
      </type>
      <type id='nvidia-286'>
        <name>GRID P40-12C</name>
        <deviceAPI>vfio-pci</deviceAPI>
        <availableInstances>0</availableInstances>
      </type>
      <type id='nvidia-56'>
        <name>GRID P40-3A</name>
        <deviceAPI>vfio-pci</deviceAPI>
        <availableInstances>0</availableInstances>
      </type>
      <type id='nvidia-46'>
        <name>GRID P40-1Q</name>
        <deviceAPI>vfio-pci</deviceAPI>
        <availableInstances>0</availableInstances>
      </type>
      <type id='nvidia-284'>
        <name>GRID P40-6C</name>
        <deviceAPI>vfio-pci</deviceAPI>
        <availableInstances>0</availableInstances>
      </type>
      <type id='nvidia-54'>
        <name>GRID P40-1A</name>
        <deviceAPI>vfio-pci</deviceAPI>
        <availableInstances>0</availableInstances>
      </type>
      <type id='nvidia-62'>
        <name>GRID P40-1B</name>
        <deviceAPI>vfio-pci</deviceAPI>
        <availableInstances>0</availableInstances>
      </type>
      <type id='nvidia-52'>
        <name>GRID P40-12Q</name>
        <deviceAPI>vfio-pci</deviceAPI>
        <availableInstances>0</availableInstances>
      </type>
      <type id='nvidia-60'>
        <name>GRID P40-12A</name>
        <deviceAPI>vfio-pci</deviceAPI>
        <availableInstances>0</availableInstances>
      </type>
      <type id='nvidia-50'>
        <name>GRID P40-6Q</name>
        <deviceAPI>vfio-pci</deviceAPI>
        <availableInstances>0</availableInstances>
      </type>
      <type id='nvidia-156'>
        <name>GRID P40-2B</name>
        <deviceAPI>vfio-pci</deviceAPI>
        <availableInstances>0</availableInstances>
      </type>
      <type id='nvidia-59'>
        <name>GRID P40-8A</name>
        <deviceAPI>vfio-pci</deviceAPI>
        <availableInstances>0</availableInstances>
      </type>
      <type id='nvidia-49'>
        <name>GRID P40-4Q</name>
        <deviceAPI>vfio-pci</deviceAPI>
        <availableInstances>0</availableInstances>
      </type>
      <type id='nvidia-287'>
        <name>GRID P40-24C</name>
        <deviceAPI>vfio-pci</deviceAPI>
        <availableInstances>0</availableInstances>
      </type>
      <type id='nvidia-57'>
        <name>GRID P40-4A</name>
        <deviceAPI>vfio-pci</deviceAPI>
        <availableInstances>0</availableInstances>
      </type>
      <type id='nvidia-47'>
        <name>GRID P40-2Q</name>
        <deviceAPI>vfio-pci</deviceAPI>
        <availableInstances>0</availableInstances>
      </type>
      <type id='nvidia-285'>
        <name>GRID P40-8C</name>
        <deviceAPI>vfio-pci</deviceAPI>
        <availableInstances>0</availableInstances>
      </type>
      <type id='nvidia-55'>
        <name>GRID P40-2A</name>
        <deviceAPI>vfio-pci</deviceAPI>
        <availableInstances>0</availableInstances>
      </type>
      <type id='nvidia-283'>
        <name>GRID P40-4C</name>
        <deviceAPI>vfio-pci</deviceAPI>
        <availableInstances>0</availableInstances>
      </type>
      <type id='nvidia-53'>
        <name>GRID P40-24Q</name>
        <deviceAPI>vfio-pci</deviceAPI>
        <availableInstances>0</availableInstances>
      </type>
      <type id='nvidia-215'>
        <name>GRID P40-2B4</name>
        <deviceAPI>vfio-pci</deviceAPI>
        <availableInstances>0</availableInstances>
      </type>
      <type id='nvidia-61'>
        <name>GRID P40-24A</name>
        <deviceAPI>vfio-pci</deviceAPI>
        <availableInstances>0</availableInstances>
      </type>
      <type id='nvidia-51'>
        <name>GRID P40-8Q</name>
        <deviceAPI>vfio-pci</deviceAPI>
        <availableInstances>0</availableInstances>
      </type>
    </capability>
    <iommuGroup number='80'>
      <address domain='0x0000' bus='0x82' slot='0x00' function='0x0'/>
    </iommuGroup>
    <numa node='1'/>
    <pci-express>
      <link validity='cap' port='0' speed='8' width='16'/>
      <link validity='sta' speed='2.5' width='16'/>
    </pci-express>
  </capability>
</device>

这里 <iommuGroup> 标识一组设备,通过 IOMMU 功能和PCI总线拓扑,这些设备和其他设备隔离,并且本地Host主机驱动不得使用这些设备(解绑),这样才能分配给Guest虚拟机。

  • 按照 mdevctl list 输出信息:

检查所有的 mdev
e991023e-0f0e-484a-8763-df6b6874b82e 0000:82:00.0 nvidia-284
23501256-ff15-439a-98b1-e4f6d01e459f 0000:82:00.0 nvidia-284
58fe7cf4-e9de-41f4-ae4b-c424a2a81193 0000:82:00.0 nvidia-284
e19fa267-ff3a-4ce8-bcf6-6ae402871085 0000:82:00.0 nvidia-284

配置如下 vgpu_1.yamlvgpu_4.yaml 分别代表4个vGPU:

第1个vGPU设备
<device>
    <parent>pci_0000_82_00_0</parent>
    <capability type="mdev">
        <type id="nvidia-50"/>
        <uuid>334855e2-079b-11ee-9fc9-83e0dccb6713</uuid>
    </capability>
</device>
第2个vGPU设备
<device>
    <parent>pci_0000_82_00_0</parent>
    <capability type="mdev">
        <type id="nvidia-50"/>
        <uuid>33485650-079b-11ee-9fca-8f6415d2734c</uuid>
    </capability>
</device>
第3个vGPU设备
<device>
    <parent>pci_0000_82_00_0</parent>
    <capability type="mdev">
        <type id="nvidia-50"/>
        <uuid>334852fe-079b-11ee-9fc7-77463608f467</uuid>
    </capability>
</device>
第4个vGPU设备
<device>
    <parent>pci_0000_82_00_0</parent>
    <capability type="mdev">
        <type id="nvidia-50"/>
        <uuid>3348556a-079b-11ee-9fc8-7fb0c612aedd</uuid>
    </capability>
</device>
  • 定义第一个vGPU设备:

定义第1个vGPU设备
virsh nodedev-define vgpu_1.yaml

输出提示信息:

定义第1个vGPU设备的输出信息
Node device 'mdev_334855e2_079b_11ee_9fc9_83e0dccb6713_0000_82_00_0' defined from 'vgpu_1.yaml'

然后继续定义第2, 3, 4的vGPU:

定义第2,3,4个vGPU设备
virsh nodedev-define vgpu_2.yaml
virsh nodedev-define vgpu_3.yaml
virsh nodedev-define vgpu_4.yaml
  • 检查已经激活的 mediated devices:

显示所有已经激活的mdev设备(如果要显示没有激活的,则命令添加 --inactive )
virsh nodedev-list --cap mdev
  • 设置 vGPU 设备自动启动:

设置vGPU设备在host主机启动时自动启动
virsh nodedev-autostart mdev_334852fe_079b_11ee_9fc7_77463608f467_0000_82_00_0
virsh nodedev-autostart mdev_3348556a_079b_11ee_9fc8_7fb0c612aedd_0000_82_00_0
virsh nodedev-autostart mdev_334855e2_079b_11ee_9fc9_83e0dccb6713_0000_82_00_0
virsh nodedev-autostart mdev_33485650_079b_11ee_9fca_8f6415d2734c_0000_82_00_0
  • 将 vGPU 设备添加到VM:

在虚拟机中添加mdev设备
<hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='off'>
    <source>
        <address uuid='334855e2-079b-11ee-9fc9-83e0dccb6713'/>
    </source>
</hostdev>
<hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='off'>
    <source>
        <address uuid='33485650-079b-11ee-9fca-8f6415d2734c'/>
    </source>
</hostdev>
<hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='off'>
    <source>
        <address uuid='334852fe-079b-11ee-9fc7-77463608f467'/>
    </source>
</hostdev>
<hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='off'>
    <source>
        <address uuid='3348556a-079b-11ee-9fc8-7fb0c612aedd'/>
    </source>
</hostdev>

注意,这里没有设置PCI设备详细配置

晕倒,报错依旧

启动添加了4个vGPU的虚拟机依然报错
error: Failed to start domain 'y-k8s-n-1'
error: internal error: qemu unexpectedly closed the monitor: 2023-06-10T16:13:50.247914Z qemu-system-x86_64: -device vfio-pci,id=hostdev0,sysfsdev=/sys/bus/mdev/devices/334855e2-079b-11ee-9fc9-83e0dccb6713,display=off,bus=pci.7,addr=0x0: warning: vfio 334855e2-079b-11ee-9fc9-83e0dccb6713: Could not enable error recovery for the device
2023-06-10T16:13:50.272484Z qemu-system-x86_64: -device vfio-pci,id=hostdev1,sysfsdev=/sys/bus/mdev/devices/33485650-079b-11ee-9fca-8f6415d2734c,display=off,bus=pci.8,addr=0x0: vfio 33485650-079b-11ee-9fca-8f6415d2734c: error getting device from group 126: Input/output error
Verify all devices in group 126 are bound to vfio-<bus> or pci-stub and not already in use

dmesg -T 检查系统日志:

[Sun Jun 11 00:13:50 2023] [nvidia-vgpu-vfio] 334855e2-079b-11ee-9fc9-83e0dccb6713: vGPU migration disabled
[Sun Jun 11 00:13:50 2023] [nvidia-vgpu-vfio] 33485650-079b-11ee-9fca-8f6415d2734c: start failed. status: 0x0

可以看到第二个vgpu启动时已经失败

libvirt自动为添加的 vGPU 配置了 domain, bus, slot 以及 function
     <hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='off'>
       <source>
         <address uuid='334855e2-079b-11ee-9fc9-83e0dccb6713'/>
       </source>
       <address type='pci' domain='0x0000' bus='0x07' slot='0x00' function='0x0'/>
     </hostdev>
     <hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='off'>
       <source>
         <address uuid='33485650-079b-11ee-9fca-8f6415d2734c'/>
       </source>
       <address type='pci' domain='0x0000' bus='0x08' slot='0x00' function='0x0'/>
     </hostdev>
     <hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='off'>
       <source>
         <address uuid='334852fe-079b-11ee-9fc7-77463608f467'/>
       </source>
       <address type='pci' domain='0x0000' bus='0x09' slot='0x00' function='0x0'/>
     </hostdev>
     <hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='off'>
       <source>
         <address uuid='3348556a-079b-11ee-9fc8-7fb0c612aedd'/>
       </source>
       <address type='pci' domain='0x0000' bus='0x0a' slot='0x00' function='0x0'/>
     </hostdev>

但是看起来这种方式还是存在pci设备冲突。

备注

参考NVIDIA原文,采用简化的配置 libvirt 也会扩展成上述配置,但是启动时报错没有解决

但是一个虚拟机添加一个vGPU正常

既然 y-k8s-n-1 在添加多个vGPU启动失败(实际是启动第2块vGPU出现 vfio-<bus> 或 pci-stub 已经被使用),那么只在一个虚拟机中插入一个vGPU是否可以呢?

  • 重新修订 y-k8s-n-1 ,只添加一段(一块vGPU):

y-k8s-n-1 虚拟机中只添加一块vGPU
<hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='off'>
    <source>
        <address uuid='334855e2-079b-11ee-9fc9-83e0dccb6713'/>
    </source>
</hostdev>

果然,这次 virsh start y-k8s-n-1 启动正常

既然一个虚拟机加一块vGPU工作正常,那么将第二块vGPU添加到另外一个虚拟机中,是否正常呢? 答案是: 也正常

  • 修订 y-k8s-n-2 ,添加第二块vGPU:

y-k8s-n-2 虚拟机中添加另一块vGPU(第二块vGPU)
<hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='off'>
    <source>
        <address uuid='33485650-079b-11ee-9fca-8f6415d2734c'/>
    </source>
</hostdev>

验证第二台虚拟机启动也正常

  • 此时验证 nvidia-smi 输出可以看到系统启动了2个vgpu:

启动了2个各自添加一个vGPU的虚拟机之后,检查 nvidia-smi 输出
Sun Jun 11 22:41:23 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.03    Driver Version: 510.85.03    CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA Graphics...  On   | 00000000:82:00.0 Off |                    0 |
| N/A   41C    P8    18W / 150W |  11474MiB / 23040MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     11656    C+G   vgpu                             5712MiB |
|    0   N/A  N/A     13194    C+G   vgpu                             5712MiB |
+-----------------------------------------------------------------------------+

这里可以看到物理GPU的 23040MiB 显存已经被使用了 11474MiB 大约12GB,并且有2个GPU进程,名字都是 vgpu

  • 此时验证 nvidia-smi vgpu 显示详细的vgpu信息:

启动了2个各自添加一个vGPU的虚拟机之后,检查 nvidia-smi vgpu 输出
Sun Jun 11 22:41:38 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.03              Driver Version: 510.85.03                 |
|---------------------------------+------------------------------+------------+
| GPU  Name                       | Bus-Id                       | GPU-Util   |
|      vGPU ID     Name           | VM ID     VM Name            | vGPU-Util  |
|=================================+==============================+============|
|   0  NVIDIA Graphics Device     | 00000000:82:00.0             |   0%       |
|      3251634243  GRID P40-6Q    | 10a1...  y-k8s-n-1           |      0%    |
|      3251634251  GRID P40-6Q    | b102...  y-k8s-n-2           |      0%    |
+---------------------------------+------------------------------+------------+

这里可以看到有2个虚拟机 y-k8s-n-1y-k8s-n-2 分别占用了一个 GRID P40-6Q 的NVIDIA显示设备,也就是2个vGPU

备注

也就是说,截止目前,vGPU的创建和简单分配是成功的,而且能够添加到VM中,只是尚未解决如何在一个VM中使用多个vGPU。

再次尝试在一个VM中添加多个vGPU(成功又遗憾)

Please ensure all devices within the iommu_group are bound to their vfio bus driver Error 提到了一个细节,触发我想起了很久以前实践 采用OVMF实现passthrough GPU和NVMe存储 曾经在配置 PCIe Pass Through 中,有一段技术要求提到:

  • IOMMU Group 是直通给虚拟机的最小物理设备集合

  • 必须将一个 IOMMU Group 完整输出给一个VM

找到一个和我的情况完全相同的 Error when allocating multiple vGPUs in a single VM with Ubuntu KVM hypervisor 但是原帖没有解决这个问题

前面我通过 mdevctl 创建了4个vGPU,在系统日志中可以看到:

dmesg 中有 IOMMU 记录显示添加了 vGPU 设备(mdev)
dmesg | grep -i -e DMAR -e IOMMU

输出显示的最后添加 Adding to iommu group 123 以及删除 Removing from iommu group 123 以及又添加 Adding to iommu group 123 则是我之前操作创建 mdev 设备以及删除再创建的记录

dmesg 中有 IOMMU 记录显示添加了 vGPU 设备(mdev)
[Fri Jun  9 23:43:37 2023] Command line: BOOT_IMAGE=/boot/vmlinuz-5.15.0-73-generic root=UUID=caa4193b-9222-49fe-a4b3-89f1cb417e6a ro intel_iommu=on iommu=pt vfio-pci.ids=144d:a80a intel_pstate=enable processor.max_cstate=1 intel_idle.max_cstate=1 rd.driver.blacklist=nouveau,rivafb,nvidiafb,rivatv
[Fri Jun  9 23:43:37 2023] ACPI: DMAR 0x000000007B7E7000 000294 (v01 HP     ProLiant 00000001 HP   00000001)
[Fri Jun  9 23:43:37 2023] ACPI: Reserving DMAR table memory at [mem 0x7b7e7000-0x7b7e7293]
[Fri Jun  9 23:43:38 2023] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.15.0-73-generic root=UUID=caa4193b-9222-49fe-a4b3-89f1cb417e6a ro intel_iommu=on iommu=pt vfio-pci.ids=144d:a80a intel_pstate=enable processor.max_cstate=1 intel_idle.max_cstate=1 rd.driver.blacklist=nouveau,rivafb,nvidiafb,rivatv
[Fri Jun  9 23:43:38 2023] DMAR: IOMMU enabled
[Fri Jun  9 23:43:39 2023] DMAR: Host address width 46
[Fri Jun  9 23:43:39 2023] DMAR: DRHD base: 0x000000fbffc000 flags: 0x0
[Fri Jun  9 23:43:39 2023] DMAR: dmar0: reg_base_addr fbffc000 ver 1:0 cap d2078c106f0466 ecap f020df
[Fri Jun  9 23:43:39 2023] DMAR: DRHD base: 0x000000c7ffc000 flags: 0x1
[Fri Jun  9 23:43:39 2023] DMAR: dmar1: reg_base_addr c7ffc000 ver 1:0 cap d2078c106f0466 ecap f020df
[Fri Jun  9 23:43:39 2023] DMAR: RMRR base: 0x00000079174000 end: 0x00000079176fff
[Fri Jun  9 23:43:39 2023] DMAR: RMRR base: 0x000000791f4000 end: 0x000000791f7fff
[Fri Jun  9 23:43:39 2023] DMAR: RMRR base: 0x000000791de000 end: 0x000000791f3fff
[Fri Jun  9 23:43:39 2023] DMAR: RMRR base: 0x000000791cb000 end: 0x000000791dbfff
[Fri Jun  9 23:43:39 2023] DMAR: RMRR base: 0x000000791dc000 end: 0x000000791ddfff
[Fri Jun  9 23:43:39 2023] DMAR-IR: IOAPIC id 10 under DRHD base  0xfbffc000 IOMMU 0
[Fri Jun  9 23:43:39 2023] DMAR-IR: IOAPIC id 8 under DRHD base  0xc7ffc000 IOMMU 1
[Fri Jun  9 23:43:39 2023] DMAR-IR: IOAPIC id 9 under DRHD base  0xc7ffc000 IOMMU 1
[Fri Jun  9 23:43:39 2023] DMAR-IR: HPET id 0 under DRHD base 0xc7ffc000
[Fri Jun  9 23:43:39 2023] DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping.
[Fri Jun  9 23:43:39 2023] DMAR-IR: Enabled IRQ remapping in x2apic mode
[Fri Jun  9 23:43:40 2023] iommu: Default domain type: Passthrough (set via kernel command line)
[Fri Jun  9 23:43:40 2023] DMAR: No ATSR found
[Fri Jun  9 23:43:40 2023] DMAR: No SATC found
[Fri Jun  9 23:43:40 2023] DMAR: dmar0: Using Queued invalidation
[Fri Jun  9 23:43:40 2023] DMAR: dmar1: Using Queued invalidation
[Fri Jun  9 23:43:40 2023] pci 0000:00:00.0: Adding to iommu group 0
[Fri Jun  9 23:43:40 2023] pci 0000:00:01.0: Adding to iommu group 1
[Fri Jun  9 23:43:40 2023] pci 0000:00:01.1: Adding to iommu group 2
[Fri Jun  9 23:43:40 2023] pci 0000:00:02.0: Adding to iommu group 3
...
[Fri Jun  9 23:43:40 2023] pci 0000:ff:1f.2: Adding to iommu group 94
[Fri Jun  9 23:43:40 2023] DMAR: Intel(R) Virtualization Technology for Directed I/O
[Fri Jun  9 23:43:44 2023] pci 0000:04:10.0: Adding to iommu group 95
[Fri Jun  9 23:43:44 2023] pci 0000:04:10.4: Adding to iommu group 96
...
[Fri Jun  9 23:43:44 2023] pci 0000:04:13.2: Adding to iommu group 120
[Fri Jun  9 23:43:44 2023] pci 0000:04:13.1: Adding to iommu group 121
[Fri Jun  9 23:43:44 2023] pci 0000:04:13.3: Adding to iommu group 122
[Sat Jun 10 14:33:53 2023] vfio_mdev e991023e-0f0e-484a-8763-df6b6874b82e: Adding to iommu group 123
[Sat Jun 10 14:39:16 2023] vfio_mdev 58fe7cf4-e9de-41f4-ae4b-c424a2a81193: Adding to iommu group 124
[Sat Jun 10 14:39:25 2023] vfio_mdev e19fa267-ff3a-4ce8-bcf6-6ae402871085: Adding to iommu group 125
[Sat Jun 10 14:39:27 2023] vfio_mdev 23501256-ff15-439a-98b1-e4f6d01e459f: Adding to iommu group 126
[Sat Jun 10 20:47:23 2023] vfio_mdev e19fa267-ff3a-4ce8-bcf6-6ae402871085: Removing from iommu group 125
[Sat Jun 10 20:47:23 2023] vfio_mdev e19fa267-ff3a-4ce8-bcf6-6ae402871085: MDEV: detaching iommu
[Sat Jun 10 22:28:09 2023] vfio_mdev 58fe7cf4-e9de-41f4-ae4b-c424a2a81193: Removing from iommu group 124
[Sat Jun 10 22:28:09 2023] vfio_mdev 58fe7cf4-e9de-41f4-ae4b-c424a2a81193: MDEV: detaching iommu
[Sat Jun 10 22:28:17 2023] vfio_mdev 23501256-ff15-439a-98b1-e4f6d01e459f: Removing from iommu group 126
[Sat Jun 10 22:28:17 2023] vfio_mdev 23501256-ff15-439a-98b1-e4f6d01e459f: MDEV: detaching iommu
[Sat Jun 10 22:28:23 2023] vfio_mdev e991023e-0f0e-484a-8763-df6b6874b82e: Removing from iommu group 123
[Sat Jun 10 22:28:23 2023] vfio_mdev e991023e-0f0e-484a-8763-df6b6874b82e: MDEV: detaching iommu
[Sat Jun 10 22:33:39 2023] vfio_mdev 334852fe-079b-11ee-9fc7-77463608f467: Adding to iommu group 123
[Sat Jun 10 22:33:46 2023] vfio_mdev 3348556a-079b-11ee-9fc8-7fb0c612aedd: Adding to iommu group 124
[Sat Jun 10 22:33:52 2023] vfio_mdev 334855e2-079b-11ee-9fc9-83e0dccb6713: Adding to iommu group 125
[Sat Jun 10 22:33:59 2023] vfio_mdev 33485650-079b-11ee-9fca-8f6415d2734c: Adding to iommu group 126

这些添加的 group 123group 126 分别是4个 vGPU 设备对应的 iommu group

从内核 sys 文件系统可以找到对应项:

通过 ls 检查 iommu_group 中详细的设备信息
for iommu_group in {123..126};do ls -lh /sys/kernel/iommu_groups/${iommu_group}/devices/ | grep -v total;done

可以看到内核中这些 vGPU 设备都位于 /sys/devices/pci0000:80/0000:80:02.0/0000:82:00.0/ 目录下:

vGPU 设备详情
lrwxrwxrwx 1 root root 0 Jun 15 08:53 334852fe-079b-11ee-9fc7-77463608f467 -> ../../../../devices/pci0000:80/0000:80:02.0/0000:82:00.0/334852fe-079b-11ee-9fc7-77463608f467
lrwxrwxrwx 1 root root 0 Jun 15 08:55 3348556a-079b-11ee-9fc8-7fb0c612aedd -> ../../../../devices/pci0000:80/0000:80:02.0/0000:82:00.0/3348556a-079b-11ee-9fc8-7fb0c612aedd
lrwxrwxrwx 1 root root 0 Jun 15 08:55 334855e2-079b-11ee-9fc9-83e0dccb6713 -> ../../../../devices/pci0000:80/0000:80:02.0/0000:82:00.0/334855e2-079b-11ee-9fc9-83e0dccb6713
lrwxrwxrwx 1 root root 0 Jun 14 02:16 33485650-079b-11ee-9fca-8f6415d2734c -> ../../../../devices/pci0000:80/0000:80:02.0/0000:82:00.0/33485650-079b-11ee-9fca-8f6415d2734c

我发现 Ubuntu官方文档: Virtualisation with QEMU 中检查 systemctl status nvidia-vgpu-mgr 得到的状态信息和我不同。在这个文档中提供了一些Guest获得vGPU passed的信息(表明vGPU工作)案例:

Ubuntu文档中vGPU正常添加的日志案例
$ systemctl status nvidia-vgpu-mgr
     Loaded: loaded (/lib/systemd/system/nvidia-vgpu-mgr.service; enabled; vendor preset: enabled)
     Active: active (running) since Tue 2021-09-14 07:30:19 UTC; 3min 58s ago
    Process: 1559 ExecStart=/usr/bin/nvidia-vgpu-mgr (code=exited, status=0/SUCCESS)
   Main PID: 1564 (nvidia-vgpu-mgr)
      Tasks: 1 (limit: 309020)
     Memory: 1.1M
     CGroup: /system.slice/nvidia-vgpu-mgr.service
             └─1564 /usr/bin/nvidia-vgpu-mgr

Sep 14 07:30:19 node-watt systemd[1]: Starting NVIDIA vGPU Manager Daemon...
Sep 14 07:30:19 node-watt systemd[1]: Started NVIDIA vGPU Manager Daemon.
Sep 14 07:30:20 node-watt nvidia-vgpu-mgr[1564]: notice: vmiop_env_log: nvidia-vgpu-mgr daemon started

# Entries when a guest gets a vGPU passed
Sep 14 08:29:50 node-watt nvidia-vgpu-mgr[2866]: notice: vmiop_log: (0x0): gpu-pci-id : 0x4100
Sep 14 08:29:50 node-watt nvidia-vgpu-mgr[2866]: notice: vmiop_log: (0x0): vgpu_type : Quadro
Sep 14 08:29:50 node-watt nvidia-vgpu-mgr[2866]: notice: vmiop_log: (0x0): Framebuffer: 0x1dc000000
Sep 14 08:29:50 node-watt nvidia-vgpu-mgr[2866]: notice: vmiop_log: (0x0): Virtual Device Id: 0x1db4:0x1252
Sep 14 08:29:50 node-watt nvidia-vgpu-mgr[2866]: notice: vmiop_log: (0x0): FRL Value: 60 FPS
Sep 14 08:29:50 node-watt nvidia-vgpu-mgr[2866]: notice: vmiop_log: ######## vGPU Manager Information: ########
Sep 14 08:29:50 node-watt nvidia-vgpu-mgr[2866]: notice: vmiop_log: Driver Version: 470.68
Sep 14 08:29:50 node-watt nvidia-vgpu-mgr[2866]: notice: vmiop_log: (0x0): vGPU supported range: (0x70001, 0xb0001)
Sep 14 08:29:50 node-watt nvidia-vgpu-mgr[2866]: notice: vmiop_log: (0x0): Init frame copy engine: syncing...
Sep 14 08:29:50 node-watt nvidia-vgpu-mgr[2866]: notice: vmiop_log: (0x0): vGPU migration enabled
Sep 14 08:29:50 node-watt nvidia-vgpu-mgr[2866]: notice: vmiop_log: display_init inst: 0 successful

# Entries when a guest grabs a license
Sep 15 06:55:50 node-watt nvidia-vgpu-mgr[4260]: notice: vmiop_log: (0x0): vGPU license state: Unlicensed (Unrestricted)
Sep 15 06:55:52 node-watt nvidia-vgpu-mgr[4260]: notice: vmiop_log: (0x0): vGPU license state: Licensed

# In the guest the card is then fully recognized and enabled
$ nvidia-smi -a | grep -A 2 "Licensed Product"
    vGPU Software Licensed Product
        Product Name                      : NVIDIA RTX Virtual Workstation
        License Status                    : Licensed

我检查我的Host主机 nvidia-vgpu-mgr 日志,发现之前启动正常的服务日志,现在显示已经是一些错误信息了:

服务器添加vgpu错误日志
● nvidia-vgpu-mgr.service - NVIDIA vGPU Manager Daemon
     Loaded: loaded (/lib/systemd/system/nvidia-vgpu-mgr.service; enabled; vendor preset: enabled)
     Active: active (running) since Sat 2023-06-10 00:13:12 CST; 5 days ago
    Process: 3760 ExecStart=/opt/vgpu_unlock/vgpu_unlock /usr/bin/nvidia-vgpu-mgr (code=exited, status=0/SUCCESS)
   Main PID: 3764 (vgpu_unlock)
      Tasks: 11 (limit: 464054)
     Memory: 41.0M
        CPU: 5min 18.731s
     CGroup: /system.slice/nvidia-vgpu-mgr.service
             ├─3764 /bin/python3 /opt/vgpu_unlock/vgpu_unlock -f /usr/bin/nvidia-vgpu-mgr
             └─3784 /usr/bin/nvidia-vgpu-mgr

Jun 14 11:03:21 zcloud.staging.huatai.me nvidia-vgpu-mgr[26737]: error: vmiop_log: (0x1): Thread for engine 0x0 could not join with error 0x5
Jun 14 11:03:21 zcloud.staging.huatai.me nvidia-vgpu-mgr[26737]: error: vmiop_log: (0x1): Failed to free thread event for engine 0x0. Error: 0x5
Jun 14 11:03:21 zcloud.staging.huatai.me nvidia-vgpu-mgr[26737]: error: vmiop_log: (0x1): Thread for engine 0x4 could not join with error 0x5
Jun 14 11:03:21 zcloud.staging.huatai.me nvidia-vgpu-mgr[26737]: error: vmiop_log: (0x1): Failed to free thread event for engine 0x4. Error: 0x5
Jun 14 11:03:21 zcloud.staging.huatai.me nvidia-vgpu-mgr[26737]: error: vmiop_log: (0x1): Thread for engine 0x5 could not join with error 0x5
Jun 14 11:03:21 zcloud.staging.huatai.me nvidia-vgpu-mgr[26737]: error: vmiop_log: (0x1): Failed to free thread event for engine 0x5. Error: 0x5
Jun 14 11:03:21 zcloud.staging.huatai.me nvidia-vgpu-mgr[26737]: error: vmiop_log: display_init failed for inst: 1
Jun 14 11:03:21 zcloud.staging.huatai.me nvidia-vgpu-mgr[26737]: error: vmiop_env_log: (0x1): vmiope_process_configuration failed with 0x1f
Jun 14 11:03:21 zcloud.staging.huatai.me nvidia-vgpu-mgr[26737]: error: vmiop_env_log: (0x1): plugin_initialize failed  with error:0x1f
Jun 14 11:03:25 zcloud.staging.huatai.me nvidia-vgpu-mgr[26737]: notice: vmiop_log: (0x0): Srubbing completed but notification missed

想到我的虚拟机中尚未安装Guest GRID软件包,也没有配置连接Licence Server ( 安装NVIDIA license服务器 ),会不会是这个原因导致无法添加第2块vGPU呢?

再想了一下,不对,出现 vfio 设备添加报错是在VM启动初始化时候,此时GUEST操作系统尚未启动,所以虚拟机内部Guest GRID软件尚未起作用。头疼…

  • 再次将4块 vGPU 添加到到 y-k8s-n-1 虚拟机中,启动依然是报错的,此时,检查 journalctl -u nvidia-vgpu-mgr --no-pager 输出信息:

检查添加了多块vGPU虚拟机启动时 nvidia-vgpu-mgr 报错日志
Jun 15 09:03:58 zcloud.staging.huatai.me systemd[1]: Stopping NVIDIA vGPU Manager Daemon...
Jun 15 09:03:58 zcloud.staging.huatai.me systemd[1]: nvidia-vgpu-mgr.service: Deactivated successfully.
Jun 15 09:03:58 zcloud.staging.huatai.me systemd[1]: nvidia-vgpu-mgr.service: Unit process 3784 (nvidia-vgpu-mgr) remains running after unit stopped.
Jun 15 09:03:58 zcloud.staging.huatai.me systemd[1]: Stopped NVIDIA vGPU Manager Daemon.
Jun 15 09:03:58 zcloud.staging.huatai.me systemd[1]: nvidia-vgpu-mgr.service: Consumed 5min 18.755s CPU time.
Jun 15 09:03:58 zcloud.staging.huatai.me systemd[1]: nvidia-vgpu-mgr.service: Found left-over process 3784 (nvidia-vgpu-mgr) in control group while starting unit. Ignoring.
Jun 15 09:03:58 zcloud.staging.huatai.me systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jun 15 09:03:58 zcloud.staging.huatai.me systemd[1]: Starting NVIDIA vGPU Manager Daemon...
Jun 15 09:03:58 zcloud.staging.huatai.me systemd[1]: Started NVIDIA vGPU Manager Daemon.
Jun 15 09:03:58 zcloud.staging.huatai.me bash[30237]: vgpu_unlock loaded.
Jun 15 09:03:58 zcloud.staging.huatai.me nvidia-vgpu-mgr[30237]: vgpu_unlock loaded.
Jun 15 09:03:58 zcloud.staging.huatai.me nvidia-vgpu-mgr[30253]: vgpu_unlock loaded.
Jun 15 09:03:58 zcloud.staging.huatai.me nvidia-vgpu-mgr[30253]: notice: vmiop_env_log: nvidia-vgpu-mgr daemon started
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31016]: notice: vmiop_env_log: vmiop-env: guest_max_gpfn:0x0
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31016]: notice: vmiop_env_log: (0x0): Received start call from nvidia-vgpu-vfio module: mdev uuid 334855e2-079b-11ee-9fc9-83e0dccb6713 GPU PCI id 00:82:00.0 config params vgpu_type_id=50
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31016]: notice: vmiop_env_log: (0x0): pluginconfig: vgpu_type_id=50
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31016]: notice: vmiop_env_log: Successfully updated env symbols!
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31016]: notice: vmiop_log: (0x0): gpu-pci-id : 0x8200
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31016]: notice: vmiop_log: (0x0): vgpu_type : Quadro
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31016]: notice: vmiop_log: (0x0): Framebuffer: 0x164000000
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31016]: notice: vmiop_log: (0x0): Virtual Device Id: 0x1b38:0x11ec
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31016]: notice: vmiop_log: (0x0): FRL Value: 60 FPS
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31016]: notice: vmiop_log: ######## vGPU Manager Information: ########
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31016]: notice: vmiop_log: Driver Version: 510.85.03
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31016]: notice: vmiop_log: (0x0): vGPU supported range: (0x70001, 0xd0001)
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31016]: notice: vmiop_log: (0x0): Detected ECC enabled on physical GPU.
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31016]: notice: vmiop_log: (0x0): Guest usable FB size is reduced due to ECC.
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31019]: vgpu_unlock loaded.
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31016]: notice: vmiop_log: (0x0): Init frame copy engine: syncing...
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31035]: vgpu_unlock loaded.
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31035]: notice: vmiop_env_log: vmiop-env: guest_max_gpfn:0x0
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31035]: notice: vmiop_env_log: (0x0): Received start call from nvidia-vgpu-vfio module: mdev uuid 334855e2-079b-11ee-9fc9-83e0dccb6713 GPU PCI id 00:82:00.0 config params vgpu_type_id=50
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31035]: notice: vmiop_env_log: (0x0): pluginconfig: vgpu_type_id=50
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31035]: notice: vmiop_env_log: Successfully updated env symbols!
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31016]: notice: vmiop_log: (0x0): vGPU migration enabled
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31035]: op_type: 0xa0810115 failed.
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31035]: notice: vmiop_log: (0x0): gpu-pci-id : 0x8200
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31035]: notice: vmiop_log: (0x0): vgpu_type : Quadro
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31035]: notice: vmiop_log: (0x0): Framebuffer: 0x164000000
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31035]: notice: vmiop_log: (0x0): Virtual Device Id: 0x1b38:0x11ec
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31035]: notice: vmiop_log: (0x0): FRL Value: 60 FPS
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31035]: notice: vmiop_log: ######## vGPU Manager Information: ########
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31035]: notice: vmiop_log: Driver Version: 510.85.03
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31016]: notice: vmiop_log: display_init inst: 0 successful
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31035]: notice: vmiop_log: (0x0): vGPU supported range: (0x70001, 0xd0001)
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31035]: notice: vmiop_log: (0x0): Detected ECC enabled on physical GPU.
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31035]: notice: vmiop_log: (0x0): Guest usable FB size is reduced due to ECC.
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31035]: notice: vmiop_log: (0x0): Init frame copy engine: syncing...
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31016]: notice: vmiop_env_log: (0x1): Received start call from nvidia-vgpu-vfio module: mdev uuid 33485650-079b-11ee-9fca-8f6415d2734c GPU PCI id 00:82:00.0 config params vgpu_type_id=50
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31016]: notice: vmiop_env_log: (0x1): pluginconfig: vgpu_type_id=50
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31016]: notice: vmiop_log: (0x1): gpu-pci-id : 0x8200
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31016]: notice: vmiop_log: (0x1): vgpu_type : Quadro
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31016]: notice: vmiop_log: (0x1): Framebuffer: 0x164000000
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31016]: notice: vmiop_log: (0x1): Virtual Device Id: 0x1b38:0x11ec
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31016]: notice: vmiop_log: (0x1): FRL Value: 60 FPS
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31016]: notice: vmiop_log: ######## vGPU Manager Information: ########
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31016]: notice: vmiop_log: Driver Version: 510.85.03
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31016]: error: vmiop_log: (0x1): init_device_instance failed for inst 1 with error 1 (multiple vGPUs in a VM not supported)
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31016]: error: vmiop_log: (0x1): Initialization: init_device_instance failed error 1
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31016]: error: vmiop_log: (0x1): Thread for engine 0x0 could not join with error 0x5
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31016]: error: vmiop_log: (0x1): Failed to free thread event for engine 0x0. Error: 0x5
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31016]: error: vmiop_log: (0x1): Thread for engine 0x4 could not join with error 0x5
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31016]: error: vmiop_log: (0x1): Failed to free thread event for engine 0x4. Error: 0x5
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31016]: error: vmiop_log: (0x1): Thread for engine 0x5 could not join with error 0x5
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31016]: error: vmiop_log: (0x1): Failed to free thread event for engine 0x5. Error: 0x5
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31016]: error: vmiop_log: display_init failed for inst: 1
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31016]: error: vmiop_env_log: (0x1): vmiope_process_configuration failed with 0x1f
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31016]: error: vmiop_env_log: (0x1): plugin_initialize failed  with error:0x1f
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31035]: notice: vmiop_log: (0x0): vGPU migration enabled
Jun 15 10:04:14 zcloud.staging.huatai.me nvidia-vgpu-mgr[31035]: notice: vmiop_log: display_init inst: 0 successful
Jun 15 10:04:18 zcloud.staging.huatai.me nvidia-vgpu-mgr[31035]: notice: vmiop_log: (0x0): Srubbing completed but notification missed

乌龙

原来错误日志是如此明显: multiple vGPUs in a VM not supported

同一虚拟机配置多vGPU有硬件限制

原来 Virtual GPU Software R525 for Ubuntu Release Notes #Multiple vGPU Support 是有硬件限制的,而且非常苛刻:

  • NVIDIA Pascal GPU Architecture (我的 Nvidia Tesla P10 GPU运算卡 )中 Tesla P40 实际上只有2个vGPU规格支持在一个虚拟机中配置多个vGPU: P40-24QP40-24C (NVIDIA你是玩我呀,24C和24Q不就是完整的一块P40 GPU卡么)

  • 实际上真正有效的 vGPU 功能要从 NVIDIA Volta GPU Architecture 系列以上,才支持全系列 Q / C 不同规格 多vGPUs 配置到同一个VM

唉,折腾了好几天,原来我的 Nvidia Tesla P10 GPU运算卡 太低端了,无法实现 同一虚拟机配置多vGPU ,郁闷…

清理环境,再次起步

终于折腾完了 NVIDIA Virtual GPU (vGPU) ,断断续续杂七杂八写了很多曲折的笔记…

我决定将 Nvidia Tesla P10 GPU运算卡 切分成2块 NVIDIA Virtual GPU (vGPU) 来构建 GPU Kubernetes :操作汇总整理到 vGPU快速起步 。这里我先清理掉本文多次实践后的vGPU环境,以便重新开始:

清理vGPU环境
# 消除mdev的profile(持久化配置)
mdevctl undefine -u 334855e2-079b-11ee-9fc9-83e0dccb6713
mdevctl undefine -u 33485650-079b-11ee-9fca-8f6415d2734c
mdevctl undefine -u 334852fe-079b-11ee-9fc7-77463608f467
mdevctl undefine -u 3348556a-079b-11ee-9fc8-7fb0c612aedd

# 删除mdev设备(stop)
mdevctl stop -u 334855e2-079b-11ee-9fc9-83e0dccb6713
mdevctl stop -u 33485650-079b-11ee-9fca-8f6415d2734c
mdevctl stop -u 334852fe-079b-11ee-9fc7-77463608f467
mdevctl stop -u 3348556a-079b-11ee-9fc8-7fb0c612aedd

# 最后检查列表,确认已经清理干净
mdevctl list

警告

目前我实际采用 vGPU快速起步 构建双vGPU模式来运行(每个vGPU分配12G显存)

nvidia-smi 清理

我在上述清理之后实践 vGPU快速起步 还是遇到了 y-k8s-n-1 启动报错:

清理vGPU环境后启动新创建使用单个vGPU的报错,实际上是 nvidia-smi 没有清理干净
error: Failed to start domain 'y-k8s-n-1'
error: internal error: qemu unexpectedly closed the monitor: 2023-06-15T07:08:55.663867Z qemu-system-x86_64: -device vfio-pci,id=hostdev0,sysfsdev=/sys/bus/mdev/devices/3eb9d560-0b31-11ee-91a9-bb28039c61eb,display=off,bus=pci.7,addr=0x0: vfio 3eb9d560-0b31-11ee-91a9-bb28039c61eb: error getting device from group 123: Input/output error
Verify all devices in group 123 are bound to vfio-<bus> or pci-stub and not already in use

此时我发现 nvidia-smi vgpu 居然还残留着之前配置的2个 P40-6Q (当时启动失败,但是配置残留):

nvidia-smi vgpu 没有清理干净残留的2个 P40-6Q
Thu Jun 15 15:12:03 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.03              Driver Version: 510.85.03                 |
|---------------------------------+------------------------------+------------+
| GPU  Name                       | Bus-Id                       | GPU-Util   |
|      vGPU ID     Name           | VM ID     VM Name            | vGPU-Util  |
|=================================+==============================+============|
|   0  NVIDIA Graphics Device     | 00000000:82:00.0             |   0%       |
|      3251634329  GRID P40-6Q    | 10a1...  y-k8s-n-1           |      0%    |
|      3251634341  GRID P40-6Q    | 10a1...  y-k8s-n-1           |      0%    |
+---------------------------------+------------------------------+------------+

而且此时 nvidia-smi 也残留着当时已经分配的一个 P40-6Q vGPU:

nvidia-smi 没有清理干净残留的1个 P40-6Q
Thu Jun 15 15:18:05 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.03    Driver Version: 510.85.03    CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA Graphics...  On   | 00000000:82:00.0 Off |                    0 |
| N/A   39C    P8    18W / 150W |   5762MiB / 23040MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     31016    C+G   vgpu                             5712MiB |
+-----------------------------------------------------------------------------+
  • 执行 systemctl restart nvidia-vgpu-mgr ,然后检查 journalctl -u nvidia-vgpu-mgr 果然发现有残留:

重启 nvidia-vgpu-mgr 发现有4个之前残留的进程(之前使用过的2个mdev设备 P40-6Q )
Jun 15 15:20:57 zcloud.staging.huatai.me systemd[1]: Stopping NVIDIA vGPU Manager Daemon...
Jun 15 15:20:57 zcloud.staging.huatai.me systemd[1]: nvidia-vgpu-mgr.service: Deactivated successfully.
Jun 15 15:20:57 zcloud.staging.huatai.me systemd[1]: nvidia-vgpu-mgr.service: Unit process 3784 (nvidia-vgpu-mgr) remains running after unit stopped.
Jun 15 15:20:57 zcloud.staging.huatai.me systemd[1]: nvidia-vgpu-mgr.service: Unit process 30232 (vgpu_unlock) remains running after unit stopped.
Jun 15 15:20:57 zcloud.staging.huatai.me systemd[1]: nvidia-vgpu-mgr.service: Unit process 30253 (nvidia-vgpu-mgr) remains running after unit stopped.
Jun 15 15:20:57 zcloud.staging.huatai.me systemd[1]: nvidia-vgpu-mgr.service: Unit process 31016 (nvidia-vgpu-mgr) remains running after unit stopped.
Jun 15 15:20:57 zcloud.staging.huatai.me systemd[1]: Stopped NVIDIA vGPU Manager Daemon.
Jun 15 15:20:57 zcloud.staging.huatai.me systemd[1]: nvidia-vgpu-mgr.service: Consumed 18.441s CPU time.
Jun 15 15:20:57 zcloud.staging.huatai.me systemd[1]: nvidia-vgpu-mgr.service: Found left-over process 3784 (nvidia-vgpu-mgr) in control group while starting unit. Ignoring.
Jun 15 15:20:57 zcloud.staging.huatai.me systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jun 15 15:20:57 zcloud.staging.huatai.me systemd[1]: nvidia-vgpu-mgr.service: Found left-over process 30232 (vgpu_unlock) in control group while starting unit. Ignoring.
Jun 15 15:20:57 zcloud.staging.huatai.me systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jun 15 15:20:57 zcloud.staging.huatai.me systemd[1]: nvidia-vgpu-mgr.service: Found left-over process 30253 (nvidia-vgpu-mgr) in control group while starting unit. Ignoring.
Jun 15 15:20:57 zcloud.staging.huatai.me systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jun 15 15:20:57 zcloud.staging.huatai.me systemd[1]: nvidia-vgpu-mgr.service: Found left-over process 31016 (nvidia-vgpu-mgr) in control group while starting unit. Ignoring.
Jun 15 15:20:57 zcloud.staging.huatai.me systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jun 15 15:20:57 zcloud.staging.huatai.me systemd[1]: Starting NVIDIA vGPU Manager Daemon...
Jun 15 15:20:57 zcloud.staging.huatai.me systemd[1]: Started NVIDIA vGPU Manager Daemon.
Jun 15 15:20:57 zcloud.staging.huatai.me bash[34344]: vgpu_unlock loaded.
Jun 15 15:20:57 zcloud.staging.huatai.me nvidia-vgpu-mgr[34344]: vgpu_unlock loaded.
Jun 15 15:20:57 zcloud.staging.huatai.me nvidia-vgpu-mgr[34360]: vgpu_unlock loaded.
Jun 15 15:20:57 zcloud.staging.huatai.me nvidia-vgpu-mgr[34360]: notice: vmiop_env_log: nvidia-vgpu-mgr daemon started
  • 检查进程:

    ps aux | grep nvidia-vgpu-mgr
    

果然看到了对应的pid:

通过 ps 检查可以看到残留的 nvidia-vgpu-mgr 对应于之前曾经使用过的2个mdev设备 P40-6Q
root        3784  0.0  0.0 429044  2228 ?        Ss   Jun10   0:01 /usr/bin/nvidia-vgpu-mgr
root       30232  0.0  0.0 446312 49452 ?        Sl   09:03   0:09 /bin/python3 /opt/vgpu_unlock/vgpu_unlock -f /usr/bin/nvidia-vgpu-mgr
root       30253  0.0  0.0 466376  9208 ?        Ssl  09:03   0:00 /usr/bin/nvidia-vgpu-mgr
root       34340  0.0  0.0 438116 49256 ?        Sl   15:20   0:00 /bin/python3 /opt/vgpu_unlock/vgpu_unlock -f /usr/bin/nvidia-vgpu-mgr
root       34360  0.0  0.0 474572  8456 ?        Ssl  15:20   0:00 /usr/bin/nvidia-vgpu-mgr
  • 问题出在已经销毁的 mdev 设备对应的 vGPU 一直是激活状态,执行 nvidia-smi vqpu -q 可以看到查询详情:

nvidia-smi vqpu -q 显示已经销毁的 mdev 设备对应的 vGPU 依然是激活状态,所以导致资源不是放
GPU 00000000:82:00.0
    Active vGPUs                          : 2
    vGPU ID                               : 3251634329
        VM UUID                           : 10a12241-1e83-4b70-bc59-a33d7c6d063c
        VM Name                           : y-k8s-n-1
        vGPU Name                         : GRID P40-6Q
        vGPU Type                         : 50
        vGPU UUID                         : ed1f9055-0b20-11ee-90a2-c79b496fe3f9
        MDEV UUID                         : 334855e2-079b-11ee-9fc9-83e0dccb6713
        Guest Driver Version              : N/A
        License Status                    : N/A (Expiry: N/A)
        GPU Instance ID                   : N/A
        Accounting Mode                   : N/A
        ECC Mode                          : Disabled
        Accounting Buffer Size            : 4000
        Frame Rate Limit                  : N/A
        PCI
            Bus Id                        : 00000000:00:00.0
        FB Memory Usage
            Total                         : 6144 MiB
            Used                          : 0 MiB
            Free                          : 6144 MiB
        Utilization
            Gpu                           : 0 %
            Memory                        : 0 %
            Encoder                       : 0 %
            Decoder                       : 0 %
        Encoder Stats
            Active Sessions               : 0
            Average FPS                   : 0
            Average Latency               : 0
        FBC Stats
            Active Sessions               : 0
            Average FPS                   : 0
            Average Latency               : 0
    vGPU ID                               : 3251634341
        VM UUID                           : 10a12241-1e83-4b70-bc59-a33d7c6d063c
        VM Name                           : y-k8s-n-1
        vGPU Name                         : GRID P40-6Q
        vGPU Type                         : 50
        vGPU UUID                         : 00000000-0000-0000-0000-000000000000
        MDEV UUID                         : 33485650-079b-11ee-9fca-8f6415d2734c
        Guest Driver Version              : N/A
        License Status                    : N/A (Expiry: N/A)
        GPU Instance ID                   : N/A
        Accounting Mode                   : N/A
        ECC Mode                          : Disabled
        Accounting Buffer Size            : 4000
        Frame Rate Limit                  : N/A
        PCI
            Bus Id                        : 00000000:00:00.0
        FB Memory Usage
            Total                         : 6144 MiB
            Used                          : 0 MiB
            Free                          : 6144 MiB
        Utilization
            Gpu                           : 0 %
            Memory                        : 0 %
            Encoder                       : 0 %
            Decoder                       : 0 %
        Encoder Stats
            Active Sessions               : 0
            Average FPS                   : 0
            Average Latency               : 0
        FBC Stats
            Active Sessions               : 0
            Average FPS                   : 0
            Average Latency               : 0

这说明关键因素在于 nvidia-smi vgpu ,需要清理残留

  • 检查 nvidia-smi vgpu -h ,可以看到有一个对应参数 -caa

    [-caa | --clear-accounted-apps]: Clears accounting information of the vGPU instance that have already terminated.
    

可以用来清理已经终止的 vGPU 实例的记账信息

  • 执行 vGPU 终止的实例 accounting information 清理:

清理 vGPU 终止的实例 accounting information
nvidia-smi vgpu -caa

可以看到残留的 vGPU accounting infomation 清理

清理 vGPU 终止的实例 accounting information
Cleared Accounted PIDs for vGPU 3251634329
Cleared Accounted PIDs for vGPU 3251634341

但是没有解决问题

  • 乌龙: 我尝试了 echo 1/sys/class/mdev_bus/0000:82:00.0/reset ,结果 nvidia-smi 再也检测不到设备了:

    Unable to determine the device handle for GPU 0000:82:00.0: Unknown Error
    
  • 尝试 rmmod nvidia相关内核模块,但是显示正在使用

  • 执行 lsof | grep nvidia | awk '{print $2}' | sort -u 找出所有进程杀死,不过有一个内核进程 [nvidia] 无法杀掉

  • 此时执行 lsmod | grep nvidia 可以看到已经基本上没有使用模块了:

    nvidia_vgpu_vfio       57344  0
    nvidia              39174144  2
    mdev                   28672  1 nvidia_vgpu_vfio
    drm                   622592  4 drm_kms_helper,nvidia,mgag200
    

则可以依次卸载内核模块:

rmmod nvidia_vgpu_vfio
rmmod nvidia

则所有 nvidia 相关模块都卸载了

  • 再次加载 nvidia 模块:

    # modprobe nvidia
    # lsmod | grep nvidia
    nvidia              39174144  0
    drm                   622592  4 drm_kms_helper,nvidia,mgag200
    

此时执行 nvidia-smi 不再报错,但是显示没有设备:

No devices were found
  • 我重新走了一遍 vgpu_unlock (为了重装一遍驱动以及加载内核模块),完成后可以看到内核模块重新加载:

    nvidia_vgpu_vfio       57344  0
    nvidia              39145472  2
    mdev                   28672  1 nvidia_vgpu_vfio
    drm                   622592  4 drm_kms_helper,nvidia,mgag200
    
  • 不过 nvidia-smi 依然显示 No devices were found

  • lspci -v -s 82:00.0 输出显示没有异常:

    82:00.0 3D controller: NVIDIA Corporation GP102GL [Tesla P10] (rev a1)
         Subsystem: NVIDIA Corporation GP102GL [Tesla P10]
         Physical Slot: 3
         Flags: bus master, fast devsel, latency 0, IRQ 16, NUMA node 1, IOMMU group 80
         Memory at c8000000 (32-bit, non-prefetchable) [size=16M]
         Memory at 3b000000000 (64-bit, prefetchable) [size=32G]
         Memory at 3b800000000 (64-bit, prefetchable) [size=32M]
         Capabilities: [60] Power Management version 3
         Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
         Capabilities: [78] Express Endpoint, MSI 00
         Capabilities: [100] Virtual Channel
         Capabilities: [250] Latency Tolerance Reporting
         Capabilities: [128] Power Budgeting <?>
         Capabilities: [420] Advanced Error Reporting
         Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
         Capabilities: [900] Secondary PCI Express
         Kernel driver in use: nvidia
         Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
    

备注

算了,我暂时放弃了,不折腾了。其实最简单的方式是重启服务器…

参考