安装NVIDIA Virtual GPU Guest Driver

物理主机(Host)上 安装NVIDIA Virtual GPU Manager 后,通过 Libvirt虚拟机管理器 向虚拟机内部添加了 NVIDIA Virtual GPU (vGPU) 设备,现在到了开始真正使用vGPU的时候了。也就是需要在VM内部安装Guest驱动来实际使用GPU。

准备工作

在Ubuntu Guest虚拟机中需要安装GCC,Linux Kernel Headers和dkms
sudo apt install gcc linux-headers-$(uname -r) dkms

安装 nvidia-linux-grid Guest驱动

  • Ubuntu安装 nvidia-linux-grid Guest驱动:

在Ubuntu Guest虚拟机中安装 nvidia-linux-grid Guest驱动
sudo dpkg -i nvidia-linux-grid-510_510.85.02_amd64.deb
  • 然后重启虚拟机

配置licence

  • 在Ubuntu虚拟机中编辑 /etc/nvidia/gridd.conf 配置:

配置虚拟机 /etc/nvidia/gridd.conf 连接License服务器
# Description: Set License Server Address
# Data type: string
# Format:  "<address>"
ServerAddress=192.168.6.248

# Description: Set License Server port number
# Data type: integer
# Format:  <port>, default is 7070
ServerPort=7070

# Description: Set Feature to be enabled
# Data type: integer
# Possible values:
#    0 => for unlicensed state
#    1 => for NVIDIA vGPU (Optional, autodetected as per vGPU type)
#    2 => for NVIDIA RTX Virtual Workstation
#    4 => for NVIDIA Virtual Compute Server
# All other values reserved
FeatureType=4

这里有个问题,没有添加vGPU的虚拟机无法启动 nvida-gridd 服务。所以我返回 安装NVIDIA Virtual GPU Manager 为虚拟机 y-k8s-n-1 添加vGPU

  • 启动 nvidia-gridd :

配置Lince Server的IP和端口以及请求License,然后启动 nvidia-gridd
systemctl start nvidia-gridd

客户端请求的服务器License必须得到服务器支持,例如License Server只提供 Quadro-Virtual-DWS ,但是客户端配置成 FeatureType=4 请求 Virtual Compute Server ,则客户端启动 nvidia-gridd 后日志会提示类似如下错误:

nvidia-gridd 请求License必须和License Server提供种类匹配,否则客户端会有错误日志,不过 Quadro-Virtual-DWS License 似乎可以和 Virtual Compute Server 通用
Jun 16 00:14:46 y-k8s-n-1 systemd[1]: Starting NVIDIA Grid Daemon...
Jun 16 00:14:46 y-k8s-n-1 systemd[1]: Started NVIDIA Grid Daemon.
Jun 16 00:14:46 y-k8s-n-1 nvidia-gridd[23795]: Started (23795)
Jun 16 00:14:46 y-k8s-n-1 nvidia-gridd[23795]: vGPU Software package (0)
Jun 16 00:14:46 y-k8s-n-1 nvidia-gridd[23795]: Ignore service provider licensing
Jun 16 00:14:46 y-k8s-n-1 nvidia-gridd[23795]: Unable to fetch the client configuration token file
Jun 16 00:14:47 y-k8s-n-1 nvidia-gridd[23795]: Service provider detection complete.
Jun 16 00:14:47 y-k8s-n-1 nvidia-gridd[23795]: Calling load_byte_array(tra)
Jun 16 00:14:47 y-k8s-n-1 nvidia-gridd[23795]: Acquiring license. (Info: http://192.168.6.248:7070/request; NVIDIA Virtual Compute Server)
Jun 16 00:14:47 y-k8s-n-1 nvidia-gridd[23795]: Calling load_byte_array(tra)
Jun 16 00:14:48 y-k8s-n-1 nvidia-gridd[23795]: Failed to acquire/renew license from license server. (Info: http://192.168.6.248:7070/request; NVIDIA Virtual Compute Server - Error: [1,7E2,2,0[7000000B,0,702C7]] Requested feature was not found.)
Jun 16 00:14:48 y-k8s-n-1 nvidia-gridd[23795]: Calling load_byte_array(tra)
Jun 16 00:14:49 y-k8s-n-1 nvidia-gridd[23795]: License acquired successfully. (Info: http://192.168.6.248:7070/request, NVIDIA Virtual Compute Server; Expiry: 2023-6-16 16:14:59 GMT)

不过,实践看来 Quadro-Virtual-DWS License 似乎可以和 Virtual Compute Server 通用(从 nvidia-gridd 日志看最后加载License成功 )

此时,观察 Lince Server 服务器的 License Feature Usage 可以看到Licence计数已经减少了1个,也就是被vGPU客户端使用了

参考