在Kuternetes集成GPU可观测能力

GPU现在已经成为Kubernetes环境重要资源,我们需要能够通过类似 Prometheus监控 这样的统一监控来访问GPU指标以监控GPU资源,就像传统的CPU资源监控一样。

备注

如果在Kubernetes集群已经部署了 NVIDIA GPU Operator ,那么会自动在GPU节点上安装好 DCGM-Exporter 。所以,采用 NVIDIA GPU OperatorNVIDIA/dcgm-exporter GitHub官方 推荐的部署方式( Note: Consider using the NVIDIA GPU Operator rather than DCGM-Exporter directly. )。

实际上,我尝试了 NVIDIA/dcgm-exporter GitHub官方 提供的通过Helm chart安装 DCGM-Exporter ,没有成功:

helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm repo update
helm install --generate-name gpu-helm-charts/dcgm-exporter

提示报错:

Error: INSTALLATION FAILED: unable to build kubernetes objects from release manifest: resource mapping not found for name: "dcgm-exporter-1679911060" namespace: "default" from "": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
ensure CRDs are installed first

不过,如果 安装NVIDIA GPU Operator 就直接解决了这个问题,会自动完成GPU节点的 dcgm-exporter 安装。

对于一些生产环境,可能不会部署完整的 NVIDIA GPU Operator (而采用自己的解决方案),这种情况依然可以独立部署 DCGM-Exporter ,本文即参考官方文档实现这种部署模式。此时在GPU节点上本机安装,也就是既不使用 NVIDIA GPU Operator 也不容器化驱动程序( Docker运行NVIDIA容器 )

NVIDIA驱动

在物理主机或者虚拟机(GPU PassThrough)中,需要安装 NVIDIA Drivers(驱动) ,采用 通过Linux发行版软件仓库方式安装NVDIA CUDA驱动

备注

根据操作系统版本不同,主要分为 RedHat LinuxUbuntu Linux 系,采用不同的包管理方式

Ubuntu安装NVIDIA驱动

  • 执行Ubuntu添加仓库:

在Ubuntu 22.04操作系统添加NVIDIA官方软件仓库配置
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt-get update
  • 安装 NVIDIA CUDA 驱动:

Ubuntu使用NVIDIA官方软件仓库安装CUDA驱动
sudo apt-get -y install cuda-drivers

RHEL/CentOS 7安装NVIDIA驱动

  • 执行RHEL/CentOS 7仓库添加:

在RHEL/CentOS 7操作系统添加NVIDIA官方软件仓库配置
# CentOS7可能需要安装编译工具链, RHEL7通常已经安装
sudo dnf install -y tar bzip2 make automake gcc gcc-c++ pciutils elfutils-libelf-devel libglvnd-devel iptables firewalld vim bind-utils wget

# 安装EPEL仓库
sudo yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm

# 对于RHEL7还需要激活一些可选仓库(CentOS7无需此操作)
sudo subscription-manager repos --enable="rhel-*-optional-rpms" --enable="rhel-*-extras-rpms"  --enable="rhel-ha-for-rhel-*-server-rpms"

# 安装CUDA仓库公钥
distribution=$(. /etc/os-release;echo $ID`rpm -E "%{?rhel}%{?fedora}"`)

# 设置仓库
ARCH=$( /bin/arch )
sudo yum-config-manager --add-repo http://developer.download.nvidia.com/compute/cuda/repos/$distribution/${ARCH}/cuda-$distribution.repo

# 安装Kernel头文件
sudo yum install -y kernel-devel-$(uname -r) kernel-headers-$(uname -r)

# 清理缓存
sudo yum clean expire-cache
  • 安装 NVIDIA CUDA 驱动:

RHEL/CentOS 7使用NVIDIA官方软件仓库安装CUDA驱动
sudo yum -y install cuda-drivers

安装 容器运行时(Container Runtimes)

NVIDIA提供了多种 NVIDIA容器运行时 支持,可以选择:

上述任意一种 容器运行时(Container Runtimes) 都支持,选择安装了runtime之后,就需要安装对应的NVIDIA Container Toolkit

安装NVIDIA Container Toolkit

NVIDIA提供了多种 NVIDIA容器运行时 支持,例如 Docker Atlas , containerd运行时(runtime) , cri-o容器运行时 ,请按照你的Kubernetes集群实际 容器运行时(Container Runtimes) 对应安装 NVIDIA Container Toolkit:

安装Kubernetes

部署 Vanilla Kubernetes :

安装 NVIDIA Device Plugin

安装NVIDIA Device Plugin (独立安装 nvida-device-plugin )

  • 部署 helm:

在Linux平台安装helm
version=3.12.2
wget https://get.helm.sh/helm-v${version}-linux-amd64.tar.gz
tar -zxvf helm-v${version}-linux-amd64.tar.gz
sudo mv linux-amd64/helm /usr/local/bin/helm
  • 添加 nvidia-device-plugin helm 仓库:

添加nvidia-device-plugin helm仓库
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin \
   && helm repo update
  • 部署 NVIDIA Device Plugins :

使用helm安装nvidia-device-plugin
helm upgrade -i nvdp nvdp/nvidia-device-plugin \
  --namespace nvidia-device-plugin \
  --create-namespace \
  --version 0.13.0

正式开始

GPU可观测性

NVIDIA的GPU可观测性也是建立在 Prometheus监控 基础上,构建的完整数据采集,时序数据库存储metrics,并通过 Grafana通用可视分析平台 实现可视化。此外, Prometheus监控 包含了 Alertmanager 提供了告警创建和管理。Prometheus 通过 kube-state-metrics (KSM)Node Exporter 分别为Kubernetes API对象输出集群级别的 Metrics 和节点级别的 Metrics (例如CPU使用率)。

../../_images/prometheus_architecture.png

Prometheus 架构

要从Kubernetes采集GPU可观测性数据,建议使用 DCGM-Exporter : 基于 NVIDIA DCGM (Data Center GPU Manager)DCGM-ExporterPrometheus监控 输出了GPU Metrics 并且能够被 Grafana通用可视分析平台 可视化。 DCGM-Exporter 架构充分发挥了 KubeletPodResources API 并且采用 Prometheus 能够抓取的格式输出 GPU metrics。此外,还包括了一个 ServiceMonitor 的公开endpoints。

部署Prometheus

实际上NVIDIA官方文档中介绍的 GPU可观测性 方案所采用的 Prometheus 部署方式就是采用社区提供的 使用Helm 3在Kubernetes集群部署Prometheus和Grafana 。NVIDIA做了一些微调:

使用官方脚本安装 helm
curl -LO https://git.io/get_helm.sh
chmod 700 get_helm.sh
./get_helm.sh
  • 添加 Prometheus 社区helm chart:

添加 Prometheus 社区helm chart
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
  • NVIDIA对社区方案参数做一些调整(见下文),所以先导出 chart 使用的变量(以便修订):

helm inspect values 输出Prometheus Stack的chart变量值
helm inspect values prometheus-community/kube-prometheus-stack > kube-prometheus-stack.values
  • 将metrics端口 30090 作为 NodePort 输出在每个节点(实际建议修订)

修订输出端口30090
prometheus:

  ## Configuration for Prometheus service
  ##
  service:
    annotations: {}
    labels: {}
    clusterIP: ""

    ## Port for Prometheus Service to listen on
    ##
    port: 9090

    ## To be used with a proxy extraContainer port
    targetPort: 9090

    ## List of IP addresses at which the Prometheus server service is available
    ## Ref: https://kubernetes.io/docs/user-guide/services/#external-ips
    ##
    externalIPs: []

    ## Port to expose on each node
    ## Only used if service.type is 'NodePort'
    ##
    nodePort: 30090

    ## Loadbalancer IP
    ## Only use if service.type is "LoadBalancer"
    loadBalancerIP: ""
    loadBalancerSourceRanges: []

    ## Denotes if this Service desires to route external traffic to node-local or cluster-wide endpoints
    ##
    externalTrafficPolicy: Cluster

    ## Service type
    ##
    type: NodePort

...

grafana:

  ## Passed to grafana subchart and used by servicemonitor below
  ##
  service:
    portName: http-web
    nodePort: 30080
    type: NodePort

...

alertmanager:

  ## Deploy alertmanager
  ##
  enabled: true
  ...
  ## Configuration for Alertmanager service
  ##
  service:
    annotations: {}
    labels: {}
    clusterIP: ""

    ## Port for Alertmanager Service to listen on
    ##
    port: 9093
    ## To be used with a proxy extraContainer port
    ##
    targetPort: 9093
    ## Port to expose on each node
    ## Only used if service.type is 'NodePort'
    ##
    nodePort: 30903
    ## List of IP addresses at which the Prometheus server service is available
    ## Ref: https://kubernetes.io/docs/user-guide/services/#external-ips
    ##
    ...
    ## Service type
    ##
    type: NodePort
  • 修改了 prometheusSpec.serviceMonitorSelectorNilUsesHelmValues 设置为 false

修改 prometheusSpec.serviceMonitorSelectorNilUsesHelmValues 设置为 false
# If true, a nil or {} value for prometheus.prometheusSpec.serviceMonitorSelector will cause the
# prometheus resource to be created with selectors based on values in the helm deployment,
# which will also match the servicemonitors created
#
serviceMonitorSelectorNilUsesHelmValues: false
  • configMap 配置 additionalScrapeConfigs 添加 gpu-metrics :

configMap 配置 additionalScrapeConfigs 添加 gpu-metrics
# AdditionalScrapeConfigs allows specifying additional Prometheus scrape configurations. Scrape configurations
# are appended to the configurations generated by the Prometheus Operator. Job configurations must have the form
# as specified in the official Prometheus documentation:
# https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config. As scrape configs are
# appended, the user is responsible to make sure it is valid. Note that using this feature may expose the possibility
# to break upgrades of Prometheus. It is advised to review Prometheus release notes to ensure that no incompatible
# scrape configs are going to break Prometheus after the upgrade.
#
# The scrape configuration example below will find master nodes, provided they have the name .*mst.*, relabel the
# port to 2379 and allow etcd scraping provided it is running on all Kubernetes master nodes
#
additionalScrapeConfigs:
- job_name: gpu-metrics
  scrape_interval: 1s
  metrics_path: /metrics
  scheme: http
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - gpu-operator
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_node_name]
    action: replace
    target_label: kubernetes_node
  • 最后执行部署,部署中采用自己定制的values:

使用定制helm chart values来安装部署 kube-prometheus-stack
helm install prometheus-community/kube-prometheus-stack \
   --create-namespace --namespace prometheus \
   --generate-name \
   --values /tmp/kube-prometheus-stack.values

备注

上述手工编辑替换的方法比较繁琐,实际上 helm 支持命令行直接替换变量参数:

使用参数覆盖方式定制变量来安装部署 kube-prometheus-stack
helm install prometheus-community/kube-prometheus-stack \
   --create-namespace --namespace prometheus \
   --generate-name \
   --set prometheus.service.type=NodePort \
   --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false

备注

实际上我已经完成了 使用Helm 3在Kubernetes集群部署Prometheus和Grafana ,而且我也是将 Prometheus监控 的服务端口映射为 NodePort (安装后手动修订部署),所以不再需要执行官方文档安装

不过,我在 在Kubernetes集群(z-k8s)部署集成GPU监控的Prometheus和Grafana 采用了NVIDIA的部署方案,请参考那次实践。

(可选方法)独立安装 NVIDIA DCGM (Data Center GPU Manager)DCGM-Exporter

Prometheus + Grafana 监控 NVIDIA GPU 采用了另外一种直接在物理主机部署 NVIDIA DCGM (Data Center GPU Manager)DCGM-Exporter 的方法,也就是直接采用 Systemd进程管理器 来运行这两个程序。

DCGM-Exporter 在GitHub官方介绍了通过 Docker Atlas 来运行 dcgm-exporter 的方法。

备注

对比之下,我感觉采用NVIDIA官方的手册更符合最新的部署模式,所以我在NVIDIA官方部署方式上进行一些调整

部署DCGM

备注

需要部署 dcgm-exporter 就可以,物理主机上无需再安装 NVIDIA DCGM (Data Center GPU Manager)

我在 dcgm-exporter 容器内部检查,容器内部已经安装了 nvidia-dcgm ,只不过似乎没有以服务方式运行。参考 Monitor Your Computing System with Prometheus, Grafana, Alertmanager, and Nvidia DCGM

# nv-hostengine
Started host engine version 2.4.6 using port number: 5555

# dcgmi discovery -l
1 GPU found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information                                                   |
+--------+----------------------------------------------------------------------+
| 0      | Name: NVIDIA Graphics Device                                         |
|        | PCI Bus ID: 00000000:09:00.0                                         |
|        | Device UUID: GPU-794d1de5-b8c7-9b49-6fe3-f96f8fd98a19                |
+--------+----------------------------------------------------------------------+
0 NvSwitches found.
+-----------+
| Switch ID |
+-----------+
+-----------+

注意,容器内部必须先启动 nv-hostengine 才能运行 dcgmi discovery -l 检查主机的GPU卡

在容器内部可以执行 curl localhost:9400/metrics 获得GPU的 metrics数据

  • 添加 dcgm-exporter helm repo:

添加 dcgm-exporter helm repo
helm repo add gpu-helm-charts \
   https://nvidia.github.io/dcgm-exporter/helm-charts

helm repo update
  • 安装 dcgm-exporter chart:

helm 安装 dcgm-exporter
helm install \
   --generate-name \
   gpu-helm-charts/dcgm-exporter

这里可能会遇到报错,原因是 dcgm-exporter 要求 Kubernetes >= 1.19.0-0 :

安装 DCGM-Exporter 遇到Kubernetes版本不满足要求(需要安装低版本 dcgm-exporter )
Error: INSTALLATION FAILED: chart requires kubeVersion: >= 1.19.0-0 which is incompatible with Kubernetes v1.18.10

则采用 helm安装特定版本chart 方法完成低版本安装:

安装指定 2.6.10 版本 dcgm-exporter chart:

安装指定版本helm chart
helm install --generate-name gpu-helm-charts/dcgm-exporter --version 2.6.10

安装成功的输出信息:

安装指定版本helm chart成功的信息
NAME: dcgm-exporter-1680364448
LAST DEPLOYED: Sat Apr  1 23:54:13 2023
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Get the application URL by running these commands:
  export POD_NAME=$(kubectl get pods -n default -l "app.kubernetes.io/name=dcgm-exporter,app.kubernetes.io/instance=dcgm-exporter-1680364448" -o jsonpath="{.items[0].metadata.name}")
  kubectl -n default port-forward $POD_NAME 8080:9400 &
  echo "Visit http://127.0.0.1:8080/metrics to use your application"

third-party Profiling module 错误

我在生产环境的依次一次部署中,先部署了 DCGM-Exporter (系统已经安装了 nvidia-device-plugin ,但是还没有部署 prometheus-stack ),非常奇怪, dcgm-exporter 的pod不断crash:

# kubectl -n nvidia-gpu get pods
NAME                             READY   STATUS             RESTARTS   AGE
dcgm-exporter-1680885308-2ttq6   0/1     CrashLoopBackOff   241        20h
dcgm-exporter-1680885308-5rzsf   0/1     CrashLoopBackOff   0          20h
dcgm-exporter-1680885308-5w29s   0/1     CrashLoopBackOff   241        20h
dcgm-exporter-1680885308-68sv7   0/1     CrashLoopBackOff   0          119m
...

检查 kubelet 日志显示仅显示容器不断 CrashLoopBackOff

...
E0408 18:51:25.676318   41268 pod_workers.go:191] Error syncing pod 4c56555f-1b97-4d68-965b-af67cd99df48 ("dcgm-exporter-1680885308-68sv7_nvidia-gpu(4c56555f-1b97-4d68-965b-af67cd99df48)"), skipping: failed to "StartContainer" for "exporter" with CrashLoopBackOff: "back-off 1m20s restarting failed container=exporter pod=dcgm-exporter-1680885308-68sv7_nvidia-gpu(4c56555f-1b97-4d68-965b-af67cd99df48)"
...

此时检查pod日志,显示第三方profiling模块返回错误导致:

# kubectl logs dcgm-exporter-1680885308-68sv7 -n nvidia-gpu
time="2023-04-08T10:50:50Z" level=info msg="Starting dcgm-exporter"
time="2023-04-08T10:50:50Z" level=info msg="DCGM successfully initialized!"
time="2023-04-08T10:50:51Z" level=info msg="Collecting DCP Metrics"
time="2023-04-08T10:50:51Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/dcp-metrics-included.csv"
time="2023-04-08T10:50:53Z" level=fatal msg="Error watching fields: The third-party Profiling module returned an unrecoverable error"

在一个韩文 GPU Operator on CentOS 提示解决方法是: GPU Operator v1.3.0 升级到 v1.4.0

不过,同样操作系统和硬件( NVIDIA A100 Tensor Core GPU ) 以及驱动 ( Driver Version: 470.103.01   CUDA Version: 11.4 ),我之前部署的集群却没有问题。

备注

这个错误故障后来解决: 原因是阿里云租用的服务器部署Kubernetes,已经购买使用了阿里云的基于 prometheus-stack 魔改的监控,所以系统中有了一个 Systemd进程管理器 模式运行的 DCGM-Exporter 。但是阿里云把 dcgm-exporter 的运行名改成了 starship ,导致没有注意到物理主机上已经运行了相同的程序。由于 starship 和我部署的 dcgm-exporter DaemonSet 都是定时采集,很容易同时采集数据造成冲突。

备注

Kubernetes官方有一篇和NVIDIA合作的技术文档 Third Party Device Metrics Reaches GA 详细解析了NVIDIA GPU metrics采集的方案,我后续在 Kubernetes第三方设备metrics 汇总研究

配置 Grafana

共享Grafana dashboards 中有NVIDIA公司提供的一个专用dashboard NVIDIA DCGM Exporter Dashboard ,将该面板的JSON文件对应 URL https://grafana.com/grafana/dashboards/12239 添加:

  • 选择菜单 Dashboards >> Import

  • (没有成功)将 https://grafana.com/grafana/dashboards/12239 直接填写到 Import via grafana.com 栏,然后点击 Load

  • 或者(我实际采用此方法)先从 共享Grafana dashboards 下载 https://grafana.com/grafana/dashboards/12239 对应的JSON文件,然后点击 Load ,此时会提示信息如下

../../_images/import_dcgm-exporter_dashboard.png

配置prometheus

此时虽然配置了Grafana,但是还拿不到 dcgm-exporter 数据,原因是 prometheus 尚未配置抓取数据(针对 9400 端口)

prometheus 的配置文件中添加 scrape_config 配置(具体参考 prometheus_configuration_scrape_config ),采用 更新Kubernetes集群的Prometheus配置 :

helm 支持 upgrade 指令,可以更新原先的 helm chart,也就说,可以在后面重新更新一些配置来添加Prometheus部署后的更新配置(比直接修订 prometheus.yml 方便)

备注

我最终采用 更新Kubernetes集群的Prometheus配置 方式,将NVIDIA官方文档的 configMap 配置 additionalScrapeConfigs 添加 gpu-metrics 整合成功。这样就能够初步显示出GPU的监控

  • 输出 kube-prometheus-stack 参数配置:

helm inspect values 输出Prometheus Stack的chart变量值
helm inspect values prometheus-community/kube-prometheus-stack > kube-prometheus-stack.values
  • 修改 /tmp/kube-prometheus-stack.valuesconfigMap 配置 additionalScrapeConfigs 添加 gpu-metrics :

configMap 配置 additionalScrapeConfigs 添加 gpu-metrics (namespace由于部署原因设为default)
# AdditionalScrapeConfigs allows specifying additional Prometheus scrape configurations. Scrape configurations
# are appended to the configurations generated by the Prometheus Operator. Job configurations must have the form
# as specified in the official Prometheus documentation:
# https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config. As scrape configs are
# appended, the user is responsible to make sure it is valid. Note that using this feature may expose the possibility
# to break upgrades of Prometheus. It is advised to review Prometheus release notes to ensure that no incompatible
# scrape configs are going to break Prometheus after the upgrade.
#
# The scrape configuration example below will find master nodes, provided they have the name .*mst.*, relabel the
# port to 2379 and allow etcd scraping provided it is running on all Kubernetes master nodes
#
additionalScrapeConfigs:
- job_name: gpu-metrics
  scrape_interval: 1s
  metrics_path: /metrics
  scheme: http
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - default
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_node_name]
    action: replace
    target_label: kubernetes_node
  • 更新:

使用 helm upgrade prometheus-community/kube-prometheus-stack
helm upgrade kube-prometheus-stack-1681228346 prometheus-community/kube-prometheus-stack \
  --namespace prometheus --values kube-prometheus-stack.values

完成上述 helm upgrade 之后,就会在NVIDIA GPU Grafana Dashboard看到监控数据采集成功,显示类似如下:

../../_images/nvidia_dcgm-exporter_grafana_1.png

通过 dcgm-exporter 采集NVIDIA GPU监控数据: 温度和功率

../../_images/nvidia_dcgm-exporter_grafana_2.png

通过 dcgm-exporter 采集NVIDIA GPU监控数据: GPU时钟频率、GPU使用率、Tensor Core使用率、Framebuffer内存使用量

参考