pcm-exporter :Intel Performance Counter Monitor (Intel PCM) Prometheus exporter

Intel开发的 PCM (Performance Counter Monitor) 提供了 Prometheus Exporters ,名为 pcm-sensor-server ,可以采用JSON格式或Prometheus(基于文本)格式输出Intel处理器metrics。此外,还提供了一个 Docker Atlas 容器运行。

安装

  • Ubuntu Linux 安装 pcm 之后就具备了 pcm-sensor-server :

在Ubuntu安装Intel PCM
sudo apt install pcm -y

运行

pcm-sensor-server 有一些简单的运行参数,可以通过 pcm-sensor-server --help 看到:

pcm-sensor-server 运行参数
Usage: pcm-sensor-server [OPTION]

Valid Options:
    -d                   : Run in the background
    -p portnumber        : Run on port <portnumber> (default port is 9738)
    -r|--reset           : Reset programming of the performance counters.
    -D|--debug level     : level = 0: no debug info, > 0 increase verbosity.
    -R|--real-time       : If possible the daemon will run with real time
                           priority, could be useful under heavy load to
                           stabilize the async counter fetching.
    -h|--help            : This information

可以看到 pcm-sensor-server 在高负载服务器上可以通过实时模式优先级来保证计数值获取。此外,运行端口和后台运行参数

Systemd进程管理器 运行 pcm-sensor-server

/etc/systemd/system/pcm-exporter.service
[Unit]
Description=pcm-exporter
Wants=network-online.target
After=network-online.target

StartLimitIntervalSec=500
StartLimitBurst=5

[Service]
User=root
Group=root
Type=simple
Restart=on-failure
RestartSec=5s
ExecStart=/usr/sbin/pcm-sensor-server \
  -p 9738 \
  --real-time

[Install]
WantedBy=multi-user.target
  • 启动 pcm-exporter :

启动 pcm-exporter 服务
systemctl daemon-reload
systemctl enable --now pcm-server
systemctl status pcm-server

如果正常,例如在我的 HPE ProLiant DL360 Gen9服务器 服务器上, Intel Xeon E5-2600 v3系列处理器 ,输出服务状态如下:

pcm-exporter 服务运行状态
● pcm-exporter.service - pcm-exporter
     Loaded: loaded (/etc/systemd/system/pcm-exporter.service; disabled; vendor preset: enabled)
     Active: active (running) since Sat 2023-07-15 15:06:34 CST; 2min 35s ago
   Main PID: 295193 (pcm-sensor-serv)
      Tasks: 213 (limit: 464040)
     Memory: 14.5M
        CPU: 3.295s
     CGroup: /system.slice/pcm-exporter.service
             └─295193 /usr/sbin/pcm-sensor-server -p 9738 --real-time

Jul 15 15:06:34 zcloud.staging.huatai.me pcm-sensor-server[295193]: ERROR: QPI LL monitoring device (0:ff:9:2) is missing. The QPI statistics will be incomplete or missing.
Jul 15 15:06:34 zcloud.staging.huatai.me pcm-sensor-server[295193]: Socket 1: 2 memory controllers detected with total number of 4 channels. 0 QPI ports detected. 0 M2M (mesh to memory) blocks detected. 2 Home Agents detected. 0 M3UPI blocks detected.
Jul 15 15:06:34 zcloud.staging.huatai.me pcm-sensor-server[295193]: INFO: using Linux resctrl driver for RDT metrics (L3OCC, LMB, RMB) because resctrl driver is mounted.
Jul 15 15:06:34 zcloud.staging.huatai.me pcm-sensor-server[295193]: INFO: can't create directory /sys/fs/resctrl/mon_groups/pcm47 error: No space left on device
Jul 15 15:06:34 zcloud.staging.huatai.me pcm-sensor-server[295193]: INFO: can't create directory /pcm/sys/fs/resctrl/mon_groups/pcm47 error: No such file or directory
Jul 15 15:06:34 zcloud.staging.huatai.me pcm-sensor-server[295193]: ERROR: RDT metrics (L3OCC,LMB,RMB) will not be available
Jul 15 15:06:34 zcloud.staging.huatai.me pcm-sensor-server[295193]: Disabling NMI watchdog since it consumes one hw-PMU counter.
Jul 15 15:06:34 zcloud.staging.huatai.me pcm-sensor-server[295193]: Trying to use Linux perf events...
Jul 15 15:06:34 zcloud.staging.huatai.me pcm-sensor-server[295193]: Successfully programmed on-core PMU using Linux perf
Jul 15 15:06:34 zcloud.staging.huatai.me pcm-sensor-server[295193]: Starting plain HTTP server on http://localhost:9738/

注意,功能受到硬件支持的限制,例如 Intel Xeon E5-2600 v3系列处理器 无法支持 Intel Resource Director Technology(RDT) (需要到下一代 v4 才行),也不支持 Intel QuickPath Interconnect 。 不过,如果换成在 Xeon Platinum 8163 CPU @ 2.50GHz (skylake) 则可以看到如下完整的正常输出:

pcm-exporter 服务运行输出(skylake处理器)
Scheduler changed to SCHED_RR and priority to 1

=====  Processor information  =====
Linux arch_perfmon flag  : yes
Hybrid processor         : no
IBRS and IBPB supported  : yes
STIBP supported          : yes
Spec arch caps supported : no
Max CPUID level          : 22
CPU model number         : 85
Number of physical cores: 48
Number of logical cores: 96
Number of online logical cores: 96
Threads (logical cores) per physical core: 2
Num sockets: 2
Physical cores per socket: 24
Last level cache slices per socket: 24
Core PMU (perfmon) version: 4
Number of core PMU generic (programmable) counters: 3
Width of generic (programmable) counters: 48 bits
Number of core PMU fixed counters: 3
Width of fixed counters: 48 bits
Nominal core frequency: 2500000000 Hz
IBRS enabled in the kernel   : no
STIBP enabled in the kernel  : no
Package thermal spec power: 165 Watt; Package minimum power: 87 Watt; Package maximum power: 363 Watt;

INFO: Linux perf interface to program uncore PMUs is present
Socket 0: 2 memory controllers detected with total number of 6 channels. 3 UPI ports detected. 2 M2M (mesh to memory) blocks detected. 0 HBM M2M blocks detected. 0 EDC/HBM channels detected. 0 Home Agents detected. 3 M3UPI blocks detected.
Socket 1: 2 memory controllers detected with total number of 6 channels. 3 UPI ports detected. 2 M2M (mesh to memory) blocks detected. 0 HBM M2M blocks detected. 0 EDC/HBM channels detected. 0 Home Agents detected. 3 M3UPI blocks detected.
INFO: using Linux resctrl driver for RDT metrics (L3OCC, LMB, RMB) because resctrl driver is mounted.

 Disabling NMI watchdog since it consumes one hw-PMU counter. To keep NMI watchdog set environment variable PCM_KEEP_NMI_WATCHDOG=1 (this reduces the core metrics set)
 Closed perf event handles
Trying to use Linux perf events...
Successfully programmed on-core PMU using Linux perf
Socket 0
Max UPI link 0 speed: 23.3 GBytes/second (10.4 GT/second)
Max UPI link 1 speed: 23.3 GBytes/second (10.4 GT/second)
Max UPI link 2 speed: 23.3 GBytes/second (10.4 GT/second)
Socket 1
Max UPI link 0 speed: 23.3 GBytes/second (10.4 GT/second)
Max UPI link 1 speed: 23.3 GBytes/second (10.4 GT/second)
Max UPI link 2 speed: 23.3 GBytes/second (10.4 GT/second)
Starting plain HTTP server on http://localhost:9738/

此时,使用浏览器访问 http://192.168.6.200:9738 (我的服务器地址),就能够看到 PCM Sensor Server 介绍页面,其中提供了 Influxdb时序数据库Prometheus监控 结合 Grafana通用可视分析平台 的配置案例。

接下来我们就可以配置 Intel PCM Grafana (Intel官方提供了非常简便的 Docker Atlas 运行方法)

docker容器化运行

如果不想编译安装或者手工部署,则可以采用容器化运行

容器化运行 pcm-exporter
modprobe msr

# GitHub Container repository
docker run -d --name pcm --privileged -p 9738:9738 ghcr.io/opcm/pcm

# 或者Dockerhub repository
docker run -d --name pcm --privileged -p 9738:9738 opcm/pc

参考