pcm-exporter
:Intel Performance Counter Monitor (Intel PCM) Prometheus exporter¶
Intel开发的 PCM (Performance Counter Monitor) 提供了 Prometheus Exporters ,名为 pcm-sensor-server
,可以采用JSON格式或Prometheus(基于文本)格式输出Intel处理器metrics。此外,还提供了一个 Docker Atlas 容器运行。
安装¶
Ubuntu Linux 安装
pcm
之后就具备了pcm-sensor-server
:
sudo apt install pcm -y
运行¶
pcm-sensor-server
有一些简单的运行参数,可以通过 pcm-sensor-server --help
看到:
Usage: pcm-sensor-server [OPTION]
Valid Options:
-d : Run in the background
-p portnumber : Run on port <portnumber> (default port is 9738)
-r|--reset : Reset programming of the performance counters.
-D|--debug level : level = 0: no debug info, > 0 increase verbosity.
-R|--real-time : If possible the daemon will run with real time
priority, could be useful under heavy load to
stabilize the async counter fetching.
-h|--help : This information
可以看到 pcm-sensor-server
在高负载服务器上可以通过实时模式优先级来保证计数值获取。此外,运行端口和后台运行参数
Systemd进程管理器 运行 pcm-sensor-server
¶
为了方便运行,参考 Prometheus快速起步 创建一个
pcm-exporter
运行 Systemd进程管理器 服务配置/etc/systemd/system/pcm-exporter.service
:
[Unit]
Description=pcm-exporter
Wants=network-online.target
After=network-online.target
StartLimitIntervalSec=500
StartLimitBurst=5
[Service]
User=root
Group=root
Type=simple
Restart=on-failure
RestartSec=5s
ExecStart=/usr/sbin/pcm-sensor-server \
-p 9738 \
--real-time
[Install]
WantedBy=multi-user.target
启动
pcm-exporter
:
systemctl daemon-reload
systemctl enable --now pcm-server
systemctl status pcm-server
如果正常,例如在我的 HPE ProLiant DL360 Gen9服务器 服务器上, Intel Xeon E5-2600 v3系列处理器 ,输出服务状态如下:
● pcm-exporter.service - pcm-exporter
Loaded: loaded (/etc/systemd/system/pcm-exporter.service; disabled; vendor preset: enabled)
Active: active (running) since Sat 2023-07-15 15:06:34 CST; 2min 35s ago
Main PID: 295193 (pcm-sensor-serv)
Tasks: 213 (limit: 464040)
Memory: 14.5M
CPU: 3.295s
CGroup: /system.slice/pcm-exporter.service
└─295193 /usr/sbin/pcm-sensor-server -p 9738 --real-time
Jul 15 15:06:34 zcloud.staging.huatai.me pcm-sensor-server[295193]: ERROR: QPI LL monitoring device (0:ff:9:2) is missing. The QPI statistics will be incomplete or missing.
Jul 15 15:06:34 zcloud.staging.huatai.me pcm-sensor-server[295193]: Socket 1: 2 memory controllers detected with total number of 4 channels. 0 QPI ports detected. 0 M2M (mesh to memory) blocks detected. 2 Home Agents detected. 0 M3UPI blocks detected.
Jul 15 15:06:34 zcloud.staging.huatai.me pcm-sensor-server[295193]: INFO: using Linux resctrl driver for RDT metrics (L3OCC, LMB, RMB) because resctrl driver is mounted.
Jul 15 15:06:34 zcloud.staging.huatai.me pcm-sensor-server[295193]: INFO: can't create directory /sys/fs/resctrl/mon_groups/pcm47 error: No space left on device
Jul 15 15:06:34 zcloud.staging.huatai.me pcm-sensor-server[295193]: INFO: can't create directory /pcm/sys/fs/resctrl/mon_groups/pcm47 error: No such file or directory
Jul 15 15:06:34 zcloud.staging.huatai.me pcm-sensor-server[295193]: ERROR: RDT metrics (L3OCC,LMB,RMB) will not be available
Jul 15 15:06:34 zcloud.staging.huatai.me pcm-sensor-server[295193]: Disabling NMI watchdog since it consumes one hw-PMU counter.
Jul 15 15:06:34 zcloud.staging.huatai.me pcm-sensor-server[295193]: Trying to use Linux perf events...
Jul 15 15:06:34 zcloud.staging.huatai.me pcm-sensor-server[295193]: Successfully programmed on-core PMU using Linux perf
Jul 15 15:06:34 zcloud.staging.huatai.me pcm-sensor-server[295193]: Starting plain HTTP server on http://localhost:9738/
注意,功能受到硬件支持的限制,例如 Intel Xeon E5-2600 v3系列处理器 无法支持 Intel Resource Director Technology(RDT) (需要到下一代 v4
才行),也不支持 Intel QuickPath Interconnect 。 不过,如果换成在 Xeon Platinum 8163 CPU @ 2.50GHz
(skylake) 则可以看到如下完整的正常输出:
Scheduler changed to SCHED_RR and priority to 1
===== Processor information =====
Linux arch_perfmon flag : yes
Hybrid processor : no
IBRS and IBPB supported : yes
STIBP supported : yes
Spec arch caps supported : no
Max CPUID level : 22
CPU model number : 85
Number of physical cores: 48
Number of logical cores: 96
Number of online logical cores: 96
Threads (logical cores) per physical core: 2
Num sockets: 2
Physical cores per socket: 24
Last level cache slices per socket: 24
Core PMU (perfmon) version: 4
Number of core PMU generic (programmable) counters: 3
Width of generic (programmable) counters: 48 bits
Number of core PMU fixed counters: 3
Width of fixed counters: 48 bits
Nominal core frequency: 2500000000 Hz
IBRS enabled in the kernel : no
STIBP enabled in the kernel : no
Package thermal spec power: 165 Watt; Package minimum power: 87 Watt; Package maximum power: 363 Watt;
INFO: Linux perf interface to program uncore PMUs is present
Socket 0: 2 memory controllers detected with total number of 6 channels. 3 UPI ports detected. 2 M2M (mesh to memory) blocks detected. 0 HBM M2M blocks detected. 0 EDC/HBM channels detected. 0 Home Agents detected. 3 M3UPI blocks detected.
Socket 1: 2 memory controllers detected with total number of 6 channels. 3 UPI ports detected. 2 M2M (mesh to memory) blocks detected. 0 HBM M2M blocks detected. 0 EDC/HBM channels detected. 0 Home Agents detected. 3 M3UPI blocks detected.
INFO: using Linux resctrl driver for RDT metrics (L3OCC, LMB, RMB) because resctrl driver is mounted.
Disabling NMI watchdog since it consumes one hw-PMU counter. To keep NMI watchdog set environment variable PCM_KEEP_NMI_WATCHDOG=1 (this reduces the core metrics set)
Closed perf event handles
Trying to use Linux perf events...
Successfully programmed on-core PMU using Linux perf
Socket 0
Max UPI link 0 speed: 23.3 GBytes/second (10.4 GT/second)
Max UPI link 1 speed: 23.3 GBytes/second (10.4 GT/second)
Max UPI link 2 speed: 23.3 GBytes/second (10.4 GT/second)
Socket 1
Max UPI link 0 speed: 23.3 GBytes/second (10.4 GT/second)
Max UPI link 1 speed: 23.3 GBytes/second (10.4 GT/second)
Max UPI link 2 speed: 23.3 GBytes/second (10.4 GT/second)
Starting plain HTTP server on http://localhost:9738/
此时,使用浏览器访问 http://192.168.6.200:9738 (我的服务器地址),就能够看到 PCM Sensor Server
介绍页面,其中提供了 Influxdb时序数据库 和 Prometheus监控 结合 Grafana通用可视分析平台 的配置案例。
接下来我们就可以配置 Intel PCM Grafana (Intel官方提供了非常简便的 Docker Atlas 运行方法)
docker容器化运行¶
如果不想编译安装或者手工部署,则可以采用容器化运行
modprobe msr
# GitHub Container repository
docker run -d --name pcm --privileged -p 9738:9738 ghcr.io/opcm/pcm
# 或者Dockerhub repository
docker run -d --name pcm --privileged -p 9738:9738 opcm/pc