IPMI Exporter¶
备注
Prometheus Exporters 有一个官方 ipmi_exporter
可以基于 IPMI 输出 Metrics 。并且有一个非常完美的 Grafana通用可视分析平台 Grafana Dashboard 15765: IPMI Exporter 。这样可以用来监控大规模服务器集群,并且生成告警。
ipmi_exporter
输出本地IPMI metrics到标准的 /metrics
,无需特殊配置。对于远程metrics,通用配置方法非常类似 Blackbox Exporter (黑盒测试HTTP,HTTPS,DNS,TCP,ICMP和gRPC),只需要简单使用 target
和 module
URL参数告知IPMI设备入口即可。可以对数以千计的IPMI设备进行metrics输出。
对于本地ipmi无需密码账号,对于远程ipmi,则要提供IPMI设备的用户名和密码。此外,还提供了一个 blacklist
屏蔽掉 FreeIPMI 不支持的特殊OEM传感器。
有两个案例配置文件: ipmi_local.yml
抓取本地主机metrics,以及 ipmi_remote.yml
抓取远程IPMI接口。
备注
社区 Prometheus Exporters 的 ipmi_exporter
是采用 freeipmi
来访问IPMI获取服务器监控数据的。
另外一种解决方案是采用 Node Exporter 提供的 Node Exporter ipmitool 文本插件 实现
安装¶
直接从 Prometheus IPMI Exporter (GitHub) 下载release包,也可以参考该文档自己编译:
version=1.6.1
tar xfz ipmi_exporter-${version}.linux-amd64.tar.gz
cd ipmi_exporter-${version}.linux-amd64
sudo mv ipmi_exporter /usr/local/bin/
服务器安装
freeipmi
(安装以后会获得 GNU FreeIPMI 全部命令,位于/usr/sbin
目录):
sudo apt install freeipmi -y
创建一个
ipmi-exporter
用户,类似 Prometheus快速起步 ,我们将限定这个用户不可登录:
sudo groupadd --system ipmi-exporter
sudo useradd -s /sbin/nologin --system -g ipmi-exporter ipmi-exporter
ipmi-exporter
用户必须能sudo
无密码执行freeipmi
命令,所以配置/etc/sudoers
:
cat << 'EOF' > /tmp/ipmi-expoter.sudo
ipmi-exporter ALL = NOPASSWD: /usr/sbin/ipmimonitoring,\
/usr/sbin/ipmi-sensors,\
/usr/sbin/ipmi-dcmi,\
/usr/sbin/ipmi-raw,\
/usr/sbin/bmc-info,\
/usr/sbin/ipmi-chassis,\
/usr/sbin/ipmi-sel
EOF
cat /tmp/ipmi-expoter.sudo | sudo tee -a /etc/sudoers
rm /tmp/ipmi-expoter.sudo
创建
/etc/prometheus/ipmi_remote.yml
:
modules:
#cloudatlas:
default:
user: "some_user"
pass: "secret_pw"
privilege: "admin"
driver: "LAN"
collectors:
- bmc
- ipmi
- dcmi
- chassis
- sel
collector_cmd:
bmc: sudo
ipmi: sudo
dcmi: sudo
chassis: sudo
sel: sudo
custom_args:
bmc:
- "bmc-info"
ipmi:
- "ipmimonitoring"
dcmi:
- "ipmi-dcmi"
chassis:
- "ipmi-chassis"
sel:
- "ipmi-sel"
创建
/etc/prometheus/ipmi_local.yml
(如果是本地执行):
modules:
#cloudatlas:
default:
# Available collectors are bmc, ipmi, chassis, dcmi, sel, and sm-lan-mode
collectors:
- bmc
- ipmi
- dcmi
- chassis
- sel
collector_cmd:
bmc: sudo
ipmi: sudo
dcmi: sudo
chassis: sudo
sel: sudo
custom_args:
bmc:
- "bmc-info"
ipmi:
- "ipmimonitoring"
dcmi:
- "ipmi-dcmi"
chassis:
- "ipmi-chassis"
sel:
- "ipmi-sel"
注意,这里按照 /etc/sudoers
添加的需要执行的命令,每个都采用了 sudo
来执行,参考了 ipmi_exporter exanmple config
参考 Prometheus快速起步 方式为
ipmi_exporter
配置一个/etc/systemd/system/ipmi_exporter.service
(这里假设使用/etc/prometheus/ipmi_remote.yml
远程访问IPMI):
[Unit]
Description=ipmi_exporter
Wants=network-online.target
After=network-online.target
StartLimitIntervalSec=500
StartLimitBurst=5
[Service]
User=ipmi-exporter
Group=ipmi-exporter
Type=simple
Restart=on-failure
RestartSec=5s
ExecStart=/usr/local/bin/ipmi_exporter \
--config.file=/etc/prometheus/ipmi_remote.yml \
--web.listen-address=0.0.0.0:9290
[Install]
WantedBy=multi-user.target
启动
ipmi_exporter
服务:
sudo systemctl daemon-reload
sudo systemctl enable --now ipmi_exporter
sudo systemctl status ipmi_exporter
debug¶
上述运行起来 ipmi_exporter
之后,直接通过 wget http://127.0.0.1:9290/metrics
拿到的数据中关于IPMI的记录没有采集到(metrics值0表示采集异常):
...
# HELP ipmi_up '1' if a scrape of the IPMI device was successful, '0' otherwise.
# TYPE ipmi_up gauge
ipmi_up{collector="bmc"} 0
ipmi_up{collector="chassis"} 0
ipmi_up{collector="dcmi"} 0
ipmi_up{collector="ipmi"} 0
...
检查服务 systemctl status ipmi_exporter
显示没有权限打开:
● ipmi_exporter.service - ipmi_exporter
Loaded: loaded (/etc/systemd/system/ipmi_exporter.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2023-07-14 15:11:25 CST; 1min 51s ago
Main PID: 234488 (ipmi_exporter)
Tasks: 10 (limit: 464040)
Memory: 3.2M
CPU: 34ms
CGroup: /system.slice/ipmi_exporter.service
└─234488 /usr/local/bin/ipmi_exporter --config.file=/etc/prometheus/ipmi_local.yml --web.listen-address=0.0.0.0:9290
Jul 14 15:11:25 zcloud.staging.huatai.me systemd[1]: Started ipmi_exporter.
Jul 14 15:11:25 zcloud.staging.huatai.me ipmi_exporter[234488]: ts=2023-07-14T07:11:25.208Z caller=main.go:107 level=info msg="Starting ipmi_exporter" version="(version=1.6.1, branch=HEAD, revision=344b8b4a565a9ced936aad4d4ac9a29892515cba)"
Jul 14 15:11:25 zcloud.staging.huatai.me ipmi_exporter[234488]: ts=2023-07-14T07:11:25.208Z caller=config.go:243 level=info msg="Loaded config file" path=/etc/prometheus/ipmi_local.yml
Jul 14 15:11:25 zcloud.staging.huatai.me ipmi_exporter[234488]: ts=2023-07-14T07:11:25.208Z caller=main.go:172 level=info msg="Listening on" address=0.0.0.0:9290
Jul 14 15:11:25 zcloud.staging.huatai.me ipmi_exporter[234488]: ts=2023-07-14T07:11:25.208Z caller=tls_config.go:195 level=info msg="TLS is disabled." http2=false
Jul 14 15:12:11 zcloud.staging.huatai.me ipmi_exporter[234488]: ts=2023-07-14T07:12:11.415Z caller=collector_ipmi.go:151 level=error msg="Failed to collect sensor data" target=[local] error="error running ipmimonitoring: exit status 1: /usr/sbin/ipmi-sensors: permission denied\n"
Jul 14 15:12:11 zcloud.staging.huatai.me ipmi_exporter[234488]: ts=2023-07-14T07:12:11.420Z caller=collector_dcmi.go:53 level=error msg="Failed to collect DCMI data" target=[local] error="error running ipmi-dcmi: exit status 1: ipmi-dcmi: permission denied\n"
Jul 14 15:12:11 zcloud.staging.huatai.me ipmi_exporter[234488]: ts=2023-07-14T07:12:11.422Z caller=collector_bmc.go:53 level=error msg="Failed to collect BMC data" target=[local] error="error running bmc-info: exit status 1: bmc-info: permission denied\n"
Jul 14 15:12:11 zcloud.staging.huatai.me ipmi_exporter[234488]: ts=2023-07-14T07:12:11.425Z caller=collector_chassis.go:53 level=error msg="Failed to collect chassis data" target=[local] error="error running ipmi-chassis: exit status 1: ipmi-chassis: permission denied\n"
注意,这里输出错误信息实际上是每次访问 ipmi_exporter
的 metrics
时候输出的,也就是服务的控制台输出。
我调整了以下 ipmi-exporter
用户账号,允许该用户登陆( /bin/bash
),然后切换到该账号下模拟执行 sudo bmc-info
是能够正常运行输出的。所以怀疑是 /etc/prometheus/ipmi_local.yml
配置生效问题(虽然参考 ipmi_exporter exanmple config 配置了命令都使用 sudo
)
汗,我知道原因了:
我配置了一个 cloudatlas
模块,而不是 default
,这个模块是在 prometheus
scrap 时指定模块才能使用,简单使用 curl
没有指定模块,就没有用到 cloudatlas
配置中的 sudo
简化配置,修订为 default
之后,果然直接使用 curl http://127.0.0.1:9290/metrics
可以获得完整的IPMI输出信息,包括了服务器的温度,风扇是否工作正常等信息…
配置Prometheus¶
本地metrics¶
本地metrics获取非常简单,只需要配置运行exporter的一个默认的metrics入口:
修改
/etc/prometheus/prometheus.yml
添加一段scrape_configs
配置:
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
...
- job_name: ipmi
scrape_interval: 1m
scrape_timeout: 30s
metrics_path: /metrics
scheme: http
static_configs:
- targets:
- 192.168.6.200:9290
然后重启
prometheus
服务就可以在 targets 页面看到新的抓取目标:
远程metrics(待实践)¶
远程metrics需要配置2个文件,一个是指定目标,一个是配置抓取地址(这里假设远程服务器的DNS名字是 ipmi-exporter.internal.example.com
):
配置
/srv/ipmi_exporter/targets.yml
:
---
- targets:
- 10.1.2.23
- 10.1.2.24
- 10.1.2.25
- 10.1.2.26
- 10.1.2.27
- 10.1.2.28
- 10.1.2.29
- 10.1.2.30
labels:
job: ipmi_exporter
在
/etc/prometheus/prometheus.yml
添加一段scrape_configs
配置:
- job_name: ipmi
params:
module: ['default']
scrape_interval: 1m
scrape_timeout: 30s
metrics_path: /metrics
scheme: http
file_sd_configs:
- files:
- /srv/ipmi_exporter/targets.yml
refresh_interval: 5m
relabel_configs:
- source_labels: [__address__]
separator: ;
regex: (.*)
target_label: __param_target
replacement: ${1}
action: replace
- source_labels: [__param_target]
separator: ;
regex: (.*)
target_label: instance
replacement: ${1}
action: replace
- separator: ;
regex: .*
target_label: __address__
replacement: ipmi-exporter.internal.example.com:9290
action: replace
配置Grafana¶
Grafana通用可视分析平台 Grafana Dashboard 15765: IPMI Exporter
完成后就可以看到我的 HPE ProLiant DL360 Gen9服务器 HPE服务器监控 的功耗监控:
温度监控:
备注
没有获取到风扇转速