IPMI Exporter

Prometheus Exporters 有一个官方 ipmi_exporter 可以基于 IPMI 输出 Metrics 。并且有一个非常完美的 Grafana通用可视分析平台 Grafana Dashboard 15765: IPMI Exporter 。这样可以用来监控大规模服务器集群,并且生成告警。

ipmi_exporter 输出本地IPMI metrics到标准的 /metrics ,无需特殊配置。对于远程metrics,通用配置方法非常类似 Blackbox Exporter (黑盒测试HTTP,HTTPS,DNS,TCP,ICMP和gRPC),只需要简单使用 targetmodule URL参数告知IPMI设备入口即可。可以对数以千计的IPMI设备进行metrics输出。

对于本地ipmi无需密码账号,对于远程ipmi,则要提供IPMI设备的用户名和密码。此外,还提供了一个 blacklist 屏蔽掉 FreeIPMI 不支持的特殊OEM传感器。

有两个案例配置文件: ipmi_local.yml 抓取本地主机metrics,以及 ipmi_remote.yml 抓取远程IPMI接口。

备注

社区 Prometheus Exportersipmi_exporter 是采用 freeipmi 来访问IPMI获取服务器监控数据的。

另外一种解决方案是采用 Node Exporter 提供的 Node Exporter ipmitool 文本插件 实现

安装

安装 ipmi_exporter
version=1.6.1

tar xfz ipmi_exporter-${version}.linux-amd64.tar.gz
cd ipmi_exporter-${version}.linux-amd64
sudo mv ipmi_exporter /usr/local/bin/
  • 服务器安装 freeipmi (安装以后会获得 GNU FreeIPMI 全部命令,位于 /usr/sbin 目录):

在Ubuntu中安装FreeIPMI
sudo apt install freeipmi -y
  • 创建一个 ipmi-exporter 用户,类似 Prometheus快速起步 ,我们将限定这个用户不可登录:

设置 ipmi-exporter 用户账号
sudo groupadd --system ipmi-exporter
sudo useradd -s /sbin/nologin --system -g ipmi-exporter ipmi-exporter
  • ipmi-exporter 用户必须能 sudo 无密码执行 freeipmi 命令,所以配置 /etc/sudoers :

配置 ipmi-exporter 用户 sudo 权限
cat << 'EOF' > /tmp/ipmi-expoter.sudo
ipmi-exporter ALL = NOPASSWD: /usr/sbin/ipmimonitoring,\
                              /usr/sbin/ipmi-sensors,\
                              /usr/sbin/ipmi-dcmi,\
                              /usr/sbin/ipmi-raw,\
                              /usr/sbin/bmc-info,\
                              /usr/sbin/ipmi-chassis,\
                              /usr/sbin/ipmi-sel
EOF

cat /tmp/ipmi-expoter.sudo | sudo tee -a /etc/sudoers
rm /tmp/ipmi-expoter.sudo
  • 创建 /etc/prometheus/ipmi_remote.yml :

创建 /etc/prometheus/ipmi_remote.yml
modules:
    #cloudatlas:
    default:
        user: "some_user"
        pass: "secret_pw"
        privilege: "admin"
        driver: "LAN"
        collectors:
        - bmc
        - ipmi
        - dcmi
        - chassis
        - sel
        collector_cmd:
            bmc: sudo
            ipmi: sudo
            dcmi: sudo
            chassis: sudo
            sel: sudo
        custom_args:
            bmc:
            - "bmc-info"
            ipmi:
            - "ipmimonitoring"
            dcmi:
            - "ipmi-dcmi"
            chassis:
            - "ipmi-chassis"
            sel:
            - "ipmi-sel"
  • 创建 /etc/prometheus/ipmi_local.yml (如果是本地执行):

创建 /etc/prometheus/ipmi_local.yml
modules:
        #cloudatlas:
        default:
            # Available collectors are bmc, ipmi, chassis, dcmi, sel, and sm-lan-mode
            collectors:
            - bmc
            - ipmi
            - dcmi
            - chassis
            - sel
            collector_cmd:
                bmc: sudo
                ipmi: sudo
                dcmi: sudo
                chassis: sudo
                sel: sudo
            custom_args:
                bmc:
                - "bmc-info"
                ipmi:
                - "ipmimonitoring"
                dcmi:
                - "ipmi-dcmi"
                chassis:
                - "ipmi-chassis"
                sel:
                - "ipmi-sel"

注意,这里按照 /etc/sudoers 添加的需要执行的命令,每个都采用了 sudo 来执行,参考了 ipmi_exporter exanmple config

  • 参考 Prometheus快速起步 方式为 ipmi_exporter 配置一个 /etc/systemd/system/ipmi_exporter.service (这里假设使用 /etc/prometheus/ipmi_remote.yml 远程访问IPMI):

systemd启动配置 /etc/systemd/system/ipmi_exporter.service
[Unit]
Description=ipmi_exporter
Wants=network-online.target
After=network-online.target

StartLimitIntervalSec=500
StartLimitBurst=5

[Service]
User=ipmi-exporter
Group=ipmi-exporter
Type=simple
Restart=on-failure
RestartSec=5s
ExecStart=/usr/local/bin/ipmi_exporter \
  --config.file=/etc/prometheus/ipmi_remote.yml \
  --web.listen-address=0.0.0.0:9290

[Install]
WantedBy=multi-user.target
  • 启动 ipmi_exporter 服务:

启动 ipmi_exporter 服务
sudo systemctl daemon-reload
sudo systemctl enable --now ipmi_exporter
sudo systemctl status ipmi_exporter

debug

上述运行起来 ipmi_exporter 之后,直接通过 wget http://127.0.0.1:9290/metrics 拿到的数据中关于IPMI的记录没有采集到(metrics值0表示采集异常):

wget http://127.0.0.1:9290/metrics 检查 ipmi_exporter 显示数据采集失败,metrics 值 0 表示采集异常
...
# HELP ipmi_up '1' if a scrape of the IPMI device was successful, '0' otherwise.
# TYPE ipmi_up gauge
ipmi_up{collector="bmc"} 0
ipmi_up{collector="chassis"} 0
ipmi_up{collector="dcmi"} 0
ipmi_up{collector="ipmi"} 0
...

检查服务 systemctl status ipmi_exporter 显示没有权限打开:

显示 ipmi_exporter 采集因为权限不足而失败
● ipmi_exporter.service - ipmi_exporter
     Loaded: loaded (/etc/systemd/system/ipmi_exporter.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2023-07-14 15:11:25 CST; 1min 51s ago
   Main PID: 234488 (ipmi_exporter)
      Tasks: 10 (limit: 464040)
     Memory: 3.2M
        CPU: 34ms
     CGroup: /system.slice/ipmi_exporter.service
             └─234488 /usr/local/bin/ipmi_exporter --config.file=/etc/prometheus/ipmi_local.yml --web.listen-address=0.0.0.0:9290

Jul 14 15:11:25 zcloud.staging.huatai.me systemd[1]: Started ipmi_exporter.
Jul 14 15:11:25 zcloud.staging.huatai.me ipmi_exporter[234488]: ts=2023-07-14T07:11:25.208Z caller=main.go:107 level=info msg="Starting ipmi_exporter" version="(version=1.6.1, branch=HEAD, revision=344b8b4a565a9ced936aad4d4ac9a29892515cba)"
Jul 14 15:11:25 zcloud.staging.huatai.me ipmi_exporter[234488]: ts=2023-07-14T07:11:25.208Z caller=config.go:243 level=info msg="Loaded config file" path=/etc/prometheus/ipmi_local.yml
Jul 14 15:11:25 zcloud.staging.huatai.me ipmi_exporter[234488]: ts=2023-07-14T07:11:25.208Z caller=main.go:172 level=info msg="Listening on" address=0.0.0.0:9290
Jul 14 15:11:25 zcloud.staging.huatai.me ipmi_exporter[234488]: ts=2023-07-14T07:11:25.208Z caller=tls_config.go:195 level=info msg="TLS is disabled." http2=false
Jul 14 15:12:11 zcloud.staging.huatai.me ipmi_exporter[234488]: ts=2023-07-14T07:12:11.415Z caller=collector_ipmi.go:151 level=error msg="Failed to collect sensor data" target=[local] error="error running ipmimonitoring: exit status 1: /usr/sbin/ipmi-sensors: permission denied\n"
Jul 14 15:12:11 zcloud.staging.huatai.me ipmi_exporter[234488]: ts=2023-07-14T07:12:11.420Z caller=collector_dcmi.go:53 level=error msg="Failed to collect DCMI data" target=[local] error="error running ipmi-dcmi: exit status 1: ipmi-dcmi: permission denied\n"
Jul 14 15:12:11 zcloud.staging.huatai.me ipmi_exporter[234488]: ts=2023-07-14T07:12:11.422Z caller=collector_bmc.go:53 level=error msg="Failed to collect BMC data" target=[local] error="error running bmc-info: exit status 1: bmc-info: permission denied\n"
Jul 14 15:12:11 zcloud.staging.huatai.me ipmi_exporter[234488]: ts=2023-07-14T07:12:11.425Z caller=collector_chassis.go:53 level=error msg="Failed to collect chassis data" target=[local] error="error running ipmi-chassis: exit status 1: ipmi-chassis: permission denied\n"

注意,这里输出错误信息实际上是每次访问 ipmi_exportermetrics 时候输出的,也就是服务的控制台输出。

我调整了以下 ipmi-exporter 用户账号,允许该用户登陆( /bin/bash ),然后切换到该账号下模拟执行 sudo bmc-info 是能够正常运行输出的。所以怀疑是 /etc/prometheus/ipmi_local.yml 配置生效问题(虽然参考 ipmi_exporter exanmple config 配置了命令都使用 sudo )

汗,我知道原因了:

我配置了一个 cloudatlas 模块,而不是 default ,这个模块是在 prometheus scrap 时指定模块才能使用,简单使用 curl 没有指定模块,就没有用到 cloudatlas 配置中的 sudo

简化配置,修订为 default 之后,果然直接使用 curl http://127.0.0.1:9290/metrics 可以获得完整的IPMI输出信息,包括了服务器的温度,风扇是否工作正常等信息…

配置Prometheus

本地metrics

本地metrics获取非常简单,只需要配置运行exporter的一个默认的metrics入口:

  • 修改 /etc/prometheus/prometheus.yml 添加一段 scrape_configs 配置:

在 prometheus.yml 中 scrape_configs 段落添加 ipmi 抓取任务
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
...
  - job_name: ipmi
    scrape_interval: 1m
    scrape_timeout: 30s
    metrics_path: /metrics
    scheme: http
    static_configs:
    - targets:
      - 192.168.6.200:9290
  • 然后重启 prometheus 服务就可以在 targets 页面看到新的抓取目标:

../../../../_images/prometheus_impi.png

远程metrics(待实践)

远程metrics需要配置2个文件,一个是指定目标,一个是配置抓取地址(这里假设远程服务器的DNS名字是 ipmi-exporter.internal.example.com ):

  • 配置 /srv/ipmi_exporter/targets.yml :

设置目标地址
---
- targets:
  - 10.1.2.23
  - 10.1.2.24
  - 10.1.2.25
  - 10.1.2.26
  - 10.1.2.27
  - 10.1.2.28
  - 10.1.2.29
  - 10.1.2.30
  labels:
    job: ipmi_exporter
  • /etc/prometheus/prometheus.yml 添加一段 scrape_configs 配置:

远程抓取配置 prometheus.yml
- job_name: ipmi
  params:
    module: ['default']
  scrape_interval: 1m
  scrape_timeout: 30s
  metrics_path: /metrics
  scheme: http
  file_sd_configs:
  - files:
    - /srv/ipmi_exporter/targets.yml
    refresh_interval: 5m
  relabel_configs:
  - source_labels: [__address__]
    separator: ;
    regex: (.*)
    target_label: __param_target
    replacement: ${1}
    action: replace
  - source_labels: [__param_target]
    separator: ;
    regex: (.*)
    target_label: instance
    replacement: ${1}
    action: replace
  - separator: ;
    regex: .*
    target_label: __address__
    replacement: ipmi-exporter.internal.example.com:9290
    action: replace

配置Grafana

Grafana通用可视分析平台 Grafana Dashboard 15765: IPMI Exporter

完成后就可以看到我的 HPE ProLiant DL360 Gen9服务器 HPE服务器监控 的功耗监控:

../../../../_images/grafana_impi.png

温度监控:

../../../../_images/grafana_impi_temperatures.png

备注

没有获取到风扇转速

参考