私有云监控¶

虽然是在一台二手服务器上通过 KVM Atlas 虚拟化出云计算集群，但是随着服务器增多和架构部署的复杂化，我渐渐发现，没有一个完善的监控，是很难排查出系统问题和及时发现故障的:

ntp服务异常会导致分布式集群出现很多意想不到的错误，例如 Squid SSH代理遇到 kex_exchange_identification: Connection closed 异常
DNS服务异常会导致服务调用失败

物理主机监控¶

采用 Prometheus监控能够对 Kubernetes Atlas 集群进行监控，也能够通过 IPMI Exporter 采集物理主机的温度、主频等基础数据，所以在物理主机中:

物理主机部署 Cockpit服务器统一管理平台，通过 Performance Co-Pilot 集成插件方式来实现底层系统的性能数据采集和分析
(可选)物理主机独立部署 Prometheus监控和 IPMI Exporter (采用 Systemd进程管理器运行)，这样可以持续采集监控数据
物理主机上部署一个独立 Grafana通用可视分析平台来汇总基础运行监控，将 Performance Co-Pilot 数据可视化，例如底层的 Ceph Atlas Gluster Atlas ZFS 等监控数据
通过 prometheus-webhook-dingtalk 发送钉钉消息，也通过微信来发送通知，此外还可以尝试自己接入一个短信、语音网关来实现通知

另外一个轻量级的主机监控是 Cockpit服务器统一管理平台，发行版已经提供了集成，并且可以快速激活，也可以尝试实现上述 Prometheus监控的监控服务，同时提供对服务器的配置管理:

通过激活 cockpit-pcp 可以监控 Metrics 实现服务器的温度监控

备注

最优的监控解决方案是 OpenTelemetry : Prometheus监控和 Performance Co-Pilot 仅提供了 Metrics 监控，两者的层次和功能其实非常类似，而 OpenTelemetry 集成了 Traces, Metrics, Logs 实现了完整的软件堆栈分析，当然这也更为复杂，更适合分布式集群的深入分析。不过，OpenTelemetry专注于数据生成、采集和管理，实际完整产品化方案可以采用 SigNoz监控，或者结合 Prometheus监控 + Jaeger分布式跟踪系统 + Fluentd日志采集系统来构建解决方案

备注

经过对比不同的主机监控方案，我在 HPE服务器监控方案中筛选了上述几个方案综合监控服务器集群

物理主机部署 Prometheus 软件堆栈非常简便，采用 Prometheus快速起步简单步骤就能初步完成部署

备注

对于生产环境，如果服务器操作系统是CentOS7 则采用以下组合:

部署Prometheus¶

除了在 Kubernetes集群(y-k8s) 采用 y-k8s集群部署kube-prometheus-stack 部署Kubernetes集群内的 Prometheus监控监控外，在物理主机上直接部署一套 Prometheus监控 + Grafana通用可视分析平台，以提供基础监控并集成 Performance Co-Pilot 来实现集成性能监控，并且将 Intel® Performance Counter Monitor (Intel® PCM) 集成在监控中，对物理主机的 Intel CPU架构进行深入的性能分析。

准备用户账号:

在操作系统中添加 prometheus 用户¶

sudo groupadd --system prometheus
sudo useradd -s /sbin/nologin --system -g prometheus prometheus

在操作系统中创建prometheus目录¶

sudo mkdir /var/lib/prometheus
for i in rules rules.d files_sd; do sudo mkdir -p /etc/prometheus/${i}; done

下载最新prometheus二进制程序:

在Ubuntu环境安装Prometheus¶

mkdir -p /tmp/prometheus && cd /tmp/prometheus
curl -s https://api.github.com/repos/prometheus/prometheus/releases/latest | grep browser_download_url | grep linux-amd64 | cut -d '"' -f 4 | wget -qi -
tar xvf prometheus*.tar.gz
cd prometheus*/
sudo mv prometheus promtool /usr/local/bin/

在解压缩的Prometheus软件包目录下有配置案例以及 console libraries :

简单配置¶

sudo mkdir -p /etc/prometheus
sudo mv consoles/ console_libraries/ /etc/prometheus/
sudo mv prometheus.yml /etc/prometheus/prometheus.yml

创建 Prometheus 的 Systemd进程管理器服务管理配置文件 /etc/systemd/system/prometheus.service :

Prometheus Systemd进程管理器服务管理配置文件 /etc/systemd/system/prometheus.service¶

[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

StartLimitIntervalSec=500
StartLimitBurst=5

[Service]
User=prometheus
Group=prometheus
Type=simple
Restart=on-failure
RestartSec=5s
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/data/prometheus \
  --web.console.templates=/etc/prometheus/consoles \
  --web.console.libraries=/etc/prometheus/console_libraries \
  --web.listen-address=0.0.0.0:9090 \
  --web.enable-lifecycle

[Install]
WantedBy=multi-user.target

启动服务:

启动Prometheus¶

sudo systemctl daemon-reload
sudo systemctl enable --now prometheus
sudo systemctl status prometheus

部署Grafana¶

由于物理主机使用的是 Ubuntu Linux 22.04，Grafana提供了非常方便的软件仓库安装方式，可以快速完成安装Grafana

安装社区版APT源:

在Ubuntu中安装Grafana¶

sudo apt install -y apt-transport-https
sudo apt install -y software-properties-common wget
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -

echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee -a /etc/apt/sources.list.d/grafana.list

sudo apt update
sudo apt install grafana

启动服务:

启动Grafana¶

sudo systemctl daemon-reload
sudo systemctl start grafana-server
sudo systemctl status grafana-server
sudo systemctl enable grafana-server.service

反向代理¶

Prometheus监控 (端口9090) 和 Grafana通用可视分析平台 (端口3000)都内置提供了 Web Atlas 服务，为了方便访问，采用 Nginx 反向代理在反向代理后面运行Grafana :

NGINX 反向代理 alertmanager 配置¶

upstream alertmanager {
    server 192.168.6.115:30903;
}

server {
    listen 9093;
    #listen [::]:80;

    server_name alertmanager alertmanager.cloud-atlas.io;

    location / {
        proxy_set_header Host $http_host;
	proxy_pass http://alertmanager;
    }
}

NGINX 反向代理 grafana 配置¶

upstream grafana {
    server 192.168.6.115:30080;
}

server {
    listen 80;
    #listen [::]:80;

    server_name grafana grafana.cloud-atlas.io;

    location / {
        proxy_set_header Host $http_host;
	proxy_pass http://grafana;
    }
}

NGINX 反向代理 prometheus 配置¶

upstream prometheus {
    server 192.168.6.115:30090;
}

server {
    listen 9092;
    #listen [::]:80;

    server_name prometheus prometheus.cloud-atlas.io;

    location / {
        proxy_set_header Host $http_host;
	proxy_pass http://prometheus;
    }
}

启动 NGINX 后，通过域名访问:

grafana.cloud-atlas.io
prometheus.cloud-atlas.io
alertmanager.cloud-atlas.io

安装 Prometheus监控 / Grafana通用可视分析平台 / Alertmanager 只是提供了运行监控运行框架，我们需要通过丰富的 Prometheus Exporters 以及第三方exporter来实现全面的系统监控:

Node Exporter¶

Node Exporter 是首先需要部署的重要Exporter

备注

我的 Kubernetes集群(y-k8s) 采用了 NVIDIA GPU 的 NVIDIA Virtual GPU (vGPU) 构建模拟 GPU Kubernetes ，NVIDIA官方提供了 DCGM-Exporter 可以实现GPU的全面监控。不过，由于GPU是结合 Kubernetes Atlas 部署，所以我仅在 Kubernetes集群(y-k8s) 中通过使用Helm 3在Kubernetes集群部署Prometheus和Grafana 部署整个GPU管理组件以及结合 DCGM-Exporter 的监控，在物理服务器上就不再部署。

下载安装 Node Exporter 执行程序:

安装Node Exporter执行程序¶

version=1.6.1
wget https://github.com/prometheus/node_exporter/releases/download/v${version}/node_exporter-${version}.linux-amd64.tar.gz
tar xvfz node_exporter-${version}.linux-amd64.tar.gz
cd node_exporter-${version}.linux-amd64/
sudo mv node_exporter /usr/local/bin/

# 直接运行
/usr/local/bin/node_exporter

配置 /etc/systemd/system/node_exporter.service :

配置 Node Exporter 服务，通过 Systemd进程管理器运行¶

[Unit]
Description=node_exporter
Wants=network-online.target
After=network-online.target

StartLimitIntervalSec=500
StartLimitBurst=5

[Service]
User=prometheus
Group=prometheus
Type=simple
Restart=on-failure
RestartSec=5s
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target

启动:

通过 systemctl 启动 node_exporter 服务¶

systemctl daemon-reload
systemctl enable --now node_exporter
systemctl status node_exporter

修改 /etc/prometheus/prometheus.yaml 添加监控目标服务器 Node Exporter 数据抓取:

在 /etc/prometheus/prometheus.yml 中添加抓取node配置任务¶

...
scrape_configs:
...
  - job_name: "node"
    static_configs:
    - targets:
      - localhost:9100
      - 192.168.6.11:9100
      - 192.168.6.12:9100

在 Grafana Dashboard 搜索添加 1860 ID 的Dashboard ( Node Exporter Full )

PCM监控¶

对于 Intel CPU架构可以采用 pcm-exporter :Intel Performance Counter Monitor (Intel PCM) Prometheus exporter 提供详细的CPU监控，结合 Intel PCM Grafana 可以在日常生产中帮助我们定位CPU相关故障和异常:

大多数最新的主流发行版都包含了 Intel PCM 安装包，例如 Ubuntu Linux 安装 pcm 之后就具备了 pcm-sensor-server :

在Ubuntu安装Intel PCM¶

sudo apt install pcm -y

虽然可以直接运行 pcm-sensor-server ，但是为了方便维护，采用 Systemd进程管理器服务配置 /etc/systemd/system/pcm-exporter.service :

/etc/systemd/system/pcm-exporter.service¶

[Unit]
Description=pcm-exporter
Wants=network-online.target
After=network-online.target

StartLimitIntervalSec=500
StartLimitBurst=5

[Service]
User=root
Group=root
Type=simple
Restart=on-failure
RestartSec=5s
ExecStart=/usr/sbin/pcm-sensor-server \
  -p 9738 \
  --real-time

[Install]
WantedBy=multi-user.target

启动 pcm-exporter :

启动 pcm-exporter 服务¶

systemctl daemon-reload
systemctl enable --now pcm-server
systemctl status pcm-server

在 Grafana Dashboard 搜索添加 17108 ID 的Dashboard ( Grafana Dashboard 17108: Processor Counter Monitor (PCM) Dashboard )