阿里云Prometheus监控产品

阿里云为用户提供了基于开源 Prometheus监控和 Grafana通用可视分析平台构建的阿里云实时监控服务ARMS 监控产品。其主要产品 Prometheus监控 / Grafana服务 是基于开源软件构建:

定制了开箱即用的Grafana面板
提供WEB页面一键安装Prometheus全家桶(其实就是类似社区的使用Helm 3在Kubernetes集群部署Prometheus和Grafana 不过把所有步骤都集成起来提供了一个页面导引 )
解决了大陆用户(是的，很不幸就是我们)无法正常访问github仓库/google仓库的安装问题(这是开源社区 prometheus-stack 安装的最大障碍)
集成了日志服务 SLS 和云监控 CMS插件
- 这个功能实际上可以采用社区的 Thanos 分布式时序存储来实现，社区thanos是Grafana的全家桶之一，与Prometheus集成完成度非常好

备注

阿里云实时监控服务ARMS - Kubernetes监控类似于 Cilium网络的eBPF技术，可能也是在开源基础上做的定制。

备注

以下是我的一些使用体验和架构分析，作为云产品的调研和参考

阿里云提供了 prometheus 部署监控，实际上也是社区版本的魔改:

kubectl get all -n arms-prom

可以看到:

NAME                                                               DESIRED   CURRENT   READY   AGE
replicaset.apps/arms-prom-operator-ack-arms-prometheus-9db46f96c   1         1         1       2d4h
...

这里检查 arms-prom-operator-ack-arms-prometheus-9db46f96c replicaset:

kubectl get replicaset arms-prom-operator-ack-arms-prometheus-9db46f96c -o yaml -n arms-prom

可以看到:

...
      containers:
      - args:
        - --port=9335
        - --yaml=/etc/config/prometheusDisk/prometheus.yaml
...
        volumeMounts:
        - mountPath: /etc/config/prometheusDisk
          name: prom-config
...
      volumes:
...
      - emptyDir: {}
        name: prom-config

也就是说部署这个 replicaset 的节点，本地有一个 tmpfs 目录 prom-config

kubectl get pods -A -o wide | grep arms-prom-operator-ack-arms-prometheus-9db46f96c

这个容器内部 /etc/config/prometheusDisk 是空目录:

# df -h
Filesystem                Size      Used Available Use% Mounted on
...
/dev/nvme3n1              3.4T     17.5G      3.2T   1% /etc/config/prometheusDisk

容器内部检查进程:

# ps aux | grep pro
    1 root      3h12 /entry --port=9335 --yaml=/etc/config/prometheusDisk/prometheus.yaml ...

然而，在容器中 /etc/config/prometheusDisk/prometheus.yaml 并不存在

通过CM修改阿里云prometheus配置

可以通过以下 YAML 模版结合jobs来修订阿里云prometheus配置:

通过CM合并到阿里云prometheus配置

apiVersion: v1
data:
  promYaml: |-
    scrape_configs:
    - job_name: my-job
      ...
kind: ConfigMap
metadata:
  labels:
    target: arms
    type: prometheus-yaml
  name: arms-prom-prometheus-yaml
  namespace: arms-prom

执行:

kubectl apply -f arms-prom-prometheus.yaml

不支持 `web.enable-lifecycle`

阿里云prometheus配置关闭了通过HTTP POST reload :

参考 Updating a k8s Prometheus operator's configs to add a scrape target ，实际上Prometheus是支持动态加载配置的，无需重启:

curl -X POST http://localhost:9090/-/reload

不过，在阿里云平台，提示:

Lifecycle API is not enabled.

备注

Prometheus can reload its configuration at runtime. If the new configuration is not well-formed, the changes will not be applied. A configuration reload is triggered by sending a SIGHUP to the Prometheus process or sending a HTTP POST request to the /-/reload endpoint (when the --web.enable-lifecycle flag is enabled). This will also reload any configured rule files.

参考 Correct way to update rules and configuration for a Prometheus installation on a Kubernetes cluster that was setup by prometheus-operator helm chart?

`starship` Agent

阿里云的GPU服务器也采用了 DCGM-Exporter 来实现 NVIDIA GPU 监控，不过阿里云做了定制打包成 starship Agent，作为物理主机上的 Systemd进程管理器服务运行。这个服务和在Kuternetes集成GPU可观测能力方案中采用的 DaemonSet 模式运行有冲突，两者必须只取一种。否则会出现冲突而出现 dcgm-exporter "context deadline exceeded"

备注

在测试 startship Agent 和 DCGM-Exporter 共存方案: 可以采用 Daemonset nodeAffinity 来实现对打标的节点安装 DCGM-Exporter DS 以及采用同样的方法 nodeAntiAffinity 来避开 systemd 模式运行的 startship 节点。

参考

阿里云实时监控服务ARMS 官方帮助文档

阿里云Prometheus监控产品

通过CM修改阿里云prometheus配置

不支持 web.enable-lifecycle

starship Agent

参考

不支持 `web.enable-lifecycle`

`starship` Agent