prometheus-webhook-dingtalk

timonwong / prometheus-webhook-dingtalk 是Prometheus官方推荐的第三方 Alertmanager Webhook Receiver ,用于支持通过钉钉 DingTalk 发送告警通知。

systemd 方式运行 prometheus-webhook-dingtalk

备注

对于没有容器运行环境的系统,可以直接下载二进制可执行程序并结合 Systemd进程管理器 管理脚本来实现服务启动和停止,也非常方便

  • timonwong / prometheus-webhook-dingtalk GitHub的Release可以下载到官方编译的执行程序,例如AMD64版本 prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz (将程序复制到 /opt 目录,后续配置也以这个为准):

    tar xfz prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz
    mv prometheus-webhook-dingtalk-2.1.0.linux-amd64 /opt/prometheus-webhook-dingtalk
    
  • 编辑 /etc/systemd/system/prometheus-webhook-dingtalk.service :

编辑创建 /etc/systemd/system/prometheus-webhook-dingtalk.service
[Unit]
Description=prometheus-webhook-dingtalk
After=network-online.target
 
[Service]
Restart=on-failure
ExecStart=/opt/prometheus-webhook-dingtalk/prometheus-webhook-dingtalk --config.file=/opt/prometheus-webhook-dingtalk/config.yml
 
[Install]
WantedBy=multi-user.target
  • 采用 prometheus-webhook-dingtalk 模版 的配置文件 config.ymltemplate.tmpl ,将这两个文件复制到 /opt/prometheus-webhook-dingtalk 目录下(注意修订一下 config.yml 指定正确的 template.tmpl 位置)

  • 启动服务:

    systemctl daemon-reload
    systemctl start prometheus-webhook-dingtalk
    systemctl enable prometheus-webhook-dingtalk
    ss -tnl | grep 8060
    

Docker运行

备注

timonwong / prometheus-webhook-dingtalk 原作者已经不再使用钉钉(应该离开了阿里),所以项目文档没有很好维护,需要根据项目issue中一些线索来自行探索。我尝试 ./contrib/k8s 没有成功,由于我也没有时间折腾,所以采用最简单的Docker方式来运行,先满足项目的临时要求。

docker方式运行 prometheus-webhook-dingtalk
docker run -d --restart always -p 8060:8060 -v $PWD/config.yml:/etc/prometheus-webhook-dingtalk/config.yml \
    timonwong/prometheus-webhook-dingtalk --config.file=/etc/prometheus-webhook-dingtalk/config.yml \
    --web.listen-address=0.0.0.0:8060 --web.enable-ui --web.enable-lifecycle
nerdctl( containerd运行时(runtime) )方式运行 prometheus-webhook-dingtalk
nerdctl run -d --restart always -p 8060:8060 -v $PWD/config.yml:/etc/prometheus-webhook-dingtalk/config.yml \
timonwong/prometheus-webhook-dingtalk --config.file=/etc/prometheus-webhook-dingtalk/config.yml --web.listen-address=0.0.0.0:8060 --web.enable-ui

备注

参数 --web.listen-address=0.0.0.0:8060 --web.enable-ui --web.enable-lifecycle :

  • --web.listen-address=0.0.0.0:8060 监听所有网络接口

  • --web.enable-ui 激活WEB ui功能,这样方便通过WEB页面配置模版

  • --web.enable-lifecycle 提供了通过 curl -XPOST http://localhost:8060/-/reload 重新加载配置的功能

备注

ctr 不支持类似 docker 的很多高级功能,例如不支持端口转发,所以采用改进型工具 nerdctl

备注

这里使用了参数 --restart always ,这会使得 nerdctl stop 失效。解决的方法是使用 nerdctl rm -f XXXX

config.yml 配置访问DingTalk的token
## Request timeout
# timeout: 5s

## Uncomment following line in order to write template from scratch (be careful!)
#no_builtin_template: true

## Customizable templates path
#templates:
#  - contrib/templates/legacy/template.tmpl

## You can also override default template using `default_message`
## The following example to use the 'legacy' template from v0.3.0
#default_message:
#  title: '{{ template "legacy.title" . }}'
#  text: '{{ template "legacy.content" . }}'

## Targets, previously was known as "profiles"
targets:
  cloud_atlas_alert:
    url: https://oapi.dingtalk.com/robot/send?access_token=zzzzzzzzzzzz
    mention:
      mobiles: ['136xxxxxxxxx']
  sre_team_1:
    url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
    mention:
      mobiles: ['136xxxx8827', '139xxxx8325']
  sre_team_2:
    url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
    mention:
      mobiles: ['156xxxx8827', '189xxxx8325']

备注

这里我简单配置了3个 target (相当于值班组),当 kube-prometheus-stack 配置AlertManager 配置了对应的 receivers ,关联的webhook是根据URL中的路径来识别出哪个 target ,对应的钉钉机器人就会被通知到。

下文我将配置 kube-prometheus-stack 配置AlertManager ,添加对应的接受人关联到这个 webhook

kube-prometheus-stack 配置

kube-prometheus-stack 通过 helmvalues.yaml 添加对应的 receivers ,来和 prometheus-webhook-dingtalk 关联:

简单配置alertmanager的接收者就能够收到通知,这里采用 prometheus-webhook-dingtalk
## Configuration for alertmanager
## ref: https://prometheus.io/docs/alerting/alertmanager/
##
alertmanager:
...
  ## Alertmanager configuration directives
  ## ref: https://prometheus.io/docs/alerting/configuration/#configuration-file
  ##      https://prometheus.io/webtools/alerting/routing-tree-editor/
  ##
  config:
    global:
      resolve_timeout: 5m
    inhibit_rules:
      - source_matchers:
          - 'severity = critical'
        target_matchers:
          - 'severity =~ warning|info'
        equal:
          - 'namespace'
          - 'alertname'
      - source_matchers:
          - 'severity = warning'
        target_matchers:
          - 'severity = info'
        equal:
          - 'namespace'
          - 'alertname'
      - source_matchers:
          - 'alertname = InfoInhibitor'
        target_matchers:
          - 'severity = info'
        equal:
          - 'namespace'
    route:
      group_by: ['namespace']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h
      receiver: 'cloud_atlas_alert'
      routes:
      - receiver: 'cloud_atlas_alert'
        matchers:
          - alertname =~ "InfoInhibitor|Watchdog"
    receivers:
    - name: cloud_atlas_alert
      webhook_configs:
        - url: http://192.168.6.115:8060/dingtalk/cloud_atlas_alert/send
    templates:
    - '/etc/alertmanager/config/*.tmpl'

然后执行 更新Kubernetes集群的Prometheus配置 :

使用 helm upgrade prometheus-community/kube-prometheus-stack
helm upgrade kube-prometheus-stack-1681228346 prometheus-community/kube-prometheus-stack \
  --namespace prometheus --values kube-prometheus-stack.values

此时更新后的 alertmanager.yaml 之后,钉钉群机器人就会立即收到通知

../../../../_images/alert_dingtalk.png

Prometheus的 web.external-url

默认通知中 Graph 是使用 Prometheus监控 的内部域名 http://kube-prometheus-stack-1680-prometheus.prometheus:9090/graph ,这个URL通常在外部无法访问(当然你也可以在公司内部增加这个域名解析)。比较好的解决方法是采用 --web.external-url 参数传递给 Prometheus监控 ( Alertmanager 也有这样一个参数) 。对于 在Kubernetes集群(z-k8s)部署集成GPU监控的Prometheus和Grafana 所采用的 kube-prometheus-stack 修订配置: 参考 f663fb6 修订位置应该是 prometheus.prometheusSpec.externalURL (是的,我想到了 kube-prometheus-stack tsdb数据保存时间 曾经设置过向 prometheus 传递运行参数 --storage.tsdb.retention.time=180d )

配置 kube-prometheus-stack 的 Prometheus外部访问URL
## Deploy a Prometheus instance
##
prometheus:
  enabled: true
  ...
  ## Settings affecting prometheusSpec
  ## ref: https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/api.md#prometheusspec
  ##
  prometheusSpec:
    ...
    ## External URL at which Prometheus will be reachable.
    ##
    externalUrl: "http://prometheus.cloud-atlas.io:9090"
    ...
    ## How long to retain metrics
    ##
    retention: 180d

多个群通知

参考 如果是发送给多个群怎么配置? #198 可以尝试将钉钉消息发给多个群:

- name: 'rx'
  webhook_configs:
  - url: 'http://monitor-alertmanager-webhook-dingtalk:8060/dingtalk/r1/send'
  - url: http://monitor-alertmanager-webhook-dingtalk:8060/dingtalk/r2/send'

访问设置页面

prometheus-webhook-dingtalk 提供了一个 Node.js Atlas 编写的配置页面,可以参考 prometheus-webhook-dingtalk FAQ 配置模版

参考