prometheus-webhook-dingtalk 钉钉消息过长

在配置 prometheus-webhook-dingtalk 接收了大量 Prometheus 告警之后,突然发现从某个时间开始,不再接收到新的告警通知,感觉不太正常。

从 Prometheus 的 Alerts 页面查看,可以看到系统是有很多 Firing 状态的告警,也就是说已经通过alertmanager发送告警。但是为何钉钉消息没有收到?

检查 systemd 方式运行 prometheus-webhook-dingtalk ,也就是通过 Systemd进程管理器journalctl 检查服务日志:

执行 journalctl 检查 prometheus-webhook-dingtalk 服务日志
journalctl -u prometheus-webhook-dingtalk.service --no-pager

可以看到服务日志显示钉钉消息过长( resp_status=400 / respCode=460101 )

prometheus-webhook-dingtalk 日志显示消息体过长(超过2k)导致被钉钉服务器拒绝
Jun 08 10:05:46 iZ2zeav45krsh6sr8t9r4qZ prometheus-webhook-dingtalk[30365]: ts=2023-06-08T02:05:46.971Z caller=dingtalk.go:103 level=error component=web target=antgpu_sre msg="Failed to send notification to DingTalk" respCode=460101 respMsg="description: body 大小不合法;solution:请保持大小在 20000bytes 以内;"
Jun 08 10:05:46 iZ2zeav45krsh6sr8t9r4qZ prometheus-webhook-dingtalk[30365]: ts=2023-06-08T02:05:46.971Z caller=entry.go:26 level=info component=web http_scheme=http http_proto=HTTP/1.1 http_method=POST remote_addr=10.233.76.68:36966 user_agent=Alertmanager/0.25.0 uri=http://172.21.45.22:8060/dingtalk/antgpu_sre/send resp_status=400 resp_bytes_length=27 resp_elapsed_ms=50.012778 msg="request complete"

消息过长主要原因:

  • 大量报警,Prometheus采用group方式聚合导致消息过长

  • inhibit_rules 有很多重复报警

  • 默认info级别消息过多

Error- Message is too long #30 有人提到了尝试修改 inhibit_rules 消除重复报警,我觉得可行。例如默认配置:

默认 inhibit_rules
     inhibit_rules:
       - source_matchers:
           - 'severity = critical'
         target_matchers:
           - 'severity =~ warning|info'
         equal:
           - 'namespace'
           - 'alertname'

修改成:

inhibit_rules 修改成只抑制warning,过滤掉info级别消息
inhibit_rules:
  - source_match:      
      severity: 'critical'
    target_match:      
      severity: 'warning'
    equal: ['alertname']

我简化成放弃 InfoInhibitor 告警,并且只接收 warningcritical ( inhibit_rules 配置保持没有修改 ):

alertmanager告警仅接收warning和critical
   config:
     global:
       resolve_timeout: 5m
     inhibit_rules:
       - source_matchers:
           - 'severity = critical'
         target_matchers:
           - 'severity =~ warning|info'
         equal:
           - 'namespace'
           - 'alertname'
       - source_matchers:
           - 'severity = warning'
         target_matchers:
           - 'severity = info'
         equal:
           - 'namespace'
           - 'alertname'
       - source_matchers:
           - 'alertname = InfoInhibitor'
         target_matchers:
           - 'severity = info'
         equal:
           - 'namespace'
     route:
       group_by: ['namespace']
       group_wait: 30s
       group_interval: 5m
       repeat_interval: 24h
       routes:
       - "matchers":
         - "alertname = Watchdog"
         "receiver": "antgpu_sre"
       - "matchers":
         - "alertname = InfoInhibitor"
         "receiver": "null"
       - "matchers":
         - "alertname = etcdHighNumberOfFailedGRPCRequests"
         "receiver": "null"
       - "matchers":
         - "severity =~ warning|critical"
         "receiver": "antgpu_sre"
     receivers:
     - name: 'antgpu_sre'
       webhook_configs:
         - url: http://10.0.1.169:8060/dingtalk/antgpu_sre/send
     - name: 'null'
     templates:
     - '/etc/alertmanager/config/*.tmpl'

参考