prometheus-webhook-dingtalk
钉钉消息过长¶
在配置 prometheus-webhook-dingtalk 接收了大量 Prometheus 告警之后,突然发现从某个时间开始,不再接收到新的告警通知,感觉不太正常。
从 Prometheus 的 Alerts
页面查看,可以看到系统是有很多 Firing
状态的告警,也就是说已经通过alertmanager发送告警。但是为何钉钉消息没有收到?
检查 systemd 方式运行 prometheus-webhook-dingtalk ,也就是通过 Systemd进程管理器 的 journalctl 检查服务日志:
journalctl -u prometheus-webhook-dingtalk.service --no-pager
可以看到服务日志显示钉钉消息过长( resp_status=400
/ respCode=460101
)
Jun 08 10:05:46 iZ2zeav45krsh6sr8t9r4qZ prometheus-webhook-dingtalk[30365]: ts=2023-06-08T02:05:46.971Z caller=dingtalk.go:103 level=error component=web target=antgpu_sre msg="Failed to send notification to DingTalk" respCode=460101 respMsg="description: body 大小不合法;solution:请保持大小在 20000bytes 以内;"
Jun 08 10:05:46 iZ2zeav45krsh6sr8t9r4qZ prometheus-webhook-dingtalk[30365]: ts=2023-06-08T02:05:46.971Z caller=entry.go:26 level=info component=web http_scheme=http http_proto=HTTP/1.1 http_method=POST remote_addr=10.233.76.68:36966 user_agent=Alertmanager/0.25.0 uri=http://172.21.45.22:8060/dingtalk/antgpu_sre/send resp_status=400 resp_bytes_length=27 resp_elapsed_ms=50.012778 msg="request complete"
消息过长主要原因:
大量报警,Prometheus采用group方式聚合导致消息过长
inhibit_rules
有很多重复报警默认info级别消息过多
Error- Message is too long #30 有人提到了尝试修改 inhibit_rules
消除重复报警,我觉得可行。例如默认配置:
inhibit_rules:
- source_matchers:
- 'severity = critical'
target_matchers:
- 'severity =~ warning|info'
equal:
- 'namespace'
- 'alertname'
修改成:
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname']
我简化成放弃 InfoInhibitor
告警,并且只接收 warning
和 critical
( inhibit_rules
配置保持没有修改 ):
config:
global:
resolve_timeout: 5m
inhibit_rules:
- source_matchers:
- 'severity = critical'
target_matchers:
- 'severity =~ warning|info'
equal:
- 'namespace'
- 'alertname'
- source_matchers:
- 'severity = warning'
target_matchers:
- 'severity = info'
equal:
- 'namespace'
- 'alertname'
- source_matchers:
- 'alertname = InfoInhibitor'
target_matchers:
- 'severity = info'
equal:
- 'namespace'
route:
group_by: ['namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 24h
routes:
- "matchers":
- "alertname = Watchdog"
"receiver": "antgpu_sre"
- "matchers":
- "alertname = InfoInhibitor"
"receiver": "null"
- "matchers":
- "alertname = etcdHighNumberOfFailedGRPCRequests"
"receiver": "null"
- "matchers":
- "severity =~ warning|critical"
"receiver": "antgpu_sre"
receivers:
- name: 'antgpu_sre'
webhook_configs:
- url: http://10.0.1.169:8060/dingtalk/antgpu_sre/send
- name: 'null'
templates:
- '/etc/alertmanager/config/*.tmpl'