Prometheus `Info-level alert inhibition`¶

在部署完成 prometheus-webhook-dingtalk 能够接收到告警通知，很方便。但是，也出现了一些困扰

不断收到 pods 出现 pending 的Info级别告警

不断收到pod pending的info级别告警¶

[NONE] Info-level alert inhibition.
Description: This is an alert that is used to inhibit info alerts.
By themselves, the info-level alerts are sometimes very noisy, but they are relevant when combined with
other alerts.
This alert fires whenever there's a severity="info" alert, and stops firing when another alert with a
severity of 'warning' or 'critical' starts firing on the same namespace.
This alert should be routed to a null receiver and configured to inhibit alerts with severity="info".
Graph: 📈
Details:


alertname: InfoInhibitor
alertstate: pending
container: pytorch
namespace: kubemaker
pod: lmm-split-images-n8-004-ptjob-worker-6
prometheus: default/kube-prometheus-stack-1681-prometheus

但是，检查Kubernetes集群，发现 pending 告警的pod实际上是运行状态

既然已经进入了Running状态，为何我还会不断收到告警通知呢？

InfoInhibitor¶

inhibit info alerts 在info-level级别告警中有时候非常”噪音”，但是结合其他告警能够提供相关性信息:

CPUThrottlingHigh (CPU暴涨)告警在 Polar Signals 集群( Parca持续性能分析 )非常常见，但是除非有其他报警触发，否则 CPUThrottlingHigh 警告将被禁止
- 这是因为单一的CPU使用率高并不代表系统存在问题，但是结合其他异常指标则很可能是存在隐患的线索

上述Alert没有任何影响，仅仅是作为 alertmanager 中缺失功能的解决方法:

只要存在 severity="info" 警报，就会触发，并且在同一个命名空间上开始触发另一个具有 warning 或 cirtical 性的警报时停止触发
- (我的)简单理解就是，如果存在 info 级别的警报，则会不断触发(即使这个状态已经结束)，这是因为Prometheus希望让你知道存在过这种状态信息
- 当相同namespace中出现更严重级别的报警，则该 info 级别通知就会结束，因为Prometheus认为通知目的已经达成

缓解方法¶

采用如下步骤来缓解这种(可能)不必要告警:

将 severity="info" 级别的inhibit alerts的接受人配置为 null 接收人
- 注意一定要配置一个 recivers 中有一个 name: 'null' ，否个可能还会会错误发送
并且配置为禁止 severity="info" 的警报

详细配置可以参考 kube-prometheus/manifests/alertmanager-secret.yaml

例如，对于 prometheus-webhook-dingtalk 采用 kube-prometheus-stack 修订 vaules.yaml 配置:

配置 kube-prometheus-stack 修改 severity="info" 级别的inhibit alerts的接受人配置为 null¶

## Configuration for alertmanager
## ref: https://prometheus.io/docs/alerting/alertmanager/
##
alertmanager:
...
  ## Alertmanager configuration directives
  ## ref: https://prometheus.io/docs/alerting/configuration/#configuration-file
  ##      https://prometheus.io/webtools/alerting/routing-tree-editor/
  ##
  config:
    global:
      resolve_timeout: 5m
    inhibit_rules:
      - source_matchers:
          - 'severity = critical'
        target_matchers:
          - 'severity =~ warning|info'
        equal:
          - 'namespace'
          - 'alertname'
      - source_matchers:
          - 'severity = warning'
        target_matchers:
          - 'severity = info'
        equal:
          - 'namespace'
          - 'alertname'
      - source_matchers:
          - 'alertname = InfoInhibitor'
        target_matchers:
          - 'severity = info'
        equal:
          - 'namespace'
    route:
      group_by: ['namespace']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h
      routes:
      - matchers:
        - "alertname = Watchdog"
        receiver: "Watchdog"
      - matchers:
        - "alertname = InfoInhibitor"
        receiver: "null"
      - matchers:
        - "severity = critical"
        receiver: cloud_atlas_alert
    receivers:
    - name: cloud_atlas_alert
      webhook_configs:
        - url: http://192.168.6.115:8060/dingtalk/cloud_atlas_alert/send
    - name: "null"
    - name: "Watchdog"
    templates:
    - '/etc/alertmanager/config/*.tmpl'

Prometheus Info-level alert inhibition¶

InfoInhibitor¶

缓解方法¶

参考¶

Prometheus `Info-level alert inhibition`¶