kube-prometheus-stack 告警配置

Prometheus监控应用实战 类似,通过配置 Prometheus 告警规则 来实现告警通知,不过 kube-prometheus-stack 提供了 values 输入方法来简化配置,本文记录实践经验。

告警配置入口

kube-prometheus-stackvalues.yaml 中可以找到如下 PrometheusRules 相关入口:

kube-prometheus-stack 配置 PrometheusRules 默认入口,由于这里采用独立配置文件 rules_disk_alert.yaml 所以需要注释掉 additionalPrometheusRulesMap: {}
## Deprecated way to provide custom recording or alerting rules to be deployed into the cluster.
##
# additionalPrometheusRules: []
#  - name: my-rule-file
#    groups:
#      - name: my_group
#        rules:
#        - record: my_record
#          expr: 100 * my_record

## Provide custom recording or alerting rules to be deployed into the cluster.
##
#additionalPrometheusRulesMap: {}
#  rule-name:
#    groups:
#    - name: my_group
#      rules:
#      - record: my_record
#        expr: 100 * my_record

我们可以在入口上添加自定义的监控配置 ConfigMap : 如果采用独立配置,则注释掉 additionalPrometheusRulesMap: {} ;如果采用合并配置,则直接在 additionalPrometheusRulesMap: {} 行下添加配置(注意要去掉 {} )

Prometheus规则 DiskUsage

  • 配置 rules_DiskUsage.yaml :

添加Prometheus规则对DiskUsage告警(可以将磁盘相关告警都放到名为 disk_alert 的分组中)
additionalPrometheusRulesMap:
  rule-name:
    groups:
    - name: disk_alert
      rules:
      - alert: DiskUsage
        expr: (node_filesystem_size_bytes{mountpoint="/"}) - (node_filesystem_free_bytes{mountpoint="/"}) >= (node_filesystem_size_bytes{mountpoint="/"}) * 0.8
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Disk usage is too high (instance {{ $labels.instance }})"
          description: "Disk usage is too high (instance {{ $labels.instance }})."

上述监控目录 / 使用量,对于不同目录还要不断添加,感觉有点繁琐。改进为以下监控方式:

全面的主机磁盘使用空间检测,包括实例、设备和挂载点
- alert: "HostOutOfDiskSpace"
  expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 20 and ON (instance, device, mountpoint) node_filesystem_readonly == 0
  for: 2m
  labels:
    severity: "warning"
  annotations:
    summary: "Host out of disk space (instance {{ $labels.instance }})"
    description: "Disk is almost full (< 20% left)"
    value: "{{ $value }}"

这样就可以监视所有挂载目录,出现超过 80% 使用率报警

通过 helm upgrade 添加附加的Prometheus告警规则
helm upgrade kube-prometheus-stack-1681228346 prometheus-community/kube-prometheus-stack \
  --namespace prometheus --values kube-prometheus-stack.values -f rules_disk_alert.yaml

备注

采用独立的 rules_disk_alert.yaml 需要注释掉对应 additionalPrometheusRulesMap: {} ,所以我感觉还是直接编辑 values.yaml 更为方便(只需要一个配置文件)。当然,为了能够将不同的配置归类,采用独立的配置文件也未尝不可。主要看你的运维习惯。

参考