Prometheus规则 etcdDatabaseHighFragmentationRatio

收到关于 etcd - 分布式kv存储 告警:

etcd数据库使用大小小于实际分配存储的50%告警
Alerts Firing
[WARNING] etcd database size in use is less than 50% of the actual allocated storage.
Description: etcd cluster "kube-etcd": database size in use on instance 172.21.44.238:2381 is 19.18% of the actual allocated disk space, please run defragmentation (e.g. etcdctl defrag) to retrieve the unused fragmented disk space.
Graph: 📈
Details:


alertname: etcdDatabaseHighFragmentationRatio
endpoint: http-metrics
instance: 172.21.44.238:2381
job: kube-etcd
namespace: kube-system
prometheus: default/kube-prometheus-stack-1681-prometheus
service: kube-prometheus-stack-1681-kube-etcd

这个告警初看没有明白,既然使用率不到50%为何还要告警? 而且还提示我要做碎片整理(run defragmentation)

helm定制 kube-prometheus-stack 解析社区 kube-prometheus-stack 可以看到在 templates/prometheus/rules-1.14/etcd.yaml 有如下规则:

kube-prometheus-stack 的 etcd 监控规则
{{- if not (.Values.defaultRules.disabled.etcdDatabaseHighFragmentationRatio | default false) }}
    - alert: etcdDatabaseHighFragmentationRatio
      annotations:
{{- if .Values.defaultRules.additionalRuleAnnotations }}
{{ toYaml .Values.defaultRules.additionalRuleAnnotations | indent 8 }}
{{- end }}
{{- if .Values.defaultRules.additionalRuleGroupAnnotations.etcd }}
{{ toYaml .Values.defaultRules.additionalRuleGroupAnnotations.etcd | indent 8 }}
{{- end }}
        description: 'etcd cluster "{{`{{`}} $labels.job {{`}}`}}": database size in use on instance {{`{{`}} $labels.instance {{`}}`}} is {{`{{`}} $value | humanizePercentage {{`}}`}} of the actual allocated disk space, please run defragmentation (e.g. etcdctl defrag) to retrieve the unused fragmented disk space.'
        runbook_url: https://etcd.io/docs/v3.5/op-guide/maintenance/#defragmentation
        summary: etcd database size in use is less than 50% of the actual allocated storage.
      expr: (last_over_time(etcd_mvcc_db_total_size_in_use_in_bytes[5m]) / last_over_time(etcd_mvcc_db_total_size_in_bytes[5m])) < 0.5 and etcd_mvcc_db_total_size_in_use_in_bytes > 104857600
      for: 10m
      labels:
        severity: warning
      {{- if or .Values.defaultRules.additionalRuleLabels .Values.defaultRules.additionalRuleGroupLabels.etcd }}
        {{- with .Values.defaultRules.additionalRuleLabels }}
          {{- toYaml . | nindent 8 }}
        {{- end }}
        {{- with .Values.defaultRules.additionalRuleGroupLabels.etcd }}
          {{- toYaml . | nindent 8 }}
        {{- end }}
      {{- end }}
{{- end }}

可以看到这个Prometheus查询规则:

(last_over_time(etcd_mvcc_db_total_size_in_use_in_bytes[5m]) / last_over_time(etcd_mvcc_db_total_size_in_bytes[5m])) < 0.5

查询出 etcd_mvcc_db 的使用空间和总空间的比率,小于 50%

并且 etcd_mvcc_db 使用空间 大于 100MB 就会发送告警

备注

目前这个告警出现频率不高,而且我观察了 etcd_mvcc_db 使用空间会自动压缩,所以告警之后再去观察可能使用空间只有 50MB 左右。暂时忽略这个告警

etcd碎片整理