Prometheus “PrometheusRuleFailures” 告警

Prometheus Info-level alert inhibition 后,inhibition alert 消除,但是我发现收到了 PrometheusRuleFailures 告警:

PrometheusRuleFailures 告警
Alerts Firing
[CRITICAL] Prometheus is failing rule evaluations.
Description: Prometheus default/prometheus-kube-prometheus-stack-1681-prometheus-0 has failed to evaluate 30 rules in the last 5m.
Graph: 📈
Details:


alertname: PrometheusRuleFailures
container: prometheus
endpoint: http-web
instance: 10.233.76.71:9090
job: kube-prometheus-stack-1681-prometheus
namespace: default
pod: prometheus-kube-prometheus-stack-1681-prometheus-0
prometheus: default/kube-prometheus-stack-1681-prometheus
rule_group: /etc/prometheus/rules/prometheus-kube-prometheus-stack-1681-prometheus-rulefiles-0/default-kube-prometheus-stack-1681-kubelet.rules-9b578b57-68f0-4d5e-9899-f1f747f2040f.yaml;kubelet.rules
service: kube-prometheus-stack-1681-prometheus

点击 Graph 链接可以看到Query语句是:

PrometheusRuleFailures 查询语句
increase(prometheus_rule_evaluation_failures_total{job="kube-prometheus-stack-1681-prometheus",namespace="default"}[5m]) > 0

查询有两条记录:

PrometheusRuleFailures 查询结果
{container="prometheus", endpoint="http-web", instance="10.233.76.71:9090", job="kube-prometheus-stack-1681-prometheus", namespace="default", pod="prometheus-kube-prometheus-stack-1681-prometheus-0", rule_group="/etc/prometheus/rules/prometheus-kube-prometheus-stack-1681-prometheus-rulefiles-0/default-kube-prometheus-stack-1681-kubelet.rules-9b578b57-68f0-4d5e-9899-f1f747f2040f.yaml;kubelet.rules", service="kube-prometheus-stack-1681-prometheus"}
{container="prometheus", endpoint="http-web", instance="10.233.76.71:9090", job="kube-prometheus-stack-1681-prometheus", namespace="default", pod="prometheus-kube-prometheus-stack-1681-prometheus-0", rule_group="/etc/prometheus/rules/prometheus-kube-prometheus-stack-1681-prometheus-rulefiles-0/default-kube-prometheus-stack-1681-kubernetes-system-kubelet-c8833e9e-9fe3-4187-a8c3-9bdc00a245d3.yaml;kubernetes-system-kubelet", service="kube-prometheus-stack-1681-prometheus"}

那么,这里提示有2条规则评估错误:

是什么规则?

登录到 prometheus-kube-prometheus-stack-1681-prometheus-0 pods 中检查上述两个规则,非别对应 kubelet 规则:

  • /etc/prometheus/rules/prometheus-kube-prometheus-stack-1681-prometheus-rulefiles-0/default-kube-prometheus-stack-1681-kubelet.rules-9b578b57-68f0-4d5e-9899-f1f747f2040f.yaml

/etc/prometheus/rules/prometheus-kube-prometheus-stack-1681-prometheus-rulefiles-0/default-kube-prometheus-stack-1681-kubernetes-system-kubelet-c8833e9e-9fe3-4187-a8c3-9bdc00a245d3.yaml 规则
histogram_quantile(0.99, sum(rate(kubelet_pleg_relist_duration_seconds_bucket[5m]))
      by (cluster, instance, le) * on(cluster, instance) group_left(node) kubelet_node_name{job="kubelet",
      metrics_path="/metrics"})

输入到 Prometheus 中验证:

结果提示错误:

规则查询输出报错
Error executing query: found duplicate series for the match group {instance="172.21.44.202:10250"} on the right hand-side of the operation: [{__name__="kubelet_node_name", endpoint="https-metrics", instance="172.21.44.202:10250", job="kubelet", metrics_path="/metrics", namespace="kube-system", node="i-2ze6nk43mbc7xxpcb0ac", service="stable-kube-prometheus-sta-kubelet"}, {__name__="kubelet_node_name", endpoint="https-metrics", instance="172.21.44.202:10250", job="kubelet", metrics_path="/metrics", namespace="kube-system", node="i-2ze6nk43mbc7xxpcb0ac", service="kube-prometheus-stack-1681-kubelet"}];many-to-many matching not allowed: matching labels must be unique on one side
  • /etc/prometheus/rules/prometheus-kube-prometheus-stack-1681-prometheus-rulefiles-0/default-kube-prometheus-stack-1681-kubernetes-system-kubelet-c8833e9e-9fe3-4187-a8c3-9bdc00a245d3.yaml

/etc/prometheus/rules/prometheus-kube-prometheus-stack-1681-prometheus-rulefiles-0/default-kube-prometheus-stack-1681-kubernetes-system-kubelet-c8833e9e-9fe3-4187-a8c3-9bdc00a245d3.yaml
histogram_quantile(0.99, sum(rate(kubelet_pod_worker_duration_seconds_bucket{job="kubelet",
      metrics_path="/metrics"}[5m])) by (cluster, instance, le)) * on(cluster, instance)
      group_left(node) kubelet_node_name{job="kubelet", metrics_path="/metrics"} >
      60

这个rules文件有多条查询,其中也有一条比较奇怪,验证查询报错:

/etc/prometheus/rules/prometheus-kube-prometheus-stack-1681-prometheus-rulefiles-0/default-kube-prometheus-stack-1681-kubernetes-system-kubelet-c8833e9e-9fe3-4187-a8c3-9bdc00a245d3.yaml
Error executing query: found duplicate series for the match group {instance="172.21.44.202:10250"} on the right hand-side of the operation: [{__name__="kubelet_node_name", endpoint="https-metrics", instance="172.21.44.202:10250", job="kubelet", metrics_path="/metrics", namespace="kube-system", node="i-2ze6nk43mbc7xxpcb0ac", service="stable-kube-prometheus-sta-kubelet"}, {__name__="kubelet_node_name", endpoint="https-metrics", instance="172.21.44.202:10250", job="kubelet", metrics_path="/metrics", namespace="kube-system", node="i-2ze6nk43mbc7xxpcb0ac", service="kube-prometheus-stack-1681-kubelet"}];many-to-many matching not allowed: matching labels must be unique on one side

待续…