kube-prometheus-stack 持久化卷

使用Helm 3在Kubernetes集群部署Prometheus和Grafana 默认部署的监控采用了内存储存( emptyDir 类型内存储存),我当时找到 Deploying kube-prometheus-stack with persistent storage on Kubernetes Cluster 参考,想构建一个PV/PVC来用于Prometheus,但是没有成功。

手工编辑 StatefulSet prometheus,添加存储 PV/PVC 实际上不能成功,包括我想编辑 nodeSelector 来指定服务器,也完全无效。正在绝望的时候,找到 [kube-prometheus-stack] [Help] Persistant Storage #186[prometheus-kube-stack] Grafana is not persistent #436 ,原来 kube-prometheus-stack 使用了 Prometheus Operator 来完成部署,实际上在最初生成 kube-prometheus-stack.values 这个文件中已经包含了大量的配置选项(包括存储)以及注释:

helm inspect values 输出Prometheus Stack的chart变量值
helm inspect values prometheus-community/kube-prometheus-stack > kube-prometheus-stack.values

检查生成的 kube-prometheus-stack.values 可以看到如下配置内容:

kube-prometheus-stack.values 包含的持久化存储配置模版,prometheus部分
...
    ## Storage is the definition of how storage will be used by the Alertmanager instances.
    ## ref: https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/user-guides/storage.md
    ##
    storage: {}
    # volumeClaimTemplate:
    #   spec:
    #     storageClassName: gluster
    #     accessModes: ["ReadWriteOnce"]
    #     resources:
    #       requests:
    #         storage: 50Gi
    #     selector: {}
...
    ## Prometheus StorageSpec for persistent data
    ## ref: https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/user-guides/storage.md
    ##
    storageSpec: {}
    ## Using PersistentVolumeClaim
    ##
    #  volumeClaimTemplate:
    #    spec:
    #      storageClassName: gluster
    #      accessModes: ["ReadWriteOnce"]
    #      resources:
    #        requests:
    #          storage: 50Gi
    #    selector: {}

    ## Using tmpfs volume
    ##
    #  emptyDir:
    #    medium: Memory

    # Additional volumes on the output StatefulSet definition.
    volumes: []

    # Additional VolumeMounts on the output StatefulSet definition.
    volumeMounts: []

hostPath 存储卷

备注

请注意, alertmanagerprometheus 共用一个存储目录,但是需要注意 即使挂载同一个目录,也必须为每个 PV/PVC 完成配置,因为 PV/PVC 是一一对应的的 (参考 Can Multiple PVCs Bind to One PV in OpenShift? )

我实现简单的 在Kubernetes中部署hostPath存储 :

kube-prometheus-stack.values 配置简单的本地 hostPath 存储卷(案例包含了 prometheus/alertmanager/thanos/grafana)
    ## Storage is the definition of how storage will be used by the Alertmanager instances.
    ## ref: https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/user-guides/storage.md
    ##
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: prometheus-data-altermanager
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 400Gi
    #     selector: {}
...
## Using default values from https://github.com/grafana/helm-charts/blob/main/charts/grafana/values.yaml
##
grafana:
  enabled: true
  namespaceOverride: ""

  persistence:
    enabled: true
    type: pvc
    storageClassName: prometheus-data-grafana
    accessModes:
    - ReadWriteOnce
    size: 400Gi
    finalizers:
    - kubernetes.io/pvc-protection
...
    ## Storage is the definition of how storage will be used by the ThanosRuler instances.
    ## ref: https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/user-guides/storage.md
    ##
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: prometheus-data-thanos
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 400Gi
    #   selector: {}
...
    ## Prometheus StorageSpec for persistent data
    ## ref: https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/user-guides/storage.md
    ##
    storageSpec:
    ## Using PersistentVolumeClaim
    ##
      volumeClaimTemplate:
        spec:
          storageClassName: prometheus-data-prometheus
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 400Gi
    #    selector: {}

    ## Using tmpfs volume
    ##
    #  emptyDir:
    #    medium: Memory

    # Additional volumes on the output StatefulSet definition.
    volumes: []

    # Additional VolumeMounts on the output StatefulSet definition.
    volumeMounts: []

备注

这里配置 accessModes: ["ReadWriteOnce"] 表示只有一个节点(a single node)可以挂载卷。另外两种模式是 ReadOnlyMany (多个节点可以只读挂载) 和 ReadWriteMany (多个节点可以读写挂载) - kubernetes persistent volume accessmode

kube-prometheus-stack-pv.yaml 创建 在Kubernetes中部署hostPath存储 持久化存储卷(案例包含了 prometheus/alertmanager/thanos/grafana)
apiVersion: v1
kind: PersistentVolume
metadata:
  name: kube-prometheus-stack-pv
  labels:
    type: local
spec:
  storageClassName: prometheus-data
  capacity:
    storage: 400Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  hostPath:
    path: "/prometheus/data"
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: kube-prometheus-stack-pv-alert
  labels:
    type: local
spec:
  storageClassName: prometheus-data-alert
  capacity:
    storage: 400Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  hostPath:
    path: "/prometheus/data"
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: kube-prometheus-stack-pv-thanos
  labels:
    type: local
spec:
  storageClassName: prometheus-data-thanos
  capacity:
    storage: 400Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  hostPath:
    path: "/prometheus/data"
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: kube-prometheus-stack-pv-grafana
  labels:
    type: local
spec:
  storageClassName: prometheus-data-grafana
  capacity:
    storage: 400Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  hostPath:
    path: "/prometheus/data/grafana-db"

备注

只需要创建 PV 就可以, kube-prometheus-stack values.yaml 中提供了 PVC 配置,会自动创建PVC

  • 执行:

执行构建 kube-prometheus-stack-pv
kubectl apply -f kube-prometheus-stack-pv.yaml
  • 更新:

使用 helm upgrade prometheus-community/kube-prometheus-stack
helm upgrade kube-prometheus-stack-1681228346 prometheus-community/kube-prometheus-stack \
  --namespace prometheus --values kube-prometheus-stack.values

Grafana持久化存储

我检查 kube-prometheus-stack-pv ,惊讶地发现,只有 prometheus , alertmanagerthanos 有对应的 storageSpec ,但是没有找到 grafana 的配置入口。

参考 [prometheus-kube-stack] Grafana is not persistent #436 原来配置方式略有不同,不是使用类似 prometheus.prometheusSpec.storageSpec ,而是使用 grafana.persistence 。这是因为上游 Grafana通用可视分析平台helm chart 不同:

kube-prometheus-stack.values 配置Grafana持久化存储
## Using default values from https://github.com/grafana/helm-charts/blob/main/charts/grafana/values.yaml
##
grafana:
  enabled: true
  namespaceOverride: ""

  persistence:
    enabled: true
    type: pvc
    storageClassName: prometheus-data-grafana
    accessModes:
    - ReadWriteOnce
    size: 400Gi
    finalizers:
    - kubernetes.io/pvc-protection
...

备注

Grafana持久化的卷目录结构和 prometheus/alertmanager 不同:

  • prometheus/alertmanager 是在 PV 目录下又创建了一个子目录来存储数据,例如在 /prometheus/data 目录下创建一个 prometheus-dbalertmanager-db 子目录

  • grafana 则直接在 PV 目录下存储数据,数据分散在多个子目录,所以看起来有点乱。为了能够和 prometheus/alertmanager 和谐共处,所以建议 grafanaPV 设置多一级子目录 grafana-db

警告

Grafana通用可视分析平台 的持久化卷目录和 prometheus 不同,一定要注意给 grafana 配置一个独立子目录或者其他目录,否则 grafana 持久化目录会和 prometheus 数据目录重合,并且由于 grafana 容器初始化时候自动修改目录属主,将会导致 prometheus 无法正常读写磁盘(数据采集终止)。 我不小心乌龙了 kube-prometheus-stack Grafana持久化卷后问题排查 ,切切!!!

异常排查

admissionWebhooks错误

报错信息

使用 helm upgrade prometheus-community/kube-prometheus-stack 报错信息
client.go:540: [debug] Watching for changes to Job kube-prometheus-stack-1680-admission-create with timeout of 5m0s
client.go:568: [debug] Add/Modify event for kube-prometheus-stack-1680-admission-create: ADDED
client.go:607: [debug] kube-prometheus-stack-1680-admission-create: Jobs active: 0, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for kube-prometheus-stack-1680-admission-create: MODIFIED
client.go:607: [debug] kube-prometheus-stack-1680-admission-create: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
upgrade.go:434: [debug] warning: Upgrade "kube-prometheus-stack-1680871060" failed: pre-upgrade hooks failed: timed out waiting for the condition
Error: UPGRADE FAILED: pre-upgrade hooks failed: timed out waiting for the condition
helm.go:84: [debug] pre-upgrade hooks failed: timed out waiting for the condition
UPGRADE FAILED
main.newUpgradeCmd.func2
        helm.sh/helm/v3/cmd/helm/upgrade.go:200
github.com/spf13/cobra.(*Command).execute
        github.com/spf13/cobra@v1.4.0/command.go:856
github.com/spf13/cobra.(*Command).ExecuteC
        github.com/spf13/cobra@v1.4.0/command.go:974
github.com/spf13/cobra.(*Command).Execute
        github.com/spf13/cobra@v1.4.0/command.go:902
main.main
        helm.sh/helm/v3/cmd/helm/helm.go:83
runtime.main
        runtime/proc.go:255
runtime.goexit
        runtime/asm_amd64.s:1581

我参考 [stable/prometheus-operator] pre-upgrade hooks failed with prometheus-operator-admission: dial tcp 172.20.0.1:443: i/o timeout on EKS cluster #20480admissionWebhooks 改成 false :

关闭 Prometheus Operator 的 admissionWebhooks
## Manages Prometheus and Alertmanager components
##
prometheusOperator:
  enabled: true

  ## Prometheus-Operator v0.39.0 and later support TLS natively.
  ##
  tls:
    enabled: true
    # Value must match version names from https://golang.org/pkg/crypto/tls/#pkg-constants
    tlsMinVersion: VersionTLS13
    # The default webhook port is 10250 in order to work out-of-the-box in GKE private clusters and avoid adding firewall rules.
    internalPort: 10250

  ## Admission webhook support for PrometheusRules resources added in Prometheus Operator 0.30 can be enabled to prevent incorrectly formatted
  ## rules from making their way into prometheus and potentially preventing the container from starting
  admissionWebhooks:
    failurePolicy:
    ## The default timeoutSeconds is 10 and the maximum value is 30.
    timeoutSeconds: 10
    enabled: false

这里关闭掉 Prometheus Operator 的 admissionWebhooks 没有直接影响,但是不会自动检查Prometheus错误格式的rules

此外,开启了 Prometheus Operator 的 admissionWebhooks 就会看到每次部署时候会自动运行一个 kube-prometheus-stack-1681-admission-create-fdfs2 这样的pods,运行完成后停留在 Completed 状态,也就是完成部署Prometheus的规则格式检查。如果这个Pod迟迟无法运行成功(例如我遇到Calico网络分配IP异常 containerd CNI plugin 存在bridge插件协议支持问题),就会导致更新 alertmanager / prometheus 配置刷新非常缓慢问题。

调度失败错误

调度失败:

Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  106s  default-scheduler  0/8 nodes are available: 8 pod has unbound immediate PersistentVolumeClaims. preemption: 0/8 nodes are available: 8 Preemption is not helpful for scheduling

检查发现,原来 kube-prometheus-stack 会自动创建一个 pvc (根据定义),然后去绑定你预先创建的 pv ;但是你千万不能预先创建一个 pvc (对应 pv ),这样就会抢在 kube-prometheus-stack 绑定,导致失败:

$ kubectl get pvc -A
NAMESPACE    NAME                                                                                                     STATUS    VOLUME                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE
prometheus   kube-prometheus-stack-pvc                                                                                Pending   kube-prometheus-stack-pv   0                                           89m
prometheus   prometheus-kube-prometheus-stack-1680-prometheus-db-prometheus-kube-prometheus-stack-1680-prometheus-0   Pending                                                        prometheus-data   4m8s

$ kubectl get pv -A
NAME                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS      REASON   AGE
kube-prometheus-stack-pv   10Gi       RWO            Retain           Available           prometheus-data            89m

但是,发现 pvc 依然出于 pending状态:

$ kubectl get pv -A
NAME                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS      REASON   AGE
kube-prometheus-stack-pv   10Gi       RWO            Retain           Available           prometheus-data            112m

$ kubectl get pvc -A
NAMESPACE    NAME                                                                                                     STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS      AGE
prometheus   prometheus-kube-prometheus-stack-1680-prometheus-db-prometheus-kube-prometheus-stack-1680-prometheus-0   Pending                                      prometheus-data   27m

为什么呢? kubectl describe pvc prometheus-kube-prometheus-stack-1680-prometheus-db-prometheus-kube-prometheus-stack-1680-prometheus-0 -n prometheus

我尝试删除 pvc prometheus-kube-prometheus-stack-1680-prometheus-db-prometheus-kube-prometheus-stack-1680-prometheus-0 ,但是发现再也没有生成,并且调度失败:

$ kubectl describe pods prometheus-kube-prometheus-stack-1680-prometheus-0 -n prometheus
...
Events:
  Type     Reason            Age                 From               Message
  ----     ------            ----                ----               -------
  Warning  FailedScheduling  23m (x5 over 43m)   default-scheduler  0/8 nodes are available: 8 pod has unbound immediate PersistentVolumeClaims. preemption: 0/8 nodes are available: 8 Preemption is not helpful for scheduling.
  Warning  FailedScheduling  19m                 default-scheduler  0/8 nodes are available: 8 persistentvolumeclaim "prometheus-kube-prometheus-stack-1680-prometheus-db-prometheus-kube-prometheus-stack-1680-prometheus-0" is being deleted. preemption: 0/8 nodes are available: 8 Preemption is not helpful for scheduling.
  Warning  FailedScheduling  4m2s (x3 over 14m)  default-scheduler  0/8 nodes are available: 8 persistentvolumeclaim "prometheus-kube-prometheus-stack-1680-prometheus-db-prometheus-kube-prometheus-stack-1680-prometheus-0" not found. preemption: 0/8 nodes are available: 8 Preemption is not helpful for scheduling.

解决方法很简单: pvcpv 都要删除,然后重新如上文创建一个 pv 就可以了, kube-prometheus-stack 会自动生成 pvc 并且 bind 到指定 pv

pod启动失败错误

已经正确调度到目标节点 i-0jl8d8r83kkf3yt5lzh7 ,并且可以看到 pvc 正确绑定了 pv 。在目标服务器上检查 /home/t4/prometheus/data 目录下自动创建了 prometheus-db 目录。

但是发现 prometheus 出现 crash:

# kubectl get pods -A -o wide | grep prometheus | grep -v node-exporter
prometheus    alertmanager-kube-prometheus-stack-1681-alertmanager-0            2/2     Running             1          2m18s   10.233.127.15   i-0jl8d8r83kkf3yt5lzh7   <none>           <none>
prometheus    kube-prometheus-stack-1681-operator-5b7f7cdc78-xqtm5              1/1     Running             0          2m25s   10.233.127.13   i-0jl8d8r83kkf3yt5lzh7   <none>           <none>
prometheus    kube-prometheus-stack-1681228346-grafana-fb4695b7-2qhpp           3/3     Running             0          33h     10.233.127.3    i-0jl8d8r83kkf3yt5lzh7   <none>           <none>
prometheus    kube-prometheus-stack-1681228346-kube-state-metrics-89f44fm2qbb   1/1     Running             0          2m25s   10.233.127.14   i-0jl8d8r83kkf3yt5lzh7   <none>           <none>
prometheus    prometheus-kube-prometheus-stack-1681-prometheus-0                1/2     CrashLoopBackOff    1          2m17s   10.233.127.16   i-0jl8d8r83kkf3yt5lzh7   <none>           <none>

检查日志:

# kubectl -n prometheus logs prometheus-kube-prometheus-stack-1681-prometheus-0 -c prometheus
ts=2023-04-13T00:57:17.123Z caller=main.go:556 level=info msg="Starting Prometheus Server" mode=server version="(version=2.42.0, branch=HEAD, revision=225c61122d88b01d1f0eaaee0e05b6f3e0567ac0)"
ts=2023-04-13T00:57:17.123Z caller=main.go:561 level=info build_context="(go=go1.19.5, platform=linux/amd64, user=root@c67d48967507, date=20230201-07:53:32)"
ts=2023-04-13T00:57:17.123Z caller=main.go:562 level=info host_details="(Linux 3.10.0-1160.83.1.el7.x86_64 #1 SMP Wed Jan 25 16:41:43 UTC 2023 x86_64 prometheus-kube-prometheus-stack-1681-prometheus-0 (none))"
ts=2023-04-13T00:57:17.123Z caller=main.go:563 level=info fd_limits="(soft=1048576, hard=1048576)"
ts=2023-04-13T00:57:17.123Z caller=main.go:564 level=info vm_limits="(soft=unlimited, hard=unlimited)"
ts=2023-04-13T00:57:17.124Z caller=query_logger.go:91 level=error component=activeQueryTracker msg="Error opening query log file" file=/prometheus/queries.active err="open /prometheus/queries.active: permission denied"
panic: Unable to create mmap-ed active query log

goroutine 1 [running]:
github.com/prometheus/prometheus/promql.NewActiveQueryTracker({0x7fff8871300f, 0xb}, 0x14, {0x3d8ba20, 0xc00102e280})
     /app/promql/query_logger.go:121 +0x3cd
main.main()
     /app/cmd/prometheus/main.go:618 +0x69d3

参考 prometheus: Unable to create mmap-ed active query log #21 原因是 promethues 容器内部运行服务不是root身份,在目标服务器 i-0jl8d8r83kkf3yt5lzh7 启动pod后,初始化创建了 /home/t4/prometheus/data/prometheus-db 这个目录是 root 身份,所以后续容器内部进程无法写入该目录。

临时解决方法是登陆到目标服务器 i-0jl8d8r83kkf3yt5lzh7 ,执行以下命令:

sudo chmod 777 /home/t4/prometheus/data/prometheus-db

备注

更好的解决方法是 initContainer ,具体待后续补充。 请参考 Digitalocean kubernetes and volume permissions

kube-prometheus-stack 应该也有解决方法,待补充

然后就可以看到正常运行起来了:

# kubectl get pods -A -o wide | grep prometheus | grep -v node-exporter
prometheus    alertmanager-kube-prometheus-stack-1681-alertmanager-0            2/2     Running             1          16m     10.233.127.15   i-0jl8d8r83kkf3yt5lzh7   <none>           <none>
prometheus    kube-prometheus-stack-1681-operator-5b7f7cdc78-xqtm5              1/1     Running             0          16m     10.233.127.13   i-0jl8d8r83kkf3yt5lzh7   <none>           <none>
prometheus    kube-prometheus-stack-1681228346-grafana-fb4695b7-2qhpp           3/3     Running             0          33h     10.233.127.3    i-0jl8d8r83kkf3yt5lzh7   <none>           <none>
prometheus    kube-prometheus-stack-1681228346-kube-state-metrics-89f44fm2qbb   1/1     Running             0          16m     10.233.127.14   i-0jl8d8r83kkf3yt5lzh7   <none>           <none>
prometheus    prometheus-kube-prometheus-stack-1681-prometheus-0                2/2     Running             0          16m     10.233.127.16   i-0jl8d8r83kkf3yt5lzh7   <none>           <none>

参考