`kube-prometheus-stack` 持久化卷¶

使用Helm 3在Kubernetes集群部署Prometheus和Grafana 默认部署的监控采用了内存储存( emptyDir 类型内存储存)，我当时找到 Deploying kube-prometheus-stack with persistent storage on Kubernetes Cluster 参考，想构建一个PV/PVC来用于Prometheus，但是没有成功。

手工编辑 StatefulSet prometheus，添加存储 PV/PVC 实际上不能成功，包括我想编辑 nodeSelector 来指定服务器，也完全无效。正在绝望的时候，找到 [kube-prometheus-stack] [Help] Persistant Storage #186 和 [prometheus-kube-stack] Grafana is not persistent #436 ，原来 kube-prometheus-stack 使用了 Prometheus Operator 来完成部署，实际上在最初生成 kube-prometheus-stack.values 这个文件中已经包含了大量的配置选项(包括存储)以及注释:

helm inspect values 输出Prometheus Stack的chart变量值¶

helm inspect values prometheus-community/kube-prometheus-stack > kube-prometheus-stack.values

检查生成的 kube-prometheus-stack.values 可以看到如下配置内容:

kube-prometheus-stack.values 包含的持久化存储配置模版，prometheus部分¶

...
    ## Storage is the definition of how storage will be used by the Alertmanager instances.
    ## ref: https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/user-guides/storage.md
    ##
    storage: {}
    # volumeClaimTemplate:
    #   spec:
    #     storageClassName: gluster
    #     accessModes: ["ReadWriteOnce"]
    #     resources:
    #       requests:
    #         storage: 50Gi
    #     selector: {}
...
    ## Prometheus StorageSpec for persistent data
    ## ref: https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/user-guides/storage.md
    ##
    storageSpec: {}
    ## Using PersistentVolumeClaim
    ##
    #  volumeClaimTemplate:
    #    spec:
    #      storageClassName: gluster
    #      accessModes: ["ReadWriteOnce"]
    #      resources:
    #        requests:
    #          storage: 50Gi
    #    selector: {}

    ## Using tmpfs volume
    ##
    #  emptyDir:
    #    medium: Memory

    # Additional volumes on the output StatefulSet definition.
    volumes: []

    # Additional VolumeMounts on the output StatefulSet definition.
    volumeMounts: []

`hostPath` 存储卷¶

备注

请注意， alertmanager 和 prometheus 共用一个存储目录，但是需要注意 即使挂载同一个目录，也必须为每个 PV/PVC 完成配置，因为 PV/PVC 是一一对应的的 (参考 Can Multiple PVCs Bind to One PV in OpenShift? )

我实现简单的在Kubernetes中部署hostPath存储 :

kube-prometheus-stack.values 配置简单的本地 hostPath 存储卷(案例包含了 prometheus/alertmanager/thanos/grafana)¶

    ## Storage is the definition of how storage will be used by the Alertmanager instances.
    ## ref: https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/user-guides/storage.md
    ##
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: prometheus-data-altermanager
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 400Gi
    #     selector: {}
...
## Using default values from https://github.com/grafana/helm-charts/blob/main/charts/grafana/values.yaml
##
grafana:
  enabled: true
  namespaceOverride: ""

  persistence:
    enabled: true
    type: pvc
    storageClassName: prometheus-data-grafana
    accessModes:
    - ReadWriteOnce
    size: 400Gi
    finalizers:
    - kubernetes.io/pvc-protection
...
    ## Storage is the definition of how storage will be used by the ThanosRuler instances.
    ## ref: https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/user-guides/storage.md
    ##
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: prometheus-data-thanos
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 400Gi
    #   selector: {}
...
    ## Prometheus StorageSpec for persistent data
    ## ref: https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/user-guides/storage.md
    ##
    storageSpec:
    ## Using PersistentVolumeClaim
    ##
      volumeClaimTemplate:
        spec:
          storageClassName: prometheus-data-prometheus
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 400Gi
    #    selector: {}

    ## Using tmpfs volume
    ##
    #  emptyDir:
    #    medium: Memory

    # Additional volumes on the output StatefulSet definition.
    volumes: []

    # Additional VolumeMounts on the output StatefulSet definition.
    volumeMounts: []

备注

这里配置 accessModes: ["ReadWriteOnce"] 表示只有一个节点(a single node)可以挂载卷。另外两种模式是 ReadOnlyMany (多个节点可以只读挂载) 和 ReadWriteMany (多个节点可以读写挂载) - kubernetes persistent volume accessmode

然后准备一个 PV 配置, 创建在Kubernetes中部署hostPath存储持久化存储卷:

kube-prometheus-stack-pv.yaml 创建在Kubernetes中部署hostPath存储持久化存储卷(案例包含了 prometheus/alertmanager/thanos/grafana)¶

apiVersion: v1
kind: PersistentVolume
metadata:
  name: kube-prometheus-stack-pv
  labels:
    type: local
spec:
  storageClassName: prometheus-data
  capacity:
    storage: 400Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  hostPath:
    path: "/prometheus/data"
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: kube-prometheus-stack-pv-alert
  labels:
    type: local
spec:
  storageClassName: prometheus-data-alert
  capacity:
    storage: 400Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  hostPath:
    path: "/prometheus/data"
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: kube-prometheus-stack-pv-thanos
  labels:
    type: local
spec:
  storageClassName: prometheus-data-thanos
  capacity:
    storage: 400Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  hostPath:
    path: "/prometheus/data"
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: kube-prometheus-stack-pv-grafana
  labels:
    type: local
spec:
  storageClassName: prometheus-data-grafana
  capacity:
    storage: 400Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  hostPath:
    path: "/prometheus/data/grafana-db"

备注

只需要创建 PV 就可以， kube-prometheus-stack values.yaml 中提供了 PVC 配置，会自动创建PVC

执行:

执行构建 kube-prometheus-stack-pv¶

kubectl apply -f kube-prometheus-stack-pv.yaml

更新:

使用 helm upgrade prometheus-community/kube-prometheus-stack¶

helm upgrade kube-prometheus-stack-1681228346 prometheus-community/kube-prometheus-stack \
  --namespace prometheus --values kube-prometheus-stack.values

Grafana持久化存储¶

我检查 kube-prometheus-stack-pv ，惊讶地发现，只有 prometheus , alertmanager 和 thanos 有对应的 storageSpec ，但是没有找到 grafana 的配置入口。

参考 [prometheus-kube-stack] Grafana is not persistent #436 原来配置方式略有不同，不是使用类似 prometheus.prometheusSpec.storageSpec ，而是使用 grafana.persistence 。这是因为上游 Grafana通用可视分析平台的 helm chart 不同:

kube-prometheus-stack.values 配置Grafana持久化存储¶

## Using default values from https://github.com/grafana/helm-charts/blob/main/charts/grafana/values.yaml
##
grafana:
  enabled: true
  namespaceOverride: ""

  persistence:
    enabled: true
    type: pvc
    storageClassName: prometheus-data-grafana
    accessModes:
    - ReadWriteOnce
    size: 400Gi
    finalizers:
    - kubernetes.io/pvc-protection
...

备注

Grafana持久化的卷目录结构和 prometheus/alertmanager 不同:

prometheus/alertmanager 是在 PV 目录下又创建了一个子目录来存储数据，例如在 /prometheus/data 目录下创建一个 prometheus-db 和 alertmanager-db 子目录
grafana 则直接在 PV 目录下存储数据，数据分散在多个子目录，所以看起来有点乱。为了能够和 prometheus/alertmanager 和谐共处，所以建议 grafana 的 PV 设置多一级子目录 grafana-db

警告

Grafana通用可视分析平台的持久化卷目录和 prometheus 不同，一定要注意给 grafana 配置一个独立子目录或者其他目录，否则 grafana 持久化目录会和 prometheus 数据目录重合，并且由于 grafana 容器初始化时候自动修改目录属主，将会导致 prometheus 无法正常读写磁盘(数据采集终止)。我不小心乌龙了 kube-prometheus-stack Grafana持久化卷后问题排查，切切!!!

异常排查¶

admissionWebhooks错误¶

报错信息

使用 helm upgrade prometheus-community/kube-prometheus-stack 报错信息¶

client.go:540: [debug] Watching for changes to Job kube-prometheus-stack-1680-admission-create with timeout of 5m0s
client.go:568: [debug] Add/Modify event for kube-prometheus-stack-1680-admission-create: ADDED
client.go:607: [debug] kube-prometheus-stack-1680-admission-create: Jobs active: 0, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for kube-prometheus-stack-1680-admission-create: MODIFIED
client.go:607: [debug] kube-prometheus-stack-1680-admission-create: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
upgrade.go:434: [debug] warning: Upgrade "kube-prometheus-stack-1680871060" failed: pre-upgrade hooks failed: timed out waiting for the condition
Error: UPGRADE FAILED: pre-upgrade hooks failed: timed out waiting for the condition
helm.go:84: [debug] pre-upgrade hooks failed: timed out waiting for the condition
UPGRADE FAILED
main.newUpgradeCmd.func2
        helm.sh/helm/v3/cmd/helm/upgrade.go:200
github.com/spf13/cobra.(*Command).execute
        github.com/spf13/cobra@v1.4.0/command.go:856
github.com/spf13/cobra.(*Command).ExecuteC
        github.com/spf13/cobra@v1.4.0/command.go:974
github.com/spf13/cobra.(*Command).Execute
        github.com/spf13/cobra@v1.4.0/command.go:902
main.main
        helm.sh/helm/v3/cmd/helm/helm.go:83
runtime.main
        runtime/proc.go:255
runtime.goexit
        runtime/asm_amd64.s:1581

我参考 [stable/prometheus-operator] pre-upgrade hooks failed with prometheus-operator-admission: dial tcp 172.20.0.1:443: i/o timeout on EKS cluster #20480 将 admissionWebhooks 改成 false :

关闭 Prometheus Operator 的 admissionWebhooks¶

## Manages Prometheus and Alertmanager components
##
prometheusOperator:
  enabled: true

  ## Prometheus-Operator v0.39.0 and later support TLS natively.
  ##
  tls:
    enabled: true
    # Value must match version names from https://golang.org/pkg/crypto/tls/#pkg-constants
    tlsMinVersion: VersionTLS13
    # The default webhook port is 10250 in order to work out-of-the-box in GKE private clusters and avoid adding firewall rules.
    internalPort: 10250

  ## Admission webhook support for PrometheusRules resources added in Prometheus Operator 0.30 can be enabled to prevent incorrectly formatted
  ## rules from making their way into prometheus and potentially preventing the container from starting
  admissionWebhooks:
    failurePolicy:
    ## The default timeoutSeconds is 10 and the maximum value is 30.
    timeoutSeconds: 10
    enabled: false

这里关闭掉 Prometheus Operator 的 admissionWebhooks 没有直接影响，但是不会自动检查Prometheus错误格式的rules

此外，开启了 Prometheus Operator 的 admissionWebhooks 就会看到每次部署时候会自动运行一个 kube-prometheus-stack-1681-admission-create-fdfs2 这样的pods，运行完成后停留在 Completed 状态，也就是完成部署Prometheus的规则格式检查。如果这个Pod迟迟无法运行成功(例如我遇到Calico网络分配IP异常 containerd CNI plugin 存在bridge插件协议支持问题)，就会导致更新 alertmanager / prometheus 配置刷新非常缓慢问题。

调度失败错误¶

调度失败:

Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  106s  default-scheduler  0/8 nodes are available: 8 pod has unbound immediate PersistentVolumeClaims. preemption: 0/8 nodes are available: 8 Preemption is not helpful for scheduling

检查发现，原来 kube-prometheus-stack 会自动创建一个 pvc (根据定义)，然后去绑定你预先创建的 pv ；但是你千万不能预先创建一个 pvc (对应 pv )，这样就会抢在 kube-prometheus-stack 绑定，导致失败:

$ kubectl get pvc -A
NAMESPACE    NAME                                                                                                     STATUS    VOLUME                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE
prometheus   kube-prometheus-stack-pvc                                                                                Pending   kube-prometheus-stack-pv   0                                           89m
prometheus   prometheus-kube-prometheus-stack-1680-prometheus-db-prometheus-kube-prometheus-stack-1680-prometheus-0   Pending                                                        prometheus-data   4m8s

$ kubectl get pv -A
NAME                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS      REASON   AGE
kube-prometheus-stack-pv   10Gi       RWO            Retain           Available           prometheus-data            89m

但是，发现 pvc 依然出于 pending状态:

$ kubectl get pv -A
NAME                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS      REASON   AGE
kube-prometheus-stack-pv   10Gi       RWO            Retain           Available           prometheus-data            112m

$ kubectl get pvc -A
NAMESPACE    NAME                                                                                                     STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS      AGE
prometheus   prometheus-kube-prometheus-stack-1680-prometheus-db-prometheus-kube-prometheus-stack-1680-prometheus-0   Pending                                      prometheus-data   27m

为什么呢？ kubectl describe pvc prometheus-kube-prometheus-stack-1680-prometheus-db-prometheus-kube-prometheus-stack-1680-prometheus-0 -n prometheus

我尝试删除 pvc prometheus-kube-prometheus-stack-1680-prometheus-db-prometheus-kube-prometheus-stack-1680-prometheus-0 ，但是发现再也没有生成，并且调度失败:

$ kubectl describe pods prometheus-kube-prometheus-stack-1680-prometheus-0 -n prometheus
...
Events:
  Type     Reason            Age                 From               Message
  ----     ------            ----                ----               -------
  Warning  FailedScheduling  23m (x5 over 43m)   default-scheduler  0/8 nodes are available: 8 pod has unbound immediate PersistentVolumeClaims. preemption: 0/8 nodes are available: 8 Preemption is not helpful for scheduling.
  Warning  FailedScheduling  19m                 default-scheduler  0/8 nodes are available: 8 persistentvolumeclaim "prometheus-kube-prometheus-stack-1680-prometheus-db-prometheus-kube-prometheus-stack-1680-prometheus-0" is being deleted. preemption: 0/8 nodes are available: 8 Preemption is not helpful for scheduling.
  Warning  FailedScheduling  4m2s (x3 over 14m)  default-scheduler  0/8 nodes are available: 8 persistentvolumeclaim "prometheus-kube-prometheus-stack-1680-prometheus-db-prometheus-kube-prometheus-stack-1680-prometheus-0" not found. preemption: 0/8 nodes are available: 8 Preemption is not helpful for scheduling.

解决方法很简单: pvc 和 pv 都要删除，然后重新如上文创建一个 pv 就可以了， kube-prometheus-stack 会自动生成 pvc 并且 bind 到指定 pv

pod启动失败错误¶

已经正确调度到目标节点 i-0jl8d8r83kkf3yt5lzh7 ，并且可以看到 pvc 正确绑定了 pv 。在目标服务器上检查 /home/t4/prometheus/data 目录下自动创建了 prometheus-db 目录。

但是发现 prometheus 出现 crash:

# kubectl get pods -A -o wide | grep prometheus | grep -v node-exporter
prometheus    alertmanager-kube-prometheus-stack-1681-alertmanager-0            2/2     Running             1          2m18s   10.233.127.15   i-0jl8d8r83kkf3yt5lzh7   <none>           <none>
prometheus    kube-prometheus-stack-1681-operator-5b7f7cdc78-xqtm5              1/1     Running             0          2m25s   10.233.127.13   i-0jl8d8r83kkf3yt5lzh7   <none>           <none>
prometheus    kube-prometheus-stack-1681228346-grafana-fb4695b7-2qhpp           3/3     Running             0          33h     10.233.127.3    i-0jl8d8r83kkf3yt5lzh7   <none>           <none>
prometheus    kube-prometheus-stack-1681228346-kube-state-metrics-89f44fm2qbb   1/1     Running             0          2m25s   10.233.127.14   i-0jl8d8r83kkf3yt5lzh7   <none>           <none>
prometheus    prometheus-kube-prometheus-stack-1681-prometheus-0                1/2     CrashLoopBackOff    1          2m17s   10.233.127.16   i-0jl8d8r83kkf3yt5lzh7   <none>           <none>

检查日志:

# kubectl -n prometheus logs prometheus-kube-prometheus-stack-1681-prometheus-0 -c prometheus
ts=2023-04-13T00:57:17.123Z caller=main.go:556 level=info msg="Starting Prometheus Server" mode=server version="(version=2.42.0, branch=HEAD, revision=225c61122d88b01d1f0eaaee0e05b6f3e0567ac0)"
ts=2023-04-13T00:57:17.123Z caller=main.go:561 level=info build_context="(go=go1.19.5, platform=linux/amd64, user=root@c67d48967507, date=20230201-07:53:32)"
ts=2023-04-13T00:57:17.123Z caller=main.go:562 level=info host_details="(Linux 3.10.0-1160.83.1.el7.x86_64 #1 SMP Wed Jan 25 16:41:43 UTC 2023 x86_64 prometheus-kube-prometheus-stack-1681-prometheus-0 (none))"
ts=2023-04-13T00:57:17.123Z caller=main.go:563 level=info fd_limits="(soft=1048576, hard=1048576)"
ts=2023-04-13T00:57:17.123Z caller=main.go:564 level=info vm_limits="(soft=unlimited, hard=unlimited)"
ts=2023-04-13T00:57:17.124Z caller=query_logger.go:91 level=error component=activeQueryTracker msg="Error opening query log file" file=/prometheus/queries.active err="open /prometheus/queries.active: permission denied"
panic: Unable to create mmap-ed active query log

goroutine 1 [running]:
github.com/prometheus/prometheus/promql.NewActiveQueryTracker({0x7fff8871300f, 0xb}, 0x14, {0x3d8ba20, 0xc00102e280})
     /app/promql/query_logger.go:121 +0x3cd
main.main()
     /app/cmd/prometheus/main.go:618 +0x69d3

参考 prometheus: Unable to create mmap-ed active query log #21 原因是 promethues 容器内部运行服务不是root身份，在目标服务器 i-0jl8d8r83kkf3yt5lzh7 启动pod后，初始化创建了 /home/t4/prometheus/data/prometheus-db 这个目录是 root 身份，所以后续容器内部进程无法写入该目录。

临时解决方法是登陆到目标服务器 i-0jl8d8r83kkf3yt5lzh7 ，执行以下命令:

sudo chmod 777 /home/t4/prometheus/data/prometheus-db

备注

更好的解决方法是 initContainer ，具体待后续补充。请参考 Digitalocean kubernetes and volume permissions

kube-prometheus-stack 应该也有解决方法，待补充

然后就可以看到正常运行起来了:

# kubectl get pods -A -o wide | grep prometheus | grep -v node-exporter
prometheus    alertmanager-kube-prometheus-stack-1681-alertmanager-0            2/2     Running             1          16m     10.233.127.15   i-0jl8d8r83kkf3yt5lzh7   <none>           <none>
prometheus    kube-prometheus-stack-1681-operator-5b7f7cdc78-xqtm5              1/1     Running             0          16m     10.233.127.13   i-0jl8d8r83kkf3yt5lzh7   <none>           <none>
prometheus    kube-prometheus-stack-1681228346-grafana-fb4695b7-2qhpp           3/3     Running             0          33h     10.233.127.3    i-0jl8d8r83kkf3yt5lzh7   <none>           <none>
prometheus    kube-prometheus-stack-1681228346-kube-state-metrics-89f44fm2qbb   1/1     Running             0          16m     10.233.127.14   i-0jl8d8r83kkf3yt5lzh7   <none>           <none>
prometheus    prometheus-kube-prometheus-stack-1681-prometheus-0                2/2     Running             0          16m     10.233.127.16   i-0jl8d8r83kkf3yt5lzh7   <none>           <none>

参考¶

Deploying kube-prometheus-stack with persistent storage on Kubernetes Cluster 这个持久化卷的配置方法不成功，但是持久化卷应用方法不通
[kube-prometheus-stack] how to use persistent volumes instead of emptyDir 提供yaml配置定制helm安装
[prometheus-kube-stack] Grafana is not persistent #436 提供了设置存储的案例，采用 StorageClass 会自动创建存储卷，不需要预先创建

kube-prometheus-stack 持久化卷¶

hostPath 存储卷¶