kube-prometheus-stack Grafana持久化卷后问题排查

我在 kube-prometheus-stack 持久化卷 配置了 Grafana通用可视分析平台 持久化,当时我还同时配置了 Nginx反向代理 ,但是没有想到启用域名之后 在反向代理后面运行Grafana 始终无法登陆( 401 Unauthorized ),不得已回退。

但是,没有想到回退了 在反向代理后面运行Grafana (也就是去除 kube-prometheus-stack.valuesdomain 配置重新 更新Kubernetes集群的Prometheus配置 ),我惊奇地发现, Grafana通用可视分析平台 不能从 Prometheus监控 获取数据,即使添加了数据源( test & save 之后)也不行。

排查

  • 检查 grafana 日志:

    kubectl -n prometheus logs kube-prometheus-stack-1681228346-grafana-849b55868d-7msvq -c grafana
    

看到一个奇怪的:

grafana查询prometheus失败日志
logger=context userId=1 orgId=1 uname=admin t=2023-04-20T04:21:35.81369208Z level=error msg="Internal server error" error="[plugin.downstreamError] failed to query data: Post \"http://10.233.29.215:9090/api/v1/query_range\": context canceled" remote_addr=140.205.147.128 traceID=
logger=context userId=1 orgId=1 uname=admin t=2023-04-20T04:21:35.813720205Z level=error msg="Internal server error" error="[plugin.downstreamError] failed to query data: Post \"http://10.233.29.215:9090/api/v1/query_range\": context canceled" remote_addr=140.205.147.128 traceID=
logger=context userId=1 orgId=1 uname=admin t=2023-04-20T04:21:35.813749822Z level=error msg="Request Completed" method=POST path=/api/ds/query status=500 remote_addr=140.205.147.128 time_ms=11 duration=11.443352ms size=116 referer="http://8.130.120.196/d/aIUcmJE4k/node-exporter-nodes?orgId=1&refresh=30s" handler=/api/ds/query
logger=context userId=1 orgId=1 uname=admin t=2023-04-20T04:21:35.813759523Z level=error msg="Request Completed" method=POST path=/api/ds/query status=500 remote_addr=140.205.147.128 time_ms=16 duration=16.011315ms size=116 referer="http://8.130.120.196/d/aIUcmJE4k/node-exporter-nodes?orgId=1&refresh=30s" handler=/api/ds/query
logger=context userId=1 orgId=1 uname=admin t=2023-04-20T04:21:35.814832629Z level=error msg="Internal server error" error="[plugin.downstreamError] failed to query data: Post \"http://10.233.29.215:9090/api/v1/query_range\": context canceled" remote_addr=140.205.147.128 traceID=
logger=context userId=1 orgId=1 uname=admin t=2023-04-20T04:21:35.814875022Z level=error msg="Request Completed" method=POST path=/api/ds/query status=500 remote_addr=140.205.147.128 time_ms=3 duration=3.020887ms size=116 referer="http://8.130.120.196/d/aIUcmJE4k/node-exporter-nodes?orgId=1&refresh=30s" handler=/api/ds/query
logger=cleanup t=2023-04-20T04:27:40.729095741Z level=info msg="Completed cleanup jobs" duration=4.202254ms
logger=context userId=1 orgId=1 uname=admin t=2023-04-20T04:31:59.810145747Z level=error msg="Internal server error" error="[plugin.downstreamError] failed to query data: Post \"http://10.233.29.215:9090/api/v1/query_range\": context canceled" remote_addr=140.205.147.128 traceID=
logger=context userId=1 orgId=1 uname=admin t=2023-04-20T04:31:59.810192932Z level=error msg="Request Completed" method=POST path=/api/ds/query status=500 remote_addr=140.205.147.128 time_ms=20 duration=20.459938ms size=116 referer="http://8.130.120.196/d/aIUcmJE4k/node-exporter-nodes?orgId=1&refresh=30s" handler=/api/ds/query
logger=context userId=1 orgId=1 uname=admin t=2023-04-20T04:31:59.810891Z level=error msg="Internal server error" error="[plugin.downstreamError] failed to query data: Post \"http://10.233.29.215:9090/api/v1/query_range\": context canceled" remote_addr=140.205.147.128 traceID=
  • 检查 prometheus 日志:

    kubectl -n prometheus logs prometheus-kube-prometheus-stack-1681-prometheus-0 -c prometheus
    

除了 metrics 抓取错误,但是为何会有很多 write to WAL 没有权限的错误:

prometheus失败日志
ts=2023-04-20T04:56:45.785Z caller=scrape.go:1311 level=error component="scrape manager" scrape_pool=serviceMonitor/prometheus/kube-prometheus-stack-1681228346-prometheus-node-exporter/0 target=http://10.1.58.197:9100/metrics msg="Scrape commit failed" err="write to WAL: log samples: create new segment file: open /prometheus/wal/00001251: permission denied"
ts=2023-04-20T04:56:45.792Z caller=scrape.go:1311 level=error component="scrape manager" scrape_pool=gpu-metrics target=http://10.233.72.4:9400/metrics msg="Scrape commit failed" err="write to WAL: log samples: create new segment file: open /prometheus/wal/00001251: permission denied"
ts=2023-04-20T04:56:45.792Z caller=scrape.go:1311 level=error component="scrape manager" scrape_pool=gpu-metrics target=http://10.233.126.4:9400/metrics msg="Scrape commit failed" err="write to WAL: log samples: create new segment file: open /prometheus/wal/00001251: permission denied"
ts=2023-04-20T04:56:45.797Z caller=scrape.go:1311 level=error component="scrape manager" scrape_pool=gpu-metrics target=http://10.233.64.9:9400/metrics msg="Scrape commit failed" err="write to WAL: log samples: create new segment file: open /prometheus/wal/00001251: permission denied"
ts=2023-04-20T04:56:45.801Z caller=scrape.go:1311 level=error component="scrape manager" scrape_pool=gpu-metrics target=http://10.233.127.24:9400/metrics msg="Scrape commit failed" err="write to WAL: log samples: create new segment file: open /prometheus/wal/00001251: permission denied"
ts=2023-04-20T04:56:45.822Z caller=scrape.go:1311 level=error component="scrape manager" scrape_pool=gpu-metrics target=http://10.233.90.10:9400/metrics msg="Scrape commit failed" err="write to WAL: log samples: create new segment file: open /prometheus/wal/00001251: permission denied"
ts=2023-04-20T04:56:45.829Z caller=scrape.go:1311 level=error component="scrape manager" scrape_pool=gpu-metrics target=http://10.233.122.10:9400/metrics msg="Scrape commit failed" err="write to WAL: log samples: create new segment file: open /prometheus/wal/00001251: permission denied"
ts=2023-04-20T04:56:45.837Z caller=scrape.go:1311 level=error component="scrape manager" scrape_pool=gpu-metrics target=http://10.233.105.18:9400/metrics msg="Scrape commit failed" err="write to WAL: log samples: create new segment file: open /prometheus/wal/00001251: permission denied"
ts=2023-04-20T04:56:45.839Z caller=scrape.go:1311 level=error component="scrape manager" scrape_pool=gpu-metrics target=http://10.233.102.13:9400/metrics msg="Scrape commit failed" err="write to WAL: log samples: create new segment file: open /prometheus/wal/00001251: permission denied"
ts=2023-04-20T04:56:45.849Z caller=scrape.go:1311 level=error component="scrape manager" scrape_pool=gpu-metrics target=http://10.233.124.5:9400/metrics msg="Scrape commit failed" err="write to WAL: log samples: create new segment file: open /prometheus/wal/00001251: permission denied"
ts=2023-04-20T04:56:45.849Z caller=scrape.go:1311 level=error component="scrape manager" scrape_pool=gpu-metrics target=http://10.233.110.4:9400/metrics msg="Scrape commit failed" err="write to WAL: log samples: create new segment file: open /prometheus/wal/00001251: permission denied"
ts=2023-04-20T04:56:45.857Z caller=scrape.go:1311 level=error component="scrape manager" scrape_pool=serviceMonitor/prometheus/kube-prometheus-stack-1681228346-prometheus-node-exporter/0 target=http://10.1.81.8:9100/metrics msg="Scrape commit failed" err="write to WAL: log samples: create new segment file: open /prometheus/wal/00001251: permission denied"
  • 登陆到 prometheus 容器内部检查:

    kubectl -n prometheus exec -it prometheus-kube-prometheus-stack-1681-prometheus-0 -c prometheus -- /bin/sh
    

然后进入 /prometheus 目录,尝试touch文件并检查文件权限:

prometheus排查文件读写
/prometheus $ touch test
/prometheus $ ls -lh
total 68K
drwxr-xr-x    3 472      472         4.0K Apr 13 11:01 01GXX4CCR1JRADDYFGBDJ59N4H
drwxr-xr-x    3 472      472         4.0K Apr 15 15:03 01GY2PXCZP34CS4AMGY8HJ8M7D
drwxr-xr-x    3 472      472         4.0K Apr 17 23:05 01GY8Q9J8KFM79K9JGPB1240ZC
drwxr-xr-x    3 472      472         4.0K Apr 18 17:03 01GYAN15ANJVP5MKM9R610F819
drwxr-xr-x    3 472      472         4.0K Apr 19 09:02 01GYCBXE23WKK8Z9ST9TVB4HYH
drwxr-xr-x    3 472      472         4.0K Apr 19 17:01 01GYD7CBHDJGDYS79TJ1TDCNZ5
drwxr-xr-x    3 472      472         4.0K Apr 19 21:00 01GYDN2KDY5H5QQV4R8C4NXG9A
drwxr-xr-x    3 472      472         4.0K Apr 19 23:00 01GYDVYANXCK87PFPJ4GPKBNKG
drwxr-xr-x    3 472      472         4.0K Apr 19 23:01 01GYDVZK3PTXAQKX91JW8MTBBK
drwxr-xr-x    3 472      472         4.0K Apr 20 01:00 01GYE2T1ENAAR2E6Z6V5H8HDJP
drwxr-xr-x    2 472      472         4.0K Apr 20 01:00 chunks_head
-rw-r--r--    1 472      472            0 Apr 13 01:26 lock
-rw-r--r--    1 472      472        19.5K Apr 20 05:02 queries.active
-rw-r--r--    1 1000     2000           0 Apr 20 05:02 test
drwxr-xr-x    3 472      472         4.0K Apr 20 01:29 wal
/prometheus $ id
uid=1000 gid=2000 groups=2000
/prometheus $ cat /etc/passwd
root:x:0:0:root:/root:/bin/sh
daemon:x:1:1:daemon:/usr/sbin:/bin/false
bin:x:2:2:bin:/bin:/bin/false
sys:x:3:3:sys:/dev:/bin/false
sync:x:4:100:sync:/bin:/bin/sync
mail:x:8:8:mail:/var/spool/mail:/bin/false
www-data:x:33:33:www-data:/var/www:/bin/false
operator:x:37:37:Operator:/var:/bin/false
nobody:x:65534:65534:nobody:/home:/bin/false
/prometheus $ ps aux | grep prometheus
    1 1000      4d13 /bin/prometheus --web.console.templates=/etc/prometheus/consoles --web.console.libraries=/etc/prometheus/console_libraries --storage.tsdb.retention.time=180d --config.file=/etc/prometheus/config_out/prometheus.env.yaml --storage.tsdb.path=/prometheus --web.enable-lifecycle --web.external-url=http://kube-prometheus-stack-1681-prometheus.prometheus:9090 --web.route-prefix=/ --storage.tsdb.wal-compression --web.config.file=/etc/prometheus/web_config/web-config.yaml

我发现 prometheus 是使用 1000 作为 uid 的,但是,为何目录下文件权限都是 472uid

我突然想到我上午配置 kube-prometheus-stack 持久化卷 ,配置 Grafana通用可视分析平台 持久化存储发现一个怪现象, grafana 持久化目录并没有自建独立子目录,而是把目录散在 /prometheus/data 目录下,而 prometheus 是独立的 prometheus-data 子目录。看来 grafana 完全认为自己独占目录,直接把目录下所有子目录都修订为自己的运行 uid 472 了。

解决方法是把 grafana 的目录独立出来,然后恢复 prometheus 目录的的 uid gid1000     2000

后续

这次意外导致我的集群数据采集数据丢失了几个小时,也是一个经验教训。这次意外也提醒我数据备份恢复的重要性(毕竟大家话费了大量世间配置Grafana的面板),即使监控数据具有时效性,但是通常灰飞烟灭也是不能接受的。

我将实践: