在Kubernetes集群(z-k8s)部署集成GPU监控的Prometheus和Grafana

Prometheus 社区提供了 kube-prometheus-stack helm chart,一个完整的Kubernetes manifests,包含 Grafana通用可视分析平台 dashboard,以及结合了文档和脚本的 Prometheus 规则 以方便通过 Prometheus Operator 。不过,对于GPU节点的监控,建议在部署时做一些修订(见本文)可以方便一气呵成。当然,先完成 使用Helm 3在Kubernetes集群部署Prometheus和Grafana 再通过 更新Kubernetes集群的Prometheus配置 也可以。

helm3

  • helm 提供方便部署:

使用官方脚本安装 helm
curl -LO https://git.io/get_helm.sh
chmod 700 get_helm.sh
./get_helm.sh

安装NVIDIA GPU Operator

这段待整理

安装Prometheus 和 Grafana

helm配置

  • 添加 Prometheus 社区helm chart:

添加 Prometheus 社区helm chart
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
  • NVIDIA对社区方案参数做一些调整,所以先导出 chart 使用的变量(以便修订):

helm inspect values 输出Prometheus Stack的chart变量值
prometheus:

  ## Configuration for Prometheus service
  ##
  service:
    annotations: {}
    labels: {}
    clusterIP: ""

    ## Port for Prometheus Service to listen on
    ##
    port: 9090

    ## To be used with a proxy extraContainer port
    targetPort: 9090

    ## List of IP addresses at which the Prometheus server service is available
    ## Ref: https://kubernetes.io/docs/user-guide/services/#external-ips
    ##
    externalIPs: []

    ## Port to expose on each node
    ## Only used if service.type is 'NodePort'
    ##
    nodePort: 30090

    ## Loadbalancer IP
    ## Only use if service.type is "LoadBalancer"
    loadBalancerIP: ""
    loadBalancerSourceRanges: []

    ## Denotes if this Service desires to route external traffic to node-local or cluster-wide endpoints
    ##
    externalTrafficPolicy: Cluster

    ## Service type
    ##
    type: NodePort

...

grafana:

  ## Passed to grafana subchart and used by servicemonitor below
  ##
  service:
    portName: http-web
    nodePort: 30080
    type: NodePort

...

alertmanager:

  ## Deploy alertmanager
  ##
  enabled: true
  ...
  ## Configuration for Alertmanager service
  ##
  service:
    annotations: {}
    labels: {}
    clusterIP: ""

    ## Port for Alertmanager Service to listen on
    ##
    port: 9093
    ## To be used with a proxy extraContainer port
    ##
    targetPort: 9093
    ## Port to expose on each node
    ## Only used if service.type is 'NodePort'
    ##
    nodePort: 30903
    ## List of IP addresses at which the Prometheus server service is available
    ## Ref: https://kubernetes.io/docs/user-guide/services/#external-ips
    ##
    ...
    ## Service type
    ##
    type: NodePort
  • 修订一: 将metrics端口 30090 作为 NodePort 输出在每个节点(实际只需要修改 type: ClusterIP 改为 type: NodePort 行,建议同时修改 stable-grafana ( helm install 时支持传递参数 --set grafana.service.type=NodePort ,通过增加 nodePort 指定 80/30080映射), alertmanager (9093/30903) 和 prometheus (9090/30090) 对应的 svc ):

修订服务类型NodePort
prometheus:

  ## Configuration for Prometheus service
  ##
  service:
    annotations: {}
    labels: {}
    clusterIP: ""

    ## Port for Prometheus Service to listen on
    ##
    port: 9090

    ## To be used with a proxy extraContainer port
    targetPort: 9090

    ## List of IP addresses at which the Prometheus server service is available
    ## Ref: https://kubernetes.io/docs/user-guide/services/#external-ips
    ##
    externalIPs: []

    ## Port to expose on each node
    ## Only used if service.type is 'NodePort'
    ##
    nodePort: 30090

    ## Loadbalancer IP
    ## Only use if service.type is "LoadBalancer"
    loadBalancerIP: ""
    loadBalancerSourceRanges: []

    ## Denotes if this Service desires to route external traffic to node-local or cluster-wide endpoints
    ##
    externalTrafficPolicy: Cluster

    ## Service type
    ##
    type: NodePort

...

grafana:

  ## Passed to grafana subchart and used by servicemonitor below
  ##
  service:
    portName: http-web
    nodePort: 30080
    type: NodePort

...

alertmanager:

  ## Deploy alertmanager
  ##
  enabled: true
  ...
  ## Configuration for Alertmanager service
  ##
  service:
    annotations: {}
    labels: {}
    clusterIP: ""

    ## Port for Alertmanager Service to listen on
    ##
    port: 9093
    ## To be used with a proxy extraContainer port
    ##
    targetPort: 9093
    ## Port to expose on each node
    ## Only used if service.type is 'NodePort'
    ##
    nodePort: 30903
    ## List of IP addresses at which the Prometheus server service is available
    ## Ref: https://kubernetes.io/docs/user-guide/services/#external-ips
    ##
    ...
    ## Service type
    ##
    type: NodePort

备注

我最初在 kube-prometheus-stack.values 没有找到修订 grafanaservice.type 的地方,后来找到可以通过 传递参数 --set grafana.service.type=NodePort 实现,再仔细看了 values ,原来默认没有配置,所以需要自己手工添加

其他修订:

defaultDashboardsTimezone: Asia/Shanghai
  • 修订二: 修改 prometheusSpec.serviceMonitorSelectorNilUsesHelmValues 设置为 false :

修改 prometheusSpec.serviceMonitorSelectorNilUsesHelmValues 设置为 false
# If true, a nil or {} value for prometheus.prometheusSpec.serviceMonitorSelector will cause the
# prometheus resource to be created with selectors based on values in the helm deployment,
# which will also match the servicemonitors created
#
serviceMonitorSelectorNilUsesHelmValues: false
  • 修改三: 在 configMap 配置 additionalScrapeConfigs 添加 gpu-metrics :

configMap 配置 additionalScrapeConfigs 添加 gpu-metrics
# AdditionalScrapeConfigs allows specifying additional Prometheus scrape configurations. Scrape configurations
# are appended to the configurations generated by the Prometheus Operator. Job configurations must have the form
# as specified in the official Prometheus documentation:
# https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config. As scrape configs are
# appended, the user is responsible to make sure it is valid. Note that using this feature may expose the possibility
# to break upgrades of Prometheus. It is advised to review Prometheus release notes to ensure that no incompatible
# scrape configs are going to break Prometheus after the upgrade.
#
# The scrape configuration example below will find master nodes, provided they have the name .*mst.*, relabel the
# port to 2379 and allow etcd scraping provided it is running on all Kubernetes master nodes
#
additionalScrapeConfigs:
- job_name: gpu-metrics
  scrape_interval: 1s
  metrics_path: /metrics
  scheme: http
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - gpu-operator
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_node_name]
    action: replace
    target_label: kubernetes_node

准备存储

kube-prometheus-stack-pv.yaml 创建 在Kubernetes中部署hostPath存储 持久化存储卷
apiVersion: v1
kind: PersistentVolume
metadata:
  name: kube-prometheus-stack-pv
  labels:
    type: local
spec:
  storageClassName: prometheus-data
  capacity:
    storage: 400Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  hostPath:
    path: "/prometheus/data"
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: kube-prometheus-stack-pv-alert
  labels:
    type: local
spec:
  storageClassName: prometheus-data-alert
  capacity:
    storage: 400Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  hostPath:
    path: "/prometheus/data"
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: kube-prometheus-stack-pv-thanos
  labels:
    type: local
spec:
  storageClassName: prometheus-data-thanos
  capacity:
    storage: 400Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  hostPath:
    path: "/prometheus/data"
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: kube-prometheus-stack-pv-grafana
  labels:
    type: local
spec:
  storageClassName: prometheus-data-grafana
  capacity:
    storage: 400Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  hostPath:
    path: "/prometheus/data/grafana-db"

备注

只需要创建 PV 就可以, kube-prometheus-stack values.yaml 中提供了 PVC 配置,会自动创建PVC

  • 执行:

执行构建 kube-prometheus-stack-pv
kubectl apply -f kube-prometheus-stack-pv.yaml

部署

  • 执行部署,部署中采用自己定制的values:

使用定制helm chart values来安装部署 kube-prometheus-stack (传递定制存储参数没有成功,实际正确方法应该采用 kube-prometheus-stack 持久化卷 )
helm install prometheus-community/kube-prometheus-stack \
   --create-namespace --namespace prometheus \
   --generate-name \
   --values /tmp/kube-prometheus-stack.values
   #--set=alertmanager.persistentVolume.existingClaim=kube-prometheus-stack-pvc,server.persistentVolume.existingClaim=kube-prometheus-stack-pvc,grafana.persistentVolume.existingClaim=kube-prometheus-stack-pvc

备注

持久化存储解决方案采用 kube-prometheus-stack 持久化卷 验证通过

输出信息:

使用定制helm chart values来安装部署 kube-prometheus-stack 输出信息
NAME: kube-prometheus-stack-1680871060
LAST DEPLOYED: Fri Apr  7 20:38:00 2023
NAMESPACE: prometheus
STATUS: deployed
REVISION: 1
NOTES:
kube-prometheus-stack has been installed. Check its status by running:
  kubectl --namespace prometheus get pods -l "release=kube-prometheus-stack-1680871060"

Visit https://github.com/prometheus-operator/kube-prometheus for instructions on how to create & configure Alertmanager and Prometheus instances using the Operator.

备注

在生产集群部署,遇到过如下报错:

Error: INSTALLATION FAILED: unable to build kubernetes objects from release manifest: error validating "": error validating data: ValidationError(ServiceMonitor.spec.endpoints[0]): unknown field "enableHttp2" in com.coreos.monitoring.v1.ServiceMonitor.spec.endpoints

参考 prometheus-kube-stack helm install results in unknown field “enableHttp2” #2633 情况类似:

Found same error upgrading from old Prometheus installation.
Solution: uninstall prometheus, delete CRDs and install again.
https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack#uninstall-helm-chart

原因是我之前的一次安装 prometheus-stack 安装,中途按下了 ctrl-c ,然后执行了一次 helm uninstall stack 来卸载。但是根据文档,CRD是不会自动清理掉,这可能导致了冲突。需要手工清理相关监控的CRD:

kubectl delete crd alertmanagerconfigs.monitoring.coreos.com
kubectl delete crd alertmanagers.monitoring.coreos.com
kubectl delete crd podmonitors.monitoring.coreos.com
kubectl delete crd probes.monitoring.coreos.com
kubectl delete crd prometheuses.monitoring.coreos.com
kubectl delete crd prometheusrules.monitoring.coreos.com
kubectl delete crd servicemonitors.monitoring.coreos.com
kubectl delete crd thanosrulers.monitoring.coreos.com

备注

在生产集群部署,遇到调度失败:

kubectl --namespace prometheus get pods kube-prometheus-stack-1680962838-prometheus-node-exporter-5kk5q -o yaml
...
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2023-04-08T14:07:36Z"
    message: '0/12 nodes are available: 12 node(s) didn''t have free ports for the
      requested pod ports.'
    reason: Unschedulable

原因是Kubernetes集群在阿里云平台部署,已经购买了阿里云的 Prometheus 监控,所以集群已经提前部署了 node-exporter ,导致端口中途。解决方法是修订上文自定义values文件 kube-prometheus-stack.values

...
## Deploy node exporter as a daemonset to all nodes
##
nodeExporter:
  enabled: false

然后重新部署。(不过实践发现还是存在其他问题,遂放弃)

备注

如果已经部署好 prometheus-stack ,需要添加 DCGM-Exporter 数据采集支持,则可以通过 更新Kubernetes集群的Prometheus配置 修订

备注

在墙内部署会遇到镜像下载问题,通过镜像导入目标节点:

#下载
docker pull registry.k8s.io/ingress-nginx/kube-webhook-certgen:v20221220-controller-v1.5.1-58-g787ea74b6
docker pull registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.8.2

#导出
docker save -o kube-webhook-certgen.tar registry.k8s.io/ingress-nginx/kube-webhook-certgen:v20221220-controller-v1.5.1-58-g787ea74b6
docker save -o kube-state-metrics.tar registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.8.2

#导入
nerdctl -n k8s.io load < /tmp/kube-webhook-certgen.tar
nerdctl -n k8s.io load < /tmp/kube-state-metrics.tar
  • 对于需要部署到指定监控服务器,可以采用 label 方法:

    kubectl label nodes i-0jl8d8r83kkf3yt5lzh7 telemetry=prometheus
    

依次修订 deployments ,例如 kubectl edit deployment stable-grafana

spec:
  nodeSelector:
    telemetry: prometheus
  containers:
  ...
  • 检查 prometheus namespace中部署的容器:

检查 kube-prometheus-stack 部署容器
kubectl --namespace prometheus get pods -l "release=kube-prometheus-stack-1680871060"

输出显示类似如下:

检查 kube-prometheus-stack 部署容器输出显示
NAME                                                              READY   STATUS    RESTARTS   AGE
kube-prometheus-stack-1680-operator-df66d5c4c-8jqzj               1/1     Running   0          3m59s
kube-prometheus-stack-1680871060-kube-state-metrics-865958g6ffz   1/1     Running   0          3m59s
kube-prometheus-stack-1680871060-prometheus-node-exporter-6nwkp   1/1     Running   0          3m59s
kube-prometheus-stack-1680871060-prometheus-node-exporter-6rk88   1/1     Running   0          3m59s
kube-prometheus-stack-1680871060-prometheus-node-exporter-7jx92   1/1     Running   0          3m59s
kube-prometheus-stack-1680871060-prometheus-node-exporter-dkqqs   1/1     Running   0          3m59s
kube-prometheus-stack-1680871060-prometheus-node-exporter-dqmfc   1/1     Running   0          3m59s
kube-prometheus-stack-1680871060-prometheus-node-exporter-h2rdq   1/1     Running   0          3m59s
kube-prometheus-stack-1680871060-prometheus-node-exporter-h44wr   1/1     Running   0          3m59s
kube-prometheus-stack-1680871060-prometheus-node-exporter-t655c   1/1     Running   0          3m59s

检查部署完成的Prometheus Pods可以看到每个节点都运行了 node-exporter 且已经运行起 Prometheus和Grafana(注意,位于 prometheus namespace)

备注

如果有遇到镜像无法下载问题,请参考 使用Helm 3在Kubernetes集群部署Prometheus和Grafana 我的实践经验

服务输出

  • 检查 svc :

检查部署完成的服务 kubectl get svc
kubectl get svc -n prometheus

输出显示:

kubectl get svc 输出显示当前grafana服务还是ClusterIP,需要修订
NAME                                                        TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
alertmanager-operated                                       ClusterIP   None             <none>        9093/TCP,9094/TCP,9094/UDP   14m
kube-prometheus-stack-1680-alertmanager                     NodePort    10.106.70.4      <none>        9093:30903/TCP               15m
kube-prometheus-stack-1680-operator                         ClusterIP   10.107.104.10    <none>        443/TCP                      15m
kube-prometheus-stack-1680-prometheus                       NodePort    10.101.120.210   <none>        9090:30090/TCP               15m
kube-prometheus-stack-1680871060-grafana                    ClusterIP   10.99.214.112    <none>        80/TCP                       15m
kube-prometheus-stack-1680871060-kube-state-metrics         ClusterIP   10.108.43.250    <none>        8080/TCP                     15m
kube-prometheus-stack-1680871060-prometheus-node-exporter   ClusterIP   10.110.33.129    <none>        9100/TCP                     15m
prometheus-operated                                         ClusterIP   None             <none>        9090/TCP                     14m

默认情况下, prometheusgrafana 服务都是使用ClusterIP在集群内部,所以要能够在外部访问,需要使用 Kubernetes集群的Load Balancer和Ingress辨析 或者 NodePort (简单) 。上文我采用了NVIDIA官方部署文档方法,将 alertmanagerprometheus 修订成了 NodePort 模式,但是没有修订 grafana ,所以下面我再手工修订 grafana 设置为 NodePort 模式

  • 修改 stable-grafana 服务,将 typeClusterIP 修改为 NodePort 或者 LoadBalancer

kubectl edit svc 将ClusterIP类型改为NodePort
kubectl edit svc kube-prometheus-stack-1680871060-grafana -n prometheus

最终检查 svc 如下:

kubectl get svc 输出显示NodePort类型就是在运行服务节点对外提供服务
NAME                                                        TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
alertmanager-operated                                       ClusterIP   None             <none>        9093/TCP,9094/TCP,9094/UDP   166m
kube-prometheus-stack-1680-alertmanager                     NodePort    10.106.70.4      <none>        9093:30903/TCP               166m
kube-prometheus-stack-1680-operator                         ClusterIP   10.107.104.10    <none>        443/TCP                      166m
kube-prometheus-stack-1680-prometheus                       NodePort    10.101.120.210   <none>        9090:30090/TCP               166m
kube-prometheus-stack-1680871060-grafana                    NodePort    10.99.214.112    <none>        80:32427/TCP                 166m
kube-prometheus-stack-1680871060-kube-state-metrics         ClusterIP   10.108.43.250    <none>        8080/TCP                     166m
kube-prometheus-stack-1680871060-prometheus-node-exporter   ClusterIP   10.110.33.129    <none>        9100/TCP                     166m
prometheus-operated                                         ClusterIP   None             <none>        9090/TCP                     166m

不过,这样外部访问的端口是随机的,有点麻烦。临时性解决方法,我采用 Nginx反向代理 将对外端口固定住,然后反向转发给 NodePort 的随机端口,至少能临时使用。

端口转发

备注

我在上次实践 在Kuternetes集成GPU可观测能力 采用 Nginx反向代理 到Grafana,遇到过 在反向代理后面运行Grafana 问题,原因是Grafana新版本为了阻断跨站攻击,对客户端请求源和返回地址进行校验,所以必须对 Nginx 设置代理头部

另外可以采用 Apache反向代理 来实现反向代理(因为我已经采用了 Apache WebDAV服务器 实现 通过WebDAV同步Joplin数据 )

在通过 NodePort 输出 Prometheus/Grafana/Altermanager 时,pod容器可以在集群的任何node节点运行。对于外部访问,比较好的方式是采用 Kubernetes MetalLB 负载均衡 结合 Ingress 来实现完整的云计算网络。

不过,出于快速构建,我当前采用简化的服务输出方式 NodePort ,所以再部署一个 简单的WEB反向代理就能在外部访问 iptables端口转发(port forwarding) 实现访问。

  • 检查 prometheus-stack 输出的 NodePort :

检查服务的 NodePort
kubectl get svc -n prometheus | grep NodePort

输出显示:

检查服务的 NodePort 输出
kube-prometheus-stack-1680-alertmanager                     NodePort    10.106.70.4      <none>        9093:30903/TCP               2d1h
kube-prometheus-stack-1680-prometheus                       NodePort    10.101.120.210   <none>        9090:30090/TCP               2d1h
kube-prometheus-stack-1680871060-grafana                    NodePort    10.99.214.112    <none>        80:32427/TCP                 2d1h
  • 检查 prometheus-stack 对应pods落在哪个 nodes 上:

检查prometheus的服务对应 pods 落在哪个 nodes
kubectl get pods -n prometheus -o wide

输出显示

检查prometheus的服务对应 pods 落在哪个 nodes (对应3个NODE)
NAME                                                              READY   STATUS    RESTARTS       AGE    IP              NODE        NOMINATED NODE   READINESS GATES
alertmanager-kube-prometheus-stack-1680-alertmanager-0            2/2     Running   1 (2d2h ago)   2d2h   10.0.5.28       z-k8s-n-3   <none>           <none>
kube-prometheus-stack-1680-operator-df66d5c4c-8jqzj               1/1     Running   0              2d2h   10.0.4.178      z-k8s-n-2   <none>           <none>
kube-prometheus-stack-1680871060-grafana-6f5c7cb5-k2kw9           3/3     Running   0              2d2h   10.0.7.107      z-k8s-n-4   <none>           <none>
kube-prometheus-stack-1680871060-kube-state-metrics-865958g6ffz   1/1     Running   0              2d2h   10.0.7.187      z-k8s-n-4   <none>           <none>
kube-prometheus-stack-1680871060-prometheus-node-exporter-6nwkp   1/1     Running   0              2d2h   192.168.6.112   z-k8s-n-2   <none>           <none>
...
prometheus-kube-prometheus-stack-1680-prometheus-0                2/2     Running   0              2d2h   10.0.4.242      z-k8s-n-2   <none>           <none>
  • 检查 nodes 对应IP:

检查 nodes 对应 IP
kubectl get nodes -o wide
检查 nodes 对应 IP
NAME        STATUS   ROLES           AGE    VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
z-k8s-m-1   Ready    control-plane   266d   v1.25.3   192.168.6.101   <none>        Ubuntu 22.04.2 LTS   5.15.0-69-generic   containerd://1.6.6
z-k8s-m-2   Ready    control-plane   264d   v1.25.3   192.168.6.102   <none>        Ubuntu 22.04.2 LTS   5.15.0-69-generic   containerd://1.6.6
z-k8s-m-3   Ready    control-plane   264d   v1.25.3   192.168.6.103   <none>        Ubuntu 22.04.2 LTS   5.15.0-69-generic   containerd://1.6.6
z-k8s-n-1   Ready    <none>          264d   v1.25.3   192.168.6.111   <none>        Ubuntu 22.04.2 LTS   5.15.0-69-generic   containerd://1.6.6
z-k8s-n-2   Ready    <none>          264d   v1.25.3   192.168.6.112   <none>        Ubuntu 22.04.2 LTS   5.15.0-69-generic   containerd://1.6.6
z-k8s-n-3   Ready    <none>          264d   v1.25.3   192.168.6.113   <none>        Ubuntu 22.04.2 LTS   5.15.0-69-generic   containerd://1.6.6
z-k8s-n-4   Ready    <none>          264d   v1.25.3   192.168.6.114   <none>        Ubuntu 22.04.2 LTS   5.15.0-69-generic   containerd://1.6.6
z-k8s-n-5   Ready    <none>          264d   v1.25.3   192.168.6.115   <none>        Ubuntu 22.04.2 LTS   5.15.0-69-generic   containerd://1.6.6

整理对应关系:

prometheus-stack 服务 NodePort 对应关系

服务

Gateway IP

Gateway Port

Node IP

Port

grafana

192.168.106.15

8080

192.168.6.114

32427

prometheus

192.168.106.15

9090

192.168.6.112

30090

alertmanager

192.168.106.15

9093

192.168.6.113

30903

  • 执行以下端口转发脚本:

端口转发 prometheus-stack 服务端口
local_host=192.168.106.15

dashboard_port=8443
grafana_port=8080
prometheus_port=9090
alertmanager_port=9093

k8s_dashboard_host=172.21.44.215
k8s_dashboard_port=32642

k8s_grafana_host=192.168.6.114
k8s_grafana_port=32427

k8s_prometheus_host=192.168.6.112
k8s_prometheus_port=30090

k8s_alertmanager_host=192.168.6.113
k8s_alertmanager_port=30903

iptables -t nat -A PREROUTING -p tcp --dport ${dashboard_port} -j DNAT --to-destination ${k8s_dashboard_host}:${k8s_dashboard_port}
iptables -t nat -A POSTROUTING -p tcp -d ${k8s_dashboard_host} --dport ${k8s_dashboard_port} -j SNAT --to-source ${local_host}

iptables -t nat -A PREROUTING -p tcp --dport ${grafana_port} -j DNAT --to-destination ${k8s_grafana_host}:${k8s_grafana_port}
iptables -t nat -A POSTROUTING -p tcp -d ${k8s_grafana_host} --dport ${k8s_grafana_port} -j SNAT --to-source ${local_host}

iptables -t nat -A PREROUTING -p tcp --dport ${prometheus_port} -j DNAT --to-destination ${k8s_prometheus_host}:${k8s_prometheus_port}
iptables -t nat -A POSTROUTING -p tcp -d ${k8s_prometheus_host} --dport ${k8s_prometheus_port} -j SNAT --to-source ${local_host}

iptables -t nat -A PREROUTING -p tcp --dport ${alertmanager_port} -j DNAT --to-destination ${k8s_alertmanager_host}:${k8s_alertmanager_port}
iptables -t nat -A POSTROUTING -p tcp -d ${k8s_alertmanager_host} --dport ${k8s_alertmanager_port} -j SNAT --to-source ${local_host}

配置修订

对于需要后续调整的配置,采用 更新Kubernetes集群的Prometheus配置 方法:

使用 helm upgrade prometheus-community/kube-prometheus-stack
helm upgrade kube-prometheus-stack-1681228346 prometheus-community/kube-prometheus-stack \
  --namespace prometheus --values kube-prometheus-stack.values

例如更新 scrape 配置

持久化存储

默认配置:

默认 prometheus 存储在内存
...
        volumeMounts:
...
        - mountPath: /prometheus
          name: prometheus-kube-prometheus-stack-1681-prometheus-db
...
      volumes:
      - emptyDir: {}
        name: prometheus-kube-prometheus-stack-1681-prometheus-db

我最初按照上文 Deploying kube-prometheus-stack with persistent storage on Kubernetes Cluster 构建了存储PV/PVC,但是采用了 helm install 参数 ``

访问使用

访问 Grafana 面板,初始账号 admin 密码是 prom-operator ,请立即修改

然后我们可以开始 Grafana配置快速起步

Kubernetes Ingress控制器 改进

我最初为了方便快速,采用了 NodePort 输出服务,所以简单部署了 在反向代理后面运行Grafana ,后续尝试改进成 Kubernetes Ingress控制器 模式

参考