在Kubernetes中安装运行Prometheus

注解

本文通过手工配置步骤,一步步在Kubernetes集群运行Prometheus进行集群监控,配合 在Kubernetes集群运行Grafana 可以实现Kubernetes集群常规监控和故障分析。后续再通过 使用Helm 3在Kubernetes集群部署Prometheus和Grafana 实现自动化部署整套监控系统。

Prometheus提供了官方docker hub的 Prometheus docker image ,可以用来安装。

Prometheus Kubernetes Manifest文件

创建Namespace和ClusterRole

  • 首先创建一个用于所有监控组件的Kubernetes namespace,这样可以避免Prometheus Kubernetes部署对象被部署到默认namespace:

    kubectl create namespace monitoring
    

Prometheus使用Kubernetes API来获取节点、Pods、Deployments等的所有提供的metrics,所以需要创建一个 read access 的RBAC策略绑定到 monitoring namespace。

  • 创建一个 clusterRole.yaml :
run_prometheus_in_k8s/clusterRole.yaml
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus
rules:
- apiGroups: [""]
  resources:
  - nodes
  - nodes/proxy
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
- apiGroups:
  - extensions
  resources:
  - ingresses
  verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: ServiceAccount
  name: default
  namespace: monitoring

在权限中添加了 verbs: ["get", "list", "watch"] 提供了节点、服务、pods和ingress的对应权限,然后绑定到 monitoring namespace。

  • 使用如下命令创建角色:

    kubectl create -f clusterRole.yaml
    

提示成功:

clusterrole.rbac.authorization.k8s.io/prometheus created
clusterrolebinding.rbac.authorization.k8s.io/prometheus created

创建Config Map来输出Prometheus配置

  • 配置文件:
    • prometheus.yaml 处理所有配置,服务发现,存储以及数据保留等有关Prometheus的配置
    • prometheus.rules 包含所有Prometheus告警规则

通过将Prometheus配置暴露给Kubernetes Config Map,就不需要在添加和删除配置时重建Prometheus镜像,你只需要更新config map并重启Prometheus Pod就可以使配置生效。

config-map.yaml 配置中包含了上述两个配置文件:

run_prometheus_in_k8s/config-map.yaml
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-server-conf
  labels:
    name: prometheus-server-conf
  namespace: monitoring
data:
  prometheus.rules: |-
    groups:
    - name: devopscube demo alert
      rules:
      - alert: High Pod Memory
        expr: sum(container_memory_usage_bytes) > 1
        for: 1m
        labels:
          severity: slack
        annotations:
          summary: High Memory Usage
  prometheus.yml: |-
    global:
      scrape_interval: 5s
      evaluation_interval: 5s
    rule_files:
      - /etc/prometheus/prometheus.rules
    alerting:
      alertmanagers:
      - scheme: http
        static_configs:
        - targets:
          - "alertmanager.monitoring.svc:9093"

    scrape_configs:
      - job_name: 'node-exporter'
        kubernetes_sd_configs:
          - role: endpoints
        relabel_configs:
        - source_labels: [__meta_kubernetes_endpoints_name]
          regex: 'node-exporter'
          action: keep
      
      - job_name: 'kubernetes-apiservers'

        kubernetes_sd_configs:
        - role: endpoints
        scheme: https

        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: default;kubernetes;https

      - job_name: 'kubernetes-nodes'

        scheme: https

        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

        kubernetes_sd_configs:
        - role: node

        relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_node_label_(.+)
        - target_label: __address__
          replacement: kubernetes.default.svc:443
        - source_labels: [__meta_kubernetes_node_name]
          regex: (.+)
          target_label: __metrics_path__
          replacement: /api/v1/nodes/${1}/proxy/metrics     
      
      - job_name: 'kubernetes-pods'

        kubernetes_sd_configs:
        - role: pod

        relabel_configs:
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
          action: keep
          regex: true
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
          action: replace
          target_label: __metrics_path__
          regex: (.+)
        - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
          action: replace
          regex: ([^:]+)(?::\d+)?;(\d+)
          replacement: $1:$2
          target_label: __address__
        - action: labelmap
          regex: __meta_kubernetes_pod_label_(.+)
        - source_labels: [__meta_kubernetes_namespace]
          action: replace
          target_label: kubernetes_namespace
        - source_labels: [__meta_kubernetes_pod_name]
          action: replace
          target_label: kubernetes_pod_name
      
      - job_name: 'kube-state-metrics'
        static_configs:
          - targets: ['kube-state-metrics.kube-system.svc.cluster.local:8080']

      - job_name: 'kubernetes-cadvisor'

        scheme: https

        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

        kubernetes_sd_configs:
        - role: node

        relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_node_label_(.+)
        - target_label: __address__
          replacement: kubernetes.default.svc:443
        - source_labels: [__meta_kubernetes_node_name]
          regex: (.+)
          target_label: __metrics_path__
          replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
      
      - job_name: 'kubernetes-service-endpoints'

        kubernetes_sd_configs:
        - role: endpoints

        relabel_configs:
        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
          action: keep
          regex: true
        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
          action: replace
          target_label: __scheme__
          regex: (https?)
        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
          action: replace
          target_label: __metrics_path__
          regex: (.+)
        - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
          action: replace
          target_label: __address__
          regex: ([^:]+)(?::\d+)?;(\d+)
          replacement: $1:$2
        - action: labelmap
          regex: __meta_kubernetes_service_label_(.+)
        - source_labels: [__meta_kubernetes_namespace]
          action: replace
          target_label: kubernetes_namespace
        - source_labels: [__meta_kubernetes_service_name]
          action: replace
          target_label: kubernetes_name
  • scrape配置解析

    • kubernetes-apiservers 从API服务器获取所有metrics
    • kubernetes-nodes 搜集所有kubernetes node metrics
    • kubernetes-pods 如果pod的metadata通过 prometheus.io/scrapeprometheus.io/port 声明就采集
    • kubernetes-cadvisor 采集所有cAdvisor metrics
    • kubernetes-service-endpoints 如果service的pod的metadata通过 prometheus.io/scrapeprometheus.io/port 声明就采集
  • prometheus.rules 包含所有发送告警规则

  • 现在执行以下命令创建Config Map:

    kubectl create -f config-map.yaml
    

创建Prometheus Deployment

  • 创建 prometheus-deployment.yaml :
run_prometheus_in_k8s/prometheus-deployment.yaml
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus-deployment
  namespace: monitoring
  labels:
    app: prometheus-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus-server
  template:
    metadata:
      labels:
        app: prometheus-server
    spec:
      containers:
        - name: prometheus
          #image: prom/prometheus
          image: prom/prometheus-linux-arm64
          args:
            - "--config.file=/etc/prometheus/prometheus.yml"
            - "--storage.tsdb.path=/prometheus/"
          ports:
            - containerPort: 9090
          volumeMounts:
            - name: prometheus-config-volume
              mountPath: /etc/prometheus/
            - name: prometheus-storage-volume
              mountPath: /prometheus/
      nodeSelector:
        kubernetes.io/arch: arm64
      volumes:
        - name: prometheus-config-volume
          configMap:
            defaultMode: 420
            name: prometheus-server-conf
  
        - name: prometheus-storage-volume
          emptyDir: {}

警告

这里没有设置 Kubernetes持久化存储卷 后续完善,生产环境已经要持久化存储

注解

需要注意,我的实践环境是在 部署ARM架构Kubernetes ,所以需要采用 ARM 版本prometheus镜像 prom/prometheus-linux-arm64 ,如果是 x86 架构,则直接使用 prom/prometheus

pod的deployment中必须配置 nodeSelector

spec:
containers:
  - name: prometheus
    #image: prom/prometheus
    image: prom/prometheus-linux-arm64
    ...
nodeSelector:
  kubernetes.io/arch: arm64

如果你使用常规的x86环境,请将上述配置修订成:

spec:
containers:
  - name: prometheus
    image: prom/prometheus
    #image: prom/prometheus-linux-arm64
    ...
#nodeSelector:
#  kubernetes.io/arch: arm64
  • 创建部署:

    kubectl create  -f prometheus-deployment.yaml
    
  • 完成后检查:

    kubectl -n monitoring get pods -o wide
    

显示:

NAME                                     READY   STATUS    RESTARTS   AGE   IP           NODE         NOMINATED NODE   READINESS GATES
prometheus-deployment-64d4b79f85-565jn   1/1     Running   0          24h   10.244.1.3   pi-worker1   <none>           <none>

设置Kube State Metrics

默认配置下Kube state metrics service并没有提供很多metrics。所以需要确保部署Kube state metrics来监控所有的Kubernetes API对象,例如 deployments , pods , jobs , cronjobs 等等。请参考 在Kubernetes集群配置Kube State Metrics