更新Kubernetes集群的Prometheus配置¶
备注
在 使用Helm 3在Kubernetes集群部署Prometheus和Grafana 中部署 DCGM-Exporter 管理GPU监控,需要修订Prometheus配置来抓取特定节点和端口metrics,需要修订Prometheus配置。
对于采用Prometheus Operator(例如 使用Helm 3在Kubernetes集群部署Prometheus和Grafana 就是采用 kube-prometheus-stack helm chart完成部署)构建Prometheus监控堆栈之后,可以通过设置附加的scrape配置来监控自己的自定义服务。
所谓的附加的scrape配置(additional scrape config)就是使用 正则表达式 来找寻匹配的服务,并根据 标签( label
), 注释( annotation
),命名空间( namespace
)或命名( name
)来定位一组服务。
附加的scrape配置(additional scrape config)是一项功能强大的底层操作,需要在 helm chart values的 prometheusSpec
部分 或者 Prometheus配置文件 中指定。所以非常适合实现一个平台界别的监控极致,而且可以控制Prometheus的安装和配置设置。
备注
本文实践是在我的 使用Helm 3在Kubernetes集群部署Prometheus和Grafana 基础上完成,目标是 在Kuternetes集成GPU可观测能力
在 使用Helm 3在Kubernetes集群部署Prometheus和Grafana 没有包含 在Kuternetes集成GPU可观测能力 针对GPU监控所需的 additionalScrapeConfigs
所以 prometheus
此时无法抓取 9400
端口的GPU metrics 。以下是 在Kuternetes集成GPU可观测能力 所要求的 configMap
:
# AdditionalScrapeConfigs allows specifying additional Prometheus scrape configurations. Scrape configurations
# are appended to the configurations generated by the Prometheus Operator. Job configurations must have the form
# as specified in the official Prometheus documentation:
# https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config. As scrape configs are
# appended, the user is responsible to make sure it is valid. Note that using this feature may expose the possibility
# to break upgrades of Prometheus. It is advised to review Prometheus release notes to ensure that no incompatible
# scrape configs are going to break Prometheus after the upgrade.
#
# The scrape configuration example below will find master nodes, provided they have the name .*mst.*, relabel the
# port to 2379 and allow etcd scraping provided it is running on all Kubernetes master nodes
#
additionalScrapeConfigs:
- job_name: gpu-metrics
scrape_interval: 1s
metrics_path: /metrics
scheme: http
kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- gpu-operator
relabel_configs:
- source_labels: [__meta_kubernetes_pod_node_name]
action: replace
target_label: kubernetes_node
获取当前 Prometheus
helm
:
helm list -A | grep prometheus
输出信息:
stable default 1 2023-03-29 22:10:58.12684326 +0800 CST deployedkube-prometheus-stack-45.8.0v0.6 3.0
dcgm-exporter
部署情况¶
参考 how prometheus get dcgm-exporter metrics? #106 对 dcgm-exporter
进行剖析:
Namespace
DaemonSet
Service
ServiceMonitor
ServiceAccount
观察 在Kuternetes集成GPU可观测能力 部署的
service
,其中有dcgm-exporter-1680364448
,检查内容:kubectl get svc dcgm-exporter-1680364448 -o yaml
输出 dcgm-exporter
service配置:
apiVersion: v1
kind: Service
metadata:
annotations:
meta.helm.sh/release-name: dcgm-exporter-1680364448
meta.helm.sh/release-namespace: default
creationTimestamp: "2023-04-01T15:54:13Z"
labels:
app.kubernetes.io/component: dcgm-exporter
app.kubernetes.io/instance: dcgm-exporter-1680364448
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: dcgm-exporter
app.kubernetes.io/version: 2.6.10
helm.sh/chart: dcgm-exporter-2.6.10
managedFields:
- apiVersion: v1
fieldsType: FieldsV1
...
manager: helm
operation: Update
time: "2023-04-01T15:54:13Z"
name: dcgm-exporter-1680364448
namespace: default
resourceVersion: "6314410"
selfLink: /api/v1/namespaces/default/services/dcgm-exporter-1680364448
uid: fef9c429-4c9f-418b-ae62-c8012efc577b
spec:
clusterIP: 10.233.18.35
ports:
- name: metrics
port: 9400
protocol: TCP
targetPort: 9400
selector:
app.kubernetes.io/instance: dcgm-exporter-1680364448
app.kubernetes.io/name: dcgm-exporter
sessionAffinity: None
type: ClusterIP
status:
loadBalancer: {}
观察
dcgm-exporter
daemonset配置:kubectl get ds dcgm-exporter-1680364448 -o yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
annotations:
deprecated.daemonset.template.generation: "1"
meta.helm.sh/release-name: dcgm-exporter-1680364448
meta.helm.sh/release-namespace: default
creationTimestamp: "2023-04-01T15:54:13Z"
generation: 1
labels:
app.kubernetes.io/component: dcgm-exporter
app.kubernetes.io/instance: dcgm-exporter-1680364448
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: dcgm-exporter
app.kubernetes.io/version: 2.6.10
helm.sh/chart: dcgm-exporter-2.6.10
managedFields:
- apiVersion: apps/v1
fieldsType: FieldsV1
fieldsV1:
...
manager: helm
operation: Update
time: "2023-04-01T15:54:13Z"
- apiVersion: apps/v1
fieldsType: FieldsV1
fieldsV1:
f:status:
f:currentNumberScheduled: {}
f:desiredNumberScheduled: {}
f:numberAvailable: {}
f:numberMisscheduled: {}
f:numberReady: {}
f:numberUnavailable: {}
f:observedGeneration: {}
f:updatedNumberScheduled: {}
manager: kube-controller-manager
operation: Update
time: "2023-04-02T09:48:53Z"
name: dcgm-exporter-1680364448
namespace: default
resourceVersion: "6988330"
selfLink: /apis/apps/v1/namespaces/default/daemonsets/dcgm-exporter-1680364448
uid: 43010398-556f-4db0-9d2a-4b544cc6d318
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
app.kubernetes.io/component: dcgm-exporter
app.kubernetes.io/instance: dcgm-exporter-1680364448
app.kubernetes.io/name: dcgm-exporter
template:
metadata:
creationTimestamp: null
labels:
app.kubernetes.io/component: dcgm-exporter
app.kubernetes.io/instance: dcgm-exporter-1680364448
app.kubernetes.io/name: dcgm-exporter
spec:
containers:
- args:
- -f
- /etc/dcgm-exporter/dcp-metrics-included.csv
env:
- name: DCGM_EXPORTER_KUBERNETES
value: "true"
- name: DCGM_EXPORTER_LISTEN
value: :9400
image: nvcr.io/nvidia/k8s/dcgm-exporter:2.4.6-2.6.10-ubuntu20.04
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /health
port: 9400
scheme: HTTP
initialDelaySeconds: 45
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
name: exporter
ports:
- containerPort: 9400
name: metrics
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /health
port: 9400
scheme: HTTP
initialDelaySeconds: 45
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources: {}
securityContext:
capabilities:
add:
- SYS_ADMIN
runAsNonRoot: false
runAsUser: 0
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/lib/kubelet/pod-resources
name: pod-gpu-resources
readOnly: true
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: dcgm-exporter-1680364448
serviceAccountName: dcgm-exporter-1680364448
terminationGracePeriodSeconds: 30
volumes:
- hostPath:
path: /var/lib/kubelet/pod-resources
type: ""
name: pod-gpu-resources
updateStrategy:
rollingUpdate:
maxUnavailable: 1
type: RollingUpdate
...
更新 helm
部署¶
参考 在Kuternetes集成GPU可观测能力 方法,将
prometheus-community/kube-prometheus-stack
chart的values导出:
helm inspect values prometheus-community/kube-prometheus-stack > kube-prometheus-stack.values
在
configMap
配置additionalScrapeConfigs
添加gpu-metrics
:
# AdditionalScrapeConfigs allows specifying additional Prometheus scrape configurations. Scrape configurations
# are appended to the configurations generated by the Prometheus Operator. Job configurations must have the form
# as specified in the official Prometheus documentation:
# https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config. As scrape configs are
# appended, the user is responsible to make sure it is valid. Note that using this feature may expose the possibility
# to break upgrades of Prometheus. It is advised to review Prometheus release notes to ensure that no incompatible
# scrape configs are going to break Prometheus after the upgrade.
#
# The scrape configuration example below will find master nodes, provided they have the name .*mst.*, relabel the
# port to 2379 and allow etcd scraping provided it is running on all Kubernetes master nodes
#
additionalScrapeConfigs:
- job_name: gpu-metrics
scrape_interval: 1s
metrics_path: /metrics
scheme: http
kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- default
relabel_configs:
- source_labels: [__meta_kubernetes_pod_node_name]
action: replace
target_label: kubernetes_node
更新:
helm upgrade kube-prometheus-stack-1681228346 prometheus-community/kube-prometheus-stack \
--namespace prometheus --values kube-prometheus-stack.values
我这里遇到一个报错,是因为我忘记我已经在 使用Helm 3在Kubernetes集群部署Prometheus和Grafana 步骤中修订了 prometheus / grafana 等 service,已经将 ClusterIP
修订为 NodePort
,所以提示错误(忽略):
Error: UPGRADE FAILED: cannot patch "stable-grafana" with kind Service: Service "stable-grafana" is invalid: spec.ports[0].nodePort: Forbidden: may not be used when `type` is 'ClusterIP' && cannot patch "stable-kube-prometheus-sta-alertmanager" with kind Service: Service "stable-kube-prometheus-sta-alertmanager" is invalid: spec.ports[0].nodePort: Forbidden: may not be used when `type` is 'ClusterIP' && cannot patch "stable-kube-prometheus-sta-prometheus" with kind Service:
Service "stable-kube-prometheus-sta-prometheus" is invalid: spec.ports[0].nodePort: Forbidden: may not be used when `type` is 'ClusterIP'
备注
更新helm需要2个参数: [RELEASE] [CHART]
,否则会报错:
Error: "helm upgrade" requires 2 arguments
Usage: helm upgrade [RELEASE] [CHART] [flags]
不过,比较反直觉,这里 [RELEASE]
需要使用 helm list
的 NAME
,而 [CHART]
则使用 repo_name/path_to_chart
格式,使用 prometheus-community/kube-prometheus-stack
,但不是 prometheus-community/kube-prometheus-stack-45.9.1
备注
helm upgrade
会再次拉取软件包,例如 Get "https://github.com/prometheus-community/helm-charts/releases/download/kube-prometheus-stack-45.9.1/kube-prometheus-stack-45.9.1.tgz"
,所以这个方法很沉重。我后续会找更好的更新方法
执行完成后,就可以在 Grafana通用可视分析平台 面板看到GPU数据已经采集
问题¶
对于主机采用了复杂的网络接口( Calico网络 多个网口), 上述抓取获得的主机实例IP地址可能不是想要的接口IP,实践发现如果 DCGM-Exporter DaemonSet 采用 Kubernetes hostNetwork 则能固定获得主机IP地址,也就比较清晰反映了部署情况