在Kubernetes集群(z-k8s)部署集成GPU监控的Prometheus和Grafana¶
备注
对于 采用OVMF实现passthrough GPU和NVMe存储 的 GPU Kubernetes 集群,本文综合了 使用Helm 3在Kubernetes集群部署Prometheus和Grafana 和 在Kuternetes集成GPU可观测能力 实现 私有云架构 的 Kubernetes 监控
Prometheus 社区提供了 kube-prometheus-stack helm chart,一个完整的Kubernetes manifests,包含 Grafana通用可视分析平台 dashboard,以及结合了文档和脚本的 Prometheus 规则 以方便通过 Prometheus Operator 。不过,对于GPU节点的监控,建议在部署时做一些修订(见本文)可以方便一气呵成。当然,先完成 使用Helm 3在Kubernetes集群部署Prometheus和Grafana 再通过 更新Kubernetes集群的Prometheus配置 也可以。
helm3¶
helm 提供方便部署:
curl -LO https://git.io/get_helm.sh
chmod 700 get_helm.sh
./get_helm.sh
安装NVIDIA GPU Operator¶
这段待整理
安装Prometheus 和 Grafana¶
helm配置¶
添加 Prometheus 社区helm chart:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
NVIDIA对社区方案参数做一些调整,所以先导出 chart 使用的变量(以便修订):
prometheus:
## Configuration for Prometheus service
##
service:
annotations: {}
labels: {}
clusterIP: ""
## Port for Prometheus Service to listen on
##
port: 9090
## To be used with a proxy extraContainer port
targetPort: 9090
## List of IP addresses at which the Prometheus server service is available
## Ref: https://kubernetes.io/docs/user-guide/services/#external-ips
##
externalIPs: []
## Port to expose on each node
## Only used if service.type is 'NodePort'
##
nodePort: 30090
## Loadbalancer IP
## Only use if service.type is "LoadBalancer"
loadBalancerIP: ""
loadBalancerSourceRanges: []
## Denotes if this Service desires to route external traffic to node-local or cluster-wide endpoints
##
externalTrafficPolicy: Cluster
## Service type
##
type: NodePort
...
grafana:
## Passed to grafana subchart and used by servicemonitor below
##
service:
portName: http-web
nodePort: 30080
type: NodePort
...
alertmanager:
## Deploy alertmanager
##
enabled: true
...
## Configuration for Alertmanager service
##
service:
annotations: {}
labels: {}
clusterIP: ""
## Port for Alertmanager Service to listen on
##
port: 9093
## To be used with a proxy extraContainer port
##
targetPort: 9093
## Port to expose on each node
## Only used if service.type is 'NodePort'
##
nodePort: 30903
## List of IP addresses at which the Prometheus server service is available
## Ref: https://kubernetes.io/docs/user-guide/services/#external-ips
##
...
## Service type
##
type: NodePort
修订一: 将metrics端口
30090
作为NodePort
输出在每个节点(实际只需要修改type: ClusterIP
改为type: NodePort
行,建议同时修改stable-grafana
(helm install
时支持传递参数--set grafana.service.type=NodePort
,通过增加nodePort
指定 80/30080映射),alertmanager
(9093/30903) 和prometheus
(9090/30090) 对应的svc
):
prometheus:
## Configuration for Prometheus service
##
service:
annotations: {}
labels: {}
clusterIP: ""
## Port for Prometheus Service to listen on
##
port: 9090
## To be used with a proxy extraContainer port
targetPort: 9090
## List of IP addresses at which the Prometheus server service is available
## Ref: https://kubernetes.io/docs/user-guide/services/#external-ips
##
externalIPs: []
## Port to expose on each node
## Only used if service.type is 'NodePort'
##
nodePort: 30090
## Loadbalancer IP
## Only use if service.type is "LoadBalancer"
loadBalancerIP: ""
loadBalancerSourceRanges: []
## Denotes if this Service desires to route external traffic to node-local or cluster-wide endpoints
##
externalTrafficPolicy: Cluster
## Service type
##
type: NodePort
...
grafana:
## Passed to grafana subchart and used by servicemonitor below
##
service:
portName: http-web
nodePort: 30080
type: NodePort
...
alertmanager:
## Deploy alertmanager
##
enabled: true
...
## Configuration for Alertmanager service
##
service:
annotations: {}
labels: {}
clusterIP: ""
## Port for Alertmanager Service to listen on
##
port: 9093
## To be used with a proxy extraContainer port
##
targetPort: 9093
## Port to expose on each node
## Only used if service.type is 'NodePort'
##
nodePort: 30903
## List of IP addresses at which the Prometheus server service is available
## Ref: https://kubernetes.io/docs/user-guide/services/#external-ips
##
...
## Service type
##
type: NodePort
备注
我最初在 kube-prometheus-stack.values
没有找到修订 grafana
的 service.type
的地方,后来找到可以通过 传递参数 --set grafana.service.type=NodePort
实现,再仔细看了 values
,原来默认没有配置,所以需要自己手工添加
其他修订:
defaultDashboardsTimezone: Asia/Shanghai
修订二: 修改
prometheusSpec.serviceMonitorSelectorNilUsesHelmValues
设置为false
:
# If true, a nil or {} value for prometheus.prometheusSpec.serviceMonitorSelector will cause the
# prometheus resource to be created with selectors based on values in the helm deployment,
# which will also match the servicemonitors created
#
serviceMonitorSelectorNilUsesHelmValues: false
修改三: 在
configMap
配置additionalScrapeConfigs
添加gpu-metrics
:
# AdditionalScrapeConfigs allows specifying additional Prometheus scrape configurations. Scrape configurations
# are appended to the configurations generated by the Prometheus Operator. Job configurations must have the form
# as specified in the official Prometheus documentation:
# https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config. As scrape configs are
# appended, the user is responsible to make sure it is valid. Note that using this feature may expose the possibility
# to break upgrades of Prometheus. It is advised to review Prometheus release notes to ensure that no incompatible
# scrape configs are going to break Prometheus after the upgrade.
#
# The scrape configuration example below will find master nodes, provided they have the name .*mst.*, relabel the
# port to 2379 and allow etcd scraping provided it is running on all Kubernetes master nodes
#
additionalScrapeConfigs:
- job_name: gpu-metrics
scrape_interval: 1s
metrics_path: /metrics
scheme: http
kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- gpu-operator
relabel_configs:
- source_labels: [__meta_kubernetes_pod_node_name]
action: replace
target_label: kubernetes_node
准备存储¶
创建 在Kubernetes中部署hostPath存储 持久化存储卷:
apiVersion: v1
kind: PersistentVolume
metadata:
name: kube-prometheus-stack-pv
labels:
type: local
spec:
storageClassName: prometheus-data
capacity:
storage: 400Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
hostPath:
path: "/prometheus/data"
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: kube-prometheus-stack-pv-alert
labels:
type: local
spec:
storageClassName: prometheus-data-alert
capacity:
storage: 400Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
hostPath:
path: "/prometheus/data"
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: kube-prometheus-stack-pv-thanos
labels:
type: local
spec:
storageClassName: prometheus-data-thanos
capacity:
storage: 400Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
hostPath:
path: "/prometheus/data"
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: kube-prometheus-stack-pv-grafana
labels:
type: local
spec:
storageClassName: prometheus-data-grafana
capacity:
storage: 400Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
hostPath:
path: "/prometheus/data/grafana-db"
备注
只需要创建 PV 就可以, kube-prometheus-stack
values.yaml 中提供了 PVC 配置,会自动创建PVC
执行:
kubectl apply -f kube-prometheus-stack-pv.yaml
部署¶
执行部署,部署中采用自己定制的values:
helm install prometheus-community/kube-prometheus-stack \
--create-namespace --namespace prometheus \
--generate-name \
--values /tmp/kube-prometheus-stack.values
#--set=alertmanager.persistentVolume.existingClaim=kube-prometheus-stack-pvc,server.persistentVolume.existingClaim=kube-prometheus-stack-pvc,grafana.persistentVolume.existingClaim=kube-prometheus-stack-pvc
备注
持久化存储解决方案采用 kube-prometheus-stack 持久化卷 验证通过
输出信息:
NAME: kube-prometheus-stack-1680871060
LAST DEPLOYED: Fri Apr 7 20:38:00 2023
NAMESPACE: prometheus
STATUS: deployed
REVISION: 1
NOTES:
kube-prometheus-stack has been installed. Check its status by running:
kubectl --namespace prometheus get pods -l "release=kube-prometheus-stack-1680871060"
Visit https://github.com/prometheus-operator/kube-prometheus for instructions on how to create & configure Alertmanager and Prometheus instances using the Operator.
备注
在生产集群部署,遇到过如下报错:
Error: INSTALLATION FAILED: unable to build kubernetes objects from release manifest: error validating "": error validating data: ValidationError(ServiceMonitor.spec.endpoints[0]): unknown field "enableHttp2" in com.coreos.monitoring.v1.ServiceMonitor.spec.endpoints
参考 prometheus-kube-stack helm install results in unknown field “enableHttp2” #2633 情况类似:
Found same error upgrading from old Prometheus installation.
Solution: uninstall prometheus, delete CRDs and install again.
https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack#uninstall-helm-chart
原因是我之前的一次安装 prometheus-stack
安装,中途按下了 ctrl-c
,然后执行了一次 helm uninstall stack
来卸载。但是根据文档,CRD是不会自动清理掉,这可能导致了冲突。需要手工清理相关监控的CRD:
kubectl delete crd alertmanagerconfigs.monitoring.coreos.com
kubectl delete crd alertmanagers.monitoring.coreos.com
kubectl delete crd podmonitors.monitoring.coreos.com
kubectl delete crd probes.monitoring.coreos.com
kubectl delete crd prometheuses.monitoring.coreos.com
kubectl delete crd prometheusrules.monitoring.coreos.com
kubectl delete crd servicemonitors.monitoring.coreos.com
kubectl delete crd thanosrulers.monitoring.coreos.com
备注
在生产集群部署,遇到调度失败:
kubectl --namespace prometheus get pods kube-prometheus-stack-1680962838-prometheus-node-exporter-5kk5q -o yaml
...
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2023-04-08T14:07:36Z"
message: '0/12 nodes are available: 12 node(s) didn''t have free ports for the
requested pod ports.'
reason: Unschedulable
原因是Kubernetes集群在阿里云平台部署,已经购买了阿里云的 Prometheus
监控,所以集群已经提前部署了 node-exporter
,导致端口中途。解决方法是修订上文自定义values文件 kube-prometheus-stack.values
...
## Deploy node exporter as a daemonset to all nodes
##
nodeExporter:
enabled: false
然后重新部署。(不过实践发现还是存在其他问题,遂放弃)
备注
如果已经部署好 prometheus-stack
,需要添加 DCGM-Exporter 数据采集支持,则可以通过 更新Kubernetes集群的Prometheus配置 修订
备注
在墙内部署会遇到镜像下载问题,通过镜像导入目标节点:
#下载
docker pull registry.k8s.io/ingress-nginx/kube-webhook-certgen:v20221220-controller-v1.5.1-58-g787ea74b6
docker pull registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.8.2
#导出
docker save -o kube-webhook-certgen.tar registry.k8s.io/ingress-nginx/kube-webhook-certgen:v20221220-controller-v1.5.1-58-g787ea74b6
docker save -o kube-state-metrics.tar registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.8.2
#导入
nerdctl -n k8s.io load < /tmp/kube-webhook-certgen.tar
nerdctl -n k8s.io load < /tmp/kube-state-metrics.tar
对于需要部署到指定监控服务器,可以采用
label
方法:kubectl label nodes i-0jl8d8r83kkf3yt5lzh7 telemetry=prometheus
依次修订 deployments
,例如 kubectl edit deployment stable-grafana
spec:
nodeSelector:
telemetry: prometheus
containers:
...
检查
prometheus
namespace中部署的容器:
kubectl --namespace prometheus get pods -l "release=kube-prometheus-stack-1680871060"
输出显示类似如下:
NAME READY STATUS RESTARTS AGE
kube-prometheus-stack-1680-operator-df66d5c4c-8jqzj 1/1 Running 0 3m59s
kube-prometheus-stack-1680871060-kube-state-metrics-865958g6ffz 1/1 Running 0 3m59s
kube-prometheus-stack-1680871060-prometheus-node-exporter-6nwkp 1/1 Running 0 3m59s
kube-prometheus-stack-1680871060-prometheus-node-exporter-6rk88 1/1 Running 0 3m59s
kube-prometheus-stack-1680871060-prometheus-node-exporter-7jx92 1/1 Running 0 3m59s
kube-prometheus-stack-1680871060-prometheus-node-exporter-dkqqs 1/1 Running 0 3m59s
kube-prometheus-stack-1680871060-prometheus-node-exporter-dqmfc 1/1 Running 0 3m59s
kube-prometheus-stack-1680871060-prometheus-node-exporter-h2rdq 1/1 Running 0 3m59s
kube-prometheus-stack-1680871060-prometheus-node-exporter-h44wr 1/1 Running 0 3m59s
kube-prometheus-stack-1680871060-prometheus-node-exporter-t655c 1/1 Running 0 3m59s
检查部署完成的Prometheus Pods可以看到每个节点都运行了 node-exporter
且已经运行起 Prometheus和Grafana(注意,位于 prometheus
namespace)
备注
如果有遇到镜像无法下载问题,请参考 使用Helm 3在Kubernetes集群部署Prometheus和Grafana 我的实践经验
服务输出¶
检查
svc
:
kubectl get svc -n prometheus
输出显示:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 14m
kube-prometheus-stack-1680-alertmanager NodePort 10.106.70.4 <none> 9093:30903/TCP 15m
kube-prometheus-stack-1680-operator ClusterIP 10.107.104.10 <none> 443/TCP 15m
kube-prometheus-stack-1680-prometheus NodePort 10.101.120.210 <none> 9090:30090/TCP 15m
kube-prometheus-stack-1680871060-grafana ClusterIP 10.99.214.112 <none> 80/TCP 15m
kube-prometheus-stack-1680871060-kube-state-metrics ClusterIP 10.108.43.250 <none> 8080/TCP 15m
kube-prometheus-stack-1680871060-prometheus-node-exporter ClusterIP 10.110.33.129 <none> 9100/TCP 15m
prometheus-operated ClusterIP None <none> 9090/TCP 14m
默认情况下, prometheus
和 grafana
服务都是使用ClusterIP在集群内部,所以要能够在外部访问,需要使用 Kubernetes集群的Load Balancer和Ingress辨析 或者 NodePort
(简单) 。上文我采用了NVIDIA官方部署文档方法,将 alertmanager
和 prometheus
修订成了 NodePort
模式,但是没有修订 grafana
,所以下面我再手工修订 grafana
设置为 NodePort
模式
修改
stable-grafana
服务,将type
从ClusterIP
修改为NodePort
或者LoadBalancer
kubectl edit svc kube-prometheus-stack-1680871060-grafana -n prometheus
最终检查 svc
如下:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 166m
kube-prometheus-stack-1680-alertmanager NodePort 10.106.70.4 <none> 9093:30903/TCP 166m
kube-prometheus-stack-1680-operator ClusterIP 10.107.104.10 <none> 443/TCP 166m
kube-prometheus-stack-1680-prometheus NodePort 10.101.120.210 <none> 9090:30090/TCP 166m
kube-prometheus-stack-1680871060-grafana NodePort 10.99.214.112 <none> 80:32427/TCP 166m
kube-prometheus-stack-1680871060-kube-state-metrics ClusterIP 10.108.43.250 <none> 8080/TCP 166m
kube-prometheus-stack-1680871060-prometheus-node-exporter ClusterIP 10.110.33.129 <none> 9100/TCP 166m
prometheus-operated ClusterIP None <none> 9090/TCP 166m
不过,这样外部访问的端口是随机的,有点麻烦。临时性解决方法,我采用 Nginx反向代理 将对外端口固定住,然后反向转发给 NodePort
的随机端口,至少能临时使用。
端口转发¶
备注
我在上次实践 在Kuternetes集成GPU可观测能力 采用 Nginx反向代理 到Grafana,遇到过 在反向代理后面运行Grafana 问题,原因是Grafana新版本为了阻断跨站攻击,对客户端请求源和返回地址进行校验,所以必须对 Nginx 设置代理头部
另外可以采用 Apache反向代理 来实现反向代理(因为我已经采用了 Apache WebDAV服务器 实现 通过WebDAV同步Joplin数据 )
在通过 NodePort
输出 Prometheus/Grafana/Altermanager 时,pod容器可以在集群的任何node节点运行。对于外部访问,比较好的方式是采用 Kubernetes MetalLB 负载均衡 结合 Ingress 来实现完整的云计算网络。
不过,出于快速构建,我当前采用简化的服务输出方式 NodePort
,所以再部署一个 简单的WEB反向代理就能在外部访问 iptables端口转发(port forwarding) 实现访问。
检查
prometheus-stack
输出的NodePort
:
kubectl get svc -n prometheus | grep NodePort
输出显示:
kube-prometheus-stack-1680-alertmanager NodePort 10.106.70.4 <none> 9093:30903/TCP 2d1h
kube-prometheus-stack-1680-prometheus NodePort 10.101.120.210 <none> 9090:30090/TCP 2d1h
kube-prometheus-stack-1680871060-grafana NodePort 10.99.214.112 <none> 80:32427/TCP 2d1h
检查
prometheus-stack
对应pods落在哪个nodes
上:
kubectl get pods -n prometheus -o wide
输出显示
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
alertmanager-kube-prometheus-stack-1680-alertmanager-0 2/2 Running 1 (2d2h ago) 2d2h 10.0.5.28 z-k8s-n-3 <none> <none>
kube-prometheus-stack-1680-operator-df66d5c4c-8jqzj 1/1 Running 0 2d2h 10.0.4.178 z-k8s-n-2 <none> <none>
kube-prometheus-stack-1680871060-grafana-6f5c7cb5-k2kw9 3/3 Running 0 2d2h 10.0.7.107 z-k8s-n-4 <none> <none>
kube-prometheus-stack-1680871060-kube-state-metrics-865958g6ffz 1/1 Running 0 2d2h 10.0.7.187 z-k8s-n-4 <none> <none>
kube-prometheus-stack-1680871060-prometheus-node-exporter-6nwkp 1/1 Running 0 2d2h 192.168.6.112 z-k8s-n-2 <none> <none>
...
prometheus-kube-prometheus-stack-1680-prometheus-0 2/2 Running 0 2d2h 10.0.4.242 z-k8s-n-2 <none> <none>
检查
nodes
对应IP:
kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
z-k8s-m-1 Ready control-plane 266d v1.25.3 192.168.6.101 <none> Ubuntu 22.04.2 LTS 5.15.0-69-generic containerd://1.6.6
z-k8s-m-2 Ready control-plane 264d v1.25.3 192.168.6.102 <none> Ubuntu 22.04.2 LTS 5.15.0-69-generic containerd://1.6.6
z-k8s-m-3 Ready control-plane 264d v1.25.3 192.168.6.103 <none> Ubuntu 22.04.2 LTS 5.15.0-69-generic containerd://1.6.6
z-k8s-n-1 Ready <none> 264d v1.25.3 192.168.6.111 <none> Ubuntu 22.04.2 LTS 5.15.0-69-generic containerd://1.6.6
z-k8s-n-2 Ready <none> 264d v1.25.3 192.168.6.112 <none> Ubuntu 22.04.2 LTS 5.15.0-69-generic containerd://1.6.6
z-k8s-n-3 Ready <none> 264d v1.25.3 192.168.6.113 <none> Ubuntu 22.04.2 LTS 5.15.0-69-generic containerd://1.6.6
z-k8s-n-4 Ready <none> 264d v1.25.3 192.168.6.114 <none> Ubuntu 22.04.2 LTS 5.15.0-69-generic containerd://1.6.6
z-k8s-n-5 Ready <none> 264d v1.25.3 192.168.6.115 <none> Ubuntu 22.04.2 LTS 5.15.0-69-generic containerd://1.6.6
整理对应关系:
服务 |
Gateway IP |
Gateway Port |
Node IP |
Port |
---|---|---|---|---|
grafana |
192.168.106.15 |
8080 |
192.168.6.114 |
32427 |
prometheus |
192.168.106.15 |
9090 |
192.168.6.112 |
30090 |
alertmanager |
192.168.106.15 |
9093 |
192.168.6.113 |
30903 |
执行以下端口转发脚本:
local_host=192.168.106.15
dashboard_port=8443
grafana_port=8080
prometheus_port=9090
alertmanager_port=9093
k8s_dashboard_host=172.21.44.215
k8s_dashboard_port=32642
k8s_grafana_host=192.168.6.114
k8s_grafana_port=32427
k8s_prometheus_host=192.168.6.112
k8s_prometheus_port=30090
k8s_alertmanager_host=192.168.6.113
k8s_alertmanager_port=30903
iptables -t nat -A PREROUTING -p tcp --dport ${dashboard_port} -j DNAT --to-destination ${k8s_dashboard_host}:${k8s_dashboard_port}
iptables -t nat -A POSTROUTING -p tcp -d ${k8s_dashboard_host} --dport ${k8s_dashboard_port} -j SNAT --to-source ${local_host}
iptables -t nat -A PREROUTING -p tcp --dport ${grafana_port} -j DNAT --to-destination ${k8s_grafana_host}:${k8s_grafana_port}
iptables -t nat -A POSTROUTING -p tcp -d ${k8s_grafana_host} --dport ${k8s_grafana_port} -j SNAT --to-source ${local_host}
iptables -t nat -A PREROUTING -p tcp --dport ${prometheus_port} -j DNAT --to-destination ${k8s_prometheus_host}:${k8s_prometheus_port}
iptables -t nat -A POSTROUTING -p tcp -d ${k8s_prometheus_host} --dport ${k8s_prometheus_port} -j SNAT --to-source ${local_host}
iptables -t nat -A PREROUTING -p tcp --dport ${alertmanager_port} -j DNAT --to-destination ${k8s_alertmanager_host}:${k8s_alertmanager_port}
iptables -t nat -A POSTROUTING -p tcp -d ${k8s_alertmanager_host} --dport ${k8s_alertmanager_port} -j SNAT --to-source ${local_host}
配置修订¶
对于需要后续调整的配置,采用 更新Kubernetes集群的Prometheus配置 方法:
helm upgrade kube-prometheus-stack-1681228346 prometheus-community/kube-prometheus-stack \
--namespace prometheus --values kube-prometheus-stack.values
例如更新 scrape
配置
持久化存储¶
默认配置:
...
volumeMounts:
...
- mountPath: /prometheus
name: prometheus-kube-prometheus-stack-1681-prometheus-db
...
volumes:
- emptyDir: {}
name: prometheus-kube-prometheus-stack-1681-prometheus-db
我最初按照上文 Deploying kube-prometheus-stack with persistent storage on Kubernetes Cluster 构建了存储PV/PVC,但是采用了 helm install
参数 ``
访问使用¶
访问 Grafana 面板,初始账号 admin
密码是 prom-operator
,请立即修改
然后我们可以开始 Grafana配置快速起步
在Kuternetes集成GPU可观测能力 采用 NVIDIA官方提供的面板 NVIDIA DCGM Exporter Dashboard ,可以直接导入监控我的 Nvidia Tesla P10 GPU运算卡
Kubernetes Ingress控制器 改进¶
我最初为了方便快速,采用了 NodePort
输出服务,所以简单部署了 在反向代理后面运行Grafana ,后续尝试改进成 Kubernetes Ingress控制器 模式