metrics-server¶
安装¶
Metrics Server
可以通过 YAML manifest 直接安装,也可以通过官方的 Metrics Server Helm Chart 安装
YAML安装¶
执行以下命令安装最新版本
Metrics Server
:
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
安装显示:
serviceaccount/metrics-server created
clusterrole.rbac.authorization.k8s.io/system:aggregated-metrics-reader created
clusterrole.rbac.authorization.k8s.io/system:metrics-server created
rolebinding.rbac.authorization.k8s.io/metrics-server-auth-reader created
clusterrolebinding.rbac.authorization.k8s.io/metrics-server:system:auth-delegator created
clusterrolebinding.rbac.authorization.k8s.io/system:metrics-server created
service/metrics-server created
deployment.apps/metrics-server created
apiservice.apiregistration.k8s.io/v1beta1.metrics.k8s.io created
检查运行
pod
:
kubectl -n kube-system get pods -o wide | grep metrics-server
可以看到管控服务器上运行了一个 metrics-server
:
metrics-server-d9694457-r9tzk 0/1 Running 0 10m 10.233.93.204 y-k8s-m-3 <none> <none>
Helm安装¶
执行以下命令可以通过 helm chart 安装
Metrics Server
:
helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/
helm upgrade --install metrics-server metrics-server/metrics-server
高可用部署¶
通过设置 replicas
值大于1,Metrics Server可以通过YAML manifest或者Helm chart部署高可用模式:
对于 Kubernetes v1.21+:
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/high-availability-1.21+.yaml
对于 Kubernetes v1.19-1.21:
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/high-availability.yaml
kubectl top
¶
部署 Metrics Server
之后,可以通过 kubectl top
来观察 Node 和 Pod 的工作情况:
检查node负载:
kubectl top node
问题排查¶
kubectl top ndoe
我遇到一个报错:
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get nodes.metrics.k8s.io)
检查 metrics-server
的pod日志:
kubectl -n kube-system logs metrics-server-d9694457-r9tzk
输出显示:
I0913 06:21:52.663592 1 serving.go:342] Generated self-signed cert (/tmp/apiserver.crt, /tmp/apiserver.key)
E0913 06:21:54.203213 1 scraper.go:140] "Failed to scrape node" err="Get \"https://192.168.8.117:10250/metrics/resource\": x509: certificate has expired or is not yet valid: current time 2023-09-13T06:21:54Z is after 2022-12-22T23:45:32Z" node="y-k8s-m-2"
I0913 06:21:54.207331 1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0913 06:21:54.207374 1 shared_informer.go:240] Waiting for caches to sync for RequestHeaderAuthRequestController
I0913 06:21:54.207447 1 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I0913 06:21:54.207501 1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0913 06:21:54.207629 1 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I0913 06:21:54.207662 1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0913 06:21:54.208297 1 secure_serving.go:267] Serving securely on [::]:4443
I0913 06:21:54.208390 1 dynamic_serving_content.go:131] "Starting controller" name="serving-cert::/tmp/apiserver.crt::/tmp/apiserver.key"
I0913 06:21:54.208422 1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
W0913 06:21:54.208762 1 shared_informer.go:372] The sharedIndexInformer has started, run more than once is not allowed
E0913 06:21:54.208992 1 scraper.go:140] "Failed to scrape node" err="Get \"https://192.168.8.120:10250/metrics/resource\": x509: cannot validate certificate for 192.168.8.120 because it doesn't contain any IP SANs" node="y-k8s-n-2"
E0913 06:21:54.219369 1 scraper.go:140] "Failed to scrape node" err="Get \"https://192.168.8.116:10250/metrics/resource\": x509: cannot validate certificate for 192.168.8.116 because it doesn't contain any IP SANs" node="y-k8s-m-1"
E0913 06:21:54.226210 1 scraper.go:140] "Failed to scrape node" err="Get \"https://192.168.8.119:10250/metrics/resource\": x509: cannot validate certificate for 192.168.8.119 because it doesn't contain any IP SANs" node="y-k8s-n-1"
...
可以看出,由于不能验证被监控的服务器证书,导致抓取失败
在 metrics-server (GitHub) 的文档中说明有运行参数:
--kubelet-insecure-tls
运行参数将不会验证Kubelets提供的CA证书,但是这个运行参数不建议在生产环境使用
Metrics server throwing X509 error in logs, fails to return Node or Pod metrics #1025 社区答复是这个问题是因为k8s(dev/non-prod)发行版如minikube没有提供相应的证书设置以允许和Kubelet安全通讯,所以需要在 Metrics Server
的运行参数中使用 --kubelet-insecure-tls
。
metrics-server (GitHub) 拒绝默认启用这个参数,而是要求k8s的发行版来fix这个问题。我的集群是通过 Kubespray 快速部署的,所以这块的证书可能确实是存在问题的。
手工修订 kubectl -n kube-system edit deployment metrics-server
添加配置如下:
...
spec:
containers:
- args:
- --kubelet-insecure-tls
- --cert-dir=/tmp
- --secure-port=4443
- --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
- --kubelet-use-node-status-port
- --metric-resolution=15s
image: registry.k8s.io/metrics-server/metrics-server:v0.6.4
imagePullPolicy: IfNotPresent
...
使用¶
简单使用检查Node:
kubectl top node
输出显示类似:
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
y-k8s-m-1 757m 18% 2787Mi 10%
y-k8s-m-2 841m 21% 3155Mi 11%
y-k8s-m-3 741m 18% 3731Mi 14%
y-k8s-n-1 583m 29% 4884Mi 63%
y-k8s-n-2 472m 23% 4258Mi 55%
注意,这里 top
显示的cpu类似 757m
表示 0.757
个CPU ( 1000m 相当于 1 CPU )
检查pods:
kubectl top pod
可以按照cpu排序(没有指定namespace则是default):
kubectl top pod --sort-by=cpu
NAME CPU(cores) MEMORY(bytes)
productpage-v1-58b4c9bff8-ksq9s 27m 107Mi
reviews-v3-589cb4d56c-z2j76 10m 165Mi
reviews-v2-5d99885bc9-jnzvq 8m 185Mi
reviews-v1-5896f547f5-8xpvj 7m 157Mi
ratings-v1-b8f8fcf49-snvfr 6m 53Mi
details-v1-6997d94bb9-l2cjl 5m 65Mi
观察指定namespace并且按照cpu排序:
kubectl top pod --sort-by=cpu -n kube-system
NAME CPU(cores) MEMORY(bytes)
kube-apiserver-y-k8s-m-3 477m 1133Mi
kube-apiserver-y-k8s-m-2 172m 1047Mi
kube-apiserver-y-k8s-m-1 145m 932Mi
calico-node-bcwb7 70m 124Mi
calico-node-m7mfs 68m 128Mi
...
进一步¶
kubectl top
虽然使用方便,但是展示的信息维度有限,在Kubernetes的 Kubernetes社区sig-cli 提供了 kubectl
以及一系列相关工具,其中 krew (kubectl插件管理器) 的 resource-capacity
插件提供了多角度分析集群资源使用情况。