metrics-server

安装

Metrics Server 可以通过 YAML manifest 直接安装,也可以通过官方的 Metrics Server Helm Chart 安装

YAML安装

  • 执行以下命令安装最新版本 Metrics Server :

采用YAML manifest安装 Metrics Server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

安装显示:

采用YAML manifest安装 Metrics Server 的输出信息
serviceaccount/metrics-server created
clusterrole.rbac.authorization.k8s.io/system:aggregated-metrics-reader created
clusterrole.rbac.authorization.k8s.io/system:metrics-server created
rolebinding.rbac.authorization.k8s.io/metrics-server-auth-reader created
clusterrolebinding.rbac.authorization.k8s.io/metrics-server:system:auth-delegator created
clusterrolebinding.rbac.authorization.k8s.io/system:metrics-server created
service/metrics-server created
deployment.apps/metrics-server created
apiservice.apiregistration.k8s.io/v1beta1.metrics.k8s.io created
  • 检查运行 pod :

检查安装的 metrics-server
kubectl -n kube-system get pods -o wide | grep metrics-server

可以看到管控服务器上运行了一个 metrics-server :

检查安装的 metrics-server 运行了一个实例
metrics-server-d9694457-r9tzk             0/1     Running   0                 10m    10.233.93.204    y-k8s-m-3   <none>           <none>

Helm安装

  • 执行以下命令可以通过 helm chart 安装 Metrics Server :

采用Helm安装 Metrics Server
helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/
helm upgrade --install metrics-server metrics-server/metrics-server

高可用部署

通过设置 replicas 值大于1,Metrics Server可以通过YAML manifest或者Helm chart部署高可用模式:

  • 对于 Kubernetes v1.21+:

在Kubernetes v1.21+上部署高可用Metrics Server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/high-availability-1.21+.yaml

对于 Kubernetes v1.19-1.21:

在Kubernetes v1.19-1.21 上部署高可用Metrics Server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/high-availability.yaml

kubectl top

部署 Metrics Server 之后,可以通过 kubectl top 来观察 Node 和 Pod 的工作情况:

  • 检查node负载:

使用 kubectl top 可以观察Node负载
kubectl top node

问题排查

kubectl top ndoe 我遇到一个报错:

使用 kubectl top node 报错
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get nodes.metrics.k8s.io)

检查 metrics-server 的pod日志:

检查 metrics-server 日志
kubectl -n kube-system logs metrics-server-d9694457-r9tzk

输出显示:

检查 metrics-server 日志显示证书错误导致无法抓取
I0913 06:21:52.663592       1 serving.go:342] Generated self-signed cert (/tmp/apiserver.crt, /tmp/apiserver.key)
E0913 06:21:54.203213       1 scraper.go:140] "Failed to scrape node" err="Get \"https://192.168.8.117:10250/metrics/resource\": x509: certificate has expired or is not yet valid: current time 2023-09-13T06:21:54Z is after 2022-12-22T23:45:32Z" node="y-k8s-m-2"
I0913 06:21:54.207331       1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0913 06:21:54.207374       1 shared_informer.go:240] Waiting for caches to sync for RequestHeaderAuthRequestController
I0913 06:21:54.207447       1 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I0913 06:21:54.207501       1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0913 06:21:54.207629       1 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I0913 06:21:54.207662       1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0913 06:21:54.208297       1 secure_serving.go:267] Serving securely on [::]:4443
I0913 06:21:54.208390       1 dynamic_serving_content.go:131] "Starting controller" name="serving-cert::/tmp/apiserver.crt::/tmp/apiserver.key"
I0913 06:21:54.208422       1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
W0913 06:21:54.208762       1 shared_informer.go:372] The sharedIndexInformer has started, run more than once is not allowed
E0913 06:21:54.208992       1 scraper.go:140] "Failed to scrape node" err="Get \"https://192.168.8.120:10250/metrics/resource\": x509: cannot validate certificate for 192.168.8.120 because it doesn't contain any IP SANs" node="y-k8s-n-2"
E0913 06:21:54.219369       1 scraper.go:140] "Failed to scrape node" err="Get \"https://192.168.8.116:10250/metrics/resource\": x509: cannot validate certificate for 192.168.8.116 because it doesn't contain any IP SANs" node="y-k8s-m-1"
E0913 06:21:54.226210       1 scraper.go:140] "Failed to scrape node" err="Get \"https://192.168.8.119:10250/metrics/resource\": x509: cannot validate certificate for 192.168.8.119 because it doesn't contain any IP SANs" node="y-k8s-n-1"
...

可以看出,由于不能验证被监控的服务器证书,导致抓取失败

metrics-server (GitHub) 的文档中说明有运行参数:

  • --kubelet-insecure-tls 运行参数将不会验证Kubelets提供的CA证书,但是这个运行参数不建议在生产环境使用

Metrics server throwing X509 error in logs, fails to return Node or Pod metrics #1025 社区答复是这个问题是因为k8s(dev/non-prod)发行版如minikube没有提供相应的证书设置以允许和Kubelet安全通讯,所以需要在 Metrics Server 的运行参数中使用 --kubelet-insecure-tls

metrics-server (GitHub) 拒绝默认启用这个参数,而是要求k8s的发行版来fix这个问题。我的集群是通过 Kubespray 快速部署的,所以这块的证书可能确实是存在问题的。

手工修订 kubectl -n kube-system edit deployment metrics-server 添加配置如下:

修订 metrics-server
...
     spec:
       containers:
       - args:
         - --kubelet-insecure-tls
         - --cert-dir=/tmp
         - --secure-port=4443
         - --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
         - --kubelet-use-node-status-port
         - --metric-resolution=15s
         image: registry.k8s.io/metrics-server/metrics-server:v0.6.4
         imagePullPolicy: IfNotPresent
...

使用

  • 简单使用检查Node:

使用 kubectl top 可以观察Node负载
kubectl top node

输出显示类似:

使用 kubectl top 可以观察Node负载输出案例
NAME        CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
y-k8s-m-1   757m         18%    2787Mi          10%
y-k8s-m-2   841m         21%    3155Mi          11%
y-k8s-m-3   741m         18%    3731Mi          14%
y-k8s-n-1   583m         29%    4884Mi          63%
y-k8s-n-2   472m         23%    4258Mi          55%

注意,这里 top 显示的cpu类似 757m 表示 0.757 个CPU ( 1000m 相当于 1 CPU )

  • 检查pods:

使用 kubectl top 可以观察pod
kubectl top pod

可以按照cpu排序(没有指定namespace则是default):

使用 kubectl top 可以观察按cpu排序pod(默认namespace)
kubectl top pod --sort-by=cpu
使用 kubectl top 可以观察按cpu排序pod(默认namespace)
NAME                              CPU(cores)   MEMORY(bytes)
productpage-v1-58b4c9bff8-ksq9s   27m          107Mi
reviews-v3-589cb4d56c-z2j76       10m          165Mi
reviews-v2-5d99885bc9-jnzvq       8m           185Mi
reviews-v1-5896f547f5-8xpvj       7m           157Mi
ratings-v1-b8f8fcf49-snvfr        6m           53Mi
details-v1-6997d94bb9-l2cjl       5m           65Mi
  • 观察指定namespace并且按照cpu排序:

使用 kubectl top 观察 kube-system namespace 按cpu排序pod
kubectl top pod --sort-by=cpu -n kube-system
使用 kubectl top 观察 kube-system namespace 按cpu排序pod输出
NAME                                      CPU(cores)   MEMORY(bytes)
kube-apiserver-y-k8s-m-3                  477m         1133Mi
kube-apiserver-y-k8s-m-2                  172m         1047Mi
kube-apiserver-y-k8s-m-1                  145m         932Mi
calico-node-bcwb7                         70m          124Mi
calico-node-m7mfs                         68m          128Mi
...

进一步

kubectl top 虽然使用方便,但是展示的信息维度有限,在Kubernetes的 Kubernetes社区sig-cli 提供了 kubectl 以及一系列相关工具,其中 krew (kubectl插件管理器)resource-capacity 插件提供了多角度分析集群资源使用情况。

参考