apiserver服务器无法连接端口2379

在升级了 ARM部署Kubernetes 管控服务器,意外发现 apiserver 运行异常:

kubectl get pods --all-namespaces -o wide

显示:

NAMESPACE     NAME                                 READY   STATUS                 RESTARTS   AGE     IP             NODE         NOMINATED NODE   READINESS GATES
...
kube-system   kube-apiserver-pi-master1            0/1     CreateContainerError   23         110d    192.168.6.11   pi-master1   <none>           <none>
...
kube-system   kube-flannel-ds-arm64-dwq86          0/1     Evicted                0          89s     <none>         jetson       <none>           <none>
...
kube-system   kube-flannel-ds-jcbhq                1/1     Running                0          6h50m   192.168.6.10   jetson       <none>           <none>
...
kube-system   kube-proxy-62sgn                     1/1     Running                0          6h50m   192.168.6.10   jetson       <none>           <none>

不过管控服务器 pi-master1 上其他服务运行正常。

排查

  • 检查 apiserver 日志:

    kubectl logs kube-apiserver-pi-master1 -n kube-system
    

日志显示:

Flag --insecure-port has been deprecated, This flag will be removed in a future version.
I0312 03:29:07.332584       1 server.go:625] external host was not specified, using 192.168.6.11
I0312 03:29:07.333502       1 server.go:163] Version: v1.19.4
I0312 03:29:08.642808       1 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.
I0312 03:29:08.642890       1 plugins.go:161] Loaded 10 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionWebhook,ResourceQuota.
I0312 03:29:08.646102       1 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.
I0312 03:29:08.646183       1 plugins.go:161] Loaded 10 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionWebhook,ResourceQuota.
I0312 03:29:08.652661       1 client.go:360] parsed scheme: "endpoint"
I0312 03:29:08.652792       1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://127.0.0.1:2379  <nil> 0 <nil>}]
I0312 03:29:08.681074       1 client.go:360] parsed scheme: "endpoint"
I0312 03:29:08.681309       1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://127.0.0.1:2379  <nil> 0 <nil>}]
I0312 03:29:08.710556       1 client.go:360] parsed scheme: "passthrough"
I0312 03:29:08.710685       1 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{https://127.0.0.1:2379  <nil> 0 <nil>}] <nil> <nil>}
I0312 03:29:08.710716       1 clientconn.go:948] ClientConn switching balancer to "pick_first"
I0312 03:29:08.713323       1 client.go:360] parsed scheme: "endpoint"
I0312 03:29:08.713425       1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://127.0.0.1:2379  <nil> 0 <nil>}]
...
...
E0312 03:29:19.225719       1 controller.go:152] Unable to remove old endpoints from kubernetes service: StorageError: key not found, Code: 1, Key: /registry/masterleases/192.168.6.11, ResourceVersion: 0, AdditionalErrorMsg:
...
...
W0312 03:48:55.984001       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
...
...
I0312 03:48:55.986358       1 clientconn.go:897] blockingPicker: the picked transport is not ready, loop back to repick
...
...
I0312 03:48:55.993287       1 clientconn.go:897] blockingPicker: the picked transport is not ready, loop back to repick
W0312 03:48:55.993714       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
I0312 03:48:55.996428       1 controller.go:181] Shutting down kubernetes service endpoint reconciler
I0312 03:48:55.997715       1 controller.go:123] Shutting down OpenAPI controller
...

看起来最初的报错是有关 StorageError ,找不到key文件 /registry/masterleases/192.168.6.11

  • 观察 dmesg -T 输出,发现一个奇怪的报错:

    ...
    [Mon Mar 22 11:01:52 2021] RPC: fragment too large: 1246907721
    
  • 但是,既然 apiserver 创建失败,但是为何我还能使用 kubectl 命令来访问 apiserver 呢?例如,我使用:

    kubectl get nodes
    kubectl get pods -o wide -n kubesystem
    kubectl get pods -o wide -n kube-verify
    

上述命令都能够正常工作,并且我在回复 部署ARM架构Kubernetes 的Jetson节点时候,完全顺利实现了 从Kubernetes集群删除节点 并重新加回 Jetson 节点。

etcd错误

  • 检查 ps aux | grep apiserver 就可以看到,访问端口 2379 其实就是访问 etcd 服务:

    kube-apiserver --advertise-address=192.168.6.11 ... --etcd-servers=https://127.0.0.1:2379 ....
    

这说明apiserver访问etcd出现错误,很可能就是 etcd 错误导致的。

  • 再次检查 etcd 这个pod:

    kubectl get pods -n kube-system -o wide
    

看到:

NAME                                 READY   STATUS                 RESTARTS   AGE     IP             NODE         NOMINATED NODE   READINESS GATES
...
etcd-pi-master1                      1/1     Running                18         110d    192.168.6.11   pi-master1   <none>           <none>
...
  • 检查 etcd 日志,并没有发现异常错误

kubelet错误

kubenetes #74262 提到可以检查kubelet日志:

journalctl -xeu kubelet

注意, journalctl 参数 -xeu 则日志不换行,有些过长日志不方便检查。可以加上 --no-pager 则直接屏幕打印,可以重定向到 |  vim - 检查

我的日志输出确实存在大量报错:

Mar 22 21:05:33 pi-master1 kubelet[1115805]: E0322 21:05:33.780814 1115805 pod_workers.go:191] Error syncing pod 1403dda53affdfee1130475a6d9609da ("kube-apiserver-pi-master1_kube-system(1403dda53affdfee1130475a6d9609da)"), skipping: failed to "StartContainer" for "kube-apiserver" with CreateContainerError: "Error response from daemon: Conflict. The container name \"/k8s_kube-apiserver_kube-apiserver-pi-master1_kube-system_1403dda53affdfee1130475a6d9609da_24\" is already in use by
container \"eced8537863007db9a915924785306b26238cabbd3feec0e5226e2ad0710807b\". You have to remove (or rename) that container to be able to reuse that name."

原来启动容器命名出现了重名,之前的 kube-apiserver-pi-master1 尚未清理导致无法启动相同名字的容器

  • 检查 docker ps --all 查看是否有同名残留:

    docker ps --all | grep kube-apiserver-pi-master1
    

可以看到:

8470235abd66        4958a2396b16                 "kube-apiserver --ad…"   10 days ago         Exited (0) 10 days ago                         k8s_kube-apiserver_kube-apiserver-pi-master1_kube-system_1403dda53affdfee1130475a6d9609da_23
eced85378630        4958a2396b16                 "kube-apiserver --ad…"   10 days ago         Up 10 days                                     k8s_kube-apiserver_kube-apiserver-pi-master1_kube-system_1403dda53affdfee1130475a6d9609da_24
07637544b672        k8s.gcr.io/pause:3.2         "/pause"                 10 days ago         Up 10 days                                     k8s_POD_kube-apiserver-pi-master1_kube-system_1403dda53affdfee1130475a6d9609da_17
46497b3b5724        k8s.gcr.io/pause:3.2         "/pause"                 10 days ago         Exited (0) 10 days ago                         k8s_POD_kube-apiserver-pi-master1_kube-system_1403dda53affdfee1130475a6d9609da_15
  • 尝试删除已经退出的容器:

    sudo docker rm 8470235abd66
    sudo docker rm 46497b3b5724
    
  • 果然,现在 apiserver 容器能够正常创建了:

    kubectl get pods -o wide -n kube-system
    

显示正常运行:

NAME                                 READY   STATUS    RESTARTS   AGE    IP             NODE         NOMINATED NODE   READINESS GATES
...
kube-apiserver-pi-master1            1/1     Running   24         4h8m   192.168.6.11   pi-master1   <none>           <none>
  • 再次检查系统中存在退出但是没有清理的docker容器:

    sudo docker ps --all | grep Exited
    
  • 然后检查确定后做清理:

    sudo docker ps --all | grep Exited | awk '{print $1}' | xargs sudo docker rm
    

推测

我注意到kubernetes升级,下一个 kubernetes apiserver 容器命名是后缀是依次递增的,这点可以从 docker ps --all 看到:

k8s_kube-apiserver_kube-apiserver-pi-master1_kube-system_1403dda53affdfee1130475a6d9609da_23
k8s_kube-apiserver_kube-apiserver-pi-master1_kube-system_1403dda53affdfee1130475a6d9609da_24

也就是说如果有一个升级过程中 kube-apiserver...24 意外退出,则升级程序还会重新尝试用同一个容器名创建启动,这就在docker上导致冲突

看来连接 etcd 端口失败仅仅是表象,实际排查故障还是需要从 kubelet 日志着手。