排查kind集群创建失败

我在 Asahi Linux (ARM架构) kind多节点集群 执行:

kind构建3个管控节点,5个工作节点集群配置
export CLUSTER_NAME=dev
export reg_name='kind-registry'
kind create cluster --name "${CLUSTER_NAME}" --config kind-config.yaml

遇到之前在CentOS 8(x86_64)平台没有遇到过的报错(启动管控平台初始化失败):

kind集群创建失败输出信息
Creating cluster "dev" ...
 ✓ Ensuring node image (kindest/node:v1.25.3) 🖼
 ✓ Preparing nodes 📦 📦 📦 📦 📦 📦 📦 📦  
 ✓ Configuring the external load balancer ⚖ 
 ✓ Writing configuration 📜 
 ✗ Starting control-plane 🕹 
ERROR: failed to create cluster: failed to init node with kubeadm: command "docker exec --privileged dev-control-plane kubeadm init --skip-phases=preflight --config=/kind/kubeadm.conf --skip-token-print --v=6" failed with error: exit status 1
Command Output: I1115 13:01:54.253925     148 initconfiguration.go:254] loading configuration from "/kind/kubeadm.conf"
W1115 13:01:54.254738     148 initconfiguration.go:331] [config] WARNING: Ignored YAML document with GroupVersionKind kubeadm.k8s.io/v1beta3, Kind=JoinConfiguration
[init] Using Kubernetes version: v1.25.3
[certs] Using certificateDir folder "/etc/kubernetes/pki"
I1115 13:01:54.258151     148 certs.go:112] creating a new certificate authority for ca
[certs] Generating "ca" certificate and key
I1115 13:01:54.361535     148 certs.go:522] validating certificate period for ca certificate
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [dev-control-plane dev-external-load-balancer kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local localhost] and IPs [10.96.0.1 172.18.0.8 127.0.0.1]
[certs] Generating "apiserver-kubelet-client" certificate and key
I1115 13:01:54.542204     148 certs.go:112] creating a new certificate authority for front-proxy-ca
[certs] Generating "front-proxy-ca" certificate and key
I1115 13:01:54.775997     148 certs.go:522] validating certificate period for front-proxy-ca certificate
[certs] Generating "front-proxy-client" certificate and key
I1115 13:01:54.892570     148 certs.go:112] creating a new certificate authority for etcd-ca
[certs] Generating "etcd/ca" certificate and key
I1115 13:01:54.941145     148 certs.go:522] validating certificate period for etcd/ca certificate
[certs] Generating "etcd/server" certificate and key
[certs] etcd/server serving cert is signed for DNS names [dev-control-plane localhost] and IPs [172.18.0.8 127.0.0.1 ::1]
[certs] Generating "etcd/peer" certificate and key
[certs] etcd/peer serving cert is signed for DNS names [dev-control-plane localhost] and IPs [172.18.0.8 127.0.0.1 ::1]
[certs] Generating "etcd/healthcheck-client" certificate and key
[certs] Generating "apiserver-etcd-client" certificate and key
I1115 13:01:55.518737     148 certs.go:78] creating new public/private key files for signing service account users
[certs] Generating "sa" key and public key
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
I1115 13:01:55.847113     148 kubeconfig.go:103] creating kubeconfig file for admin.conf
[kubeconfig] Writing "admin.conf" kubeconfig file
I1115 13:01:55.946026     148 kubeconfig.go:103] creating kubeconfig file for kubelet.conf
[kubeconfig] Writing "kubelet.conf" kubeconfig file
I1115 13:01:56.108316     148 kubeconfig.go:103] creating kubeconfig file for controller-manager.conf
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
I1115 13:01:56.482291     148 kubeconfig.go:103] creating kubeconfig file for scheduler.conf
[kubeconfig] Writing "scheduler.conf" kubeconfig file
I1115 13:01:56.853648     148 kubelet.go:66] Stopping the kubelet
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Starting the kubelet
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
I1115 13:01:56.931016     148 manifests.go:99] [control-plane] getting StaticPodSpecs
I1115 13:01:56.931206     148 certs.go:522] validating certificate period for CA certificate
I1115 13:01:56.931237     148 manifests.go:125] [control-plane] adding volume "ca-certs" for component "kube-apiserver"
I1115 13:01:56.931241     148 manifests.go:125] [control-plane] adding volume "etc-ca-certificates" for component "kube-apiserver"
I1115 13:01:56.931243     148 manifests.go:125] [control-plane] adding volume "k8s-certs" for component "kube-apiserver"
I1115 13:01:56.931245     148 manifests.go:125] [control-plane] adding volume "usr-local-share-ca-certificates" for component "kube-apiserver"
I1115 13:01:56.931248     148 manifests.go:125] [control-plane] adding volume "usr-share-ca-certificates" for component "kube-apiserver"
[control-plane] Creating static Pod manifest for "kube-controller-manager"
I1115 13:01:56.932437     148 manifests.go:154] [control-plane] wrote static Pod manifest for component "kube-apiserver" to "/etc/kubernetes/manifests/kube-apiserver.yaml"
I1115 13:01:56.932447     148 manifests.go:99] [control-plane] getting StaticPodSpecs
I1115 13:01:56.932542     148 manifests.go:125] [control-plane] adding volume "ca-certs" for component "kube-controller-manager"
I1115 13:01:56.932547     148 manifests.go:125] [control-plane] adding volume "etc-ca-certificates" for component "kube-controller-manager"
I1115 13:01:56.932549     148 manifests.go:125] [control-plane] adding volume "flexvolume-dir" for component "kube-controller-manager"
I1115 13:01:56.932551     148 manifests.go:125] [control-plane] adding volume "k8s-certs" for component "kube-controller-manager"
I1115 13:01:56.932554     148 manifests.go:125] [control-plane] adding volume "kubeconfig" for component "kube-controller-manager"
I1115 13:01:56.932556     148 manifests.go:125] [control-plane] adding volume "usr-local-share-ca-certificates" for component "kube-controller-manager"
I1115 13:01:56.932559     148 manifests.go:125] [control-plane] adding volume "usr-share-ca-certificates" for component "kube-controller-manager"
I1115 13:01:56.936399     148 manifests.go:154] [control-plane] wrote static Pod manifest for component "kube-controller-manager" to "/etc/kubernetes/manifests/kube-controller-manager.yaml"
[control-plane] Creating static Pod manifest for "kube-scheduler"
I1115 13:01:56.936466     148 manifests.go:99] [control-plane] getting StaticPodSpecs
I1115 13:01:56.937196     148 manifests.go:125] [control-plane] adding volume "kubeconfig" for component "kube-scheduler"
I1115 13:01:56.940429     148 manifests.go:154] [control-plane] wrote static Pod manifest for component "kube-scheduler" to "/etc/kubernetes/manifests/kube-scheduler.yaml"
[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
I1115 13:01:56.942909     148 local.go:65] [etcd] wrote Static Pod manifest for a local etcd member to "/etc/kubernetes/manifests/etcd.yaml"
I1115 13:01:56.942940     148 waitcontrolplane.go:83] [wait-control-plane] Waiting for the API server to be healthy
I1115 13:01:56.945122     148 loader.go:374] Config loaded from file:  /etc/kubernetes/admin.conf
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
I1115 13:01:56.957073     148 round_trippers.go:553] GET https://dev-external-load-balancer:6443/healthz?timeout=10s  in 1 milliseconds
I1115 13:01:57.957254     148 with_retry.go:242] Got a Retry-After 1s response for attempt 1 to https://dev-external-load-balancer:6443/healthz?timeout=10s
I1115 13:01:57.958756     148 round_trippers.go:553] GET https://dev-external-load-balancer:6443/healthz?timeout=10s  in 1 milliseconds
I1115 13:01:58.959139     148 with_retry.go:242] Got a Retry-After 1s response for attempt 2 to https://dev-external-load-balancer:6443/healthz?timeout=10s
I1115 13:01:58.959879     148 round_trippers.go:553] GET https://dev-external-load-balancer:6443/healthz?timeout=10s  in 0 milliseconds
...
I1115 13:06:17.506671     148 with_retry.go:242] Got a Retry-After 1s response for attempt 9 to https://dev-external-load-balancer:6443/healthz?timeout=10s
I1115 13:06:17.508007     148 round_trippers.go:553] GET https://dev-external-load-balancer:6443/healthz?timeout=10s  in 1 milliseconds

Unfortunately, an error has occurred:
        timed out waiting for the condition

This error is likely caused by:
        - The kubelet is not running
        - The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)

If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
        - 'systemctl status kubelet'
        - 'journalctl -xeu kubelet'

Additionally, a control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI.
Here is one example how you may list all running Kubernetes containers by using crictl:
        - 'crictl --runtime-endpoint unix:///run/containerd/containerd.sock ps -a | grep kube | grep -v pause'
        Once you have found the failing container, you can inspect its logs with:
        - 'crictl --runtime-endpoint unix:///run/containerd/containerd.sock logs CONTAINERID'
couldn't initialize a Kubernetes cluster
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/init.runWaitControlPlanePhase
        cmd/kubeadm/app/cmd/phases/init/waitcontrolplane.go:108
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
        cmd/kubeadm/app/cmd/phases/workflow/runner.go:234
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
        cmd/kubeadm/app/cmd/phases/workflow/runner.go:421
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
        cmd/kubeadm/app/cmd/phases/workflow/runner.go:207
k8s.io/kubernetes/cmd/kubeadm/app/cmd.newCmdInit.func1
        cmd/kubeadm/app/cmd/init.go:154
github.com/spf13/cobra.(*Command).execute
        vendor/github.com/spf13/cobra/command.go:856
github.com/spf13/cobra.(*Command).ExecuteC
        vendor/github.com/spf13/cobra/command.go:974
github.com/spf13/cobra.(*Command).Execute
        vendor/github.com/spf13/cobra/command.go:902
k8s.io/kubernetes/cmd/kubeadm/app.Run
        cmd/kubeadm/app/kubeadm.go:50
main.main
        cmd/kubeadm/kubeadm.go:25
runtime.main
        /usr/local/go/src/runtime/proc.go:250
runtime.goexit
        /usr/local/go/src/runtime/asm_arm64.s:1172
error execution phase wait-control-plane
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
        cmd/kubeadm/app/cmd/phases/workflow/runner.go:235
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
        cmd/kubeadm/app/cmd/phases/workflow/runner.go:421
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
        cmd/kubeadm/app/cmd/phases/workflow/runner.go:207
k8s.io/kubernetes/cmd/kubeadm/app/cmd.newCmdInit.func1
        cmd/kubeadm/app/cmd/init.go:154
github.com/spf13/cobra.(*Command).execute
        vendor/github.com/spf13/cobra/command.go:856
github.com/spf13/cobra.(*Command).ExecuteC
        vendor/github.com/spf13/cobra/command.go:974
github.com/spf13/cobra.(*Command).Execute
        vendor/github.com/spf13/cobra/command.go:902
k8s.io/kubernetes/cmd/kubeadm/app.Run
        cmd/kubeadm/app/kubeadm.go:50
main.main
        cmd/kubeadm/kubeadm.go:25
runtime.main
        /usr/local/go/src/runtime/proc.go:250
runtime.goexit
        /usr/local/go/src/runtime/asm_arm64.s:1172

这里看到提示信息是使用 systemctl status kubelet 等命令来排查,但是 kind 是一个 docker_in_docker 容器,也就是说在物理服务器上是没有 kubelet 的,需要进入第一层docker容器来执行这个检查命令

注意启动报错:

Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:v1.25.3) 🖼
 ✓ Preparing nodes 📦
 ✓ Writing configuration 📜
 ✗ Starting control-plane 🕹
ERROR: failed to create cluster: failed to init node with kubeadm: command "docker exec --privileged kind-control-plane kubeadm init --skip-phases=preflight --config=/kind/kubeadm.conf --skip-token-print --v=6" failed with error: exit status 1

使用的node镜像是 kindest/node:v1.25.3

排查方法

  • 再次启动 kind create cluster 命令,在容器运行时立即执行 docker exec -it <CONTAINER_ID> /bin/bash 登录到容器内部检查:

    systemctl status kubelet
    

可以看到报错信息:

kind节点容器systemctl status kubelet报错信息
● kubelet.service - kubelet: The Kubernetes Node Agent
     Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/kubelet.service.d
             └─10-kubeadm.conf
     Active: active (running) since Tue 2022-11-15 13:43:07 UTC; 23s ago
       Docs: http://kubernetes.io/docs/
    Process: 267 ExecStartPre=/bin/sh -euc if [ -f /sys/fs/cgroup/cgroup.controllers ]; then create-kubelet-cgroup-v2; fi (code=exited, status=0/SUCCESS)
    Process: 275 ExecStartPre=/bin/sh -euc if [ ! -f /sys/fs/cgroup/cgroup.controllers ] && [ ! -d /sys/fs/cgroup/systemd/kubelet ]; then mkdir -p /sys/fs/cgroup/systemd/kubelet; fi (code=exited, status=0/SUCCESS)
   Main PID: 276 (kubelet)
      Tasks: 21 (limit: 5732)
     Memory: 31.4M
        CPU: 483ms
     CGroup: /kubelet.slice/kubelet.service
             └─276 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime=remote --container-runtime-endpoint=unix:///run/containerd/containerd.sock --node-ip=172.18.0.2 --node-labels= --pod-infra-container-image=registry.k8s.io/pause:3.8 --provider-id=kind://docker/kind/kind-control-plane --fail-swap-on=false --cgroup-root=/kubelet

Nov 15 13:43:30 kind-control-plane kubelet[276]: E1115 13:43:30.815420     276 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"kube-scheduler-kind-control-plane_kube-system(6d3dda2cad9846e0d48dbd5d5b9f59fc)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"kube-scheduler-kind-control-plane_kube-system(6d3dda2cad9846e0d48dbd5d5b9f59fc)\\\": rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: \\\"/pause\\\": stat /pause: operation not supported: unknown\"" pod="kube-system/kube-scheduler-kind-control-plane" podUID=6d3dda2cad9846e0d48dbd5d5b9f59fc
Nov 15 13:43:30 kind-control-plane kubelet[276]: E1115 13:43:30.907655     276 kubelet.go:2448] "Error getting node" err="node \"kind-control-plane\" not found"
Nov 15 13:43:31 kind-control-plane kubelet[276]: E1115 13:43:31.008610     276 kubelet.go:2448] "Error getting node" err="node \"kind-control-plane\" not found"
Nov 15 13:43:31 kind-control-plane kubelet[276]: E1115 13:43:31.043588     276 event.go:276] Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"kind-control-plane.1727c5e81e5fd8b3", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Node", Namespace:"", Name:"kind-control-plane", UID:"kind-control-plane", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"Starting", Message:"Starting kubelet.", Source:v1.EventSource{Component:"kubelet", Host:"kind-control-plane"}, FirstTimestamp:time.Date(2022, time.November, 15, 13, 43, 7, 696740531, time.Local), LastTimestamp:time.Date(2022, time.November, 15, 13, 43, 7, 696740531, time.Local), Count:1, Type:"Normal", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Post "https://kind-control-plane:6443/api/v1/namespaces/default/events": dial tcp 172.18.0.2:6443: connect: connection refused'(may retry after sleeping)
Nov 15 13:43:31 kind-control-plane kubelet[276]: E1115 13:43:31.108853     276 kubelet.go:2448] "Error getting node" err="node \"kind-control-plane\" not found"
Nov 15 13:43:31 kind-control-plane kubelet[276]: E1115 13:43:31.209101     276 kubelet.go:2448] "Error getting node" err="node \"kind-control-plane\" not found"
Nov 15 13:43:31 kind-control-plane kubelet[276]: E1115 13:43:31.309521     276 kubelet.go:2448] "Error getting node" err="node \"kind-control-plane\" not found"
Nov 15 13:43:31 kind-control-plane kubelet[276]: E1115 13:43:31.409881     276 kubelet.go:2448] "Error getting node" err="node \"kind-control-plane\" not found"
Nov 15 13:43:31 kind-control-plane kubelet[276]: E1115 13:43:31.510402     276 kubelet.go:2448] "Error getting node" err="node \"kind-control-plane\" not found"
Nov 15 13:43:31 kind-control-plane kubelet[276]: E1115 13:43:31.611372     276 kubelet.go:2448] "Error getting node" err="node \"kind-control-plane\" not found"

这里可以看到启动 Failed to create sandbox for pod ,原因是 runc create failed: unable to start container process: exec: \\\"/pause\\\": stat /pause: operation not supported: unknown

原因分析

这个问题和ARM架构有关,参考 Failed to Create Cluster on M1 #2448 提出的 workaroudn 方法:

As a workaround, having a Dockerfile:

FROM --platform=arm64 kindest/node:v1.21.1
RUN arch

building it:

docker build -t tempkind .

and using that image:

kind create cluster --image tempkind

我参考这个方法执行以下步骤:

  • 针对ARM架构创建指定平台架构的镜像:

构建ARM架构的kind镜像
cat << EOF > Dockerfile
FROM --platform=arm64 kindest/node:v1.25.3
RUN arch
EOF

docker build -t kindest/node:v1.25.3-arm64 .
  • 此时执行 docker images 就会看到如下镜像:

针对ARM的镜像
REPOSITORY        TAG                  IMAGE ID       CREATED             SIZE
kindest/node      v1.25.3-arm64        c94e1357ad6f   About an hour ago   824MB
kindest/node      v1.25.3              aa7084fa36af   13 days ago         824MB
...
  • 然后执行以下命令使用特定ARM64镜像创建集群:

指定ARM镜像创建kind集群
kind create cluster --name dev --config kind-config.yaml --image kindest/node:v1.25.3-arm64