排查kind集群创建失败¶
我在 Asahi Linux (ARM架构) kind多节点集群 执行:
export CLUSTER_NAME=dev
export reg_name='kind-registry'
kind create cluster --name "${CLUSTER_NAME}" --config kind-config.yaml
遇到之前在CentOS 8(x86_64)平台没有遇到过的报错(启动管控平台初始化失败):
Creating cluster "dev" ...
✓ Ensuring node image (kindest/node:v1.25.3) 🖼
✓ Preparing nodes 📦 📦 📦 📦 📦 📦 📦 📦
✓ Configuring the external load balancer ⚖
✓ Writing configuration 📜
✗ Starting control-plane 🕹
ERROR: failed to create cluster: failed to init node with kubeadm: command "docker exec --privileged dev-control-plane kubeadm init --skip-phases=preflight --config=/kind/kubeadm.conf --skip-token-print --v=6" failed with error: exit status 1
Command Output: I1115 13:01:54.253925 148 initconfiguration.go:254] loading configuration from "/kind/kubeadm.conf"
W1115 13:01:54.254738 148 initconfiguration.go:331] [config] WARNING: Ignored YAML document with GroupVersionKind kubeadm.k8s.io/v1beta3, Kind=JoinConfiguration
[init] Using Kubernetes version: v1.25.3
[certs] Using certificateDir folder "/etc/kubernetes/pki"
I1115 13:01:54.258151 148 certs.go:112] creating a new certificate authority for ca
[certs] Generating "ca" certificate and key
I1115 13:01:54.361535 148 certs.go:522] validating certificate period for ca certificate
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [dev-control-plane dev-external-load-balancer kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local localhost] and IPs [10.96.0.1 172.18.0.8 127.0.0.1]
[certs] Generating "apiserver-kubelet-client" certificate and key
I1115 13:01:54.542204 148 certs.go:112] creating a new certificate authority for front-proxy-ca
[certs] Generating "front-proxy-ca" certificate and key
I1115 13:01:54.775997 148 certs.go:522] validating certificate period for front-proxy-ca certificate
[certs] Generating "front-proxy-client" certificate and key
I1115 13:01:54.892570 148 certs.go:112] creating a new certificate authority for etcd-ca
[certs] Generating "etcd/ca" certificate and key
I1115 13:01:54.941145 148 certs.go:522] validating certificate period for etcd/ca certificate
[certs] Generating "etcd/server" certificate and key
[certs] etcd/server serving cert is signed for DNS names [dev-control-plane localhost] and IPs [172.18.0.8 127.0.0.1 ::1]
[certs] Generating "etcd/peer" certificate and key
[certs] etcd/peer serving cert is signed for DNS names [dev-control-plane localhost] and IPs [172.18.0.8 127.0.0.1 ::1]
[certs] Generating "etcd/healthcheck-client" certificate and key
[certs] Generating "apiserver-etcd-client" certificate and key
I1115 13:01:55.518737 148 certs.go:78] creating new public/private key files for signing service account users
[certs] Generating "sa" key and public key
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
I1115 13:01:55.847113 148 kubeconfig.go:103] creating kubeconfig file for admin.conf
[kubeconfig] Writing "admin.conf" kubeconfig file
I1115 13:01:55.946026 148 kubeconfig.go:103] creating kubeconfig file for kubelet.conf
[kubeconfig] Writing "kubelet.conf" kubeconfig file
I1115 13:01:56.108316 148 kubeconfig.go:103] creating kubeconfig file for controller-manager.conf
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
I1115 13:01:56.482291 148 kubeconfig.go:103] creating kubeconfig file for scheduler.conf
[kubeconfig] Writing "scheduler.conf" kubeconfig file
I1115 13:01:56.853648 148 kubelet.go:66] Stopping the kubelet
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Starting the kubelet
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
I1115 13:01:56.931016 148 manifests.go:99] [control-plane] getting StaticPodSpecs
I1115 13:01:56.931206 148 certs.go:522] validating certificate period for CA certificate
I1115 13:01:56.931237 148 manifests.go:125] [control-plane] adding volume "ca-certs" for component "kube-apiserver"
I1115 13:01:56.931241 148 manifests.go:125] [control-plane] adding volume "etc-ca-certificates" for component "kube-apiserver"
I1115 13:01:56.931243 148 manifests.go:125] [control-plane] adding volume "k8s-certs" for component "kube-apiserver"
I1115 13:01:56.931245 148 manifests.go:125] [control-plane] adding volume "usr-local-share-ca-certificates" for component "kube-apiserver"
I1115 13:01:56.931248 148 manifests.go:125] [control-plane] adding volume "usr-share-ca-certificates" for component "kube-apiserver"
[control-plane] Creating static Pod manifest for "kube-controller-manager"
I1115 13:01:56.932437 148 manifests.go:154] [control-plane] wrote static Pod manifest for component "kube-apiserver" to "/etc/kubernetes/manifests/kube-apiserver.yaml"
I1115 13:01:56.932447 148 manifests.go:99] [control-plane] getting StaticPodSpecs
I1115 13:01:56.932542 148 manifests.go:125] [control-plane] adding volume "ca-certs" for component "kube-controller-manager"
I1115 13:01:56.932547 148 manifests.go:125] [control-plane] adding volume "etc-ca-certificates" for component "kube-controller-manager"
I1115 13:01:56.932549 148 manifests.go:125] [control-plane] adding volume "flexvolume-dir" for component "kube-controller-manager"
I1115 13:01:56.932551 148 manifests.go:125] [control-plane] adding volume "k8s-certs" for component "kube-controller-manager"
I1115 13:01:56.932554 148 manifests.go:125] [control-plane] adding volume "kubeconfig" for component "kube-controller-manager"
I1115 13:01:56.932556 148 manifests.go:125] [control-plane] adding volume "usr-local-share-ca-certificates" for component "kube-controller-manager"
I1115 13:01:56.932559 148 manifests.go:125] [control-plane] adding volume "usr-share-ca-certificates" for component "kube-controller-manager"
I1115 13:01:56.936399 148 manifests.go:154] [control-plane] wrote static Pod manifest for component "kube-controller-manager" to "/etc/kubernetes/manifests/kube-controller-manager.yaml"
[control-plane] Creating static Pod manifest for "kube-scheduler"
I1115 13:01:56.936466 148 manifests.go:99] [control-plane] getting StaticPodSpecs
I1115 13:01:56.937196 148 manifests.go:125] [control-plane] adding volume "kubeconfig" for component "kube-scheduler"
I1115 13:01:56.940429 148 manifests.go:154] [control-plane] wrote static Pod manifest for component "kube-scheduler" to "/etc/kubernetes/manifests/kube-scheduler.yaml"
[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
I1115 13:01:56.942909 148 local.go:65] [etcd] wrote Static Pod manifest for a local etcd member to "/etc/kubernetes/manifests/etcd.yaml"
I1115 13:01:56.942940 148 waitcontrolplane.go:83] [wait-control-plane] Waiting for the API server to be healthy
I1115 13:01:56.945122 148 loader.go:374] Config loaded from file: /etc/kubernetes/admin.conf
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
I1115 13:01:56.957073 148 round_trippers.go:553] GET https://dev-external-load-balancer:6443/healthz?timeout=10s in 1 milliseconds
I1115 13:01:57.957254 148 with_retry.go:242] Got a Retry-After 1s response for attempt 1 to https://dev-external-load-balancer:6443/healthz?timeout=10s
I1115 13:01:57.958756 148 round_trippers.go:553] GET https://dev-external-load-balancer:6443/healthz?timeout=10s in 1 milliseconds
I1115 13:01:58.959139 148 with_retry.go:242] Got a Retry-After 1s response for attempt 2 to https://dev-external-load-balancer:6443/healthz?timeout=10s
I1115 13:01:58.959879 148 round_trippers.go:553] GET https://dev-external-load-balancer:6443/healthz?timeout=10s in 0 milliseconds
...
I1115 13:06:17.506671 148 with_retry.go:242] Got a Retry-After 1s response for attempt 9 to https://dev-external-load-balancer:6443/healthz?timeout=10s
I1115 13:06:17.508007 148 round_trippers.go:553] GET https://dev-external-load-balancer:6443/healthz?timeout=10s in 1 milliseconds
Unfortunately, an error has occurred:
timed out waiting for the condition
This error is likely caused by:
- The kubelet is not running
- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
- 'systemctl status kubelet'
- 'journalctl -xeu kubelet'
Additionally, a control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI.
Here is one example how you may list all running Kubernetes containers by using crictl:
- 'crictl --runtime-endpoint unix:///run/containerd/containerd.sock ps -a | grep kube | grep -v pause'
Once you have found the failing container, you can inspect its logs with:
- 'crictl --runtime-endpoint unix:///run/containerd/containerd.sock logs CONTAINERID'
couldn't initialize a Kubernetes cluster
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/init.runWaitControlPlanePhase
cmd/kubeadm/app/cmd/phases/init/waitcontrolplane.go:108
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
cmd/kubeadm/app/cmd/phases/workflow/runner.go:234
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
cmd/kubeadm/app/cmd/phases/workflow/runner.go:421
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
cmd/kubeadm/app/cmd/phases/workflow/runner.go:207
k8s.io/kubernetes/cmd/kubeadm/app/cmd.newCmdInit.func1
cmd/kubeadm/app/cmd/init.go:154
github.com/spf13/cobra.(*Command).execute
vendor/github.com/spf13/cobra/command.go:856
github.com/spf13/cobra.(*Command).ExecuteC
vendor/github.com/spf13/cobra/command.go:974
github.com/spf13/cobra.(*Command).Execute
vendor/github.com/spf13/cobra/command.go:902
k8s.io/kubernetes/cmd/kubeadm/app.Run
cmd/kubeadm/app/kubeadm.go:50
main.main
cmd/kubeadm/kubeadm.go:25
runtime.main
/usr/local/go/src/runtime/proc.go:250
runtime.goexit
/usr/local/go/src/runtime/asm_arm64.s:1172
error execution phase wait-control-plane
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
cmd/kubeadm/app/cmd/phases/workflow/runner.go:235
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
cmd/kubeadm/app/cmd/phases/workflow/runner.go:421
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
cmd/kubeadm/app/cmd/phases/workflow/runner.go:207
k8s.io/kubernetes/cmd/kubeadm/app/cmd.newCmdInit.func1
cmd/kubeadm/app/cmd/init.go:154
github.com/spf13/cobra.(*Command).execute
vendor/github.com/spf13/cobra/command.go:856
github.com/spf13/cobra.(*Command).ExecuteC
vendor/github.com/spf13/cobra/command.go:974
github.com/spf13/cobra.(*Command).Execute
vendor/github.com/spf13/cobra/command.go:902
k8s.io/kubernetes/cmd/kubeadm/app.Run
cmd/kubeadm/app/kubeadm.go:50
main.main
cmd/kubeadm/kubeadm.go:25
runtime.main
/usr/local/go/src/runtime/proc.go:250
runtime.goexit
/usr/local/go/src/runtime/asm_arm64.s:1172
这里看到提示信息是使用 systemctl status kubelet
等命令来排查,但是 kind
是一个 docker_in_docker
容器,也就是说在物理服务器上是没有 kubelet
的,需要进入第一层docker容器来执行这个检查命令
注意启动报错:
Creating cluster "kind" ...
✓ Ensuring node image (kindest/node:v1.25.3) 🖼
✓ Preparing nodes 📦
✓ Writing configuration 📜
✗ Starting control-plane 🕹
ERROR: failed to create cluster: failed to init node with kubeadm: command "docker exec --privileged kind-control-plane kubeadm init --skip-phases=preflight --config=/kind/kubeadm.conf --skip-token-print --v=6" failed with error: exit status 1
使用的node镜像是 kindest/node:v1.25.3
排查方法¶
再次启动
kind create cluster
命令,在容器运行时立即执行docker exec -it <CONTAINER_ID> /bin/bash
登录到容器内部检查:systemctl status kubelet
可以看到报错信息:
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Tue 2022-11-15 13:43:07 UTC; 23s ago
Docs: http://kubernetes.io/docs/
Process: 267 ExecStartPre=/bin/sh -euc if [ -f /sys/fs/cgroup/cgroup.controllers ]; then create-kubelet-cgroup-v2; fi (code=exited, status=0/SUCCESS)
Process: 275 ExecStartPre=/bin/sh -euc if [ ! -f /sys/fs/cgroup/cgroup.controllers ] && [ ! -d /sys/fs/cgroup/systemd/kubelet ]; then mkdir -p /sys/fs/cgroup/systemd/kubelet; fi (code=exited, status=0/SUCCESS)
Main PID: 276 (kubelet)
Tasks: 21 (limit: 5732)
Memory: 31.4M
CPU: 483ms
CGroup: /kubelet.slice/kubelet.service
└─276 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime=remote --container-runtime-endpoint=unix:///run/containerd/containerd.sock --node-ip=172.18.0.2 --node-labels= --pod-infra-container-image=registry.k8s.io/pause:3.8 --provider-id=kind://docker/kind/kind-control-plane --fail-swap-on=false --cgroup-root=/kubelet
Nov 15 13:43:30 kind-control-plane kubelet[276]: E1115 13:43:30.815420 276 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"kube-scheduler-kind-control-plane_kube-system(6d3dda2cad9846e0d48dbd5d5b9f59fc)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"kube-scheduler-kind-control-plane_kube-system(6d3dda2cad9846e0d48dbd5d5b9f59fc)\\\": rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: \\\"/pause\\\": stat /pause: operation not supported: unknown\"" pod="kube-system/kube-scheduler-kind-control-plane" podUID=6d3dda2cad9846e0d48dbd5d5b9f59fc
Nov 15 13:43:30 kind-control-plane kubelet[276]: E1115 13:43:30.907655 276 kubelet.go:2448] "Error getting node" err="node \"kind-control-plane\" not found"
Nov 15 13:43:31 kind-control-plane kubelet[276]: E1115 13:43:31.008610 276 kubelet.go:2448] "Error getting node" err="node \"kind-control-plane\" not found"
Nov 15 13:43:31 kind-control-plane kubelet[276]: E1115 13:43:31.043588 276 event.go:276] Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"kind-control-plane.1727c5e81e5fd8b3", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Node", Namespace:"", Name:"kind-control-plane", UID:"kind-control-plane", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"Starting", Message:"Starting kubelet.", Source:v1.EventSource{Component:"kubelet", Host:"kind-control-plane"}, FirstTimestamp:time.Date(2022, time.November, 15, 13, 43, 7, 696740531, time.Local), LastTimestamp:time.Date(2022, time.November, 15, 13, 43, 7, 696740531, time.Local), Count:1, Type:"Normal", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Post "https://kind-control-plane:6443/api/v1/namespaces/default/events": dial tcp 172.18.0.2:6443: connect: connection refused'(may retry after sleeping)
Nov 15 13:43:31 kind-control-plane kubelet[276]: E1115 13:43:31.108853 276 kubelet.go:2448] "Error getting node" err="node \"kind-control-plane\" not found"
Nov 15 13:43:31 kind-control-plane kubelet[276]: E1115 13:43:31.209101 276 kubelet.go:2448] "Error getting node" err="node \"kind-control-plane\" not found"
Nov 15 13:43:31 kind-control-plane kubelet[276]: E1115 13:43:31.309521 276 kubelet.go:2448] "Error getting node" err="node \"kind-control-plane\" not found"
Nov 15 13:43:31 kind-control-plane kubelet[276]: E1115 13:43:31.409881 276 kubelet.go:2448] "Error getting node" err="node \"kind-control-plane\" not found"
Nov 15 13:43:31 kind-control-plane kubelet[276]: E1115 13:43:31.510402 276 kubelet.go:2448] "Error getting node" err="node \"kind-control-plane\" not found"
Nov 15 13:43:31 kind-control-plane kubelet[276]: E1115 13:43:31.611372 276 kubelet.go:2448] "Error getting node" err="node \"kind-control-plane\" not found"
这里可以看到启动 Failed to create sandbox for pod
,原因是 runc create failed: unable to start container process: exec: \\\"/pause\\\": stat /pause: operation not supported: unknown
原因分析¶
这个问题和ARM架构有关,参考 Failed to Create Cluster on M1 #2448 提出的 workaroudn
方法:
As a workaround, having a Dockerfile:
FROM --platform=arm64 kindest/node:v1.21.1
RUN arch
building it:
docker build -t tempkind .
and using that image:
kind create cluster --image tempkind
我参考这个方法执行以下步骤:
针对ARM架构创建指定平台架构的镜像:
cat << EOF > Dockerfile
FROM --platform=arm64 kindest/node:v1.25.3
RUN arch
EOF
docker build -t kindest/node:v1.25.3-arm64 .
此时执行
docker images
就会看到如下镜像:
REPOSITORY TAG IMAGE ID CREATED SIZE
kindest/node v1.25.3-arm64 c94e1357ad6f About an hour ago 824MB
kindest/node v1.25.3 aa7084fa36af 13 days ago 824MB
...
然后执行以下命令使用特定ARM64镜像创建集群:
kind create cluster --name dev --config kind-config.yaml --image kindest/node:v1.25.3-arm64