排查X86移动云Kind创建失败

备注

我遇到kind创建失败的原因是 X86移动云ZFS ,即底层物理主机采用了 Docker ZFS 存储驱动 。这要求 kind 的基础镜像中必须也安装 zfsutils-linux ,否则处理容器文件系统会出错。这个bug在2022年11月1日才修复,所以需要采用最新的git版本而不是release版本(我部署时候在2023年1月中旬)

X86移动云Kind(本地docker模拟k8s集群) 创建步骤和 kind多节点集群 方法相同,但是执行:

kind构建3个管控节点,5个工作节点集群配置
export CLUSTER_NAME=dev
export reg_name='kind-registry'
kind create cluster --name "${CLUSTER_NAME}" --config kind-config.yaml

出现如下报错:

kind集群启动管控节点超时报错
Creating cluster "dev" ...
 ✓ Ensuring node image (kindest/node:v1.25.3) 🖼
 ✓ Preparing nodes 📦 📦 📦 📦 📦 📦 📦 📦
 ✓ Configuring the external load balancer ⚖️
 ✓ Writing configuration 📜
 ✗ Starting control-plane 🕹️
ERROR: failed to create cluster: failed to init node with kubeadm: command "docker exec --privileged dev-control-plane kubeadm init --skip-phases=preflight --config=/kind/kubeadm.conf --skip-token-print --v=6" failed with error: exit status 1
Command Output: I0118 06:47:07.363603     149 initconfiguration.go:254] loading configuration from "/kind/kubeadm.conf"
W0118 06:47:07.365525     149 initconfiguration.go:331] [config] WARNING: Ignored YAML document with GroupVersionKind kubeadm.k8s.io/v1beta3, Kind=JoinConfiguration
[init] Using Kubernetes version: v1.25.3
[certs] Using certificateDir folder "/etc/kubernetes/pki"
I0118 06:47:07.372106     149 certs.go:112] creating a new certificate authority for ca
[certs] Generating "ca" certificate and key
I0118 06:47:07.451124     149 certs.go:522] validating certificate period for ca certificate
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [dev-control-plane dev-external-load-balancer kubernetes kubernetes.default kubernetes.default.svc kube
rnetes.default.svc.cluster.local localhost] and IPs [10.96.0.1 172.18.0.7 127.0.0.1]
[certs] Generating "apiserver-kubelet-client" certificate and key
I0118 06:47:08.001760     149 certs.go:112] creating a new certificate authority for front-proxy-ca
[certs] Generating "front-proxy-ca" certificate and key
I0118 06:47:08.092062     149 certs.go:522] validating certificate period for front-proxy-ca certificate
[certs] Generating "front-proxy-client" certificate and key
I0118 06:47:08.456417     149 certs.go:112] creating a new certificate authority for etcd-ca
[certs] Generating "etcd/ca" certificate and key
I0118 06:47:08.658389     149 certs.go:522] validating certificate period for etcd/ca certificate
[certs] Generating "etcd/server" certificate and key
[certs] etcd/server serving cert is signed for DNS names [dev-control-plane localhost] and IPs [172.18.0.7 127.0.0.1 ::1]
[certs] Generating "etcd/peer" certificate and key
[certs] etcd/peer serving cert is signed for DNS names [dev-control-plane localhost] and IPs [172.18.0.7 127.0.0.1 ::1]
[certs] Generating "etcd/healthcheck-client" certificate and key
[certs] Generating "apiserver-etcd-client" certificate and key
I0118 06:47:09.885620     149 certs.go:78] creating new public/private key files for signing service account users
[certs] Generating "sa" key and public key
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
I0118 06:47:10.065065     149 kubeconfig.go:103] creating kubeconfig file for admin.conf
[kubeconfig] Writing "admin.conf" kubeconfig file
I0118 06:47:10.270102     149 kubeconfig.go:103] creating kubeconfig file for kubelet.conf
[kubeconfig] Writing "kubelet.conf" kubeconfig file
I0118 06:47:10.343102     149 kubeconfig.go:103] creating kubeconfig file for controller-manager.conf
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
I0118 06:47:10.476121     149 kubeconfig.go:103] creating kubeconfig file for scheduler.conf
[kubeconfig] Writing "scheduler.conf" kubeconfig file
I0118 06:47:10.615067     149 kubelet.go:66] Stopping the kubelet
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Starting the kubelet
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
I0118 06:47:10.772447     149 manifests.go:99] [control-plane] getting StaticPodSpecs
I0118 06:47:10.773090     149 certs.go:522] validating certificate period for CA certificate
I0118 06:47:10.773385     149 manifests.go:125] [control-plane] adding volume "ca-certs" for component "kube-apiserver"
I0118 06:47:10.773469     149 manifests.go:125] [control-plane] adding volume "etc-ca-certificates" for component "kube-apiserver"
I0118 06:47:10.773505     149 manifests.go:125] [control-plane] adding volume "k8s-certs" for component "kube-apiserver"
I0118 06:47:10.773532     149 manifests.go:125] [control-plane] adding volume "usr-local-share-ca-certificates" for component "kube-apiserver"
I0118 06:47:10.773557     149 manifests.go:125] [control-plane] adding volume "usr-share-ca-certificates" for component "kube-apiserver"
I0118 06:47:10.776092     149 manifests.go:154] [control-plane] wrote static Pod manifest for component "kube-apiserver" to "/etc/kubernetes/manifests/kube-apiserver.yaml"
I0118 06:47:10.776109     149 manifests.go:99] [control-plane] getting StaticPodSpecs
[control-plane] Creating static Pod manifest for "kube-controller-manager"
I0118 06:47:10.776289     149 manifests.go:125] [control-plane] adding volume "ca-certs" for component "kube-controller-manager"
I0118 06:47:10.776297     149 manifests.go:125] [control-plane] adding volume "etc-ca-certificates" for component "kube-controller-manager"
I0118 06:47:10.776305     149 manifests.go:125] [control-plane] adding volume "flexvolume-dir" for component "kube-controller-manager"
I0118 06:47:10.776312     149 manifests.go:125] [control-plane] adding volume "k8s-certs" for component "kube-controller-manager"
I0118 06:47:10.776317     149 manifests.go:125] [control-plane] adding volume "kubeconfig" for component "kube-controller-manager"
I0118 06:47:10.776321     149 manifests.go:125] [control-plane] adding volume "usr-local-share-ca-certificates" for component "kube-controller-manager"
I0118 06:47:10.776325     149 manifests.go:125] [control-plane] adding volume "usr-share-ca-certificates" for component "kube-controller-manager"
[control-plane] Creating static Pod manifest for "kube-scheduler"
I0118 06:47:10.776940     149 manifests.go:154] [control-plane] wrote static Pod manifest for component "kube-controller-manager" to "/etc/kubernetes/manifests/kube-controller-manager.yaml"
I0118 06:47:10.776950     149 manifests.go:99] [control-plane] getting StaticPodSpecs
I0118 06:47:10.777122     149 manifests.go:125] [control-plane] adding volume "kubeconfig" for component "kube-scheduler"
I0118 06:47:10.777526     149 manifests.go:154] [control-plane] wrote static Pod manifest for component "kube-scheduler" to "/etc/kubernetes/manifests/kube-scheduler.yaml"
[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
I0118 06:47:10.778158     149 local.go:65] [etcd] wrote Static Pod manifest for a local etcd member to "/etc/kubernetes/manifests/etcd.yaml"
I0118 06:47:10.778195     149 waitcontrolplane.go:83] [wait-control-plane] Waiting for the API server to be healthy
I0118 06:47:10.778751     149 loader.go:374] Config loaded from file:  /etc/kubernetes/admin.conf
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
I0118 06:47:10.783452     149 round_trippers.go:553] GET https://dev-external-load-balancer:6443/healthz?timeout=10s  in 2 milliseconds
I0118 06:47:11.784001     149 with_retry.go:242] Got a Retry-After 1s response for attempt 1 to https://dev-external-load-balancer:6443/healthz?timeout=10s
I0118 06:47:11.786115     149 round_trippers.go:553] GET https://dev-external-load-balancer:6443/healthz?timeout=10s  in 1 milliseconds
I0118 06:47:12.786448     149 with_retry.go:242] Got a Retry-After 1s response for attempt 2 to https://dev-external-load-balancer:6443/healthz?timeout=10s
I0118 06:47:12.787568     149 round_trippers.go:553] GET https://dev-external-load-balancer:6443/healthz?timeout=10s  in 1 milliseconds
...
I0118 06:51:30.364631     149 with_retry.go:242] Got a Retry-After 1s response for attempt 8 to https://dev-external-load-balancer:6443/healthz?timeout=10s
I0118 06:51:30.367126     149 round_trippers.go:553] GET https://dev-external-load-balancer:6443/healthz?timeout=10s  in 2 milliseconds
I0118 06:51:31.367356     149 with_retry.go:242] Got a Retry-After 1s response for attempt 9 to https://dev-external-load-balancer:6443/healthz?timeout=10s
I0118 06:51:31.368374     149 round_trippers.go:553] GET https://dev-external-load-balancer:6443/healthz?timeout=10s  in 0 milliseconds

Unfortunately, an error has occurred:
        timed out waiting for the condition

This error is likely caused by:
        - The kubelet is not running
        - The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)

If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
        - 'systemctl status kubelet'
        - 'journalctl -xeu kubelet'

Additionally, a control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI.
Here is one example how you may list all running Kubernetes containers by using crictl:
        - 'crictl --runtime-endpoint unix:///run/containerd/containerd.sock ps -a | grep kube | grep -v pause'
        Once you have found the failing container, you can inspect its logs with:
        - 'crictl --runtime-endpoint unix:///run/containerd/containerd.sock logs CONTAINERID'
couldn't initialize a Kubernetes cluster
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/init.runWaitControlPlanePhase
        cmd/kubeadm/app/cmd/phases/init/waitcontrolplane.go:108
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
        cmd/kubeadm/app/cmd/phases/workflow/runner.go:234
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
        cmd/kubeadm/app/cmd/phases/workflow/runner.go:421
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
        cmd/kubeadm/app/cmd/phases/workflow/runner.go:207
k8s.io/kubernetes/cmd/kubeadm/app/cmd.newCmdInit.func1
        cmd/kubeadm/app/cmd/init.go:154
github.com/spf13/cobra.(*Command).execute
        vendor/github.com/spf13/cobra/command.go:856
github.com/spf13/cobra.(*Command).ExecuteC
        vendor/github.com/spf13/cobra/command.go:974
github.com/spf13/cobra.(*Command).Execute
        vendor/github.com/spf13/cobra/command.go:902
k8s.io/kubernetes/cmd/kubeadm/app.Run
        cmd/kubeadm/app/kubeadm.go:50
main.main
        cmd/kubeadm/kubeadm.go:25
runtime.main
        /usr/local/go/src/runtime/proc.go:250
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1594
error execution phase wait-control-plane
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
        cmd/kubeadm/app/cmd/phases/workflow/runner.go:235
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
        cmd/kubeadm/app/cmd/phases/workflow/runner.go:421
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
        cmd/kubeadm/app/cmd/phases/workflow/runner.go:207
k8s.io/kubernetes/cmd/kubeadm/app/cmd.newCmdInit.func1
        cmd/kubeadm/app/cmd/init.go:154
github.com/spf13/cobra.(*Command).execute
        vendor/github.com/spf13/cobra/command.go:856
github.com/spf13/cobra.(*Command).ExecuteC
        vendor/github.com/spf13/cobra/command.go:974
github.com/spf13/cobra.(*Command).Execute
        vendor/github.com/spf13/cobra/command.go:902
k8s.io/kubernetes/cmd/kubeadm/app.Run
        cmd/kubeadm/app/kubeadm.go:50
main.main
        cmd/kubeadm/kubeadm.go:25
runtime.main
        /usr/local/go/src/runtime/proc.go:250
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1594

此外,物理主机的系统日志有大量的 audit 记录,应该和容器内部运行 systemd 相关(大量重复出现应该是异常):

物理主机dmesg中有大量audit信息和runc相关
[Wed Jan 18 14:49:39 2023] audit: type=1300 audit(1674024579.870:2215): arch=c000003e syscall=321 success=yes exit=28 a0=5 a1=7ffed8a26de0 a2=78 a3=7ffed8a26de0 items=0 ppid=24008 pid=24095 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="systemd" exe="/usr/lib/systemd/systemd" key=(null)
[Wed Jan 18 14:49:39 2023] audit: type=1327 audit(1674024579.870:2215): proctitle="/sbin/init"
[Wed Jan 18 14:49:39 2023] audit: type=1334 audit(1674024579.870:2216): prog-id=690 op=LOAD
[Wed Jan 18 14:49:39 2023] audit: type=1300 audit(1674024579.870:2216): arch=c000003e syscall=321 success=yes exit=16 a0=5 a1=c0001439d8 a2=78 a3=c0003097e0 items=0 ppid=30761 pid=30773 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="runc" exe="/usr/local/sbin/runc" key=(null)
[Wed Jan 18 14:49:39 2023] audit: type=1327 audit(1674024579.870:2216): proctitle=72756E63002D2D726F6F74002F72756E2F636F6E7461696E6572642F72756E632F6B38732E696F002D2D6C6F67002F72756E2F636F6E7461696E6572642F696F2E636F6E7461696E6572642E72756E74696D652E76322E7461736B2F6B38732E696F2F3665643565383762386162343234633235663038646465386435326239
[Wed Jan 18 14:49:39 2023] audit: type=1334 audit(1674024579.870:2217): prog-id=691 op=LOAD
[Wed Jan 18 14:49:39 2023] audit: type=1300 audit(1674024579.870:2217): arch=c000003e syscall=321 success=yes exit=18 a0=5 a1=c000143770 a2=78 a3=c000309828 items=0 ppid=30761 pid=30773 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="runc" exe="/usr/local/sbin/runc" key=(null)

排查

kind create 参数添加 –retain -v 1 可以获得详细信息
kind create cluster --name dev --config kind-config.yaml --retain -v 1
kind export --name dev logs

提示信息:

Exporting logs for cluster "dev" to:
/tmp/3866643061

/tmp/3866643061 目录下会找到kind集群各个节点的日志文件

  • /tmp/3866643061/dev-control-plane/kubelet.log 日志中看到有CNI初始化失败的信息:

kind节点control-plane的kubelet.log日志显示证书签名请求失败
$ cat kubelet.log | grep E0
Jan 18 07:43:50 dev-control-plane kubelet[193]: E0118 07:43:50.985166     193 certificate_manager.go:471] kubernetes.io/kube-apiserver-client-kubelet: Failed while requesting a signed certificate from the control plane: cannot create certificate signing request: Post "https://dev-external-load-balancer:6443/apis/certificates.k8s.io/v1/certificatesigningrequests": EOF
Jan 18 07:43:51 dev-control-plane kubelet[193]: E0118 07:43:51.002200     193 event.go:276] Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"dev-control-plane.173b577162e64787", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Node", Namespace:"", Name:"dev-control-plane", UID:"dev-control-plane", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"Starting", Message:"Starting kubelet.", Source:v1.EventSource{Component:"kubelet", Host:"dev-control-plane"}, FirstTimestamp:time.Date(2023, time.January, 18, 7, 43, 51, 450951, time.Local), LastTimestamp:time.Date(2023, time.January, 18, 7, 43, 51, 450951, time.Local), Count:1, Type:"Normal", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Post "https://dev-external-load-balancer:6443/api/v1/namespaces/default/events": EOF'(may retry after sleeping)
Jan 18 07:43:51 dev-control-plane kubelet[193]: E0118 07:43:51.007537     193 kubelet.go:2373] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
Jan 18 07:43:51 dev-control-plane kubelet[193]: E0118 07:43:51.061065     193 kubelet.go:2034] "Skipping pod synchronization" err="[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]"
Jan 18 07:43:51 dev-control-plane kubelet[193]: E0118 07:43:51.105070     193 kubelet.go:2448] "Error getting node" err="node \"dev-control-plane\" not found"
Jan 18 07:43:51 dev-control-plane kubelet[193]: E0118 07:43:51.109113     193 kubelet_node_status.go:92] "Unable to register node with API server" err="Post \"https://dev-external-load-balancer:6443/api/v1/nodes\": EOF" node="dev-control-plane"
Jan 18 07:43:51 dev-control-plane kubelet[193]: E0118 07:43:51.161180     193 kubelet.go:2034] "Skipping pod synchronization" err="container runtime status check may not have completed yet"
Jan 18 07:43:51 dev-control-plane kubelet[193]: E0118 07:43:51.206135     193 kubelet.go:2448] "Error getting node" err="node \"dev-control-plane\" not found"
Jan 18 07:43:51 dev-control-plane kubelet[193]: E0118 07:43:51.255991     193 kubelet.go:1397] "Failed to start ContainerManager" err="failed to initialize top level QOS containers: root container [kubelet kubepods] doesn't exist"
Jan 18 07:43:52 dev-control-plane kubelet[273]: E0118 07:43:52.663207     273 event.go:276] Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"dev-control-plane.173b5771c5f29376", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Node", Namespace:"", Name:"dev-control-plane", UID:"dev-control-plane", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"Starting", Message:"Starting kubelet.", Source:v1.EventSource{Component:"kubelet", Host:"dev-control-plane"}, FirstTimestamp:time.Date(2023, time.January, 18, 7, 43, 52, 662201206, time.Local), LastTimestamp:time.Date(2023, time.January, 18, 7, 43, 52, 662201206, time.Local), Count:1, Type:"Normal", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Post "https://dev-external-load-balancer:6443/api/v1/namespaces/default/events": EOF'(may retry after sleeping)
Jan 18 07:43:52 dev-control-plane kubelet[273]: E0118 07:43:52.664394     273 kubelet.go:2373] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
Jan 18 07:43:52 dev-control-plane kubelet[273]: E0118 07:43:52.689512     273 kubelet.go:2034] "Skipping pod synchronization" err="[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]"
Jan 18 07:43:52 dev-control-plane kubelet[273]: E0118 07:43:52.736821     273 cri_stats_provider.go:452] "Failed to get the info of the filesystem with mountpoint" err="failed to get device for dir \"/var/lib/containerd/io.containerd.snapshotter.v1.fuse-overlayfs\": stat failed on /var/lib/containerd/io.containerd.snapshotter.v1.fuse-overlayfs with error: no such file or directory" mountpoint="/var/lib/containerd/io.containerd.snapshotter.v1.fuse-overlayfs"
Jan 18 07:43:52 dev-control-plane kubelet[273]: E0118 07:43:52.764402     273 kubelet.go:2448] "Error getting node" err="node \"dev-control-plane\" not found"
Jan 18 07:43:52 dev-control-plane kubelet[273]: E0118 07:43:52.765647     273 kubelet_node_status.go:92] "Unable to register node with API server" err="Post \"https://dev-external-load-balancer:6443/api/v1/nodes\": EOF" node="dev-control-plane"
...
  • 实际物理主机上docker容器已经启动:

    docker ps
    

显示:

CONTAINER ID   IMAGE                                COMMAND                  CREATED         STATUS         PORTS                       NAMES
19537801aa08   kindest/node:v1.25.3                 "/usr/local/bin/entr…"   4 minutes ago   Up 4 minutes   127.0.0.1:46635->6443/tcp   dev-control-plane2
75f9a2d8dc9e   kindest/node:v1.25.3                 "/usr/local/bin/entr…"   4 minutes ago   Up 4 minutes                               dev-worker3
bf960a2f24f5   kindest/node:v1.25.3                 "/usr/local/bin/entr…"   4 minutes ago   Up 4 minutes   127.0.0.1:37711->6443/tcp   dev-control-plane
c81440eb69b3   kindest/node:v1.25.3                 "/usr/local/bin/entr…"   4 minutes ago   Up 4 minutes                               dev-worker4
f2f81e25705f   kindest/node:v1.25.3                 "/usr/local/bin/entr…"   4 minutes ago   Up 4 minutes                               dev-worker5
5d52f70acb69   kindest/node:v1.25.3                 "/usr/local/bin/entr…"   4 minutes ago   Up 4 minutes                               dev-worker
acd0de1e4f4d   kindest/node:v1.25.3                 "/usr/local/bin/entr…"   4 minutes ago   Up 4 minutes   127.0.0.1:41761->6443/tcp   dev-control-plane3
8369e5a5e853   kindest/node:v1.25.3                 "/usr/local/bin/entr…"   4 minutes ago   Up 4 minutes                               dev-worker2
f9ac6e6b606a   kindest/haproxy:v20220607-9a4d8d2a   "haproxy -sf 7 -W -d…"   4 minutes ago   Up 4 minutes   127.0.0.1:35931->6443/tcp   dev-external-load-balancer
  • 进入 dev-control-plane 节点进行检查:

    docker exec -it bf960a2f24f5 /bin/bash
    

可以看到底层容器内部的磁盘挂载:

# df -h
Filesystem                                                                   Size  Used Avail Use% Mounted on
zpool-data/62fd3b4b5c3acb12b91336c2f358d369a5223f2657025f03a8b6b35a22f4d2ef  859G  504M  858G   1% /
tmpfs                                                                         64M     0   64M   0% /dev
shm                                                                           64M     0   64M   0% /dev/shm
zpool-data                                                                   860G  2.1G  858G   1% /var
tmpfs                                                                        7.8G  454M  7.4G   6% /run
tmpfs                                                                        7.8G     0  7.8G   0% /tmp
/dev/nvme0n1p2                                                                60G   27G   33G  45% /usr/lib/modules
tmpfs                                                                        5.0M     0  5.0M   0% /run/lock
  • 检查 dev-control-plane 节点的 journal.log 日志可以看到 zfs 文件系统错误:

kind节点control-plane的journal.log显示容器内部zfs文件系统相关错误(可以看到containerd已经在尝试zfs插件,但是容器中缺少zfs工具)
...
Jan 18 07:43:45 dev-control-plane containerd[108]: time="2023-01-18T07:43:45.144306425Z" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.btrfs\"..." error="path /var/lib/containerd/io.containerd.snapshotter.v1.btrfs (zfs) must be a btrfs filesystem to be used with the btrfs snapshotter: skip plugin" type=io.containerd.snapshotter.v1
Jan 18 07:43:45 dev-control-plane containerd[108]: time="2023-01-18T07:43:45.144339308Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.devmapper\"..." type=io.containerd.snapshotter.v1
Jan 18 07:43:45 dev-control-plane containerd[108]: time="2023-01-18T07:43:45.144362667Z" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.devmapper" error="devmapper not configured"
Jan 18 07:43:45 dev-control-plane containerd[108]: time="2023-01-18T07:43:45.144379795Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.native\"..." type=io.containerd.snapshotter.v1
Jan 18 07:43:45 dev-control-plane containerd[108]: time="2023-01-18T07:43:45.144417134Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.overlayfs\"..." type=io.containerd.snapshotter.v1
Jan 18 07:43:45 dev-control-plane containerd[108]: time="2023-01-18T07:43:45.144733984Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.zfs\"..." type=io.containerd.snapshotter.v1
Jan 18 07:43:45 dev-control-plane containerd[108]: time="2023-01-18T07:43:45.145156396Z" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.zfs\"..." error="exec: \"zfs\": executable file not found in $PATH: \"zfs fs list -Hp -o name,origin,used,available,mountpoint,compression,type,volsize,quota,referenced,written,logicalused,usedbydataset zpool-data\" => : skip plugin" type=io.containerd.snapshotter.v1
...

我检查了正在运行的 dev-control-plane 节点内部,确实没有安装 zfs 工具,这说明需要采用最新的 kind 版本

修复zfs

  • 我尝试添加镜像参数,指定最新镜像:

    kind create cluster --name dev --config kind-config.yaml --image kindest/node:latest
    

但是提示错误:

✗ Ensuring node image (kindest/node:latest) 🖼
ERROR: failed to create cluster: failed to pull image "kindest/node:latest": command "docker pull kindest/node:latest" failed with error: exit status 1
Command Output: Error response from daemon: manifest for kindest/node:latest not found: manifest unknown: manifest unknown
  • 由于kind的github仓库已经修复这个问题,所以下载最新的Dockerfile来构建本地镜像,注意,这个构建需要激活 Buildkit ,也就是修订 /etc/docker/dameon.json :

修改 /etc/docker/daemon.json 添加 buildkit 配置
{
  "storage-driver": "zfs",
  "features": {
    "buildkit" : true
  }
}

重启 docker 服务

  • (错误,我以为只需要build base镜像即可,实际上应该是完整build一个kubernetes镜像,见下文) 执行下面的脚本获得最新的Dockerfile,并构建镜像

构建包含zfs工具的node镜像(这个方法不正确,请忽略)
git clone git@github.com:jrwren/kind.git
cd kind/images/base/

# 需要使用 buildkit ,否则会提示 the --chmod option requires BuildKit. Refer to https://docs.docker.com/go/buildkit/ to learn how to build images with BuildKit enabled
docker build -t kindest/node:v1.25.3-zfs .

通过以下方式安装 buildx 插件(必须,否则build报错):

为docker安装buildx插件
mkdir -p ~/.docker/cli-plugins
cd ~/.docker/cli-plugins
wget -O docker-buildx https://github.com/docker/buildx/releases/download/v0.10.0/buildx-v0.10.0.linux-amd64
chmod 755 docker-buildx

下载Kubernetes源代码 $(go env GOPATH)/src/k8s.io/kubernetes ,然后构建镜像:

下载Kubernetes源代码并构建kind的node镜像
mkdir -p $(go env GOPATH)/src/k8s.io
cd $(go env GOPATH)/src/k8s.io
git clone https://github.com/kubernetes/kubernetes

# 使用当前发布版本 v1.26.1 ,这是一个 Tag(不可修改)
git checkout v1.26.1

# 执行编译
kind build node-image
  • 重新执行创建集群:

kind构建3个管控节点,5个工作节点集群配置,采用自编译最新镜像
kind create cluster --name dev --config kind-config.yaml --image kindest/node:latest

备注

构建镜像时启动的容器内部需要访问 github 下载containerd,但是会被GFW阻断,需要解决容器内部proxy设置 Docker客户端的Proxy :

在 ~/.ssh/config 添加本地端口(真实网卡借口)转发墙外服务器squid端口(loop接口)
 Host parent-squid
     HostName <SERVER_IP>
     User huatai
     LocalForward 192.168.7.152:3128 127.0.0.1:3128
  • 然后执行ssh命令打通端口转发:

执行ssh命令打通端口转发
ssh parent-squid
  • 在Docker客户端,创建或配置 ~/.docker/config.json 设置以下json格式配置:

配置Docker客户端 ~/.docker/config.json 可以为容器内部注入代理配置
{
 "proxies":
 {
   "default":
   {
     "httpProxy": "http://192.168.7.152:3128",
     "httpsProxy": "http://192.168.7.152:3128",
     "noProxy": "*.baidu.com,192.168.0.0/16,10.0.0.0/8"
   }
 }
}

接下来创建的新容器,容器内部会自动注入代理配置,也就加速下载

新问题: 创建集群不能找到日志行

上述自定义包含zfs工具的镜像执行创建 dev 集群又遇到新问题:

使用自定义zfs工具镜像创建dev集群,出现不能找到匹配的log行
$ kind create cluster --name dev --config kind-config.yaml --image kindest/node:v1.25.3-zfs --retain -v 1
Creating cluster "dev" ...
DEBUG: docker/images.go:58] Image: kindest/node:v1.25.3-zfs present locally
  Ensuring node image (kindest/node:v1.25.3-zfs) 🖼
  Preparing nodes 📦 📦 📦 📦 📦 📦 📦 📦
ERROR: failed to create cluster: could not find a log line that matches "Reached target .*Multi-User System.*|detected cgroup v1"
Stack Trace:
sigs.k8s.io/kind/pkg/errors.Errorf
        /home/huatai/go/pkg/mod/sigs.k8s.io/kind@v0.17.0/pkg/errors/errors.go:41
sigs.k8s.io/kind/pkg/cluster/internal/providers/common.WaitUntilLogRegexpMatches
        /home/huatai/go/pkg/mod/sigs.k8s.io/kind@v0.17.0/pkg/cluster/internal/providers/common/cgroups.go:84
sigs.k8s.io/kind/pkg/cluster/internal/providers/docker.createContainerWithWaitUntilSystemdReachesMultiUserSystem
        /home/huatai/go/pkg/mod/sigs.k8s.io/kind@v0.17.0/pkg/cluster/internal/providers/docker/provision.go:407
sigs.k8s.io/kind/pkg/cluster/internal/providers/docker.planCreation.func2
        /home/huatai/go/pkg/mod/sigs.k8s.io/kind@v0.17.0/pkg/cluster/internal/providers/docker/provision.go:115
sigs.k8s.io/kind/pkg/errors.UntilErrorConcurrent.func1
        /home/huatai/go/pkg/mod/sigs.k8s.io/kind@v0.17.0/pkg/errors/concurrent.go:30
runtime.goexit
        /usr/lib/go/src/runtime/asm_amd64.s:1594

这个问题在 could not find a line that matches “Reached target .*Multi-User System.*” #2460 有人遇到过