排查X86移动云Kind创建失败¶
备注
我遇到kind创建失败的原因是 X86移动云ZFS ,即底层物理主机采用了 Docker ZFS 存储驱动 。这要求 kind
的基础镜像中必须也安装 zfsutils-linux
,否则处理容器文件系统会出错。这个bug在2022年11月1日才修复,所以需要采用最新的git版本而不是release版本(我部署时候在2023年1月中旬)
在 X86移动云Kind(本地docker模拟k8s集群) 创建步骤和 kind多节点集群 方法相同,但是执行:
export CLUSTER_NAME=dev
export reg_name='kind-registry'
kind create cluster --name "${CLUSTER_NAME}" --config kind-config.yaml
出现如下报错:
Creating cluster "dev" ...
✓ Ensuring node image (kindest/node:v1.25.3) 🖼
✓ Preparing nodes 📦 📦 📦 📦 📦 📦 📦 📦
✓ Configuring the external load balancer ⚖️
✓ Writing configuration 📜
✗ Starting control-plane 🕹️
ERROR: failed to create cluster: failed to init node with kubeadm: command "docker exec --privileged dev-control-plane kubeadm init --skip-phases=preflight --config=/kind/kubeadm.conf --skip-token-print --v=6" failed with error: exit status 1
Command Output: I0118 06:47:07.363603 149 initconfiguration.go:254] loading configuration from "/kind/kubeadm.conf"
W0118 06:47:07.365525 149 initconfiguration.go:331] [config] WARNING: Ignored YAML document with GroupVersionKind kubeadm.k8s.io/v1beta3, Kind=JoinConfiguration
[init] Using Kubernetes version: v1.25.3
[certs] Using certificateDir folder "/etc/kubernetes/pki"
I0118 06:47:07.372106 149 certs.go:112] creating a new certificate authority for ca
[certs] Generating "ca" certificate and key
I0118 06:47:07.451124 149 certs.go:522] validating certificate period for ca certificate
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [dev-control-plane dev-external-load-balancer kubernetes kubernetes.default kubernetes.default.svc kube
rnetes.default.svc.cluster.local localhost] and IPs [10.96.0.1 172.18.0.7 127.0.0.1]
[certs] Generating "apiserver-kubelet-client" certificate and key
I0118 06:47:08.001760 149 certs.go:112] creating a new certificate authority for front-proxy-ca
[certs] Generating "front-proxy-ca" certificate and key
I0118 06:47:08.092062 149 certs.go:522] validating certificate period for front-proxy-ca certificate
[certs] Generating "front-proxy-client" certificate and key
I0118 06:47:08.456417 149 certs.go:112] creating a new certificate authority for etcd-ca
[certs] Generating "etcd/ca" certificate and key
I0118 06:47:08.658389 149 certs.go:522] validating certificate period for etcd/ca certificate
[certs] Generating "etcd/server" certificate and key
[certs] etcd/server serving cert is signed for DNS names [dev-control-plane localhost] and IPs [172.18.0.7 127.0.0.1 ::1]
[certs] Generating "etcd/peer" certificate and key
[certs] etcd/peer serving cert is signed for DNS names [dev-control-plane localhost] and IPs [172.18.0.7 127.0.0.1 ::1]
[certs] Generating "etcd/healthcheck-client" certificate and key
[certs] Generating "apiserver-etcd-client" certificate and key
I0118 06:47:09.885620 149 certs.go:78] creating new public/private key files for signing service account users
[certs] Generating "sa" key and public key
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
I0118 06:47:10.065065 149 kubeconfig.go:103] creating kubeconfig file for admin.conf
[kubeconfig] Writing "admin.conf" kubeconfig file
I0118 06:47:10.270102 149 kubeconfig.go:103] creating kubeconfig file for kubelet.conf
[kubeconfig] Writing "kubelet.conf" kubeconfig file
I0118 06:47:10.343102 149 kubeconfig.go:103] creating kubeconfig file for controller-manager.conf
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
I0118 06:47:10.476121 149 kubeconfig.go:103] creating kubeconfig file for scheduler.conf
[kubeconfig] Writing "scheduler.conf" kubeconfig file
I0118 06:47:10.615067 149 kubelet.go:66] Stopping the kubelet
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Starting the kubelet
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
I0118 06:47:10.772447 149 manifests.go:99] [control-plane] getting StaticPodSpecs
I0118 06:47:10.773090 149 certs.go:522] validating certificate period for CA certificate
I0118 06:47:10.773385 149 manifests.go:125] [control-plane] adding volume "ca-certs" for component "kube-apiserver"
I0118 06:47:10.773469 149 manifests.go:125] [control-plane] adding volume "etc-ca-certificates" for component "kube-apiserver"
I0118 06:47:10.773505 149 manifests.go:125] [control-plane] adding volume "k8s-certs" for component "kube-apiserver"
I0118 06:47:10.773532 149 manifests.go:125] [control-plane] adding volume "usr-local-share-ca-certificates" for component "kube-apiserver"
I0118 06:47:10.773557 149 manifests.go:125] [control-plane] adding volume "usr-share-ca-certificates" for component "kube-apiserver"
I0118 06:47:10.776092 149 manifests.go:154] [control-plane] wrote static Pod manifest for component "kube-apiserver" to "/etc/kubernetes/manifests/kube-apiserver.yaml"
I0118 06:47:10.776109 149 manifests.go:99] [control-plane] getting StaticPodSpecs
[control-plane] Creating static Pod manifest for "kube-controller-manager"
I0118 06:47:10.776289 149 manifests.go:125] [control-plane] adding volume "ca-certs" for component "kube-controller-manager"
I0118 06:47:10.776297 149 manifests.go:125] [control-plane] adding volume "etc-ca-certificates" for component "kube-controller-manager"
I0118 06:47:10.776305 149 manifests.go:125] [control-plane] adding volume "flexvolume-dir" for component "kube-controller-manager"
I0118 06:47:10.776312 149 manifests.go:125] [control-plane] adding volume "k8s-certs" for component "kube-controller-manager"
I0118 06:47:10.776317 149 manifests.go:125] [control-plane] adding volume "kubeconfig" for component "kube-controller-manager"
I0118 06:47:10.776321 149 manifests.go:125] [control-plane] adding volume "usr-local-share-ca-certificates" for component "kube-controller-manager"
I0118 06:47:10.776325 149 manifests.go:125] [control-plane] adding volume "usr-share-ca-certificates" for component "kube-controller-manager"
[control-plane] Creating static Pod manifest for "kube-scheduler"
I0118 06:47:10.776940 149 manifests.go:154] [control-plane] wrote static Pod manifest for component "kube-controller-manager" to "/etc/kubernetes/manifests/kube-controller-manager.yaml"
I0118 06:47:10.776950 149 manifests.go:99] [control-plane] getting StaticPodSpecs
I0118 06:47:10.777122 149 manifests.go:125] [control-plane] adding volume "kubeconfig" for component "kube-scheduler"
I0118 06:47:10.777526 149 manifests.go:154] [control-plane] wrote static Pod manifest for component "kube-scheduler" to "/etc/kubernetes/manifests/kube-scheduler.yaml"
[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
I0118 06:47:10.778158 149 local.go:65] [etcd] wrote Static Pod manifest for a local etcd member to "/etc/kubernetes/manifests/etcd.yaml"
I0118 06:47:10.778195 149 waitcontrolplane.go:83] [wait-control-plane] Waiting for the API server to be healthy
I0118 06:47:10.778751 149 loader.go:374] Config loaded from file: /etc/kubernetes/admin.conf
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
I0118 06:47:10.783452 149 round_trippers.go:553] GET https://dev-external-load-balancer:6443/healthz?timeout=10s in 2 milliseconds
I0118 06:47:11.784001 149 with_retry.go:242] Got a Retry-After 1s response for attempt 1 to https://dev-external-load-balancer:6443/healthz?timeout=10s
I0118 06:47:11.786115 149 round_trippers.go:553] GET https://dev-external-load-balancer:6443/healthz?timeout=10s in 1 milliseconds
I0118 06:47:12.786448 149 with_retry.go:242] Got a Retry-After 1s response for attempt 2 to https://dev-external-load-balancer:6443/healthz?timeout=10s
I0118 06:47:12.787568 149 round_trippers.go:553] GET https://dev-external-load-balancer:6443/healthz?timeout=10s in 1 milliseconds
...
I0118 06:51:30.364631 149 with_retry.go:242] Got a Retry-After 1s response for attempt 8 to https://dev-external-load-balancer:6443/healthz?timeout=10s
I0118 06:51:30.367126 149 round_trippers.go:553] GET https://dev-external-load-balancer:6443/healthz?timeout=10s in 2 milliseconds
I0118 06:51:31.367356 149 with_retry.go:242] Got a Retry-After 1s response for attempt 9 to https://dev-external-load-balancer:6443/healthz?timeout=10s
I0118 06:51:31.368374 149 round_trippers.go:553] GET https://dev-external-load-balancer:6443/healthz?timeout=10s in 0 milliseconds
Unfortunately, an error has occurred:
timed out waiting for the condition
This error is likely caused by:
- The kubelet is not running
- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
- 'systemctl status kubelet'
- 'journalctl -xeu kubelet'
Additionally, a control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI.
Here is one example how you may list all running Kubernetes containers by using crictl:
- 'crictl --runtime-endpoint unix:///run/containerd/containerd.sock ps -a | grep kube | grep -v pause'
Once you have found the failing container, you can inspect its logs with:
- 'crictl --runtime-endpoint unix:///run/containerd/containerd.sock logs CONTAINERID'
couldn't initialize a Kubernetes cluster
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/init.runWaitControlPlanePhase
cmd/kubeadm/app/cmd/phases/init/waitcontrolplane.go:108
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
cmd/kubeadm/app/cmd/phases/workflow/runner.go:234
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
cmd/kubeadm/app/cmd/phases/workflow/runner.go:421
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
cmd/kubeadm/app/cmd/phases/workflow/runner.go:207
k8s.io/kubernetes/cmd/kubeadm/app/cmd.newCmdInit.func1
cmd/kubeadm/app/cmd/init.go:154
github.com/spf13/cobra.(*Command).execute
vendor/github.com/spf13/cobra/command.go:856
github.com/spf13/cobra.(*Command).ExecuteC
vendor/github.com/spf13/cobra/command.go:974
github.com/spf13/cobra.(*Command).Execute
vendor/github.com/spf13/cobra/command.go:902
k8s.io/kubernetes/cmd/kubeadm/app.Run
cmd/kubeadm/app/kubeadm.go:50
main.main
cmd/kubeadm/kubeadm.go:25
runtime.main
/usr/local/go/src/runtime/proc.go:250
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1594
error execution phase wait-control-plane
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
cmd/kubeadm/app/cmd/phases/workflow/runner.go:235
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
cmd/kubeadm/app/cmd/phases/workflow/runner.go:421
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
cmd/kubeadm/app/cmd/phases/workflow/runner.go:207
k8s.io/kubernetes/cmd/kubeadm/app/cmd.newCmdInit.func1
cmd/kubeadm/app/cmd/init.go:154
github.com/spf13/cobra.(*Command).execute
vendor/github.com/spf13/cobra/command.go:856
github.com/spf13/cobra.(*Command).ExecuteC
vendor/github.com/spf13/cobra/command.go:974
github.com/spf13/cobra.(*Command).Execute
vendor/github.com/spf13/cobra/command.go:902
k8s.io/kubernetes/cmd/kubeadm/app.Run
cmd/kubeadm/app/kubeadm.go:50
main.main
cmd/kubeadm/kubeadm.go:25
runtime.main
/usr/local/go/src/runtime/proc.go:250
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1594
此外,物理主机的系统日志有大量的 audit 记录,应该和容器内部运行 systemd
相关(大量重复出现应该是异常):
[Wed Jan 18 14:49:39 2023] audit: type=1300 audit(1674024579.870:2215): arch=c000003e syscall=321 success=yes exit=28 a0=5 a1=7ffed8a26de0 a2=78 a3=7ffed8a26de0 items=0 ppid=24008 pid=24095 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="systemd" exe="/usr/lib/systemd/systemd" key=(null)
[Wed Jan 18 14:49:39 2023] audit: type=1327 audit(1674024579.870:2215): proctitle="/sbin/init"
[Wed Jan 18 14:49:39 2023] audit: type=1334 audit(1674024579.870:2216): prog-id=690 op=LOAD
[Wed Jan 18 14:49:39 2023] audit: type=1300 audit(1674024579.870:2216): arch=c000003e syscall=321 success=yes exit=16 a0=5 a1=c0001439d8 a2=78 a3=c0003097e0 items=0 ppid=30761 pid=30773 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="runc" exe="/usr/local/sbin/runc" key=(null)
[Wed Jan 18 14:49:39 2023] audit: type=1327 audit(1674024579.870:2216): proctitle=72756E63002D2D726F6F74002F72756E2F636F6E7461696E6572642F72756E632F6B38732E696F002D2D6C6F67002F72756E2F636F6E7461696E6572642F696F2E636F6E7461696E6572642E72756E74696D652E76322E7461736B2F6B38732E696F2F3665643565383762386162343234633235663038646465386435326239
[Wed Jan 18 14:49:39 2023] audit: type=1334 audit(1674024579.870:2217): prog-id=691 op=LOAD
[Wed Jan 18 14:49:39 2023] audit: type=1300 audit(1674024579.870:2217): arch=c000003e syscall=321 success=yes exit=18 a0=5 a1=c000143770 a2=78 a3=c000309828 items=0 ppid=30761 pid=30773 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="runc" exe="/usr/local/sbin/runc" key=(null)
排查¶
参考 RROR: failed to create cluster: failed to init node with kubeadm #1437 ,在创建kind集群时添加参数
--retain
获得更详细信息:
kind create cluster --name dev --config kind-config.yaml --retain -v 1
kind export --name dev logs
提示信息:
Exporting logs for cluster "dev" to:
/tmp/3866643061
在 /tmp/3866643061
目录下会找到kind集群各个节点的日志文件
在
/tmp/3866643061/dev-control-plane/kubelet.log
日志中看到有CNI初始化失败的信息:
$ cat kubelet.log | grep E0
Jan 18 07:43:50 dev-control-plane kubelet[193]: E0118 07:43:50.985166 193 certificate_manager.go:471] kubernetes.io/kube-apiserver-client-kubelet: Failed while requesting a signed certificate from the control plane: cannot create certificate signing request: Post "https://dev-external-load-balancer:6443/apis/certificates.k8s.io/v1/certificatesigningrequests": EOF
Jan 18 07:43:51 dev-control-plane kubelet[193]: E0118 07:43:51.002200 193 event.go:276] Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"dev-control-plane.173b577162e64787", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Node", Namespace:"", Name:"dev-control-plane", UID:"dev-control-plane", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"Starting", Message:"Starting kubelet.", Source:v1.EventSource{Component:"kubelet", Host:"dev-control-plane"}, FirstTimestamp:time.Date(2023, time.January, 18, 7, 43, 51, 450951, time.Local), LastTimestamp:time.Date(2023, time.January, 18, 7, 43, 51, 450951, time.Local), Count:1, Type:"Normal", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Post "https://dev-external-load-balancer:6443/api/v1/namespaces/default/events": EOF'(may retry after sleeping)
Jan 18 07:43:51 dev-control-plane kubelet[193]: E0118 07:43:51.007537 193 kubelet.go:2373] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
Jan 18 07:43:51 dev-control-plane kubelet[193]: E0118 07:43:51.061065 193 kubelet.go:2034] "Skipping pod synchronization" err="[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]"
Jan 18 07:43:51 dev-control-plane kubelet[193]: E0118 07:43:51.105070 193 kubelet.go:2448] "Error getting node" err="node \"dev-control-plane\" not found"
Jan 18 07:43:51 dev-control-plane kubelet[193]: E0118 07:43:51.109113 193 kubelet_node_status.go:92] "Unable to register node with API server" err="Post \"https://dev-external-load-balancer:6443/api/v1/nodes\": EOF" node="dev-control-plane"
Jan 18 07:43:51 dev-control-plane kubelet[193]: E0118 07:43:51.161180 193 kubelet.go:2034] "Skipping pod synchronization" err="container runtime status check may not have completed yet"
Jan 18 07:43:51 dev-control-plane kubelet[193]: E0118 07:43:51.206135 193 kubelet.go:2448] "Error getting node" err="node \"dev-control-plane\" not found"
Jan 18 07:43:51 dev-control-plane kubelet[193]: E0118 07:43:51.255991 193 kubelet.go:1397] "Failed to start ContainerManager" err="failed to initialize top level QOS containers: root container [kubelet kubepods] doesn't exist"
Jan 18 07:43:52 dev-control-plane kubelet[273]: E0118 07:43:52.663207 273 event.go:276] Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"dev-control-plane.173b5771c5f29376", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Node", Namespace:"", Name:"dev-control-plane", UID:"dev-control-plane", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"Starting", Message:"Starting kubelet.", Source:v1.EventSource{Component:"kubelet", Host:"dev-control-plane"}, FirstTimestamp:time.Date(2023, time.January, 18, 7, 43, 52, 662201206, time.Local), LastTimestamp:time.Date(2023, time.January, 18, 7, 43, 52, 662201206, time.Local), Count:1, Type:"Normal", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Post "https://dev-external-load-balancer:6443/api/v1/namespaces/default/events": EOF'(may retry after sleeping)
Jan 18 07:43:52 dev-control-plane kubelet[273]: E0118 07:43:52.664394 273 kubelet.go:2373] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
Jan 18 07:43:52 dev-control-plane kubelet[273]: E0118 07:43:52.689512 273 kubelet.go:2034] "Skipping pod synchronization" err="[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]"
Jan 18 07:43:52 dev-control-plane kubelet[273]: E0118 07:43:52.736821 273 cri_stats_provider.go:452] "Failed to get the info of the filesystem with mountpoint" err="failed to get device for dir \"/var/lib/containerd/io.containerd.snapshotter.v1.fuse-overlayfs\": stat failed on /var/lib/containerd/io.containerd.snapshotter.v1.fuse-overlayfs with error: no such file or directory" mountpoint="/var/lib/containerd/io.containerd.snapshotter.v1.fuse-overlayfs"
Jan 18 07:43:52 dev-control-plane kubelet[273]: E0118 07:43:52.764402 273 kubelet.go:2448] "Error getting node" err="node \"dev-control-plane\" not found"
Jan 18 07:43:52 dev-control-plane kubelet[273]: E0118 07:43:52.765647 273 kubelet_node_status.go:92] "Unable to register node with API server" err="Post \"https://dev-external-load-balancer:6443/api/v1/nodes\": EOF" node="dev-control-plane"
...
实际物理主机上docker容器已经启动:
docker ps
显示:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
19537801aa08 kindest/node:v1.25.3 "/usr/local/bin/entr…" 4 minutes ago Up 4 minutes 127.0.0.1:46635->6443/tcp dev-control-plane2
75f9a2d8dc9e kindest/node:v1.25.3 "/usr/local/bin/entr…" 4 minutes ago Up 4 minutes dev-worker3
bf960a2f24f5 kindest/node:v1.25.3 "/usr/local/bin/entr…" 4 minutes ago Up 4 minutes 127.0.0.1:37711->6443/tcp dev-control-plane
c81440eb69b3 kindest/node:v1.25.3 "/usr/local/bin/entr…" 4 minutes ago Up 4 minutes dev-worker4
f2f81e25705f kindest/node:v1.25.3 "/usr/local/bin/entr…" 4 minutes ago Up 4 minutes dev-worker5
5d52f70acb69 kindest/node:v1.25.3 "/usr/local/bin/entr…" 4 minutes ago Up 4 minutes dev-worker
acd0de1e4f4d kindest/node:v1.25.3 "/usr/local/bin/entr…" 4 minutes ago Up 4 minutes 127.0.0.1:41761->6443/tcp dev-control-plane3
8369e5a5e853 kindest/node:v1.25.3 "/usr/local/bin/entr…" 4 minutes ago Up 4 minutes dev-worker2
f9ac6e6b606a kindest/haproxy:v20220607-9a4d8d2a "haproxy -sf 7 -W -d…" 4 minutes ago Up 4 minutes 127.0.0.1:35931->6443/tcp dev-external-load-balancer
进入
dev-control-plane
节点进行检查:docker exec -it bf960a2f24f5 /bin/bash
可以看到底层容器内部的磁盘挂载:
# df -h
Filesystem Size Used Avail Use% Mounted on
zpool-data/62fd3b4b5c3acb12b91336c2f358d369a5223f2657025f03a8b6b35a22f4d2ef 859G 504M 858G 1% /
tmpfs 64M 0 64M 0% /dev
shm 64M 0 64M 0% /dev/shm
zpool-data 860G 2.1G 858G 1% /var
tmpfs 7.8G 454M 7.4G 6% /run
tmpfs 7.8G 0 7.8G 0% /tmp
/dev/nvme0n1p2 60G 27G 33G 45% /usr/lib/modules
tmpfs 5.0M 0 5.0M 0% /run/lock
检查
dev-control-plane
节点的journal.log
日志可以看到zfs
文件系统错误:
...
Jan 18 07:43:45 dev-control-plane containerd[108]: time="2023-01-18T07:43:45.144306425Z" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.btrfs\"..." error="path /var/lib/containerd/io.containerd.snapshotter.v1.btrfs (zfs) must be a btrfs filesystem to be used with the btrfs snapshotter: skip plugin" type=io.containerd.snapshotter.v1
Jan 18 07:43:45 dev-control-plane containerd[108]: time="2023-01-18T07:43:45.144339308Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.devmapper\"..." type=io.containerd.snapshotter.v1
Jan 18 07:43:45 dev-control-plane containerd[108]: time="2023-01-18T07:43:45.144362667Z" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.devmapper" error="devmapper not configured"
Jan 18 07:43:45 dev-control-plane containerd[108]: time="2023-01-18T07:43:45.144379795Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.native\"..." type=io.containerd.snapshotter.v1
Jan 18 07:43:45 dev-control-plane containerd[108]: time="2023-01-18T07:43:45.144417134Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.overlayfs\"..." type=io.containerd.snapshotter.v1
Jan 18 07:43:45 dev-control-plane containerd[108]: time="2023-01-18T07:43:45.144733984Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.zfs\"..." type=io.containerd.snapshotter.v1
Jan 18 07:43:45 dev-control-plane containerd[108]: time="2023-01-18T07:43:45.145156396Z" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.zfs\"..." error="exec: \"zfs\": executable file not found in $PATH: \"zfs fs list -Hp -o name,origin,used,available,mountpoint,compression,type,volsize,quota,referenced,written,logicalused,usedbydataset zpool-data\" => : skip plugin" type=io.containerd.snapshotter.v1
...
参考 [zfs] Failed to create cluster #2163 可以看到 2022年11月1日,在ZFS上构建
kind
的问题修复掉了,解决的方法是 add zfsutils-linux to base Dockerfile
我检查了正在运行的 dev-control-plane
节点内部,确实没有安装 zfs
工具,这说明需要采用最新的 kind
版本
修复zfs¶
我尝试添加镜像参数,指定最新镜像:
kind create cluster --name dev --config kind-config.yaml --image kindest/node:latest
但是提示错误:
✗ Ensuring node image (kindest/node:latest) 🖼
ERROR: failed to create cluster: failed to pull image "kindest/node:latest": command "docker pull kindest/node:latest" failed with error: exit status 1
Command Output: Error response from daemon: manifest for kindest/node:latest not found: manifest unknown: manifest unknown
由于kind的github仓库已经修复这个问题,所以下载最新的Dockerfile来构建本地镜像,注意,这个构建需要激活 Buildkit ,也就是修订
/etc/docker/dameon.json
:
{
"storage-driver": "zfs",
"features": {
"buildkit" : true
}
}
重启 docker
服务
(错误,我以为只需要build base镜像即可,实际上应该是完整build一个kubernetes镜像,见下文) 执行下面的脚本获得最新的Dockerfile,并构建镜像
git clone git@github.com:jrwren/kind.git
cd kind/images/base/
# 需要使用 buildkit ,否则会提示 the --chmod option requires BuildKit. Refer to https://docs.docker.com/go/buildkit/ to learn how to build images with BuildKit enabled
docker build -t kindest/node:v1.25.3-zfs .
通过以下方式安装 buildx
插件(必须,否则build报错):
mkdir -p ~/.docker/cli-plugins
cd ~/.docker/cli-plugins
wget -O docker-buildx https://github.com/docker/buildx/releases/download/v0.10.0/buildx-v0.10.0.linux-amd64
chmod 755 docker-buildx
下载Kubernetes源代码 $(go env GOPATH)/src/k8s.io/kubernetes
,然后构建镜像:
mkdir -p $(go env GOPATH)/src/k8s.io
cd $(go env GOPATH)/src/k8s.io
git clone https://github.com/kubernetes/kubernetes
# 使用当前发布版本 v1.26.1 ,这是一个 Tag(不可修改)
git checkout v1.26.1
# 执行编译
kind build node-image
重新执行创建集群:
kind create cluster --name dev --config kind-config.yaml --image kindest/node:latest
备注
构建镜像时启动的容器内部需要访问 github 下载containerd,但是会被GFW阻断,需要解决容器内部proxy设置 Docker客户端的Proxy :
在墙外服务器上 Squid快速起步 部署squid
修订
~/.ssh/config
添加:
Host parent-squid
HostName <SERVER_IP>
User huatai
LocalForward 192.168.7.152:3128 127.0.0.1:3128
然后执行ssh命令打通端口转发:
ssh parent-squid
在Docker客户端,创建或配置
~/.docker/config.json
设置以下json格式配置:
{
"proxies":
{
"default":
{
"httpProxy": "http://192.168.7.152:3128",
"httpsProxy": "http://192.168.7.152:3128",
"noProxy": "*.baidu.com,192.168.0.0/16,10.0.0.0/8"
}
}
}
接下来创建的新容器,容器内部会自动注入代理配置,也就加速下载
新问题: 创建集群不能找到日志行¶
上述自定义包含zfs工具的镜像执行创建 dev
集群又遇到新问题:
$ kind create cluster --name dev --config kind-config.yaml --image kindest/node:v1.25.3-zfs --retain -v 1
Creating cluster "dev" ...
DEBUG: docker/images.go:58] Image: kindest/node:v1.25.3-zfs present locally
✓ Ensuring node image (kindest/node:v1.25.3-zfs) 🖼
✗ Preparing nodes 📦 📦 📦 📦 📦 📦 📦 📦
ERROR: failed to create cluster: could not find a log line that matches "Reached target .*Multi-User System.*|detected cgroup v1"
Stack Trace:
sigs.k8s.io/kind/pkg/errors.Errorf
/home/huatai/go/pkg/mod/sigs.k8s.io/kind@v0.17.0/pkg/errors/errors.go:41
sigs.k8s.io/kind/pkg/cluster/internal/providers/common.WaitUntilLogRegexpMatches
/home/huatai/go/pkg/mod/sigs.k8s.io/kind@v0.17.0/pkg/cluster/internal/providers/common/cgroups.go:84
sigs.k8s.io/kind/pkg/cluster/internal/providers/docker.createContainerWithWaitUntilSystemdReachesMultiUserSystem
/home/huatai/go/pkg/mod/sigs.k8s.io/kind@v0.17.0/pkg/cluster/internal/providers/docker/provision.go:407
sigs.k8s.io/kind/pkg/cluster/internal/providers/docker.planCreation.func2
/home/huatai/go/pkg/mod/sigs.k8s.io/kind@v0.17.0/pkg/cluster/internal/providers/docker/provision.go:115
sigs.k8s.io/kind/pkg/errors.UntilErrorConcurrent.func1
/home/huatai/go/pkg/mod/sigs.k8s.io/kind@v0.17.0/pkg/errors/concurrent.go:30
runtime.goexit
/usr/lib/go/src/runtime/asm_amd64.s:1594
这个问题在 could not find a line that matches “Reached target .*Multi-User System.*” #2460 有人遇到过