Cilium完全取代kube-proxy运行Kubernetes

Cilium提供了完全取代 kube-proxy 的运行模式。比较简单的方式是在 kubeadm bootstrap 集群的时候就不安装 kube-proxy

备注

Cilium代替 kube-proxy 需要依赖 socket-LB 功能,这要求内核 v4.19.57 , v5.1.16 , v5.2.0 或者更新的 Linux 内核。 Linux 内核 v5.3 和 v5.8 添加了更多功能,可以让Cilium更加优化替代 kube-proxy 的实现。

快速起步

  • kubeadm 初始化集群时候就可以跳过安装 kube-proxy :

kubeadm初始化集群时跳过安装kube-proxy
kubeadm init --skip-phases=addon/kube-proxy

已经安装 kube-proxy 的替换方法

对于已经安装了 kube-proxy 作为 DaemonSet 的Kubernetes集群,则通过以下命令移除 kube-proxy注意: 删除kube-proxy会导致现有服务中断链接,并且停止流量,直到Cilium替换完全安装好才能恢复

移除Kubernetes集群Kube-proxy DaemonSet
kubectl -n kube-system delete ds kube-proxy
# Delete the configmap as well to avoid kube-proxy being reinstalled during a Kubeadm upgrade (works only for K8s 1.19 and newer)
kubectl -n kube-system delete cm kube-proxy
# Run on each node with root permissions:
iptables-save | grep -v KUBE | iptables-restore
  • 设置Helm仓库:

设置cilium Helm仓库
helm repo add cilium https://helm.cilium.io/
  • 执行以下命令进行安装:

Cilium替换kube-proxy
#API_SERVER_IP=192.168.6.101
API_SERVER_IP=z-k8s-api.staging.huatai.me
# Kubeadm default is 6443
API_SERVER_PORT=6443
helm install cilium cilium/cilium --version 1.11.7 \
    --namespace kube-system \
    --set kubeProxyReplacement=strict \
    --set k8sServiceHost=${API_SERVER_IP} \
    --set k8sServicePort=${API_SERVER_PORT}

这里有一个报错:

Error: INSTALLATION FAILED: cannot re-use a name that is still in use

原因官方文档是以第一次初始安装cilium为准,也就是直接删除掉kube-proxy之后,立即进行cilium安装。而我的操作步骤是,安装了cilium之后,再删除掉kube-proxy并重新安装cilium,所以就会出现冲突报错。这个问题参考 Cannot install kubernetes helm chart Error: cannot re-use a name that is still in use ,也就是采用 helm 提供的 upgrade 命令代替 install 命令,就可以重新安装:

Cilium替换kube-proxy
#API_SERVER_IP=192.168.6.101
API_SERVER_IP=z-k8s-api.staging.huatai.me
# Kubeadm default is 6443
API_SERVER_PORT=6443
helm upgrade cilium cilium/cilium --version 1.11.7 \
    --namespace kube-system \
    --set kubeProxyReplacement=strict \
    --set k8sServiceHost=${API_SERVER_IP} \
    --set k8sServicePort=${API_SERVER_PORT}

此时可以看到替换成功:

W0813 23:39:08.689475 1285915 warnings.go:70] spec.template.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[1].matchExpressions[0].key: beta.kubernetes.io/os is deprecated since v1.14; use "kubernetes.io/os" instead
Release "cilium" has been upgraded. Happy Helming!
NAME: cilium
LAST DEPLOYED: Sat Aug 13 23:39:06 2022
NAMESPACE: kube-system
STATUS: deployed
REVISION: 3
TEST SUITE: None
NOTES:
You have successfully installed Cilium with Hubble.

Your release version is 1.11.7.

For any further help, visit https://docs.cilium.io/en/v1.11/gettinghelp

另外一种解决方法可以参考 Cannot re-use a name that is still in use ,即先使用 helm uninstall 卸载组件,然后再进行 helm install (未尝试)。

  • 现在我们可以检查cilium是否在每个节点正常工作:

kubectl检查cilium的pods是否在各个节点正常运行
kubectl -n kube-system get pods -l k8s-app=cilium -o wide

输出显示:

NAME           READY   STATUS    RESTARTS   AGE   IP              NODE        NOMINATED NODE   READINESS GATES
cilium-2qcdd   1/1     Running   0          16m   192.168.6.113   z-k8s-n-3   <none>           <none>
cilium-4drkm   1/1     Running   0          17m   192.168.6.102   z-k8s-m-2   <none>           <none>
cilium-4xktc   1/1     Running   0          17m   192.168.6.101   z-k8s-m-1   <none>           <none>
cilium-5j2xb   1/1     Running   0          16m   192.168.6.112   z-k8s-n-2   <none>           <none>
cilium-d7mmq   1/1     Running   0          17m   192.168.6.114   z-k8s-n-4   <none>           <none>
cilium-fw9b5   1/1     Running   0          17m   192.168.6.115   z-k8s-n-5   <none>           <none>
cilium-t675t   1/1     Running   0          16m   192.168.6.103   z-k8s-m-3   <none>           <none>
cilium-tsntp   1/1     Running   0          16m   192.168.6.111   z-k8s-n-1   <none>           <none>

验证设置

在完成了kube-proxy替代之后,首先验证是否在节点上运行了Cilium agent正确模式:

kubectl -n kube-system exec ds/cilium -- cilium status | grep KubeProxyReplacement

此时显示输出类似:

Defaulted container "cilium-agent" out of: cilium-agent, mount-cgroup (init), apply-sysctl-overwrites (init), clean-cilium-state (init)
KubeProxyReplacement:   Strict   [enp1s0 192.168.6.102 (Direct Routing)]
  • 检查详细信息:

    kubectl -n kube-system exec ds/cilium -- cilium status --verbose
    

可选步骤:通过Nginx部署验证

  • 准备 my-nginx.yaml :

部署Nginx my-nginx.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-nginx
spec:
  selector:
    matchLabels:
      run: my-nginx
  replicas: 2
  template:
    metadata:
      labels:
        run: my-nginx
    spec:
      containers:
      - name: my-nginx
        image: nginx
        ports:
        - containerPort: 80
  • 执行部署:

    kubectl create -f my-nginx.yaml
    

检查pod创建:

kubectl get pods -o wide

出现一个意外,镜像始终没有下载成功:

NAME                       READY   STATUS              RESTARTS   AGE   IP           NODE        NOMINATED NODE   READINESS GATES
my-nginx-df7bbf6f5-457mh   0/1     ContainerCreating   0          12m   <none>       z-k8s-n-5   <none>           <none>
my-nginx-df7bbf6f5-6gndk   0/1     ContainerCreating   0          12m   <none>       z-k8s-n-1   <none>           <none>

通过 kubectl describe pods my-nginx-df7bbf6f5-457mh 可以看到一直停留在pulling image状态:

...
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  10m   default-scheduler  Successfully assigned default/my-nginx-df7bbf6f5-457mh to z-k8s-n-5
  Normal  Pulling    10m   kubelet            Pulling image "nginx"

此时检查集群事件:

kubectl get events --sort-by=.metadata.creationTimestamp

可以看到:

LAST SEEN   TYPE     REASON              OBJECT                          MESSAGE
16m         Normal   Scheduled           pod/my-nginx-df7bbf6f5-457mh    Successfully assigned default/my-nginx-df7bbf6f5-457mh to z-k8s-n-5
16m         Normal   Scheduled           pod/my-nginx-df7bbf6f5-6gndk    Successfully assigned default/my-nginx-df7bbf6f5-6gndk to z-k8s-n-1
16m         Normal   SuccessfulCreate    replicaset/my-nginx-df7bbf6f5   Created pod: my-nginx-df7bbf6f5-6gndk
16m         Normal   SuccessfulCreate    replicaset/my-nginx-df7bbf6f5   Created pod: my-nginx-df7bbf6f5-457mh
16m         Normal   ScalingReplicaSet   deployment/my-nginx             Scaled up replica set my-nginx-df7bbf6f5 to 2
16m         Normal   Pulling             pod/my-nginx-df7bbf6f5-457mh    Pulling image "nginx"
16m         Normal   Pulling             pod/my-nginx-df7bbf6f5-6gndk    Pulling image "nginx"

不过看起来还是下载镜像较慢,最终还是运行起来了:

NAME                       READY   STATUS    RESTARTS   AGE   IP           NODE        NOMINATED NODE   READINESS GATES
my-nginx-df7bbf6f5-457mh   1/1     Running   0          12h   10.0.6.22    z-k8s-n-5   <none>           <none>
my-nginx-df7bbf6f5-6gndk   1/1     Running   0          12h   10.0.3.160   z-k8s-n-1   <none>           <none>

提示信息:

service/my-nginx exposed
  • 检查 NodePort 服务:

    kubectl get svc my-nginx
    

状态显示:

NAME       TYPE       CLUSTER-IP       EXTERNAL-IP   PORT(S)        AGE
my-nginx   NodePort   10.101.117.255   <none>        80:30828/TCP   110s
  • 现在我们可以通过 cilium service list 命令来验证 Cilium eBPF kube-proxy 替换所创建的新的 NodePort 服务:

检查cilium DaemonSet的服务列表
kubectl -n kube-system exec ds/cilium -- cilium service list

输出显示:

检查cilium DaemonSet的服务列表输出信息
ID   Frontend              Service Type   Backend
1    10.104.129.196:443    ClusterIP      1 => 192.168.6.114:4244
                                          2 => 192.168.6.102:4244
                                          3 => 192.168.6.115:4244
                                          4 => 192.168.6.101:4244
                                          5 => 192.168.6.103:4244
                                          6 => 192.168.6.112:4244
                                          7 => 192.168.6.113:4244
                                          8 => 192.168.6.111:4244
2    10.108.4.221:8080     ClusterIP      1 => 10.0.5.157:8080
3    10.96.0.1:443         ClusterIP      1 => 192.168.6.101:6443
                                          2 => 192.168.6.102:6443
                                          3 => 192.168.6.103:6443
4    10.96.0.10:53         ClusterIP      1 => 10.0.0.141:53
                                          2 => 10.0.0.241:53
5    10.96.0.10:9153       ClusterIP      1 => 10.0.0.141:9153
                                          2 => 10.0.0.241:9153
6    10.100.109.59:8080    ClusterIP      1 => 10.0.7.132:8080
9    192.168.6.102:31066   NodePort       1 => 10.0.5.157:8080
10   0.0.0.0:31066         NodePort       1 => 10.0.5.157:8080
11   192.168.6.102:30798   NodePort       1 => 10.0.7.132:8080
12   0.0.0.0:30798         NodePort       1 => 10.0.7.132:8080
13   10.101.117.255:80     ClusterIP      1 => 10.0.3.160:80
                                          2 => 10.0.6.22:80
14   192.168.6.102:30828   NodePort       1 => 10.0.3.160:80
                                          2 => 10.0.6.22:80
15   0.0.0.0:30828         NodePort       1 => 10.0.3.160:80
                                          2 => 10.0.6.22:80
  • 通过以下命令获取服务输出的NodePort端口:

    node_port=$(kubectl get svc my-nginx -o=jsonpath='{@.spec.ports[0].nodePort}')
    

实际上,现在我们有3种方式访问,从前文 cilium service list 可以看到:

  • 10.101.117.255:80 ClusterIP

  • 192.168.6.102:30828 NodePort

  • 0.0.0.0:30828 NodePort

对应:

  • 在集群任何节点上访问 10.101.117.255 端口 80

  • 访问 z-k8s-m-2 (192.168.6.102) 端口 30828

  • 访问集群任何节点的端口 30828

都能够看到nginx的页面(这里举例访问 z-k8s-n-2 192.168.6.112):

curl 192.168.6.112:30828

输出可以看到:

<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
html { color-scheme: light dark;  }
body { width: 35em; margin: 0 auto;
font-family: Tahoma, Verdana, Arial, sans-serif; }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>

Socket LoadBalancer Bypass in Pod Namespace

Cilium Istio集成起步 配置Cilium时,如果部署的Cilium采用本文 kube-proxy replacement 模式( kube-proxy_free ),就需要调整 Cilium 的socket load balancing,配置 socketLB.hostNamespaceOnly=true ,否则会导致Istio的加密和遥测功能失效。

由于我已经在上文中启用了 hub-proxy_free ,所以,在部署 Cilium Istio集成起步 的第一个步骤就是本段落配置更新,激活 socketLB.hostNamespaceOnly=true :

警告

我这里配置错误了,折腾了一下才解决,请参考下文的排查和纠正。最后我给出一个正确的简化配置(不修订默认值)。Cilium有很多强大的网络功能配置需要联动,并且和底层云计算underlay网络(vxlan等)有关,所以调整要非常小心。

更新Cilium kube-proxy free配置,激活 socketLB.hostNamespaceOnly 以集成Istio(存在错误,无法启动cilium)
API_SERVER_IP=z-k8s-api.staging.huatai.me
API_SERVER_PORT=6443
helm upgrade cilium cilium/cilium --version 1.12.1 \
   --namespace kube-system \
   --reuse-values \
   --set tunnel=disabled \
   --set autoDirectNodeRoutes=true \
   --set kubeProxyReplacement=strict \
   --set socketLB.hostNamespaceOnly=true \
   --set k8sServiceHost=${API_SERVER_IP} \
   --set k8sServicePort=${API_SERVER_PORT}

不过,我这次更新遇到奇怪的问题,就是节点上的 cilium 不断crash:

$ kubectl get pods -n kube-system -o wide
NAME                                READY   STATUS             RESTARTS      AGE    IP              NODE        NOMINATED NODE   READINESS GATES
cilium-2brxn                        0/1     CrashLoopBackOff   4 (67s ago)   3m4s   192.168.6.103   z-k8s-m-3   <none>           <none>
cilium-6rhms                        1/1     Running            0             25h    192.168.6.115   z-k8s-n-5   <none>           <none>
cilium-mzrkm                        0/1     CrashLoopBackOff   4 (79s ago)   3m5s   192.168.6.113   z-k8s-n-3   <none>           <none>
cilium-operator-6dfc84b7fc-m8ftr    1/1     Running            0             3m5s   192.168.6.114   z-k8s-n-4   <none>           <none>
cilium-operator-6dfc84b7fc-sxjp5    1/1     Running            0             3m6s   192.168.6.113   z-k8s-n-3   <none>           <none>
cilium-pmdj4                        1/1     Running            0             25h    192.168.6.102   z-k8s-m-2   <none>           <none>
cilium-qjxcc                        0/1     CrashLoopBackOff   4 (81s ago)   3m5s   192.168.6.101   z-k8s-m-1   <none>           <none>
cilium-t5n4c                        1/1     Running            0             25h    192.168.6.114   z-k8s-n-4   <none>           <none>
cilium-vjqlr                        1/1     Running            0             25h    192.168.6.111   z-k8s-n-1   <none>           <none>
cilium-vk624                        0/1     CrashLoopBackOff   4 (74s ago)   3m4s   192.168.6.112   z-k8s-n-2   <none>           <none>

检查pods:

kubectl -n kube-system describe pods cilium-vk624

显示容器健康检查失败:

激活 socketLB.hostNamespaceOnly 出现pods不断crash
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  4m51s                  default-scheduler  Successfully assigned kube-system/cilium-vk624 to z-k8s-n-2
  Normal   Pulled     4m50s                  kubelet            Container image "quay.io/cilium/cilium:v1.12.1@sha256:ea2db1ee21b88127b5c18a96ad155c25485d0815a667ef77c2b7c7f31cab601b" already present on machine
  Normal   Created    4m50s                  kubelet            Created container mount-cgroup
  Normal   Pulled     4m50s                  kubelet            Container image "quay.io/cilium/cilium:v1.12.1@sha256:ea2db1ee21b88127b5c18a96ad155c25485d0815a667ef77c2b7c7f31cab601b" already present on machine
  Normal   Started    4m50s                  kubelet            Started container mount-cgroup
  Normal   Started    4m49s                  kubelet            Started container apply-sysctl-overwrites
  Normal   Created    4m49s                  kubelet            Created container apply-sysctl-overwrites
  Normal   Pulled     4m49s                  kubelet            Container image "quay.io/cilium/cilium:v1.12.1@sha256:ea2db1ee21b88127b5c18a96ad155c25485d0815a667ef77c2b7c7f31cab601b" already present on machine
  Normal   Pulled     4m48s                  kubelet            Container image "quay.io/cilium/cilium:v1.12.1@sha256:ea2db1ee21b88127b5c18a96ad155c25485d0815a667ef77c2b7c7f31cab601b" already present on machine
  Normal   Created    4m48s                  kubelet            Created container mount-bpf-fs
  Normal   Started    4m48s                  kubelet            Started container mount-bpf-fs
  Normal   Created    4m47s                  kubelet            Created container clean-cilium-state
  Normal   Started    4m47s                  kubelet            Started container clean-cilium-state
  Normal   Started    4m43s (x2 over 4m46s)  kubelet            Started container cilium-agent
  Warning  Unhealthy  4m42s (x2 over 4m44s)  kubelet            Startup probe failed: Get "http://127.0.0.1:9879/healthz": dial tcp 127.0.0.1:9879: connect: connection refused
  Warning  BackOff    4m38s (x3 over 4m40s)  kubelet            Back-off restarting failed container
  Normal   Pulled     4m24s (x3 over 4m47s)  kubelet            Container image "quay.io/cilium/cilium:v1.12.1@sha256:ea2db1ee21b88127b5c18a96ad155c25485d0815a667ef77c2b7c7f31cab601b" already present on machine
  Normal   Created    4m24s (x3 over 4m46s)  kubelet            Created container cilium-agent

检查也可以看到:

kubectl -n kube-system exec ds/cilium -- cilium status --verbose

显示有异常:

...
Encryption:               Disabled
Cluster health:           4/8 reachable   (2022-08-22T16:34:16Z)
  Name                    IP              Node          Endpoints
  z-k8s-n-4 (localhost)   192.168.6.114   reachable     reachable
  z-k8s-m-1               192.168.6.101   unreachable   reachable
  z-k8s-m-2               192.168.6.102   reachable     reachable
  z-k8s-m-3               192.168.6.103   unreachable   reachable
  z-k8s-n-1               192.168.6.111   reachable     reachable
  z-k8s-n-2               192.168.6.112   unreachable   reachable
  z-k8s-n-3               192.168.6.113   unreachable   reachable
  z-k8s-n-5               192.168.6.115   reachable     reachable

检查crash的pod日志:

kubectl -n kube-system logs cilium-vk624

发现错误是参数错误:

激活 socketLB.hostNamespaceOnly 后crash pod日志
Defaulted container "cilium-agent" out of: cilium-agent, mount-cgroup (init), apply-sysctl-overwrites (init), mount-bpf-fs (init), clean-cilium-state (init)
level=info msg="Started gops server" address="127.0.0.1:9890" subsys=daemon
level=warning msg="If auto-direct-node-routes is enabled, then you are recommended to also configure ipv4-native-routing-cidr. If ipv4-native-routing-cidr is not configured, this may lead to pod to pod traffic being masqueraded, which can cause problems with performance, observability and policy" subsys=config
level=info msg="Memory available for map entries (0.003% of 4120702976B): 10301757B" subsys=config
level=info msg="option bpf-ct-global-tcp-max set by dynamic sizing to 131072" subsys=config
level=info msg="option bpf-ct-global-any-max set by dynamic sizing to 65536" subsys=config
level=info msg="option bpf-nat-global-max set by dynamic sizing to 131072" subsys=config
level=info msg="option bpf-neigh-global-max set by dynamic sizing to 131072" subsys=config
level=info msg="option bpf-sock-rev-map-max set by dynamic sizing to 65536" subsys=config
level=info msg="  --agent-health-port='9879'" subsys=daemon
level=info msg="  --agent-labels=''" subsys=daemon
level=info msg="  --agent-not-ready-taint-key='node.cilium.io/agent-not-ready'" subsys=daemon
level=info msg="  --allocator-list-timeout='3m0s'" subsys=daemon
level=info msg="  --allow-icmp-frag-needed='true'" subsys=daemon
level=info msg="  --allow-localhost='auto'" subsys=daemon
level=info msg="  --annotate-k8s-node='false'" subsys=daemon
level=info msg="  --api-rate-limit=''" subsys=daemon
level=info msg="  --arping-refresh-period='30s'" subsys=daemon
level=info msg="  --auto-create-cilium-node-resource='true'" subsys=daemon
level=info msg="  --auto-direct-node-routes='true'" subsys=daemon
level=info msg="  --bgp-announce-lb-ip='false'" subsys=daemon
level=info msg="  --bgp-announce-pod-cidr='false'" subsys=daemon
level=info msg="  --bgp-config-path='/var/lib/cilium/bgp/config.yaml'" subsys=daemon
level=info msg="  --bpf-ct-global-any-max='262144'" subsys=daemon
level=info msg="  --bpf-ct-global-tcp-max='524288'" subsys=daemon
level=info msg="  --bpf-ct-timeout-regular-any='1m0s'" subsys=daemon
level=info msg="  --bpf-ct-timeout-regular-tcp='6h0m0s'" subsys=daemon
level=info msg="  --bpf-ct-timeout-regular-tcp-fin='10s'" subsys=daemon
level=info msg="  --bpf-ct-timeout-regular-tcp-syn='1m0s'" subsys=daemon
level=info msg="  --bpf-ct-timeout-service-any='1m0s'" subsys=daemon
level=info msg="  --bpf-ct-timeout-service-tcp='6h0m0s'" subsys=daemon
level=info msg="  --bpf-ct-timeout-service-tcp-grace='1m0s'" subsys=daemon
level=info msg="  --bpf-filter-priority='1'" subsys=daemon
level=info msg="  --bpf-fragments-map-max='8192'" subsys=daemon
level=info msg="  --bpf-lb-acceleration='disabled'" subsys=daemon
level=info msg="  --bpf-lb-affinity-map-max='0'" subsys=daemon
level=info msg="  --bpf-lb-algorithm='random'" subsys=daemon
level=info msg="  --bpf-lb-dev-ip-addr-inherit=''" subsys=daemon
level=info msg="  --bpf-lb-dsr-dispatch='opt'" subsys=daemon
level=info msg="  --bpf-lb-dsr-l4-xlate='frontend'" subsys=daemon
level=info msg="  --bpf-lb-external-clusterip='false'" subsys=daemon
level=info msg="  --bpf-lb-maglev-hash-seed='JLfvgnHc2kaSUFaI'" subsys=daemon
level=info msg="  --bpf-lb-maglev-map-max='0'" subsys=daemon
level=info msg="  --bpf-lb-maglev-table-size='16381'" subsys=daemon
level=info msg="  --bpf-lb-map-max='65536'" subsys=daemon
level=info msg="  --bpf-lb-mode='snat'" subsys=daemon
level=info msg="  --bpf-lb-rev-nat-map-max='0'" subsys=daemon
level=info msg="  --bpf-lb-rss-ipv4-src-cidr=''" subsys=daemon
level=info msg="  --bpf-lb-rss-ipv6-src-cidr=''" subsys=daemon
level=info msg="  --bpf-lb-service-backend-map-max='0'" subsys=daemon
level=info msg="  --bpf-lb-service-map-max='0'" subsys=daemon
level=info msg="  --bpf-lb-sock='false'" subsys=daemon
level=info msg="  --bpf-lb-sock-hostns-only='true'" subsys=daemon
level=info msg="  --bpf-lb-source-range-map-max='0'" subsys=daemon
level=info msg="  --bpf-map-dynamic-size-ratio='0.0025'" subsys=daemon
level=info msg="  --bpf-nat-global-max='524288'" subsys=daemon
level=info msg="  --bpf-neigh-global-max='524288'" subsys=daemon
level=info msg="  --bpf-policy-map-max='16384'" subsys=daemon
level=info msg="  --bpf-root='/sys/fs/bpf'" subsys=daemon
level=info msg="  --bpf-sock-rev-map-max='262144'" subsys=daemon
level=info msg="  --bypass-ip-availability-upon-restore='false'" subsys=daemon
level=info msg="  --certificates-directory='/var/run/cilium/certs'" subsys=daemon
level=info msg="  --cflags=''" subsys=daemon
level=info msg="  --cgroup-root='/run/cilium/cgroupv2'" subsys=daemon
level=info msg="  --cluster-health-port='4240'" subsys=daemon
level=info msg="  --cluster-id='0'" subsys=daemon
level=info msg="  --cluster-name='default'" subsys=daemon
level=info msg="  --clustermesh-config='/var/lib/cilium/clustermesh/'" subsys=daemon
level=info msg="  --cmdref=''" subsys=daemon
level=info msg="  --config=''" subsys=daemon
level=info msg="  --config-dir='/tmp/cilium/config-map'" subsys=daemon
level=info msg="  --conntrack-gc-interval='0s'" subsys=daemon
level=info msg="  --crd-wait-timeout='5m0s'" subsys=daemon
level=info msg="  --datapath-mode='veth'" subsys=daemon
level=info msg="  --debug='false'" subsys=daemon
level=info msg="  --debug-verbose=''" subsys=daemon
level=info msg="  --derive-masquerade-ip-addr-from-device=''" subsys=daemon
level=info msg="  --devices=''" subsys=daemon
level=info msg="  --direct-routing-device=''" subsys=daemon
level=info msg="  --disable-cnp-status-updates='true'" subsys=daemon
level=info msg="  --disable-conntrack='false'" subsys=daemon
level=info msg="  --disable-endpoint-crd='false'" subsys=daemon
level=info msg="  --disable-envoy-version-check='false'" subsys=daemon
level=info msg="  --disable-iptables-feeder-rules=''" subsys=daemon
level=info msg="  --dns-max-ips-per-restored-rule='1000'" subsys=daemon
level=info msg="  --dns-policy-unload-on-shutdown='false'" subsys=daemon
level=info msg="  --dnsproxy-concurrency-limit='0'" subsys=daemon
level=info msg="  --dnsproxy-concurrency-processing-grace-period='0s'" subsys=daemon
level=info msg="  --egress-masquerade-interfaces=''" subsys=daemon
level=info msg="  --egress-multi-home-ip-rule-compat='false'" subsys=daemon
level=info msg="  --enable-auto-protect-node-port-range='true'" subsys=daemon
level=info msg="  --enable-bandwidth-manager='false'" subsys=daemon
level=info msg="  --enable-bbr='false'" subsys=daemon
level=info msg="  --enable-bgp-control-plane='false'" subsys=daemon
level=info msg="  --enable-bpf-clock-probe='true'" subsys=daemon
level=info msg="  --enable-bpf-masquerade='false'" subsys=daemon
level=info msg="  --enable-bpf-tproxy='false'" subsys=daemon
level=info msg="  --enable-cilium-endpoint-slice='false'" subsys=daemon
level=info msg="  --enable-custom-calls='false'" subsys=daemon
level=info msg="  --enable-endpoint-health-checking='true'" subsys=daemon
level=info msg="  --enable-endpoint-routes='false'" subsys=daemon
level=info msg="  --enable-envoy-config='true'" subsys=daemon
level=info msg="  --enable-external-ips='true'" subsys=daemon
level=info msg="  --enable-health-check-nodeport='true'" subsys=daemon
level=info msg="  --enable-health-checking='true'" subsys=daemon
level=info msg="  --enable-host-firewall='false'" subsys=daemon
level=info msg="  --enable-host-legacy-routing='false'" subsys=daemon
level=info msg="  --enable-host-port='true'" subsys=daemon
level=info msg="  --enable-host-reachable-services='false'" subsys=daemon
level=info msg="  --enable-hubble='true'" subsys=daemon
level=info msg="  --enable-hubble-recorder-api='true'" subsys=daemon
level=info msg="  --enable-icmp-rules='true'" subsys=daemon
level=info msg="  --enable-identity-mark='true'" subsys=daemon
level=info msg="  --enable-ip-masq-agent='false'" subsys=daemon
level=info msg="  --enable-ipsec='false'" subsys=daemon
level=info msg="  --enable-ipv4='true'" subsys=daemon
level=info msg="  --enable-ipv4-egress-gateway='false'" subsys=daemon
level=info msg="  --enable-ipv4-fragment-tracking='true'" subsys=daemon
level=info msg="  --enable-ipv4-masquerade='true'" subsys=daemon
level=info msg="  --enable-ipv6='false'" subsys=daemon
level=info msg="  --enable-ipv6-masquerade='true'" subsys=daemon
level=info msg="  --enable-ipv6-ndp='false'" subsys=daemon
level=info msg="  --enable-k8s-api-discovery='false'" subsys=daemon
level=info msg="  --enable-k8s-endpoint-slice='true'" subsys=daemon
level=info msg="  --enable-k8s-event-handover='false'" subsys=daemon
level=info msg="  --enable-k8s-terminating-endpoint='true'" subsys=daemon
level=info msg="  --enable-l2-neigh-discovery='true'" subsys=daemon
level=info msg="  --enable-l7-proxy='true'" subsys=daemon
level=info msg="  --enable-local-node-route='true'" subsys=daemon
level=info msg="  --enable-local-redirect-policy='false'" subsys=daemon
level=info msg="  --enable-mke='false'" subsys=daemon
level=info msg="  --enable-monitor='true'" subsys=daemon
level=info msg="  --enable-node-port='false'" subsys=daemon
level=info msg="  --enable-policy='default'" subsys=daemon
level=info msg="  --enable-recorder='false'" subsys=daemon
level=info msg="  --enable-remote-node-identity='true'" subsys=daemon
level=info msg="  --enable-runtime-device-detection='false'" subsys=daemon
level=info msg="  --enable-selective-regeneration='true'" subsys=daemon
level=info msg="  --enable-service-topology='false'" subsys=daemon
level=info msg="  --enable-session-affinity='false'" subsys=daemon
level=info msg="  --enable-svc-source-range-check='true'" subsys=daemon
level=info msg="  --enable-tracing='false'" subsys=daemon
level=info msg="  --enable-unreachable-routes='false'" subsys=daemon
level=info msg="  --enable-vtep='false'" subsys=daemon
level=info msg="  --enable-well-known-identities='false'" subsys=daemon
level=info msg="  --enable-wireguard='false'" subsys=daemon
level=info msg="  --enable-wireguard-userspace-fallback='false'" subsys=daemon
level=info msg="  --enable-xdp-prefilter='false'" subsys=daemon
level=info msg="  --enable-xt-socket-fallback='true'" subsys=daemon
level=info msg="  --encrypt-interface=''" subsys=daemon
level=info msg="  --encrypt-node='false'" subsys=daemon
level=info msg="  --endpoint-gc-interval='5m0s'" subsys=daemon
level=info msg="  --endpoint-interface-name-prefix=''" subsys=daemon
level=info msg="  --endpoint-queue-size='25'" subsys=daemon
level=info msg="  --endpoint-status=''" subsys=daemon
level=info msg="  --envoy-config-timeout='2m0s'" subsys=daemon
level=info msg="  --envoy-log=''" subsys=daemon
level=info msg="  --exclude-local-address=''" subsys=daemon
level=info msg="  --fixed-identity-mapping=''" subsys=daemon
level=info msg="  --force-local-policy-eval-at-source='true'" subsys=daemon
level=info msg="  --fqdn-regex-compile-lru-size='1024'" subsys=daemon
level=info msg="  --gops-port='9890'" subsys=daemon
level=info msg="  --host-reachable-services-protos='tcp,udp'" subsys=daemon
level=info msg="  --http-403-msg=''" subsys=daemon
level=info msg="  --http-idle-timeout='0'" subsys=daemon
level=info msg="  --http-max-grpc-timeout='0'" subsys=daemon
level=info msg="  --http-normalize-path='true'" subsys=daemon
level=info msg="  --http-request-timeout='3600'" subsys=daemon
level=info msg="  --http-retry-count='3'" subsys=daemon
level=info msg="  --http-retry-timeout='0'" subsys=daemon
level=info msg="  --hubble-disable-tls='false'" subsys=daemon
level=info msg="  --hubble-event-buffer-capacity='4095'" subsys=daemon
level=info msg="  --hubble-event-queue-size='0'" subsys=daemon
level=info msg="  --hubble-export-file-compress='false'" subsys=daemon
level=info msg="  --hubble-export-file-max-backups='5'" subsys=daemon
level=info msg="  --hubble-export-file-max-size-mb='10'" subsys=daemon
level=info msg="  --hubble-export-file-path=''" subsys=daemon
level=info msg="  --hubble-listen-address=':4244'" subsys=daemon
level=info msg="  --hubble-metrics='dns,drop,tcp,flow,port-distribution,icmp,http'" subsys=daemon
level=info msg="  --hubble-metrics-server=':9965'" subsys=daemon
level=info msg="  --hubble-recorder-sink-queue-size='1024'" subsys=daemon
level=info msg="  --hubble-recorder-storage-path='/var/run/cilium/pcaps'" subsys=daemon
level=info msg="  --hubble-socket-path='/var/run/cilium/hubble.sock'" subsys=daemon
level=info msg="  --hubble-tls-cert-file='/var/lib/cilium/tls/hubble/server.crt'" subsys=daemon
level=info msg="  --hubble-tls-client-ca-files='/var/lib/cilium/tls/hubble/client-ca.crt'" subsys=daemon
level=info msg="  --hubble-tls-key-file='/var/lib/cilium/tls/hubble/server.key'" subsys=daemon
level=info msg="  --identity-allocation-mode='crd'" subsys=daemon
level=info msg="  --identity-change-grace-period='5s'" subsys=daemon
level=info msg="  --identity-restore-grace-period='10m0s'" subsys=daemon
level=info msg="  --install-egress-gateway-routes='false'" subsys=daemon
level=info msg="  --install-iptables-rules='true'" subsys=daemon
level=info msg="  --install-no-conntrack-iptables-rules='false'" subsys=daemon
level=info msg="  --ip-allocation-timeout='2m0s'" subsys=daemon
level=info msg="  --ip-masq-agent-config-path='/etc/config/ip-masq-agent'" subsys=daemon
level=info msg="  --ipam='cluster-pool'" subsys=daemon
level=info msg="  --ipsec-key-file=''" subsys=daemon
level=info msg="  --iptables-lock-timeout='5s'" subsys=daemon
level=info msg="  --iptables-random-fully='false'" subsys=daemon
level=info msg="  --ipv4-native-routing-cidr=''" subsys=daemon
level=info msg="  --ipv4-node='auto'" subsys=daemon
level=info msg="  --ipv4-pod-subnets=''" subsys=daemon
level=info msg="  --ipv4-range='auto'" subsys=daemon
level=info msg="  --ipv4-service-loopback-address='169.254.42.1'" subsys=daemon
level=info msg="  --ipv4-service-range='auto'" subsys=daemon
level=info msg="  --ipv6-cluster-alloc-cidr='f00d::/64'" subsys=daemon
level=info msg="  --ipv6-mcast-device=''" subsys=daemon
level=info msg="  --ipv6-native-routing-cidr=''" subsys=daemon
level=info msg="  --ipv6-node='auto'" subsys=daemon
level=info msg="  --ipv6-pod-subnets=''" subsys=daemon
level=info msg="  --ipv6-range='auto'" subsys=daemon
level=info msg="  --ipv6-service-range='auto'" subsys=daemon
level=info msg="  --join-cluster='false'" subsys=daemon
level=info msg="  --k8s-api-server=''" subsys=daemon
level=info msg="  --k8s-heartbeat-timeout='30s'" subsys=daemon
level=info msg="  --k8s-kubeconfig-path=''" subsys=daemon
level=info msg="  --k8s-namespace='kube-system'" subsys=daemon
level=info msg="  --k8s-require-ipv4-pod-cidr='false'" subsys=daemon
level=info msg="  --k8s-require-ipv6-pod-cidr='false'" subsys=daemon
level=info msg="  --k8s-service-cache-size='128'" subsys=daemon
level=info msg="  --k8s-service-proxy-name=''" subsys=daemon
level=info msg="  --k8s-sync-timeout='3m0s'" subsys=daemon
level=info msg="  --k8s-watcher-endpoint-selector='metadata.name!=kube-scheduler,metadata.name!=kube-controller-manager,metadata.name!=etcd-operator,metadata.name!=gcp-controller-manager'" subsys=daemon
level=info msg="  --keep-config='false'" subsys=daemon
level=info msg="  --kube-proxy-replacement='strict'" subsys=daemon
level=info msg="  --kube-proxy-replacement-healthz-bind-address=''" subsys=daemon
level=info msg="  --kvstore=''" subsys=daemon
level=info msg="  --kvstore-connectivity-timeout='2m0s'" subsys=daemon
level=info msg="  --kvstore-lease-ttl='15m0s'" subsys=daemon
level=info msg="  --kvstore-max-consecutive-quorum-errors='2'" subsys=daemon
level=info msg="  --kvstore-opt=''" subsys=daemon
level=info msg="  --kvstore-periodic-sync='5m0s'" subsys=daemon
level=info msg="  --label-prefix-file=''" subsys=daemon
level=info msg="  --labels=''" subsys=daemon
level=info msg="  --lib-dir='/var/lib/cilium'" subsys=daemon
level=info msg="  --local-max-addr-scope='252'" subsys=daemon
level=info msg="  --local-router-ipv4=''" subsys=daemon
level=info msg="  --local-router-ipv6=''" subsys=daemon
level=info msg="  --log-driver=''" subsys=daemon
level=info msg="  --log-opt=''" subsys=daemon
level=info msg="  --log-system-load='false'" subsys=daemon
level=info msg="  --max-controller-interval='0'" subsys=daemon
level=info msg="  --metrics=''" subsys=daemon
level=info msg="  --mke-cgroup-mount=''" subsys=daemon
level=info msg="  --monitor-aggregation='medium'" subsys=daemon
level=info msg="  --monitor-aggregation-flags='all'" subsys=daemon
level=info msg="  --monitor-aggregation-interval='5s'" subsys=daemon
level=info msg="  --monitor-queue-size='0'" subsys=daemon
level=info msg="  --mtu='0'" subsys=daemon
level=info msg="  --node-port-acceleration='disabled'" subsys=daemon
level=info msg="  --node-port-algorithm='random'" subsys=daemon
level=info msg="  --node-port-bind-protection='true'" subsys=daemon
level=info msg="  --node-port-mode='snat'" subsys=daemon
level=info msg="  --node-port-range='30000,32767'" subsys=daemon
level=info msg="  --policy-audit-mode='false'" subsys=daemon
level=info msg="  --policy-queue-size='100'" subsys=daemon
level=info msg="  --policy-trigger-interval='1s'" subsys=daemon
level=info msg="  --pprof='false'" subsys=daemon
level=info msg="  --pprof-port='6060'" subsys=daemon
level=info msg="  --preallocate-bpf-maps='false'" subsys=daemon
level=info msg="  --prepend-iptables-chains='true'" subsys=daemon
level=info msg="  --procfs='/host/proc'" subsys=daemon
level=info msg="  --prometheus-serve-addr=':9962'" subsys=daemon
level=info msg="  --proxy-connect-timeout='1'" subsys=daemon
level=info msg="  --proxy-gid='1337'" subsys=daemon
level=info msg="  --proxy-max-connection-duration-seconds='0'" subsys=daemon
level=info msg="  --proxy-max-requests-per-connection='0'" subsys=daemon
level=info msg="  --proxy-prometheus-port='9964'" subsys=daemon
level=info msg="  --read-cni-conf=''" subsys=daemon
level=info msg="  --restore='true'" subsys=daemon
level=info msg="  --route-metric='0'" subsys=daemon
level=info msg="  --sidecar-istio-proxy-image='cilium/istio_proxy'" subsys=daemon
level=info msg="  --single-cluster-route='false'" subsys=daemon
level=info msg="  --socket-path='/var/run/cilium/cilium.sock'" subsys=daemon
level=info msg="  --sockops-enable='false'" subsys=daemon
level=info msg="  --state-dir='/var/run/cilium'" subsys=daemon
level=info msg="  --tofqdns-dns-reject-response-code='refused'" subsys=daemon
level=info msg="  --tofqdns-enable-dns-compression='true'" subsys=daemon
level=info msg="  --tofqdns-endpoint-max-ip-per-hostname='50'" subsys=daemon
level=info msg="  --tofqdns-idle-connection-grace-period='0s'" subsys=daemon
level=info msg="  --tofqdns-max-deferred-connection-deletes='10000'" subsys=daemon
level=info msg="  --tofqdns-min-ttl='3600'" subsys=daemon
level=info msg="  --tofqdns-pre-cache=''" subsys=daemon
level=info msg="  --tofqdns-proxy-port='0'" subsys=daemon
level=info msg="  --tofqdns-proxy-response-max-delay='100ms'" subsys=daemon
level=info msg="  --trace-payloadlen='128'" subsys=daemon
level=info msg="  --tunnel='disabled'" subsys=daemon
level=info msg="  --tunnel-port='0'" subsys=daemon
level=info msg="  --version='false'" subsys=daemon
level=info msg="  --vlan-bpf-bypass=''" subsys=daemon
level=info msg="  --vtep-cidr=''" subsys=daemon
level=info msg="  --vtep-endpoint=''" subsys=daemon
level=info msg="  --vtep-mac=''" subsys=daemon
level=info msg="  --vtep-mask=''" subsys=daemon
level=info msg="  --write-cni-conf-when-ready=''" subsys=daemon
level=info msg="     _ _ _" subsys=daemon
level=info msg=" ___|_| |_|_ _ _____" subsys=daemon
level=info msg="|  _| | | | | |     |" subsys=daemon
level=info msg="|___|_|_|_|___|_|_|_|" subsys=daemon
level=info msg="Cilium 1.12.1 4c9a630 2022-08-15T16:29:39-07:00 go version go1.18.5 linux/amd64" subsys=daemon
level=info msg="cilium-envoy  version: 5739e4be8ae7134fee683d920d25c3732ac6c819/1.21.5/Distribution/RELEASE/BoringSSL" subsys=daemon
level=info msg="clang (10.0.0) and kernel (5.4.0) versions: OK!" subsys=linux-datapath
level=info msg="linking environment: OK!" subsys=linux-datapath
level=info msg="Detected mounted BPF filesystem at /sys/fs/bpf" subsys=bpf
level=info msg="Mounted cgroupv2 filesystem at /run/cilium/cgroupv2" subsys=cgroups
level=info msg="Parsing base label prefixes from default label list" subsys=labels-filter
level=info msg="Parsing additional label prefixes from user inputs: []" subsys=labels-filter
level=info msg="Final label prefixes to be used for identity evaluation:" subsys=labels-filter
level=info msg=" - reserved:.*" subsys=labels-filter
level=info msg=" - :io\\.kubernetes\\.pod\\.namespace" subsys=labels-filter
level=info msg=" - :io\\.cilium\\.k8s\\.namespace\\.labels" subsys=labels-filter
level=info msg=" - :app\\.kubernetes\\.io" subsys=labels-filter
level=info msg=" - !:io\\.kubernetes" subsys=labels-filter
level=info msg=" - !:kubernetes\\.io" subsys=labels-filter
level=info msg=" - !:.*beta\\.kubernetes\\.io" subsys=labels-filter
level=info msg=" - !:k8s\\.io" subsys=labels-filter
level=info msg=" - !:pod-template-generation" subsys=labels-filter
level=info msg=" - !:pod-template-hash" subsys=labels-filter
level=info msg=" - !:controller-revision-hash" subsys=labels-filter
level=info msg=" - !:annotation.*" subsys=labels-filter
level=info msg=" - !:etcd_node" subsys=labels-filter
level=info msg="Auto-disabling \"enable-bpf-clock-probe\" feature since KERNEL_HZ cannot be determined" error="Cannot probe CONFIG_HZ" subsys=daemon
level=info msg="Using autogenerated IPv4 allocation range" subsys=node v4Prefix=10.112.0.0/16
level=info msg="Initializing daemon" subsys=daemon
level=info msg="Establishing connection to apiserver" host="https://z-k8s-api.staging.huatai.me:6443" subsys=k8s
level=info msg="Connected to apiserver" subsys=k8s
level=fatal msg="Error while creating daemon" error="invalid daemon configuration: native routing cidr must be configured with option --ipv4-native-routing-cidr in combination with --enable-ipv4-masquerade --tunnel=disabled --ipam=cluster-pool --enable-ipv4=true" subsys=daemon

关键点是:

...
level=warning msg="If auto-direct-node-routes is enabled, then you are recommended to also configure ipv4-native-routing-cidr. If ipv4-native-routing-cidr is not configured, this may lead to pod to pod traffic being masqueraded, which can cause problems with performance, observability and policy" subsys=config
...
evel=fatal msg="Error while creating daemon" error="invalid daemon configuration: native routing cidr must be configured with option --ipv4-native-routing-cidr in combination with --enable-ipv4-masquerade --tunnel=disabled --ipam=cluster-pool --enable-ipv4=true" subsys=daemon

这个原因:

注意 tunnel 配置参数只有3个 {vxlan, geneve, disabled} ,其中 geneve 是BGP模式tunnel

一旦关闭 tunnel ,则必须同时配置 ipv4-native-routing-cidr: x.x.x.x/y 表示不执行封包的路由 参考 Cilium Concepts >> Networking >> Routing >> Native-Routing

cilium 默认就启用了 Encapsulation (封包),不需要配置,这样就可以和 underlying 网络架构配合无需更多配置。此时所有集群节点之间采用 mesh of tunnels 的UDP封包协议,如VXLAN或Geneve。所有Cilium node的流量都是封包的。

所以,我现在修订为:

重新更新Cilium kube-proxy free配置,激活 socketLB.hostNamespaceOnly 以集成Istio,部分配置恢复默认
API_SERVER_IP=z-k8s-api.staging.huatai.me
API_SERVER_PORT=6443
helm upgrade cilium cilium/cilium --version 1.12.1 \
   --namespace kube-system \
   --reuse-values \
   --set tunnel=vxlan \ #默认
   --set autoDirectNodeRoutes=false \ #默认
   --set kubeProxyReplacement=strict \
   --set socketLB.hostNamespaceOnly=true \
   --set loadBalancer.acceleration=disabled \ #默认
   --set loadBalancer.mode=snat \ #默认
   --set k8sServiceHost=${API_SERVER_IP} \
   --set k8sServicePort=${API_SERVER_PORT}

综上所述,实际上我走了弯路,应该保持默认配置情况下有限修订,简化配置如下(以此为准):

简化且正确配置方法: 更新Cilium kube-proxy free配置,激活 socketLB.hostNamespaceOnly 以集成Istio(不修改默认配置)
API_SERVER_IP=z-k8s-api.staging.huatai.me
API_SERVER_PORT=6443
helm upgrade cilium cilium/cilium --version 1.12.1 \
   --namespace kube-system \
   --reuse-values \
   --set kubeProxyReplacement=strict \
   --set socketLB.hostNamespaceOnly=true \
   --set k8sServiceHost=${API_SERVER_IP} \
   --set k8sServicePort=${API_SERVER_PORT}

备注

有关路由和加速请参考:

参考