使用kubeadm升级Kubernetes集群1.25

准备工作

操作系统升级(可选)

备注

操作系统升级其实和Kubernetes升级无关,但是作为 私有云架构 基础运行环境,我希望能够保持最新的 Ubuntu Linux LTS 以充分发挥软硬件性能。类似 Ceph底层Ubuntu操作系统升级到22.04

升级命令 do-release-upgrade 会检查根文件系统磁盘空间,如果空间不足会自动终止,所以类似 libvirt LVM卷扩容VM磁盘 ,也需要扩容虚拟机的根目录系统磁盘。不过,需要注意,Kubernetes集群采用的是 Clone使用Ceph RBD的虚拟机 所以方法改成 使用libvirt和XFS在线扩展Ceph RBD设备 :

  • RBD调整磁盘大小到16GB ( 1024x16=16384 ),并且 virsh blockresize 刷新虚拟机磁盘:

rbd resize调整RBD块设备镜像大小, virsh blockresize调整虚拟机vda大小
rbd resize --size 16384 libvirt-pool/z-k8s-m-1
virsh blockresize --domain z-k8s-m-1 --path vda --size 16G
  • 登录到虚拟机内部执行growpart和xfs_growfs调整分区以及文件系统大小:

在虚拟机内部使用growpart和xfs_growfs扩展根目录文件系统
#安装growpart
apt install cloud-guest-utils
#扩展分区2
growpart /dev/vda 2
#扩展XFS根分区
xfs_growfs /
  • do-release-upgrade 会检查系统所有软件包版本,由于 kubeadm kubectl kubelet 被锁定不升级,则会提示:

    Checking for a new Ubuntu release
    Please install all available updates for your release before upgrading.
    

所以暂时将仓库配置移除,完成OS升级后再恢复继续进行Kubernetes升级:

mv /etc/apt/sources.list.d/kubernetes.list ~/
执行ubuntu release upgrad
sudo apt update && sudo apt upgrade -y
sudo apt autoremove -y
sudo reboot
sudo apt install update-manager-core
sudo do-release-upgrade -d

备注

这里一定要确保已经采用了 apt hold保持包不更新 锁定了主机的Kubernetes相关软件版本,否则升级会导致Kubernetes集群不可预测的异常

备注

建议采用 virsh console 登录到虚拟机内部执行操作系统升级。通过 ssh 登录到虚拟机也能进行升级,但是升级过程会断开ssh并且可能无法连接,虽然升级是在 screen 中进行,所以ssh断开不影响,但是操作比较麻烦,还是要通过 virsh console 访问虚拟机。

备注

Kubernetes官方提供的Debian系列软件仓库始终是定位在 xenial ,也就是对应 Ubuntu Linux 16.04 LTS 。这应该是为了确保最大可能的兼容性

  • 我先完成管控面3台VM升级,然后再执行node节点升级,最终完成后确保所有虚拟机都完成重启,再使用 kubectl get nodes 检查节点:

    kubectl get nodes -o wide
    

可以看到输出信息类似:

NAME        STATUS   ROLES           AGE    VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
z-k8s-m-1   Ready    control-plane   114d   v1.24.2   192.168.6.101   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-m-2   Ready    control-plane   112d   v1.24.2   192.168.6.102   <none>        Ubuntu 22.04.1 LTS   5.4.0-131-generic   containerd://1.6.6
z-k8s-m-3   Ready    control-plane   112d   v1.24.2   192.168.6.103   <none>        Ubuntu 22.04.1 LTS   5.4.0-131-generic   containerd://1.6.6
z-k8s-n-1   Ready    <none>          112d   v1.24.2   192.168.6.111   <none>        Ubuntu 22.04.1 LTS   5.4.0-131-generic   containerd://1.6.6
z-k8s-n-2   Ready    <none>          112d   v1.24.2   192.168.6.112   <none>        Ubuntu 22.04.1 LTS   5.4.0-131-generic   containerd://1.6.6
z-k8s-n-3   Ready    <none>          112d   v1.24.2   192.168.6.113   <none>        Ubuntu 22.04.1 LTS   5.4.0-131-generic   containerd://1.6.6
z-k8s-n-4   Ready    <none>          112d   v1.24.2   192.168.6.114   <none>        Ubuntu 22.04.1 LTS   5.4.0-131-generic   containerd://1.6.6
z-k8s-n-5   Ready    <none>          112d   v1.24.2   192.168.6.115   <none>        Ubuntu 22.04.1 LTS   5.4.0-131-generic   containerd://1.6.6

升级K8S集群(1.24.2)至最新补丁版本(1.24.7)

  • 已经按照上文完成操作系统更新升级

  • 将kubernetes仓库配置恢复,并刷新仓库索引:

    mv ~/kubernetes.list /etc/apt/sources.list.d/
    apt update
    
  • 获取所有Kubernetes版本来确定需要升级的版本:

    apt-cache madison kubeadm
    

可以看到我们将要升级的目标版本:

kubeadm |  1.25.3-00 | https://apt.kubernetes.io kubernetes-xenial/main amd64 Packages
...
kubeadm |  1.24.7-00 | https://apt.kubernetes.io kubernetes-xenial/main amd64 Packages
...

升级管控平面节点

  • 控制面节点上的升级过程应该每次处理一个节点

  • 选择一个要先行升级的控制面节点: 该节点上必须拥有 /etc/kubernetes/admin.conf 文件

执行 kubeadm upgrade

对第一个管控面节点 z-k8s-m-1

  • 升级 kubeadm :

升级节点kubeadm到1.24.7(当前主版本最新补丁版本)
apt-mark unhold kubeadm && \
apt-get update && apt-get install -y kubeadm=1.24.7-00 && \
apt-mark hold kubeadm

备注

虽然只需要一个管控节点升级 kubeadm 就能完成整个集群升级,不过为了统一,其他管控节点也做 kubeadm 版本升级

  • 验证 kubeadm 版本:

    kubeadm version
    

显示输出:

kubeadm version: &version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.7", GitCommit:"e6f35974b08862a23e7f4aad8e5d7f7f2de26c15", GitTreeState:"clean", BuildDate:"2022-10-12T10:55:41Z", GoVersion:"go1.18.7", Compiler:"gc", Platform:"linux/amd64"}
  • 验证升级计划:

kubeadm验证升级计划
kubeadm upgrade plan

输出信息如下,其中 kube-proxy 因为我采用 Cilium完全取代kube-proxy运行Kubernetes 所以需要单独升级:

kubeadm验证升级1.24.7计划输出
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[preflight] Running pre-flight checks.
[upgrade] Running cluster health checks
[upgrade] Fetching available versions to upgrade to
[upgrade/versions] Cluster version: v1.24.3
[upgrade/versions] kubeadm version: v1.24.7
I1109 12:40:57.517112  336232 version.go:255] remote version is much newer: v1.25.3; falling back to: stable-1.24
[upgrade/versions] Target version: v1.24.7
[upgrade/versions] Latest version in the v1.24 series: v1.24.7

W1109 12:40:59.174039  336232 configset.go:78] Warning: No kubeproxy.config.k8s.io/v1alpha1 config is loaded. Continuing without it: configmaps "kube-proxy" not found
Components that must be upgraded manually after you have upgraded the control plane with 'kubeadm upgrade apply':
COMPONENT   CURRENT       TARGET
kubelet     8 x v1.24.2   v1.24.7

Upgrade to the latest version in the v1.24 series:

COMPONENT                 CURRENT   TARGET
kube-apiserver            v1.24.3   v1.24.7
kube-controller-manager   v1.24.3   v1.24.7
kube-scheduler            v1.24.3   v1.24.7
kube-proxy                v1.24.3   v1.24.7
CoreDNS                   v1.8.6    v1.8.6

You can now apply the upgrade by executing the following command:

        kubeadm upgrade apply v1.24.7

_____________________________________________________________________


The table below shows the current state of component configs as understood by this version of kubeadm.
Configs that have a "yes" mark in the "MANUAL UPGRADE REQUIRED" column require manual config upgrade or
resetting to kubeadm defaults before a successful upgrade can be performed. The version to manually
upgrade to is denoted in the "PREFERRED VERSION" column.

API GROUP                 CURRENT VERSION   PREFERRED VERSION   MANUAL UPGRADE REQUIRED
kubeproxy.config.k8s.io   -                 v1alpha1            no
kubelet.config.k8s.io     v1beta1           v1beta1             no
_____________________________________________________________________
  • 升级第一个管控节点,指定升级的目标版本 1.24.7 :

升级第一个管控平面节点Kubernetes套件到1.24.7(当前主版本最新补丁版本)
sudo kubeadm upgrade apply v1.24.7

升级输出信息:

升级管控平面节点Kubernetes套件到1.24.7输出信息(含交互)
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
W1109 15:11:55.826367  402890 configset.go:78] Warning: No kubeproxy.config.k8s.io/v1alpha1 config is loaded. Continuing without it: configmaps "kube-proxy" not found
[preflight] Running pre-flight checks.
[upgrade] Running cluster health checks
[upgrade/version] You have chosen to change the cluster version to "v1.24.7"
[upgrade/versions] Cluster version: v1.24.3
[upgrade/versions] kubeadm version: v1.24.7
[upgrade/confirm] Are you sure you want to proceed with the upgrade? [y/N]: y
[upgrade/prepull] Pulling images required for setting up a Kubernetes cluster
[upgrade/prepull] This might take a minute or two, depending on the speed of your internet connection
[upgrade/prepull] You can also perform this action in beforehand using 'kubeadm config images pull'
[upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.24.7" (timeout: 5m0s)...
[upgrade/staticpods] Writing new Static Pod manifests to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests1416038386"
[upgrade/staticpods] Preparing for "kube-apiserver" upgrade
[upgrade/staticpods] Renewing apiserver certificate
[upgrade/staticpods] Renewing apiserver-kubelet-client certificate
[upgrade/staticpods] Renewing front-proxy-client certificate
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-apiserver.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2022-11-09-15-13-15/kube-apiserver.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)
[apiclient] Found 3 Pods for label selector component=kube-apiserver
[upgrade/staticpods] Component "kube-apiserver" upgraded successfully!
[upgrade/staticpods] Preparing for "kube-controller-manager" upgrade
[upgrade/staticpods] Renewing controller-manager.conf certificate
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-controller-manager.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2022-11-09-15-13-15/kube-controller-manager.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)
[apiclient] Found 3 Pods for label selector component=kube-controller-manager
[upgrade/staticpods] Component "kube-controller-manager" upgraded successfully!
[upgrade/staticpods] Preparing for "kube-scheduler" upgrade
[upgrade/staticpods] Renewing scheduler.conf certificate
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-scheduler.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2022-11-09-15-13-15/kube-scheduler.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)
[apiclient] Found 3 Pods for label selector component=kube-scheduler
[upgrade/staticpods] Component "kube-scheduler" upgraded successfully!
[upgrade/postupgrade] Removing the deprecated label node-role.kubernetes.io/master='' from all control plane Nodes. After this step only the label node-role.kubernetes.io/control-plane='' will be present on control plane Nodes.
[upgrade/postupgrade] Adding the new taint &Taint{Key:node-role.kubernetes.io/control-plane,Value:,Effect:NoSchedule,TimeAdded:<nil>,} to all control plane Nodes. After this step both taints &Taint{Key:node-role.kubernetes.io/control-plane,Value:,Effect:NoSchedule,TimeAdded:<nil>,} and &Taint{Key:node-role.kubernetes.io/master,Value:,Effect:NoSchedule,TimeAdded:<nil>,} should be present on control plane Nodes.
[upload-config] Storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
[kubelet] Creating a ConfigMap "kubelet-config" in namespace kube-system with the configuration for the kubelets in the cluster
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[bootstrap-token] Configured RBAC rules to allow Node Bootstrap tokens to get nodes
[bootstrap-token] Configured RBAC rules to allow Node Bootstrap tokens to post CSRs in order for nodes to get long term certificate credentials
[bootstrap-token] Configured RBAC rules to allow the csrapprover controller automatically approve CSRs from a Node Bootstrap Token
[bootstrap-token] Configured RBAC rules to allow certificate rotation for all node client certificates in the cluster
[addons] Applied essential addon: CoreDNS
W1109 15:14:23.647437  402890 postupgrade.go:152] the ConfigMap "kube-proxy" in the namespace "kube-system" was not found. Assuming that kube-proxy was not deployed for this cluster. Note that once 'kubeadm upgrade apply' supports phases you will have to skip the kube-proxy upgrade manually

[upgrade/successful] SUCCESS! Your cluster was upgraded to "v1.24.7". Enjoy!

[upgrade/kubelet] Now that your control plane is upgraded, please proceed with upgrading your kubelets if you haven't already done so.
  • 手动升级CNI驱动插件,如 Cilium网络 (这步等后续cilium正式发行新版本后再升级,目前保持不变)

对其他管控面节点 z-k8s-m-2z-k8s-m-3

  • 其他管控节点只需要执行 upgrade node 而不是 upgrade apply :

升级节点Kubernetes套件到1.24.7(upgrade node)
sudo kubeadm upgrade node

升级输出信息:

其他管控平面节点Kubernetes套件到1.24.7(upgrade node)输出信息
[upgrade] Reading configuration from the cluster...
[upgrade] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
W1109 15:24:07.586473    6794 configset.go:78] Warning: No kubeproxy.config.k8s.io/v1alpha1 config is loaded. Continuing without it: configmaps "kube-proxy" is forbidden: User "system:node:z-k8s-m-2" cannot get resource "configmaps" in API group "" in the namespace "kube-system": no relationship found between node 'z-k8s-m-2' and this object
[preflight] Running pre-flight checks
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
[upgrade] Upgrading your Static Pod-hosted control plane instance to version "v1.24.7"...
[upgrade/staticpods] Writing new Static Pod manifests to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests703550162"
[upgrade/staticpods] Preparing for "kube-apiserver" upgrade
[upgrade/staticpods] Renewing apiserver certificate
[upgrade/staticpods] Renewing apiserver-kubelet-client certificate
[upgrade/staticpods] Renewing front-proxy-client certificate
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-apiserver.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2022-11-09-15-24-50/kube-apiserver.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)
[apiclient] Found 3 Pods for label selector component=kube-apiserver
[upgrade/staticpods] Component "kube-apiserver" upgraded successfully!
[upgrade/staticpods] Preparing for "kube-controller-manager" upgrade
[upgrade/staticpods] Renewing controller-manager.conf certificate
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-controller-manager.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2022-11-09-15-24-50/kube-controller-manager.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)
[apiclient] Found 3 Pods for label selector component=kube-controller-manager
[upgrade/staticpods] Component "kube-controller-manager" upgraded successfully!
[upgrade/staticpods] Preparing for "kube-scheduler" upgrade
[upgrade/staticpods] Renewing scheduler.conf certificate
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-scheduler.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2022-11-09-15-24-50/kube-scheduler.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)
[apiclient] Found 3 Pods for label selector component=kube-scheduler
[upgrade/staticpods] Component "kube-scheduler" upgraded successfully!
[upgrade] The control plane instance for this node was successfully updated!
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[upgrade] The configuration for this node was successfully updated!
[upgrade] Now you should go ahead and upgrade the kubelet package using your package manager.

其他管控节点不需要执行 kubeadm upgrade plan 也不需要更新CNI驱动插件的操作。

腾空管控节点

在完成了管控节点的kubernetes镜像升级之后,需要注意,这些管控节点的 kubelet / kubectl 还没有升级,所以此时执行 kubectl get nodes 看到的 VERSION 还是之前的旧版本 1.24.2 。在完成了管控平面的组件升级之后,现在可以对管控节点进行 kubelet / kubectl 升级:

  • 将节点标记为不可调度并驱逐所有负载,准备节点的维护(以下案例是 z-k8s-m-1 其他管控节点类似)

管控平面节点腾空节点(不包含daemonset)
kubectl drain z-k8s-m-1 --ignore-daemonsets

提示信息输出:

管控平面节点腾空节点(不包含daemonset)输出信息
node/z-k8s-m-1 cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/cilium-xhmc5, metallb-system/speaker-8hgcd
node/z-k8s-m-1 drained

升级管控节点kubelet和kubectl

  • 升级 kubelet 和 kubectl:

升级管控平面节点kubelet和kubectl到1.24.7
apt-mark unhold kubelet kubectl && \
apt-get update && apt-get install -y kubelet=1.24.7-00 kubectl=1.24.7-00 && \
apt-mark hold kubelet kubectl
  • 重启kubelet:

重启管控平面节点kubelet
sudo systemctl daemon-reload
sudo systemctl restart kubelet
  • 将完成升级的管控节点恢复调度并上线(以下案例是 z-k8s-m-1 其他管控节点类似):

恢复管控平面节点调度并上线
kubectl uncordon z-k8s-m-1
  • 现在已经完成了第一个管控节点 z-k8s-m-1 的升级,此时检查 kubectl get nodes 可以看到第一个管控节点已经顺利升级到 1.24.7

    kubectl get nodes -o wide
    

显示输出:

第一个管控节点升级完成后检查输出
NAME        STATUS   ROLES           AGE    VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
z-k8s-m-1   Ready    control-plane   114d   v1.24.7   192.168.6.101   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-m-2   Ready    control-plane   113d   v1.24.2   192.168.6.102   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-m-3   Ready    control-plane   113d   v1.24.2   192.168.6.103   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-n-1   Ready    <none>          113d   v1.24.2   192.168.6.111   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-n-2   Ready    <none>          113d   v1.24.2   192.168.6.112   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-n-3   Ready    <none>          113d   v1.24.2   192.168.6.113   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-n-4   Ready    <none>          113d   v1.24.2   192.168.6.114   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-n-5   Ready    <none>          113d   v1.24.2   192.168.6.115   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6

在其余管控节点上重复完成上述 “腾空管控节点” 和 “升级管控节点kubelet和kubectl”

升级工作节点

在完成了管控节点升级到 1.24.7 之后,开始升级工作节点。工作节点上的升级过程应该一次执行一个节点,或者一次执行几个节点, 以不影响运行工作负载所需的最小容量。

  • 升级 kubeadm :

升级节点kubeadm到1.24.7(当前主版本最新补丁版本)
apt-mark unhold kubeadm && \
apt-get update && apt-get install -y kubeadm=1.24.7-00 && \
apt-mark hold kubeadm
  • 执行 kubeadm upgrade :

升级节点Kubernetes套件到1.24.7(upgrade node)
sudo kubeadm upgrade node
  • 将节点标记为不可调度并驱逐所有负载,准备节点的维护(以下案例是 z-k8s-n-1 工作节点)

腾空工作节点(不包含daemonset)
kubectl drain z-k8s-n-1 --ignore-daemonsets

提示信息输出可以看到不能驱逐本地存储的pod,这里是 Cilium网络 相关的pod:

腾空工作节点(不包含daemonset)输出信息,这里提示使用本地存储的cilium的pod不能驱逐
error: unable to drain node "z-k8s-n-1" due to error:cannot delete Pods with local storage (use --delete-emptydir-data to override): istio-system/istio-ingressgateway-85cc7b7ccd-zpx42, kube-system/hubble-ui-579fdfbc58-g2lv6, continuing command...
There are pending nodes to be drained:
 z-k8s-n-1
cannot delete Pods with local storage (use --delete-emptydir-data to override): istio-system/istio-ingressgateway-85cc7b7ccd-zpx42, kube-system/hubble-ui-579fdfbc58-g2lv6

修订腾空节点的命令,添加 --delete-emptydir-data 参数:

使用–delete-emptydir-data参数腾空工作节点(不包含daemonset)
kubectl drain z-k8s-n-1 --ignore-daemonsets --delete-emptydir-data

这里的输出信息中有一些 evicting 错误:

使用–delete-emptydir-data参数腾空工作节点的输出信息(有evicting错误)
node/z-k8s-n-1 already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/cilium-sl25l, kube-system/otelcol-hubble-collector-46gtj, metallb-system/speaker-xt5zf
evicting pod podinfo/podinfo-frontend-76b9ff9c94-dsl8l
evicting pod cert-manager/cert-manager-7b4f4986bb-jpkr5
evicting pod cert-manager/cert-manager-cainjector-6b9d8b7d57-5fw2r
evicting pod cert-manager/cert-manager-webhook-d7bc6f65d-sl4fv
evicting pod cilium-monitoring/grafana-b96dcb76b-2przz
evicting pod cilium-test/echo-other-node-d79544ccf-2mxr9
evicting pod default/my-nginx-df7bbf6f5-6gndk
evicting pod istio-system/istio-ingressgateway-85cc7b7ccd-zpx42
evicting pod jaeger/jaeger-default-fbc6fd6fd-p88nx
evicting pod jaeger/jaeger-operator-67dcc96554-q4ccj
evicting pod kube-system/hubble-relay-84b4ddb556-z86lj
evicting pod kube-system/hubble-ui-579fdfbc58-g2lv6
evicting pod opentelemetry-operator-system/opentelemetry-operator-controller-manager-696c488948-mkwph
evicting pod podinfo/podinfo-backend-595c9bd9c7-7c2vv
evicting pod podinfo/podinfo-backend-595c9bd9c7-bbplt
evicting pod podinfo/podinfo-client-5b9bb6b9cd-7xh76
evicting pod podinfo/podinfo-client-5b9bb6b9cd-fv5zq
evicting pod podinfo/podinfo-frontend-76b9ff9c94-7gt9q
error when evicting pods/"istio-ingressgateway-85cc7b7ccd-zpx42" -n "istio-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I1109 16:55:32.209267   33920 request.go:682] Waited for 1.183463048s due to client-side throttling, not priority and fairness, request: POST:https://z-k8s-api.staging.huatai.me:6443/api/v1/namespaces/podinfo/pods/podinfo-client-5b9bb6b9cd-7xh76/eviction
pod/jaeger-operator-67dcc96554-q4ccj evicted
pod/cert-manager-webhook-d7bc6f65d-sl4fv evicted
evicting pod istio-system/istio-ingressgateway-85cc7b7ccd-zpx42
error when evicting pods/"istio-ingressgateway-85cc7b7ccd-zpx42" -n "istio-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
pod/cert-manager-7b4f4986bb-jpkr5 evicted
pod/my-nginx-df7bbf6f5-6gndk evicted
pod/hubble-ui-579fdfbc58-g2lv6 evicted
pod/grafana-b96dcb76b-2przz evicted
pod/cert-manager-cainjector-6b9d8b7d57-5fw2r evicted
pod/opentelemetry-operator-controller-manager-696c488948-mkwph evicted
pod/podinfo-backend-595c9bd9c7-7c2vv evicted
pod/podinfo-backend-595c9bd9c7-bbplt evicted
evicting pod istio-system/istio-ingressgateway-85cc7b7ccd-zpx42
error when evicting pods/"istio-ingressgateway-85cc7b7ccd-zpx42" -n "istio-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
pod/podinfo-frontend-76b9ff9c94-dsl8l evicted
pod/hubble-relay-84b4ddb556-z86lj evicted
I1109 16:55:42.330041   33920 request.go:682] Waited for 1.5766885s due to client-side throttling, not priority and fairness, request: GET:https://z-k8s-api.staging.huatai.me:6443/api/v1/namespaces/podinfo/pods/podinfo-frontend-76b9ff9c94-7gt9q
pod/podinfo-frontend-76b9ff9c94-7gt9q evicted
pod/podinfo-client-5b9bb6b9cd-fv5zq evicted
pod/podinfo-client-5b9bb6b9cd-7xh76 evicted
evicting pod istio-system/istio-ingressgateway-85cc7b7ccd-zpx42
error when evicting pods/"istio-ingressgateway-85cc7b7ccd-zpx42" -n "istio-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod istio-system/istio-ingressgateway-85cc7b7ccd-zpx42
error when evicting pods/"istio-ingressgateway-85cc7b7ccd-zpx42" -n "istio-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod istio-system/istio-ingressgateway-85cc7b7ccd-zpx42
error when evicting pods/"istio-ingressgateway-85cc7b7ccd-zpx42" -n "istio-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
pod/jaeger-default-fbc6fd6fd-p88nx evicted
evicting pod istio-system/istio-ingressgateway-85cc7b7ccd-zpx42
error when evicting pods/"istio-ingressgateway-85cc7b7ccd-zpx42" -n "istio-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
pod/echo-other-node-d79544ccf-2mxr9 evicted
evicting pod istio-system/istio-ingressgateway-85cc7b7ccd-zpx42
error when evicting pods/"istio-ingressgateway-85cc7b7ccd-zpx42" -n "istio-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
...
  • 同样升级 kubelet 和 kubectl:

升级工作节点kubelet和kubectl到1.24.7
apt-mark unhold kubelet kubectl && \
apt-get update && apt-get install -y kubelet=1.24.7-00 kubectl=1.24.7-00 && \
apt-mark hold kubelet kubectl
  • 重启kubelet:

重启工作节点kubelet(1.24.7)
sudo systemctl daemon-reload
sudo systemctl restart kubelet
  • 将完成升级的工作节点恢复调度并上线(以下案例是 z-k8s-n-1 其他工作节点类似):

恢复工作节点调度并上线
kubectl uncordon z-k8s-n-1

所有节点升级完成后,使用 kubectl get nodes -o wide 检查,可以看到所有节点都统一升级到 1.24.7 版本:

NAME        STATUS   ROLES           AGE    VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
z-k8s-m-1   Ready    control-plane   115d   v1.24.7   192.168.6.101   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-m-2   Ready    control-plane   113d   v1.24.7   192.168.6.102   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-m-3   Ready    control-plane   113d   v1.24.7   192.168.6.103   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-n-1   Ready    <none>          113d   v1.24.7   192.168.6.111   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-n-2   Ready    <none>          113d   v1.24.7   192.168.6.112   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-n-3   Ready    <none>          113d   v1.24.7   192.168.6.113   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-n-4   Ready    <none>          113d   v1.24.7   192.168.6.114   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-n-5   Ready    <none>          113d   v1.24.7   192.168.6.115   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6

升级K8S集群(1.24.7)至最新release版本(1.25.3)

在完成了上述 1.24.2 升级到最新补丁版本 1.24.7 之后,就具备了充分条件可以进一步升级大版本到最新release版本 1.25.3

升级方法步骤完全相同,只不过目标版本调整为 1.25.3 ,以下记录升级过程

执行 kubeadm upgrade

对第一个管控面节点 z-k8s-m-1

  • 升级 kubeadm :

升级节点kubeadm到1.25.3(最新release版本)
apt-mark unhold kubeadm && \
apt-get update && apt-get install -y kubeadm=1.25.3-00 && \
apt-mark hold kubeadm
  • 验证 kubeadm 版本:

    kubeadm version
    
  • 验证升级计划:

kubeadm验证升级计划
kubeadm upgrade plan

没有特别报错,则可继续进行

  • 升级第一个管控节点,指定升级的目标版本 1.25.3 :

升级第一个管控平面节点Kubernetes套件到1.25.3(最新release版本)
sudo kubeadm upgrade apply v1.25.3

对其他管控面节点 z-k8s-m-2z-k8s-m-3

  • 其他管控节点通过 kubeadm upgrade 升级:

升级节点Kubernetes套件到1.25.3(upgrade node)
sudo kubeadm upgrade node

腾空管控节点

  • 将节点标记为不可调度并驱逐所有负载,准备节点的维护(以下案例是 z-k8s-m-1 其他管控节点类似)

管控平面节点腾空节点(不包含daemonset)
kubectl drain z-k8s-m-1 --ignore-daemonsets

升级管控节点kubelet和kubectl

  • 升级 kubelet 和 kubectl:

升级管控平面节点kubelet和kubectl到1.25.3
apt-mark unhold kubelet kubectl && \
apt-get update && apt-get install -y kubelet=1.25.3-00 kubectl=1.25.3-00 && \
apt-mark hold kubelet kubectl
  • 重启kubelet:

重启管控平面节点kubelet
sudo systemctl daemon-reload
sudo systemctl restart kubelet
  • 将完成升级的管控节点恢复调度并上线(以下案例是 z-k8s-m-1 其他管控节点类似):

恢复管控平面节点调度并上线
kubectl uncordon z-k8s-m-1

在其余管控节点上重复完成上述 “腾空管控节点” 和 “升级管控节点kubelet和kubectl”

升级工作节点

  • 升级 kubeadm :

升级节点kubeadm到1.25.3
apt-mark unhold kubeadm && \
apt-get update && apt-get install -y kubeadm=1.25.3-00 && \
apt-mark hold kubeadm
  • 执行 kubeadm upgrade :

升级节点Kubernetes套件到1.25.3(upgrade node)
sudo kubeadm upgrade node
  • 将节点标记为不可调度并驱逐所有负载,准备节点的维护(以下案例是 z-k8s-n-1 工作节点)

腾空工作节点(不包含daemonset)
kubectl drain z-k8s-n-1 --ignore-daemonsets --delete-emptydir-data
  • 同样升级 kubelet 和 kubectl:

升级工作节点kubelet和kubectl到1.25.3
apt-mark unhold kubelet kubectl && \
apt-get update && apt-get install -y kubelet=1.25.3-00 kubectl=1.25.3-00 && \
apt-mark hold kubelet kubectl
  • 重启kubelet:

重启工作节点kubelet
sudo systemctl daemon-reload
sudo systemctl restart kubelet
  • 将完成升级的工作节点恢复调度并上线(以下案例是 z-k8s-n-1 其他工作节点类似):

恢复工作节点调度并上线
kubectl uncordon z-k8s-n-1

所有节点升级完成后,使用 kubectl get nodes -o wide 检查,可以看到所有节点都统一升级到 1.25.3 版本:

NAME        STATUS   ROLES           AGE    VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
z-k8s-m-1   Ready    control-plane   115d   v1.25.3   192.168.6.101   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-m-2   Ready    control-plane   114d   v1.25.3   192.168.6.102   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-m-3   Ready    control-plane   114d   v1.25.3   192.168.6.103   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-n-1   Ready    <none>          114d   v1.25.3   192.168.6.111   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-n-2   Ready    <none>          114d   v1.25.3   192.168.6.112   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-n-3   Ready    <none>          114d   v1.25.3   192.168.6.113   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-n-4   Ready    <none>          114d   v1.25.3   192.168.6.114   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-n-5   Ready    <none>          114d   v1.25.3   192.168.6.115   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6

故障恢复

备注

我参考官方文档从 1.24.2 升级到 1.25.3 ,没有遇到严重故障问题,所以本段落仅参考官方文档整理记录以备后用

kubeadm upgrade

  • 如果 kubeadm upgrade 失败并且没有回滚,可以再次运行 kubeadm upgrade : 这个命令是幂等的,可以重复执行。

  • 可以运行 kubeadm upgrade apply --force

数据备份

升级时,如果时集群内置 etcd - 分布式kv存储 则会在 /etc/kubernetes/tmp 目录下备份 etc 数据:

kubeadm-backup-etcd-<date>-<time>
kubeadm-backup-manifests-<date>-<time>
  • 如果 etcd 升级失败并且无法回滚,可以从上述 kubeadm-backup-etcd-<date>-<time> 文件夹内容复制到 /var/lib/etcd 进行手工恢复。 如果是外部 etcd 则上述目录为空。

  • kubeadm-backup-manifests-<date>-<time> 是当前控制面节点静态Pod清单文件备份版本,这个文件目录下内容可以复制到 /etc/kubernetes/manifests 目录下手工恢复。

参考