使用kubeadm升级Kubernetes集群1.25¶

准备工作¶

操作系统升级(可选)¶

备注

操作系统升级其实和Kubernetes升级无关，但是作为私有云架构基础运行环境，我希望能够保持最新的 Ubuntu Linux LTS 以充分发挥软硬件性能。类似 Ceph底层Ubuntu操作系统升级到22.04

升级命令 do-release-upgrade 会检查根文件系统磁盘空间，如果空间不足会自动终止，所以类似 libvirt LVM卷扩容VM磁盘，也需要扩容虚拟机的根目录系统磁盘。不过，需要注意，Kubernetes集群采用的是 Clone使用Ceph RBD的虚拟机所以方法改成使用libvirt和XFS在线扩展Ceph RBD设备 :

RBD调整磁盘大小到16GB ( 1024x16=16384 )，并且 virsh blockresize 刷新虚拟机磁盘:

rbd resize调整RBD块设备镜像大小, virsh blockresize调整虚拟机vda大小¶

rbd resize --size 16384 libvirt-pool/z-k8s-m-1
virsh blockresize --domain z-k8s-m-1 --path vda --size 16G

登录到虚拟机内部执行growpart和xfs_growfs调整分区以及文件系统大小:

在虚拟机内部使用growpart和xfs_growfs扩展根目录文件系统¶

#安装growpart
apt install cloud-guest-utils
#扩展分区2
growpart /dev/vda 2
#扩展XFS根分区
xfs_growfs /

do-release-upgrade 会检查系统所有软件包版本，由于 kubeadm kubectl kubelet 被锁定不升级，则会提示:

Checking for a new Ubuntu release
Please install all available updates for your release before upgrading.

所以暂时将仓库配置移除，完成OS升级后再恢复继续进行Kubernetes升级:

mv /etc/apt/sources.list.d/kubernetes.list ~/

采用升级Ubuntu 22.04 LTS到22.04 LTS 相同方法升级操作系统:

执行ubuntu release upgrad¶

sudo apt update && sudo apt upgrade -y
sudo apt autoremove -y
sudo reboot
sudo apt install update-manager-core
sudo do-release-upgrade -d

备注

这里一定要确保已经采用了 apt hold保持包不更新锁定了主机的Kubernetes相关软件版本，否则升级会导致Kubernetes集群不可预测的异常

备注

建议采用 virsh console 登录到虚拟机内部执行操作系统升级。通过 ssh 登录到虚拟机也能进行升级，但是升级过程会断开ssh并且可能无法连接，虽然升级是在 screen 中进行，所以ssh断开不影响，但是操作比较麻烦，还是要通过 virsh console 访问虚拟机。

备注

Kubernetes官方提供的Debian系列软件仓库始终是定位在 xenial ，也就是对应 Ubuntu Linux 16.04 LTS 。这应该是为了确保最大可能的兼容性

我先完成管控面3台VM升级，然后再执行node节点升级，最终完成后确保所有虚拟机都完成重启，再使用 kubectl get nodes 检查节点:
```
kubectl get nodes -o wide
```

可以看到输出信息类似:

NAME        STATUS   ROLES           AGE    VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
z-k8s-m-1   Ready    control-plane   114d   v1.24.2   192.168.6.101   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-m-2   Ready    control-plane   112d   v1.24.2   192.168.6.102   <none>        Ubuntu 22.04.1 LTS   5.4.0-131-generic   containerd://1.6.6
z-k8s-m-3   Ready    control-plane   112d   v1.24.2   192.168.6.103   <none>        Ubuntu 22.04.1 LTS   5.4.0-131-generic   containerd://1.6.6
z-k8s-n-1   Ready    <none>          112d   v1.24.2   192.168.6.111   <none>        Ubuntu 22.04.1 LTS   5.4.0-131-generic   containerd://1.6.6
z-k8s-n-2   Ready    <none>          112d   v1.24.2   192.168.6.112   <none>        Ubuntu 22.04.1 LTS   5.4.0-131-generic   containerd://1.6.6
z-k8s-n-3   Ready    <none>          112d   v1.24.2   192.168.6.113   <none>        Ubuntu 22.04.1 LTS   5.4.0-131-generic   containerd://1.6.6
z-k8s-n-4   Ready    <none>          112d   v1.24.2   192.168.6.114   <none>        Ubuntu 22.04.1 LTS   5.4.0-131-generic   containerd://1.6.6
z-k8s-n-5   Ready    <none>          112d   v1.24.2   192.168.6.115   <none>        Ubuntu 22.04.1 LTS   5.4.0-131-generic   containerd://1.6.6

升级K8S集群(1.24.2)至最新补丁版本(1.24.7)¶

已经按照上文完成操作系统更新升级

将kubernetes仓库配置恢复，并刷新仓库索引:

mv ~/kubernetes.list /etc/apt/sources.list.d/
apt update

获取所有Kubernetes版本来确定需要升级的版本:
```
apt-cache madison kubeadm
```

可以看到我们将要升级的目标版本:

kubeadm |  1.25.3-00 | https://apt.kubernetes.io kubernetes-xenial/main amd64 Packages
...
kubeadm |  1.24.7-00 | https://apt.kubernetes.io kubernetes-xenial/main amd64 Packages
...

升级管控平面节点¶

控制面节点上的升级过程应该每次处理一个节点
选择一个要先行升级的控制面节点: 该节点上必须拥有 /etc/kubernetes/admin.conf 文件

执行 `kubeadm upgrade`¶

对第一个管控面节点 `z-k8s-m-1`¶

升级 kubeadm :

升级节点kubeadm到1.24.7(当前主版本最新补丁版本)¶

apt-mark unhold kubeadm && \
apt-get update && apt-get install -y kubeadm=1.24.7-00 && \
apt-mark hold kubeadm

备注

虽然只需要一个管控节点升级 kubeadm 就能完成整个集群升级，不过为了统一，其他管控节点也做 kubeadm 版本升级

验证 kubeadm 版本:
```
kubeadm version
```

显示输出:

kubeadm version: &version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.7", GitCommit:"e6f35974b08862a23e7f4aad8e5d7f7f2de26c15", GitTreeState:"clean", BuildDate:"2022-10-12T10:55:41Z", GoVersion:"go1.18.7", Compiler:"gc", Platform:"linux/amd64"}

验证升级计划:

kubeadm验证升级计划¶

kubeadm upgrade plan

输出信息如下，其中 kube-proxy 因为我采用 Cilium完全取代kube-proxy运行Kubernetes 所以需要单独升级:

kubeadm验证升级1.24.7计划输出¶

[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[preflight] Running pre-flight checks.
[upgrade] Running cluster health checks
[upgrade] Fetching available versions to upgrade to
[upgrade/versions] Cluster version: v1.24.3
[upgrade/versions] kubeadm version: v1.24.7
I1109 12:40:57.517112  336232 version.go:255] remote version is much newer: v1.25.3; falling back to: stable-1.24
[upgrade/versions] Target version: v1.24.7
[upgrade/versions] Latest version in the v1.24 series: v1.24.7

W1109 12:40:59.174039  336232 configset.go:78] Warning: No kubeproxy.config.k8s.io/v1alpha1 config is loaded. Continuing without it: configmaps "kube-proxy" not found
Components that must be upgraded manually after you have upgraded the control plane with 'kubeadm upgrade apply':
COMPONENT   CURRENT       TARGET
kubelet     8 x v1.24.2   v1.24.7

Upgrade to the latest version in the v1.24 series:

COMPONENT                 CURRENT   TARGET
kube-apiserver            v1.24.3   v1.24.7
kube-controller-manager   v1.24.3   v1.24.7
kube-scheduler            v1.24.3   v1.24.7
kube-proxy                v1.24.3   v1.24.7
CoreDNS                   v1.8.6    v1.8.6

You can now apply the upgrade by executing the following command:

        kubeadm upgrade apply v1.24.7

_____________________________________________________________________


The table below shows the current state of component configs as understood by this version of kubeadm.
Configs that have a "yes" mark in the "MANUAL UPGRADE REQUIRED" column require manual config upgrade or
resetting to kubeadm defaults before a successful upgrade can be performed. The version to manually
upgrade to is denoted in the "PREFERRED VERSION" column.

API GROUP                 CURRENT VERSION   PREFERRED VERSION   MANUAL UPGRADE REQUIRED
kubeproxy.config.k8s.io   -                 v1alpha1            no
kubelet.config.k8s.io     v1beta1           v1beta1             no
_____________________________________________________________________

升级第一个管控节点，指定升级的目标版本 1.24.7 :

升级第一个管控平面节点Kubernetes套件到1.24.7(当前主版本最新补丁版本)¶

sudo kubeadm upgrade apply v1.24.7

升级输出信息:

升级管控平面节点Kubernetes套件到1.24.7输出信息(含交互)¶

[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
W1109 15:11:55.826367  402890 configset.go:78] Warning: No kubeproxy.config.k8s.io/v1alpha1 config is loaded. Continuing without it: configmaps "kube-proxy" not found
[preflight] Running pre-flight checks.
[upgrade] Running cluster health checks
[upgrade/version] You have chosen to change the cluster version to "v1.24.7"
[upgrade/versions] Cluster version: v1.24.3
[upgrade/versions] kubeadm version: v1.24.7
[upgrade/confirm] Are you sure you want to proceed with the upgrade? [y/N]: y
[upgrade/prepull] Pulling images required for setting up a Kubernetes cluster
[upgrade/prepull] This might take a minute or two, depending on the speed of your internet connection
[upgrade/prepull] You can also perform this action in beforehand using 'kubeadm config images pull'
[upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.24.7" (timeout: 5m0s)...
[upgrade/staticpods] Writing new Static Pod manifests to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests1416038386"
[upgrade/staticpods] Preparing for "kube-apiserver" upgrade
[upgrade/staticpods] Renewing apiserver certificate
[upgrade/staticpods] Renewing apiserver-kubelet-client certificate
[upgrade/staticpods] Renewing front-proxy-client certificate
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-apiserver.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2022-11-09-15-13-15/kube-apiserver.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)
[apiclient] Found 3 Pods for label selector component=kube-apiserver
[upgrade/staticpods] Component "kube-apiserver" upgraded successfully!
[upgrade/staticpods] Preparing for "kube-controller-manager" upgrade
[upgrade/staticpods] Renewing controller-manager.conf certificate
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-controller-manager.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2022-11-09-15-13-15/kube-controller-manager.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)
[apiclient] Found 3 Pods for label selector component=kube-controller-manager
[upgrade/staticpods] Component "kube-controller-manager" upgraded successfully!
[upgrade/staticpods] Preparing for "kube-scheduler" upgrade
[upgrade/staticpods] Renewing scheduler.conf certificate
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-scheduler.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2022-11-09-15-13-15/kube-scheduler.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)
[apiclient] Found 3 Pods for label selector component=kube-scheduler
[upgrade/staticpods] Component "kube-scheduler" upgraded successfully!
[upgrade/postupgrade] Removing the deprecated label node-role.kubernetes.io/master='' from all control plane Nodes. After this step only the label node-role.kubernetes.io/control-plane='' will be present on control plane Nodes.
[upgrade/postupgrade] Adding the new taint &Taint{Key:node-role.kubernetes.io/control-plane,Value:,Effect:NoSchedule,TimeAdded:<nil>,} to all control plane Nodes. After this step both taints &Taint{Key:node-role.kubernetes.io/control-plane,Value:,Effect:NoSchedule,TimeAdded:<nil>,} and &Taint{Key:node-role.kubernetes.io/master,Value:,Effect:NoSchedule,TimeAdded:<nil>,} should be present on control plane Nodes.
[upload-config] Storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
[kubelet] Creating a ConfigMap "kubelet-config" in namespace kube-system with the configuration for the kubelets in the cluster
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[bootstrap-token] Configured RBAC rules to allow Node Bootstrap tokens to get nodes
[bootstrap-token] Configured RBAC rules to allow Node Bootstrap tokens to post CSRs in order for nodes to get long term certificate credentials
[bootstrap-token] Configured RBAC rules to allow the csrapprover controller automatically approve CSRs from a Node Bootstrap Token
[bootstrap-token] Configured RBAC rules to allow certificate rotation for all node client certificates in the cluster
[addons] Applied essential addon: CoreDNS
W1109 15:14:23.647437  402890 postupgrade.go:152] the ConfigMap "kube-proxy" in the namespace "kube-system" was not found. Assuming that kube-proxy was not deployed for this cluster. Note that once 'kubeadm upgrade apply' supports phases you will have to skip the kube-proxy upgrade manually

[upgrade/successful] SUCCESS! Your cluster was upgraded to "v1.24.7". Enjoy!

[upgrade/kubelet] Now that your control plane is upgraded, please proceed with upgrading your kubelets if you haven't already done so.

手动升级CNI驱动插件，如 Cilium网络 (这步等后续cilium正式发行新版本后再升级，目前保持不变)

对其他管控面节点 `z-k8s-m-2` 和 `z-k8s-m-3`¶

其他管控节点只需要执行 upgrade node 而不是 upgrade apply :

升级节点Kubernetes套件到1.24.7(upgrade node)¶

sudo kubeadm upgrade node

升级输出信息:

其他管控平面节点Kubernetes套件到1.24.7(upgrade node)输出信息¶

[upgrade] Reading configuration from the cluster...
[upgrade] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
W1109 15:24:07.586473    6794 configset.go:78] Warning: No kubeproxy.config.k8s.io/v1alpha1 config is loaded. Continuing without it: configmaps "kube-proxy" is forbidden: User "system:node:z-k8s-m-2" cannot get resource "configmaps" in API group "" in the namespace "kube-system": no relationship found between node 'z-k8s-m-2' and this object
[preflight] Running pre-flight checks
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
[upgrade] Upgrading your Static Pod-hosted control plane instance to version "v1.24.7"...
[upgrade/staticpods] Writing new Static Pod manifests to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests703550162"
[upgrade/staticpods] Preparing for "kube-apiserver" upgrade
[upgrade/staticpods] Renewing apiserver certificate
[upgrade/staticpods] Renewing apiserver-kubelet-client certificate
[upgrade/staticpods] Renewing front-proxy-client certificate
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-apiserver.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2022-11-09-15-24-50/kube-apiserver.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)
[apiclient] Found 3 Pods for label selector component=kube-apiserver
[upgrade/staticpods] Component "kube-apiserver" upgraded successfully!
[upgrade/staticpods] Preparing for "kube-controller-manager" upgrade
[upgrade/staticpods] Renewing controller-manager.conf certificate
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-controller-manager.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2022-11-09-15-24-50/kube-controller-manager.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)
[apiclient] Found 3 Pods for label selector component=kube-controller-manager
[upgrade/staticpods] Component "kube-controller-manager" upgraded successfully!
[upgrade/staticpods] Preparing for "kube-scheduler" upgrade
[upgrade/staticpods] Renewing scheduler.conf certificate
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-scheduler.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2022-11-09-15-24-50/kube-scheduler.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)
[apiclient] Found 3 Pods for label selector component=kube-scheduler
[upgrade/staticpods] Component "kube-scheduler" upgraded successfully!
[upgrade] The control plane instance for this node was successfully updated!
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[upgrade] The configuration for this node was successfully updated!
[upgrade] Now you should go ahead and upgrade the kubelet package using your package manager.

其他管控节点不需要执行 kubeadm upgrade plan 也不需要更新CNI驱动插件的操作。

腾空管控节点¶

在完成了管控节点的kubernetes镜像升级之后，需要注意，这些管控节点的 kubelet / kubectl 还没有升级，所以此时执行 kubectl get nodes 看到的 VERSION 还是之前的旧版本 1.24.2 。在完成了管控平面的组件升级之后，现在可以对管控节点进行 kubelet / kubectl 升级:

将节点标记为不可调度并驱逐所有负载，准备节点的维护(以下案例是 z-k8s-m-1 其他管控节点类似)

管控平面节点腾空节点(不包含daemonset)¶

kubectl drain z-k8s-m-1 --ignore-daemonsets

提示信息输出:

管控平面节点腾空节点(不包含daemonset)输出信息¶

node/z-k8s-m-1 cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/cilium-xhmc5, metallb-system/speaker-8hgcd
node/z-k8s-m-1 drained

升级管控节点kubelet和kubectl¶

升级 kubelet 和 kubectl:

升级管控平面节点kubelet和kubectl到1.24.7¶

apt-mark unhold kubelet kubectl && \
apt-get update && apt-get install -y kubelet=1.24.7-00 kubectl=1.24.7-00 && \
apt-mark hold kubelet kubectl

重启kubelet:

重启管控平面节点kubelet¶

sudo systemctl daemon-reload
sudo systemctl restart kubelet

将完成升级的管控节点恢复调度并上线(以下案例是 z-k8s-m-1 其他管控节点类似):

恢复管控平面节点调度并上线¶

kubectl uncordon z-k8s-m-1

现在已经完成了第一个管控节点 z-k8s-m-1 的升级，此时检查 kubectl get nodes 可以看到第一个管控节点已经顺利升级到 1.24.7
```
kubectl get nodes -o wide
```

显示输出:

第一个管控节点升级完成后检查输出¶

NAME        STATUS   ROLES           AGE    VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
z-k8s-m-1   Ready    control-plane   114d   v1.24.7   192.168.6.101   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-m-2   Ready    control-plane   113d   v1.24.2   192.168.6.102   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-m-3   Ready    control-plane   113d   v1.24.2   192.168.6.103   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-n-1   Ready    <none>          113d   v1.24.2   192.168.6.111   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-n-2   Ready    <none>          113d   v1.24.2   192.168.6.112   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-n-3   Ready    <none>          113d   v1.24.2   192.168.6.113   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-n-4   Ready    <none>          113d   v1.24.2   192.168.6.114   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-n-5   Ready    <none>          113d   v1.24.2   192.168.6.115   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6

在其余管控节点上重复完成上述 “腾空管控节点” 和 “升级管控节点kubelet和kubectl”

升级工作节点¶

在完成了管控节点升级到 1.24.7 之后，开始升级工作节点。工作节点上的升级过程应该一次执行一个节点，或者一次执行几个节点，以不影响运行工作负载所需的最小容量。

升级 kubeadm :

升级节点kubeadm到1.24.7(当前主版本最新补丁版本)¶

apt-mark unhold kubeadm && \
apt-get update && apt-get install -y kubeadm=1.24.7-00 && \
apt-mark hold kubeadm

执行 kubeadm upgrade :

升级节点Kubernetes套件到1.24.7(upgrade node)¶

sudo kubeadm upgrade node

将节点标记为不可调度并驱逐所有负载，准备节点的维护(以下案例是 z-k8s-n-1 工作节点)

腾空工作节点(不包含daemonset)¶

kubectl drain z-k8s-n-1 --ignore-daemonsets

提示信息输出可以看到不能驱逐本地存储的pod，这里是 Cilium网络相关的pod:

腾空工作节点(不包含daemonset)输出信息，这里提示使用本地存储的cilium的pod不能驱逐¶

error: unable to drain node "z-k8s-n-1" due to error:cannot delete Pods with local storage (use --delete-emptydir-data to override): istio-system/istio-ingressgateway-85cc7b7ccd-zpx42, kube-system/hubble-ui-579fdfbc58-g2lv6, continuing command...
There are pending nodes to be drained:
 z-k8s-n-1
cannot delete Pods with local storage (use --delete-emptydir-data to override): istio-system/istio-ingressgateway-85cc7b7ccd-zpx42, kube-system/hubble-ui-579fdfbc58-g2lv6

修订腾空节点的命令，添加 --delete-emptydir-data 参数:

使用–delete-emptydir-data参数腾空工作节点(不包含daemonset)¶

kubectl drain z-k8s-n-1 --ignore-daemonsets --delete-emptydir-data

这里的输出信息中有一些 evicting 错误:

使用–delete-emptydir-data参数腾空工作节点的输出信息(有evicting错误)¶

node/z-k8s-n-1 already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/cilium-sl25l, kube-system/otelcol-hubble-collector-46gtj, metallb-system/speaker-xt5zf
evicting pod podinfo/podinfo-frontend-76b9ff9c94-dsl8l
evicting pod cert-manager/cert-manager-7b4f4986bb-jpkr5
evicting pod cert-manager/cert-manager-cainjector-6b9d8b7d57-5fw2r
evicting pod cert-manager/cert-manager-webhook-d7bc6f65d-sl4fv
evicting pod cilium-monitoring/grafana-b96dcb76b-2przz
evicting pod cilium-test/echo-other-node-d79544ccf-2mxr9
evicting pod default/my-nginx-df7bbf6f5-6gndk
evicting pod istio-system/istio-ingressgateway-85cc7b7ccd-zpx42
evicting pod jaeger/jaeger-default-fbc6fd6fd-p88nx
evicting pod jaeger/jaeger-operator-67dcc96554-q4ccj
evicting pod kube-system/hubble-relay-84b4ddb556-z86lj
evicting pod kube-system/hubble-ui-579fdfbc58-g2lv6
evicting pod opentelemetry-operator-system/opentelemetry-operator-controller-manager-696c488948-mkwph
evicting pod podinfo/podinfo-backend-595c9bd9c7-7c2vv
evicting pod podinfo/podinfo-backend-595c9bd9c7-bbplt
evicting pod podinfo/podinfo-client-5b9bb6b9cd-7xh76
evicting pod podinfo/podinfo-client-5b9bb6b9cd-fv5zq
evicting pod podinfo/podinfo-frontend-76b9ff9c94-7gt9q
error when evicting pods/"istio-ingressgateway-85cc7b7ccd-zpx42" -n "istio-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I1109 16:55:32.209267   33920 request.go:682] Waited for 1.183463048s due to client-side throttling, not priority and fairness, request: POST:https://z-k8s-api.staging.huatai.me:6443/api/v1/namespaces/podinfo/pods/podinfo-client-5b9bb6b9cd-7xh76/eviction
pod/jaeger-operator-67dcc96554-q4ccj evicted
pod/cert-manager-webhook-d7bc6f65d-sl4fv evicted
evicting pod istio-system/istio-ingressgateway-85cc7b7ccd-zpx42
error when evicting pods/"istio-ingressgateway-85cc7b7ccd-zpx42" -n "istio-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
pod/cert-manager-7b4f4986bb-jpkr5 evicted
pod/my-nginx-df7bbf6f5-6gndk evicted
pod/hubble-ui-579fdfbc58-g2lv6 evicted
pod/grafana-b96dcb76b-2przz evicted
pod/cert-manager-cainjector-6b9d8b7d57-5fw2r evicted
pod/opentelemetry-operator-controller-manager-696c488948-mkwph evicted
pod/podinfo-backend-595c9bd9c7-7c2vv evicted
pod/podinfo-backend-595c9bd9c7-bbplt evicted
evicting pod istio-system/istio-ingressgateway-85cc7b7ccd-zpx42
error when evicting pods/"istio-ingressgateway-85cc7b7ccd-zpx42" -n "istio-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
pod/podinfo-frontend-76b9ff9c94-dsl8l evicted
pod/hubble-relay-84b4ddb556-z86lj evicted
I1109 16:55:42.330041   33920 request.go:682] Waited for 1.5766885s due to client-side throttling, not priority and fairness, request: GET:https://z-k8s-api.staging.huatai.me:6443/api/v1/namespaces/podinfo/pods/podinfo-frontend-76b9ff9c94-7gt9q
pod/podinfo-frontend-76b9ff9c94-7gt9q evicted
pod/podinfo-client-5b9bb6b9cd-fv5zq evicted
pod/podinfo-client-5b9bb6b9cd-7xh76 evicted
evicting pod istio-system/istio-ingressgateway-85cc7b7ccd-zpx42
error when evicting pods/"istio-ingressgateway-85cc7b7ccd-zpx42" -n "istio-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod istio-system/istio-ingressgateway-85cc7b7ccd-zpx42
error when evicting pods/"istio-ingressgateway-85cc7b7ccd-zpx42" -n "istio-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod istio-system/istio-ingressgateway-85cc7b7ccd-zpx42
error when evicting pods/"istio-ingressgateway-85cc7b7ccd-zpx42" -n "istio-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
pod/jaeger-default-fbc6fd6fd-p88nx evicted
evicting pod istio-system/istio-ingressgateway-85cc7b7ccd-zpx42
error when evicting pods/"istio-ingressgateway-85cc7b7ccd-zpx42" -n "istio-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
pod/echo-other-node-d79544ccf-2mxr9 evicted
evicting pod istio-system/istio-ingressgateway-85cc7b7ccd-zpx42
error when evicting pods/"istio-ingressgateway-85cc7b7ccd-zpx42" -n "istio-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
...

同样升级 kubelet 和 kubectl:

升级工作节点kubelet和kubectl到1.24.7¶

apt-mark unhold kubelet kubectl && \
apt-get update && apt-get install -y kubelet=1.24.7-00 kubectl=1.24.7-00 && \
apt-mark hold kubelet kubectl

重启kubelet:

重启工作节点kubelet(1.24.7)¶

sudo systemctl daemon-reload
sudo systemctl restart kubelet

将完成升级的工作节点恢复调度并上线(以下案例是 z-k8s-n-1 其他工作节点类似):

恢复工作节点调度并上线¶

kubectl uncordon z-k8s-n-1

所有节点升级完成后，使用 kubectl get nodes -o wide 检查，可以看到所有节点都统一升级到 1.24.7 版本:

NAME        STATUS   ROLES           AGE    VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
z-k8s-m-1   Ready    control-plane   115d   v1.24.7   192.168.6.101   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-m-2   Ready    control-plane   113d   v1.24.7   192.168.6.102   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-m-3   Ready    control-plane   113d   v1.24.7   192.168.6.103   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-n-1   Ready    <none>          113d   v1.24.7   192.168.6.111   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-n-2   Ready    <none>          113d   v1.24.7   192.168.6.112   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-n-3   Ready    <none>          113d   v1.24.7   192.168.6.113   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-n-4   Ready    <none>          113d   v1.24.7   192.168.6.114   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-n-5   Ready    <none>          113d   v1.24.7   192.168.6.115   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6

升级K8S集群(1.24.7)至最新release版本(1.25.3)¶

在完成了上述 1.24.2 升级到最新补丁版本 1.24.7 之后，就具备了充分条件可以进一步升级大版本到最新release版本 1.25.3

升级方法步骤完全相同，只不过目标版本调整为 1.25.3 ，以下记录升级过程

执行 `kubeadm upgrade`¶

对第一个管控面节点 `z-k8s-m-1`¶

升级 kubeadm :

升级节点kubeadm到1.25.3(最新release版本)¶

apt-mark unhold kubeadm && \
apt-get update && apt-get install -y kubeadm=1.25.3-00 && \
apt-mark hold kubeadm

验证 kubeadm 版本:
```
kubeadm version
```
验证升级计划:

kubeadm验证升级计划¶

kubeadm upgrade plan

没有特别报错，则可继续进行

升级第一个管控节点，指定升级的目标版本 1.25.3 :

升级第一个管控平面节点Kubernetes套件到1.25.3(最新release版本)¶

sudo kubeadm upgrade apply v1.25.3

对其他管控面节点 `z-k8s-m-2` 和 `z-k8s-m-3`¶

其他管控节点通过 kubeadm upgrade 升级:

升级节点Kubernetes套件到1.25.3(upgrade node)¶

sudo kubeadm upgrade node

腾空管控节点¶

将节点标记为不可调度并驱逐所有负载，准备节点的维护(以下案例是 z-k8s-m-1 其他管控节点类似)

管控平面节点腾空节点(不包含daemonset)¶

kubectl drain z-k8s-m-1 --ignore-daemonsets

升级管控节点kubelet和kubectl¶

升级 kubelet 和 kubectl:

升级管控平面节点kubelet和kubectl到1.25.3¶

apt-mark unhold kubelet kubectl && \
apt-get update && apt-get install -y kubelet=1.25.3-00 kubectl=1.25.3-00 && \
apt-mark hold kubelet kubectl

重启kubelet:

重启管控平面节点kubelet¶

sudo systemctl daemon-reload
sudo systemctl restart kubelet

将完成升级的管控节点恢复调度并上线(以下案例是 z-k8s-m-1 其他管控节点类似):

恢复管控平面节点调度并上线¶

kubectl uncordon z-k8s-m-1

在其余管控节点上重复完成上述 “腾空管控节点” 和 “升级管控节点kubelet和kubectl”

升级工作节点¶

升级 kubeadm :

升级节点kubeadm到1.25.3¶

apt-mark unhold kubeadm && \
apt-get update && apt-get install -y kubeadm=1.25.3-00 && \
apt-mark hold kubeadm

执行 kubeadm upgrade :

升级节点Kubernetes套件到1.25.3(upgrade node)¶

sudo kubeadm upgrade node

将节点标记为不可调度并驱逐所有负载，准备节点的维护(以下案例是 z-k8s-n-1 工作节点)

腾空工作节点(不包含daemonset)¶

kubectl drain z-k8s-n-1 --ignore-daemonsets --delete-emptydir-data

同样升级 kubelet 和 kubectl:

升级工作节点kubelet和kubectl到1.25.3¶

apt-mark unhold kubelet kubectl && \
apt-get update && apt-get install -y kubelet=1.25.3-00 kubectl=1.25.3-00 && \
apt-mark hold kubelet kubectl

重启kubelet:

重启工作节点kubelet¶

sudo systemctl daemon-reload
sudo systemctl restart kubelet

将完成升级的工作节点恢复调度并上线(以下案例是 z-k8s-n-1 其他工作节点类似):

恢复工作节点调度并上线¶

kubectl uncordon z-k8s-n-1

所有节点升级完成后，使用 kubectl get nodes -o wide 检查，可以看到所有节点都统一升级到 1.25.3 版本:

NAME        STATUS   ROLES           AGE    VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
z-k8s-m-1   Ready    control-plane   115d   v1.25.3   192.168.6.101   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-m-2   Ready    control-plane   114d   v1.25.3   192.168.6.102   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-m-3   Ready    control-plane   114d   v1.25.3   192.168.6.103   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-n-1   Ready    <none>          114d   v1.25.3   192.168.6.111   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-n-2   Ready    <none>          114d   v1.25.3   192.168.6.112   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-n-3   Ready    <none>          114d   v1.25.3   192.168.6.113   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-n-4   Ready    <none>          114d   v1.25.3   192.168.6.114   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6
z-k8s-n-5   Ready    <none>          114d   v1.25.3   192.168.6.115   <none>        Ubuntu 22.04.1 LTS   5.15.0-52-generic   containerd://1.6.6

故障恢复¶

备注

我参考官方文档从 1.24.2 升级到 1.25.3 ，没有遇到严重故障问题，所以本段落仅参考官方文档整理记录以备后用

`kubeadm upgrade`¶

如果 kubeadm upgrade 失败并且没有回滚，可以再次运行 kubeadm upgrade : 这个命令是幂等的，可以重复执行。
可以运行 kubeadm upgrade apply --force

数据备份¶

升级时，如果时集群内置 etcd - 分布式kv存储则会在 /etc/kubernetes/tmp 目录下备份 etc 数据:

kubeadm-backup-etcd-<date>-<time>
kubeadm-backup-manifests-<date>-<time>

如果 etcd 升级失败并且无法回滚，可以从上述 kubeadm-backup-etcd-<date>-<time> 文件夹内容复制到 /var/lib/etcd 进行手工恢复。如果是外部 etcd 则上述目录为空。
kubeadm-backup-manifests-<date>-<time> 是当前控制面节点静态Pod清单文件备份版本，这个文件目录下内容可以复制到 /etc/kubernetes/manifests 目录下手工恢复。

使用kubeadm升级Kubernetes集群1.25¶

准备工作¶

操作系统升级(可选)¶

升级K8S集群(1.24.2)至最新补丁版本(1.24.7)¶

升级管控平面节点¶

执行 kubeadm upgrade¶

对第一个管控面节点 z-k8s-m-1¶

对其他管控面节点 z-k8s-m-2 和 z-k8s-m-3¶

腾空管控节点¶

升级管控节点kubelet和kubectl¶

升级工作节点¶

升级K8S集群(1.24.7)至最新release版本(1.25.3)¶

执行 kubeadm upgrade¶

对第一个管控面节点 z-k8s-m-1¶

对其他管控面节点 z-k8s-m-2 和 z-k8s-m-3¶

腾空管控节点¶

升级管控节点kubelet和kubectl¶

升级工作节点¶

故障恢复¶

kubeadm upgrade¶

数据备份¶

参考¶

执行 `kubeadm upgrade`¶

对第一个管控面节点 `z-k8s-m-1`¶

对其他管控面节点 `z-k8s-m-2` 和 `z-k8s-m-3`¶

执行 `kubeadm upgrade`¶

对第一个管控面节点 `z-k8s-m-1`¶

对其他管控面节点 `z-k8s-m-2` 和 `z-k8s-m-3`¶

`kubeadm upgrade`¶