排查已经被删除pods(重启)

在kubernetes集群中, deployments 的pods在异常crash时会自动恢复,不过我们还是需要排查原先pods为何会crash:

kubectl -n kube-system get pods -o wide| grep scheduler

显示:

scheduler-86c895bdc9-b7rnr                 1/1     Running   3          65d     192.168.37.200    z-k8s-m-1  <none>           <none>
scheduler-86c895bdc9-ng4gw                 1/1     Running   5          65d     192.168.37.192    z-k8s-m-2  <none>           <none>
scheduler-86c895bdc9-w8mrh                 1/1     Running   4          60d     192.168.37.231    z-k8s-m-3  <none>           <none>
  • 要知道哪个pod重启了,可以通过查看指定 namespace 的events:

    kubectl -n kube-system get event
    

会看到最近 1 小时事件:

LAST SEEN   TYPE      REASON                 OBJECT                                   MESSAGE
7m9s        Warning   Unhealthy              pod/apiserver-65cc4bc696-7grkn           Readiness probe failed: Get https://192.168.37.203:6443/readyz: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
59m         Warning   Unhealthy              pod/apiserver-65cc4bc696-kndsn           Readiness probe failed: Get https://192.168.37.158:6443/readyz: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
29m         Normal    Pulling                pod/scheduler-756b6fcfc7-9wxww   Pulling image "reg.docker.huatai.me/k8s/scheduler:release-v1.2.5_20211026111044_216c6766"
29m         Normal    Pulled                 pod/scheduler-756b6fcfc7-9wxww   Successfully pulled image "reg.docker.huatai.me/k8s/scheduler:release-v1.2.5_20211026111044_216c6766"
29m         Normal    Created                pod/scheduler-756b6fcfc7-9wxww   Created container scheduler
29m         Normal    Started                pod/scheduler-756b6fcfc7-9wxww   Started container scheduler
29m         Normal    WithOutPostStartHook   pod/scheduler-756b6fcfc7-9wxww   Container scheduler with out poststart hook
29m         Normal    Pulling                pod/scheduler-86c895bdc9-ng4gw   Pulling image "reg.docker.huatai.me/k8s/scheduler:release-v1.2.4_20210928203943_53875ece"
29m         Normal    Pulled                 pod/scheduler-86c895bdc9-ng4gw   Successfully pulled image "reg.docker.huatai.me/k8s/scheduler:release-v1.2.4_20210928203943_53875ece"
29m         Normal    Created                pod/scheduler-86c895bdc9-ng4gw   Created container scheduler
29m         Normal    Started                pod/scheduler-86c895bdc9-ng4gw   Started container scheduler
29m         Normal    WithOutPostStartHook   pod/scheduler-86c895bdc9-ng4gw   Container scheduler with out poststart hook
34m         Normal    Pulling                pod/scheduler-86c895bdc9-w8mrh   Pulling image "reg.docker.huatai.me/k8s/scheduler:release-v1.2.4_20210928203943_53875ece"
34m         Normal    Pulled                 pod/scheduler-86c895bdc9-w8mrh   Successfully pulled image "reg.docker.huatai.me/k8s/scheduler:release-v1.2.4_20210928203943_53875ece"
34m         Normal    Created                pod/scheduler-86c895bdc9-w8mrh   Created container scheduler
34m         Normal    Started                pod/scheduler-86c895bdc9-w8mrh   Started container scheduler
34m         Normal    WithOutPostStartHook   pod/scheduler-86c895bdc9-w8mrh   Container scheduler with out poststart hook

要过滤出指定pods,一种方法是直接使用 grep 命令,例如 kubectl get events | grep scheduler-86c895bdc9-w8mrh 。另外一种方法是采用标准的 field-selector

  • 首先找出 field

    kubectl -n kube-system get events --output json
    

可以看到:

...
{
   "apiVersion": "v1",
   ...
   "involvedObject": {
       ...
       "name": "scheduler-86c895bdc9-w8mrh",
       ...
   }
}
  • 可以知道 field-selector 就是 involvedObject.name ,所以查询命令:

    kubectl -n kube-system get events --field-selector involvedObject.name=scheduler-86c895bdc9-w8mrh
    

效果和使用 grep 相同

但是,这里存在一个问题,就是你不知道重启前,被替换掉的pod的名字,所以你就很难查看之前pods如何crash掉的。

  • 以前版本有一个查询 -a 参数:

    kubectl -n kube-system get pods -a
    

类似于 docker ps -a 可以查看已经清理掉掉pods,不过,近期版本已经不再支持,需要采用命令:

kubectl -n kube-system get event -o custom-columns=NAME:.metadata.name | cut -d "." -f1

但是,很不幸,这个命令只输出最近一个小时的时间。对于我上述案例,最近一小时并没有包含之前创建pods的情况记录,所以也找不到对应pods名字。

不过,对于管控pods,调度的 scheduler 只在限定的节点 ( z-k8s-m-1z-k8s-m-3 ),所以还是可以找到对于pods上的记录和日志

注解

获取pod重启时间 用于找出最近重启的应用服务器(重建)

参考