Ceph硬关机重启后OSD出现”slow ops”报错¶

异常¶

我的私有云架构中部署的私有云数据层 ZData Ceph 采用了3台 Open Virtual Machine Firmware(OMVF) 虚拟机，但是我在一次物理服务器重启前，快速执行了 virsh shutdown XXX 连续关闭了3台虚拟机。这次 virsh shutdown 命令执行间隔极短，也就是说3台Ceph虚拟机几乎同时关闭。由于没有在虚拟机内部的操作系统执行shutdown，这种虚拟机 shutdown 可能不是干净的关机方法，我发现重启服务器之后，3个虚拟机虽然正常启动，但是使用ceph存储的其他虚拟机启动会 paused 住:

 Id   Name         State
----------------------------
 1    z-b-data-2   running
 2    z-b-data-3   running
 3    z-b-data-1   running
 4    z-k8s-m-1    paused

在Ceph虚拟机集群中检查 ceph -s 可以看到告警:

cluster:
  id:     0e6c8b6f-0d32-4cdb-a45d-85f8c7997c17
  health: HEALTH_WARN
          Reduced data availability: 33 pgs inactive, 33 pgs peering
          5 slow ops, oldest one blocked for 1925 sec, daemons [osd.1,osd.2] have slow ops.

services:
  mon: 3 daemons, quorum z-b-data-1,z-b-data-2,z-b-data-3 (age 11m)
  mgr: z-b-data-1(active, since 32m)
  osd: 3 osds: 3 up (since 31m), 3 in (since 4w)

data:
  pools:   2 pools, 33 pgs
  objects: 44.79k objects, 174 GiB
  usage:   517 GiB used, 880 GiB / 1.4 TiB avail
  pgs:     100.000% pgs not active
           33 peering

这显示OSD存在异常

检查 dmesg -T 输出也有一个提示信息:

[Tue Jan 10 00:41:47 2023] systemd[1]: /lib/systemd/system/ceph-volume@.service:8: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.

看来ceph进程关闭需要改进?

排查¶

因为是晚上，我一时没有解决方法，所以在第二天开始排查，此时发现原先 paused 住的虚拟机已经进入了 runing 状态了:

paused的虚拟机进入了running状态¶

$ virsh list
 Id   Name         State
----------------------------
 1    z-b-data-2   running
 2    z-b-data-3   running
 3    z-b-data-1   running
 4    z-k8s-m-1    running

此时观察 ceph -s 集群状态出现更为严重的告警( objects degraded ):

ceph集群降级(objects degraded)¶

  cluster:
    id:     0e6c8b6f-0d32-4cdb-a45d-85f8c7997c17
    health: HEALTH_WARN
            Degraded data redundancy: 44788/134364 objects degraded (33.333%), 33 pgs degraded, 33 pgs undersized

  services:
    mon: 3 daemons, quorum z-b-data-1,z-b-data-2,z-b-data-3 (age 9h)
    mgr: z-b-data-1(active, since 9h)
    osd: 3 osds: 2 up (since 4h), 2 in (since 8h)

  data:
    pools:   2 pools, 33 pgs
    objects: 44.79k objects, 174 GiB
    usage:   345 GiB used, 586 GiB / 931 GiB avail
    pgs:     44788/134364 objects degraded (33.333%)
             33 active+undersized+degraded

  io:
    client:   0 B/s rd, 44 KiB/s wr, 0 op/s rd, 5 op/s wr

可以看到已经有一个osd出现宕机:

检查ceph日志，可以看到:

ceph集群日志显示”33 pgs: 33 active+undersized+degraded”¶

1/10/23 11:00:00 AM [WRN] overall HEALTH_WARN Degraded data redundancy: 44788/134364 objects degraded (33.333%), 33 pgs degraded, 33 pgs undersized
1/10/23 10:50:00 AM [WRN] overall HEALTH_WARN Degraded data redundancy: 44788/134364 objects degraded (33.333%), 33 pgs degraded, 33 pgs undersized
1/10/23 10:40:00 AM [WRN] overall HEALTH_WARN Degraded data redundancy: 44788/134364 objects degraded (33.333%), 33 pgs degraded, 33 pgs undersized
1/10/23 10:35:18 AM [DBG] pgmap v17785: 33 pgs: 33 active+undersized+degraded; 174 GiB data, 345 GiB used, 586 GiB / 931 GiB avail; 0 B/s rd, 76 KiB/s wr, 11 op/s; 44788/134364 objects degraded (33.333%)
1/10/23 10:35:16 AM [DBG] pgmap v17784: 33 pgs: 33 active+undersized+degraded; 174 GiB data, 345 GiB used, 586 GiB / 931 GiB avail; 0 B/s rd, 16 KiB/s wr, 2 op/s; 44788/134364 objects degraded (33.333%)
1/10/23 10:35:14 AM [DBG] pgmap v17783: 33 pgs: 33 active+undersized+degraded; 174 GiB data, 345 GiB used, 586 GiB / 931 GiB avail; 0 B/s rd, 18 KiB/s wr, 2 op/s; 44788/134364 objects degraded (33.333%)

首先我们要找到是哪个osd节点宕机:

ceph osd tree命令检查osd状态¶

# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME            STATUS  REWEIGHT  PRI-AFF
-1         1.36409  root default
-3         0.45470      host z-b-data-1
 0    ssd  0.45470          osd.0            up   1.00000  1.00000
-5         0.45470      host z-b-data-2
 1    ssd  0.45470          osd.1          down         0  1.00000
-7         0.45470      host z-b-data-3
 2    ssd  0.45470          osd.2            up   1.00000  1.00000

可以看到节点2上 osd.1 出现宕机

然后检查服务器上空间:

ceph osd df命令检查osd存储空间¶

# ceph osd df
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP    META     AVAIL    %USE   VAR   PGS  STATUS
 0    ssd  0.45470   1.00000  466 GiB  172 GiB  172 GiB   8 KiB  542 MiB  293 GiB  37.03  1.00   33      up
 1    ssd  0.45470         0      0 B      0 B      0 B     0 B      0 B      0 B      0     0    0    down
 2    ssd  0.45470   1.00000  466 GiB  172 GiB  172 GiB   8 KiB  560 MiB  293 GiB  37.04  1.00   33      up
                       TOTAL  931 GiB  345 GiB  344 GiB  17 KiB  1.1 GiB  586 GiB  37.03
MIN/MAX VAR: 1.00/1.00  STDDEV: 0.00

可以看到节点2的存储已经无法访问

在节点2上检查服务，可以看到 osd 服务是运行状态:

systemctl检查osd处于运行状态¶

# systemctl status ceph-osd@1.service
● ceph-osd@1.service - Ceph object storage daemon osd.1
     Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled)
     Active: active (running) since Tue 2023-01-10 00:41:45 CST; 10h ago
    Process: 561 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 1 (code=exited, status=0/SUCCESS)
   Main PID: 570 (ceph-osd)
      Tasks: 76
     Memory: 817.9M
        CPU: 1min 34.125s
     CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@1.service
             └─570 /usr/bin/ceph-osd -f --cluster ceph --id 1 --setuser ceph --setgroup ceph

Jan 10 01:32:47 z-b-data-2 ceph-osd[570]: 2023-01-10T01:32:47.775+0800 7fa7e2e26640 -1 osd.1 2341 get_health_metrics reporting 6 slow ops, oldest is osd_op(client.1324143.0:1 1.0 1.4e660c0c (undecoded) ondisk+write+known_if_redirected+supports_pool_eio e2337)
Jan 10 01:32:48 z-b-data-2 ceph-osd[570]: 2023-01-10T01:32:48.727+0800 7fa7e2e26640 -1 osd.1 2341 get_health_metrics reporting 6 slow ops, oldest is osd_op(client.1324143.0:1 1.0 1.4e660c0c (undecoded) ondisk+write+known_if_redirected+supports_pool_eio e2337)
...

检查节点2上 /var/log/ceph/osd.log :

systemctl检查osd处于运行状态¶

** File Read Latency Histogram By Level [default] **

2023-01-10T00:42:07.600+0800 7fa801e5e5c0  1 bluestore(/var/lib/ceph/osd/ceph-1) _upgrade_super from 4, latest 4
2023-01-10T00:42:07.600+0800 7fa801e5e5c0  1 bluestore(/var/lib/ceph/osd/ceph-1) _upgrade_super done
2023-01-10T00:42:07.608+0800 7fa801e5e5c0  0 <cls> ./src/cls/cephfs/cls_cephfs.cc:201: loading cephfs
2023-01-10T00:42:07.612+0800 7fa801e5e5c0  0 <cls> ./src/cls/hello/cls_hello.cc:316: loading cls_hello
2023-01-10T00:42:07.612+0800 7fa801e5e5c0  0 _get_class not permitted to load lua
2023-01-10T00:42:07.636+0800 7fa801e5e5c0  0 _get_class not permitted to load sdk
2023-01-10T00:42:07.644+0800 7fa801e5e5c0  0 _get_class not permitted to load test_remote_reads
2023-01-10T00:42:07.644+0800 7fa801e5e5c0  1 osd.1 2331 warning: got an error loading one or more classes: (1) Operation not permitted
2023-01-10T00:42:07.644+0800 7fa801e5e5c0  0 osd.1 2331 crush map has features 288514051259236352, adjusting msgr requires for clients
2023-01-10T00:42:07.644+0800 7fa801e5e5c0  0 osd.1 2331 crush map has features 288514051259236352 was 8705, adjusting msgr requires for mons
2023-01-10T00:42:07.644+0800 7fa801e5e5c0  0 osd.1 2331 crush map has features 3314933000852226048, adjusting msgr requires for osds
2023-01-10T00:42:07.644+0800 7fa801e5e5c0  1 osd.1 2331 check_osdmap_features require_osd_release unknown -> quincy

从日志没有看到非常多异常，但是从dashboard可以看到 OSDs 页面侠士，节点2状态是 out down :

我检查了节点2的系统日志，但是没有看到异常，物理服务器的dmesg日志没有看到硬件异常

恢复¶

由于没有找到明显报错，但是可以看到节点2出现down机，数据延迟了4小时，所以我先shutdown了节点2操作系统，然后再次启动该虚拟机。

启动节点2可以看到Ceph立即开始进行数据修复同步:

重启ceph节点2，可以看到数据同步修复¶

1/10/23 11:10:00 AM [INF] overall HEALTH_OK
1/10/23 11:07:14 AM [INF] Cluster is now healthy
1/10/23 11:07:14 AM [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 19/134364 objects degraded (0.014%), 6 pgs degraded)
1/10/23 11:07:12 AM [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 4 pgs peering)
1/10/23 11:07:10 AM [WRN] Health check update: Degraded data redundancy: 25/134364 objects degraded (0.019%), 8 pgs degraded (PG_DEGRADED)
1/10/23 11:07:04 AM [WRN] Health check update: Degraded data redundancy: 2793/134364 objects degraded (2.079%), 15 pgs degraded, 3 pgs undersized (PG_DEGRADED)
1/10/23 11:07:04 AM [WRN] Health check failed: Reduced data availability: 6 pgs peering (PG_AVAILABILITY)
1/10/23 11:07:00 AM [INF] osd.1 [v2:192.168.6.205:6802/565,v1:192.168.6.205:6803/565] boot
1/10/23 11:06:55 AM [WRN] pg 2.1f is stuck undersized for 4h, current state active+undersized+degraded, last acting [0,2]
1/10/23 11:06:55 AM [WRN] pg 2.1e is stuck undersized for 4h, current state active+undersized+degraded, last acting [0,2]
1/10/23 11:06:55 AM [WRN] pg 2.1d is stuck undersized for 4h, current state active+undersized+degraded, last acting [2,0]

然后Ceph集群恢复健康状态:

重启ceph节点2,Ceph恢复健康¶

# ceph -s
  cluster:
    id:     0e6c8b6f-0d32-4cdb-a45d-85f8c7997c17
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum z-b-data-1,z-b-data-2,z-b-data-3 (age 8m)
    mgr: z-b-data-1(active, since 10h)
    osd: 3 osds: 3 up (since 8m), 3 in (since 8m)

  data:
    pools:   2 pools, 33 pgs
    objects: 44.79k objects, 174 GiB
    usage:   517 GiB used, 880 GiB / 1.4 TiB avail
    pgs:     33 active+clean

小结¶

我初步推测是虚拟化采用OVMF实现passthrough GPU和NVMe存储可能在快速重启时出现passthcough存储访问问题，影响了ceph的数据节点。这个排查比较粗糙:

目前还缺乏对Ceph底层系统级原理的深入理解，需要补齐这块短板
需要模拟不同的破坏场景进行演练

参考¶

ceph active+undersized warning