Ceph硬关机重启后OSD出现”slow ops”报错¶
异常¶
我的 私有云架构 中部署的 私有云数据层 ZData Ceph 采用了3台 Open Virtual Machine Firmware(OMVF) 虚拟机,但是我在一次物理服务器重启前,快速执行了 virsh shutdown XXX
连续关闭了3台虚拟机。这次 virsh shutdown
命令执行间隔极短,也就是说3台Ceph虚拟机几乎同时关闭。由于没有在虚拟机内部的操作系统执行shutdown,这种虚拟机 shutdown 可能不是干净的关机方法,我发现重启服务器之后,3个虚拟机虽然正常启动,但是使用ceph存储的其他虚拟机启动会 paused
住:
Id Name State
----------------------------
1 z-b-data-2 running
2 z-b-data-3 running
3 z-b-data-1 running
4 z-k8s-m-1 paused
在Ceph虚拟机集群中检查 ceph -s
可以看到告警:
cluster:
id: 0e6c8b6f-0d32-4cdb-a45d-85f8c7997c17
health: HEALTH_WARN
Reduced data availability: 33 pgs inactive, 33 pgs peering
5 slow ops, oldest one blocked for 1925 sec, daemons [osd.1,osd.2] have slow ops.
services:
mon: 3 daemons, quorum z-b-data-1,z-b-data-2,z-b-data-3 (age 11m)
mgr: z-b-data-1(active, since 32m)
osd: 3 osds: 3 up (since 31m), 3 in (since 4w)
data:
pools: 2 pools, 33 pgs
objects: 44.79k objects, 174 GiB
usage: 517 GiB used, 880 GiB / 1.4 TiB avail
pgs: 100.000% pgs not active
33 peering
这显示OSD存在异常
检查 dmesg -T
输出也有一个提示信息:
[Tue Jan 10 00:41:47 2023] systemd[1]: /lib/systemd/system/ceph-volume@.service:8: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.
看来ceph进程关闭需要改进?
排查¶
因为是晚上,我一时没有解决方法,所以在第二天开始排查,此时发现原先 paused
住的虚拟机已经进入了 runing
状态了:
$ virsh list
Id Name State
----------------------------
1 z-b-data-2 running
2 z-b-data-3 running
3 z-b-data-1 running
4 z-k8s-m-1 running
此时观察 ceph -s
集群状态出现更为严重的告警( objects degraded
):
cluster:
id: 0e6c8b6f-0d32-4cdb-a45d-85f8c7997c17
health: HEALTH_WARN
Degraded data redundancy: 44788/134364 objects degraded (33.333%), 33 pgs degraded, 33 pgs undersized
services:
mon: 3 daemons, quorum z-b-data-1,z-b-data-2,z-b-data-3 (age 9h)
mgr: z-b-data-1(active, since 9h)
osd: 3 osds: 2 up (since 4h), 2 in (since 8h)
data:
pools: 2 pools, 33 pgs
objects: 44.79k objects, 174 GiB
usage: 345 GiB used, 586 GiB / 931 GiB avail
pgs: 44788/134364 objects degraded (33.333%)
33 active+undersized+degraded
io:
client: 0 B/s rd, 44 KiB/s wr, 0 op/s rd, 5 op/s wr
可以看到已经有一个osd出现宕机:
检查ceph日志,可以看到:
1/10/23 11:00:00 AM [WRN] overall HEALTH_WARN Degraded data redundancy: 44788/134364 objects degraded (33.333%), 33 pgs degraded, 33 pgs undersized
1/10/23 10:50:00 AM [WRN] overall HEALTH_WARN Degraded data redundancy: 44788/134364 objects degraded (33.333%), 33 pgs degraded, 33 pgs undersized
1/10/23 10:40:00 AM [WRN] overall HEALTH_WARN Degraded data redundancy: 44788/134364 objects degraded (33.333%), 33 pgs degraded, 33 pgs undersized
1/10/23 10:35:18 AM [DBG] pgmap v17785: 33 pgs: 33 active+undersized+degraded; 174 GiB data, 345 GiB used, 586 GiB / 931 GiB avail; 0 B/s rd, 76 KiB/s wr, 11 op/s; 44788/134364 objects degraded (33.333%)
1/10/23 10:35:16 AM [DBG] pgmap v17784: 33 pgs: 33 active+undersized+degraded; 174 GiB data, 345 GiB used, 586 GiB / 931 GiB avail; 0 B/s rd, 16 KiB/s wr, 2 op/s; 44788/134364 objects degraded (33.333%)
1/10/23 10:35:14 AM [DBG] pgmap v17783: 33 pgs: 33 active+undersized+degraded; 174 GiB data, 345 GiB used, 586 GiB / 931 GiB avail; 0 B/s rd, 18 KiB/s wr, 2 op/s; 44788/134364 objects degraded (33.333%)
首先我们要找到是哪个osd节点宕机:
# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 1.36409 root default
-3 0.45470 host z-b-data-1
0 ssd 0.45470 osd.0 up 1.00000 1.00000
-5 0.45470 host z-b-data-2
1 ssd 0.45470 osd.1 down 0 1.00000
-7 0.45470 host z-b-data-3
2 ssd 0.45470 osd.2 up 1.00000 1.00000
可以看到节点2上 osd.1
出现宕机
然后检查服务器上空间:
# ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
0 ssd 0.45470 1.00000 466 GiB 172 GiB 172 GiB 8 KiB 542 MiB 293 GiB 37.03 1.00 33 up
1 ssd 0.45470 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 down
2 ssd 0.45470 1.00000 466 GiB 172 GiB 172 GiB 8 KiB 560 MiB 293 GiB 37.04 1.00 33 up
TOTAL 931 GiB 345 GiB 344 GiB 17 KiB 1.1 GiB 586 GiB 37.03
MIN/MAX VAR: 1.00/1.00 STDDEV: 0.00
可以看到节点2的存储已经无法访问
在节点2上检查服务,可以看到
osd
服务是运行状态:
# systemctl status ceph-osd@1.service
● ceph-osd@1.service - Ceph object storage daemon osd.1
Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled)
Active: active (running) since Tue 2023-01-10 00:41:45 CST; 10h ago
Process: 561 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 1 (code=exited, status=0/SUCCESS)
Main PID: 570 (ceph-osd)
Tasks: 76
Memory: 817.9M
CPU: 1min 34.125s
CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@1.service
└─570 /usr/bin/ceph-osd -f --cluster ceph --id 1 --setuser ceph --setgroup ceph
Jan 10 01:32:47 z-b-data-2 ceph-osd[570]: 2023-01-10T01:32:47.775+0800 7fa7e2e26640 -1 osd.1 2341 get_health_metrics reporting 6 slow ops, oldest is osd_op(client.1324143.0:1 1.0 1.4e660c0c (undecoded) ondisk+write+known_if_redirected+supports_pool_eio e2337)
Jan 10 01:32:48 z-b-data-2 ceph-osd[570]: 2023-01-10T01:32:48.727+0800 7fa7e2e26640 -1 osd.1 2341 get_health_metrics reporting 6 slow ops, oldest is osd_op(client.1324143.0:1 1.0 1.4e660c0c (undecoded) ondisk+write+known_if_redirected+supports_pool_eio e2337)
...
检查节点2上
/var/log/ceph/osd.log
:
** File Read Latency Histogram By Level [default] **
2023-01-10T00:42:07.600+0800 7fa801e5e5c0 1 bluestore(/var/lib/ceph/osd/ceph-1) _upgrade_super from 4, latest 4
2023-01-10T00:42:07.600+0800 7fa801e5e5c0 1 bluestore(/var/lib/ceph/osd/ceph-1) _upgrade_super done
2023-01-10T00:42:07.608+0800 7fa801e5e5c0 0 <cls> ./src/cls/cephfs/cls_cephfs.cc:201: loading cephfs
2023-01-10T00:42:07.612+0800 7fa801e5e5c0 0 <cls> ./src/cls/hello/cls_hello.cc:316: loading cls_hello
2023-01-10T00:42:07.612+0800 7fa801e5e5c0 0 _get_class not permitted to load lua
2023-01-10T00:42:07.636+0800 7fa801e5e5c0 0 _get_class not permitted to load sdk
2023-01-10T00:42:07.644+0800 7fa801e5e5c0 0 _get_class not permitted to load test_remote_reads
2023-01-10T00:42:07.644+0800 7fa801e5e5c0 1 osd.1 2331 warning: got an error loading one or more classes: (1) Operation not permitted
2023-01-10T00:42:07.644+0800 7fa801e5e5c0 0 osd.1 2331 crush map has features 288514051259236352, adjusting msgr requires for clients
2023-01-10T00:42:07.644+0800 7fa801e5e5c0 0 osd.1 2331 crush map has features 288514051259236352 was 8705, adjusting msgr requires for mons
2023-01-10T00:42:07.644+0800 7fa801e5e5c0 0 osd.1 2331 crush map has features 3314933000852226048, adjusting msgr requires for osds
2023-01-10T00:42:07.644+0800 7fa801e5e5c0 1 osd.1 2331 check_osdmap_features require_osd_release unknown -> quincy
从日志没有看到非常多异常,但是从dashboard可以看到 OSDs 页面侠士,节点2状态是 out down
:
我检查了节点2的系统日志,但是没有看到异常,物理服务器的dmesg日志没有看到硬件异常
恢复¶
由于没有找到明显报错,但是可以看到节点2出现down机,数据延迟了4小时,所以我先shutdown了节点2操作系统,然后再次启动该虚拟机。
启动节点2可以看到Ceph立即开始进行数据修复同步:
1/10/23 11:10:00 AM [INF] overall HEALTH_OK
1/10/23 11:07:14 AM [INF] Cluster is now healthy
1/10/23 11:07:14 AM [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 19/134364 objects degraded (0.014%), 6 pgs degraded)
1/10/23 11:07:12 AM [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 4 pgs peering)
1/10/23 11:07:10 AM [WRN] Health check update: Degraded data redundancy: 25/134364 objects degraded (0.019%), 8 pgs degraded (PG_DEGRADED)
1/10/23 11:07:04 AM [WRN] Health check update: Degraded data redundancy: 2793/134364 objects degraded (2.079%), 15 pgs degraded, 3 pgs undersized (PG_DEGRADED)
1/10/23 11:07:04 AM [WRN] Health check failed: Reduced data availability: 6 pgs peering (PG_AVAILABILITY)
1/10/23 11:07:00 AM [INF] osd.1 [v2:192.168.6.205:6802/565,v1:192.168.6.205:6803/565] boot
1/10/23 11:06:55 AM [WRN] pg 2.1f is stuck undersized for 4h, current state active+undersized+degraded, last acting [0,2]
1/10/23 11:06:55 AM [WRN] pg 2.1e is stuck undersized for 4h, current state active+undersized+degraded, last acting [0,2]
1/10/23 11:06:55 AM [WRN] pg 2.1d is stuck undersized for 4h, current state active+undersized+degraded, last acting [2,0]
然后Ceph集群恢复健康状态:
# ceph -s
cluster:
id: 0e6c8b6f-0d32-4cdb-a45d-85f8c7997c17
health: HEALTH_OK
services:
mon: 3 daemons, quorum z-b-data-1,z-b-data-2,z-b-data-3 (age 8m)
mgr: z-b-data-1(active, since 10h)
osd: 3 osds: 3 up (since 8m), 3 in (since 8m)
data:
pools: 2 pools, 33 pgs
objects: 44.79k objects, 174 GiB
usage: 517 GiB used, 880 GiB / 1.4 TiB avail
pgs: 33 active+clean
小结¶
我初步推测是虚拟化 采用OVMF实现passthrough GPU和NVMe存储 可能在快速重启时出现passthcough存储访问问题,影响了ceph的数据节点。这个排查比较粗糙:
目前还缺乏对Ceph底层系统级原理的深入理解,需要补齐这块短板
需要模拟不同的破坏场景进行演练