Ubuntu Hibernate休眠
从Ubuntu 21.04 开始,需要采用 Systemd进程管理器 结合内核来实现休眠,原先用户端 uswsusp
已经不再支持。
systemd-hibernate.service
直接激活会提示错误:
systemd-hibernate.service
systemctl enable systemd-hibernate.service
提示报错:
systemd-hibernate.service
报错The unit files have no installation config (WantedBy=, RequiredBy=, Also=,
Alias= settings in the [Install] section, and DefaultInstance= for template
units). This means they are not meant to be enabled using systemctl.
Possible reasons for having this kind of units are:
• A unit may be statically enabled by being symlinked from another unit's
.wants/ or .requires/ directory.
• A unit's purpose may be to act as a helper for some other unit which has
a requirement dependency on it.
• A unit may be started when needed via activation (socket, path, timer,
D-Bus, udev, scripted systemctl call, ...).
• In case of template units, the unit is meant to be enabled with some
instance name specified.
需要向内核传递swap参数,也就是 resume=UUID=
:
当使用 swap 分区时不需要传递
resume_offset
参数当时用 swap 文件时,则需要参考 Arch Linux Hibernates 实践中设置传递swap文件的offset参数
准备swap分区
可以使用 swap 分区,也可以使用 swap 文件: 在提供hibernate的磁盘存储上没有本质区别,只是 swap 分区只需要向内核传递 resume=UUID=
参数指向分区即可;而 swap 文件还需要同时传递一个 swap_file_offset
参数(见 Ubuntu Hibernate休眠(旧实践归档) )。我在这里的实践采用swap分区:
分区准备:
parted /dev/sda mklabel gpt
parted -a optimal /dev/sda mkpart swap linux_swap 0% 512MB
mkswap /dev/sda1
swapon /dev/sda1
检查磁盘分区的uuid:
blkid /dev/sda1
显示uuid如下:
/dev/sda1: UUID="4525e419-1e24-48e8-9b7c-e03d281d7b41" TYPE="swap" PARTLABEL="swap" PARTUUID="ba7c6674-5d6e-4aa5-a2ed-cb0e265b73b7"
对应在
/etc/fstab
添加swap配置:
/etc/fstab
中添加swap配置/dev/disk/by-uuid/4525e419-1e24-48e8-9b7c-e03d281d7b41 none swap defaults 0 0
配置内核参数
编辑
/etc/default/grub
添加
/etc/default/grub
传递hibernate resume参数GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on iommu=pt intel_pstate=enable processor.max_cstate=1 intel_idle.max_cstate=1 rd.driver.blacklist=nouveau,rivafb,nvidiafb,rivatv resume=UUID=4525e419-1e24-48e8-9b7c-e03d281d7b41"
编辑
/etc/initramfs-tools/conf.d/resume
/etc/initramfs-tools/conf.d/resume
为 initramfs 传递resume参数RESUME=UUID=4525e419-1e24-48e8-9b7c-e03d281d7b41
重建
initramfs
:
update-initramfs -c -k all
重启一次服务器;
reboot
重启完成后,就可以通过 Systemd进程管理器 来管理hibernate:
systemctl hibernate
当完成hibernate存储运行到磁盘之后,服务器就会断电关机。要重新恢复运行状态,则按下电钮即可,在控制台终端最后会看到一段有关image加载的记录:

操作系统启动时恢复hibernate存储状态
系统恢复以后,可以通过以下命令检查
systemd-hibernate
服务,可以看到加载是否成功以及出错信息(如果有的话):
systemctl status systemd-hibernate.service
输出类似如下:
○ systemd-hibernate.service - Hibernate
Loaded: loaded (/lib/systemd/system/systemd-hibernate.service; static)
Active: inactive (dead)
Docs: man:systemd-hibernate.service(8)
Dec 02 08:03:33 zcloud.staging.huatai.me systemd-sleep[17268]: System returned from sleep state.
Dec 02 08:03:33 zcloud.staging.huatai.me systemd[1]: systemd-hibernate.service: Deactivated successfully.
Dec 02 08:03:33 zcloud.staging.huatai.me systemd[1]: Finished Hibernate.
Dec 02 08:03:33 zcloud.staging.huatai.me systemd[1]: systemd-hibernate.service: Consumed 29.492s CPU time.
Dec 02 08:14:02 zcloud.staging.huatai.me systemd[1]: Starting Hibernate...
Dec 02 08:14:03 zcloud.staging.huatai.me systemd-sleep[50217]: Entering sleep state 'hibernate'...
Dec 02 18:17:09 zcloud.staging.huatai.me systemd-sleep[50217]: System returned from sleep state.
Dec 02 18:17:09 zcloud.staging.huatai.me systemd[1]: systemd-hibernate.service: Deactivated successfully.
Dec 02 18:17:09 zcloud.staging.huatai.me systemd[1]: Finished Hibernate.
Dec 02 18:17:09 zcloud.staging.huatai.me systemd[1]: systemd-hibernate.service: Consumed 29.599s CPU time.
非root用户
备注
How To Enable Hibernation On Ubuntu (When Using A Swap File) 还提供了一些非root用户使用hibernate的配置方法,我没有实践。如有需要请参考原文
异常排查
systemctl hibernate
之后,我开启电源,发现服务器确实启动后进行了镜像解压缩,也就是从swap分区中将先前 hibernate
的内存中状态恢复过来:
hibernage
状态恢复时控制台输出信息显示保存在磁盘中的镜像已经完全恢复[ 54.388716] PM: hibernation: resume from hibernation
[ 54.419218] Freezing user space processes ... (elapsed 0.001 seconds) done.
[ 54.462750] OOM killer disabled.
[ 54.482747] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
[ 54.797313] PM: Using 3 thread(s) for decompression
[ 54.820978] PM: Loading and decompressing image data (4114252 pages)...
[ 54.929475] PM: Image loading progress: 0%
[ 65.625271] PM: Image loading progress: 10%
[ 70.616074] PM: Image loading progress: 20%
[ 75.275002] PM: Image loading progress: 30%
[ 80.097537] PM: Image loading progress: 40%
[ 84.834466] PM: Image loading progress: 50%
[ 89.929866] PM: Image loading progress: 60%
[ 94.942385] PM: Image loading progress: 70%
[ 99.811497] PM: Image loading progress: 80%
[ 104.480920] PM: Image loading progress: 90%
[ 108.438452] PM: Image loading progress: 100%
[ 108.496220] PM: Image loading done
[ 108.514655] PM: hibernation: Read 16457008 kbytes in 53.64 seconds (306.80 MB/s)
[ 108.555362] printk: Suspending console(s) (use no_console_suspend to debug)
但是稍等几秒钟,出现了 MCE(Machine-Check Exception) 错误:
hibernage
状态恢复后立即出现 MCE(Machine-Check Exception) 错误X64 Exception Type 12 - Machine-Check Exception
此时服务器自动断电,在控制台吐出如下信息:
hibernage
恢复时出现 MCE(Machine-Check Exception) 错误并自动断电 HP ProLiant System BIOS P89 v2.96 (05/17/2022)
(C) Copyright 1982 - 2022 Hewlett Packard Enterprise Development LP
Early system initialization, please wait...
iLO 4 IPv4: 192.168.7.254
iLO 4 IPv6: FE80::9657:A5FF:FE5E:7F16
2%: System Chipset Initialization
4%: QPI Link Initialization - Start
6%: QPI Link Initialization - Complete
The server is not powered on. The Virtual Serial Port is not available.
然后自动重启,重启时控制台输出的自检信息:
hibernage
恢复时出现 MCE(Machine-Check Exception) 错误并自动断电,然后自动启动时控制台显示的自检信息 HP ProLiant System BIOS P89 v2.96 (05/17/2022)
(C) Copyright 1982 - 2022 Hewlett Packard Enterprise Development LP
Early system initialization, please wait...
iLO 4 IPv4: 192.168.7.254
iLO 4 IPv6: FE80::9657:A5FF:FE5E:7F16
2%: System Chipset Initialization
4%: QPI Link Initialization - Start
6%: QPI Link Initialization - Complete
9%: Early Processor Initialization
11%: Memory Initialization - Start
24%: Memory Initialization - Complete
27%: System Security Initialization
30%: HPE SmartMemory Initialization
39%: Loading System Firmware Modules - Start
虽然在 dmesg -T
系统日志中看不到出错信息,也没有在 EDAC 诊断系统硬件故障 状态中看出端倪,不过从 HPE ProLiant DL360 Gen9服务器 的 HP服务器iLO技术 WEB控制平台检查 Integrated Management Log
可以看到如下MCE错误记录:
ID |
Severity |
Class |
Last Update |
Count |
Description |
---|---|---|---|---|---|
300 |
Critical |
CPU |
12/19/2024 12:04 |
1 |
Uncorrectable Machine Check Exception (Board 0, Processor 1, APIC ID 0x00000000, Bank 0x00000012, Status 0xBE200000'000C110A, Address 0x00000000'80210000, Misc 0xB8FC3816'00402086) |
299 |
Critical |
PCI Bus |
12/19/2024 12:02 |
1 |
PCI Bus Error (Slot 0, Bus 0, Device 0, Function 0) |
昨天和前天的 hibernate 也同样有错误
ID |
Severity |
Class |
Last Update |
Count |
Description |
---|---|---|---|---|---|
298 |
Critical |
CPU |
12/18/2024 13:56 |
1 |
Uncorrectable Machine Check Exception (Board 0, Processor 1, APIC ID 0x00000002, Bank 0x00000012, Status 0xBE200000'000C110A, Address 0x00000000'80210000, Misc 0xB8FC3816'00402086) |
297 |
Critical |
CPU |
12/18/2024 13:56 |
1 |
Uncorrectable Machine Check Exception (Board 0, Processor 1, APIC ID 0x00000002, Bank 0x00000000, Status 0xBD800000'00000006, Address 0x00000000'00000000, Misc 0x000007FA'9A729000) |
296 |
Critical |
PCI Bus |
12/18/2024 13:54 |
1 |
PCI Bus Error (Slot 0, Bus 0, Device 0, Function 0) |
ID |
Severity |
Class |
Last Update |
Count |
Description |
---|---|---|---|---|---|
295 |
Critical |
CPU |
12/17/2024 10:35 |
1 |
Uncorrectable Machine Check Exception (Board 0, Processor 2, APIC ID 0x00000020, Bank 0x00000000, Status 0xBD800000'00000006, Address 0x00000000'00000000, Misc 0x000007FA'9A729000) |
294 |
Critical |
CPU |
12/17/2024 10:35 |
1 |
Uncorrectable Machine Check Exception (Board 0, Processor 1, APIC ID 0x00000000, Bank 0x00000012, Status 0xBE200000'000C110A, Address 0x00000000'80210000, Misc 0xB8FC3816'00402086) |
293 |
Critical |
PCI Bus |
12/17/2024 10:33 |
1 |
PCI Bus Error (Slot 0, Bus 0, Device 0, Function 0) |
292 |
Critical |
CPU |
12/17/2024 09:50 |
1 |
Uncorrectable Machine Check Exception (Board 0, Processor 1, APIC ID 0x00000006, Bank 0x00000012, Status 0xBE200000'000C110A, Address 0x00000000'80210000, Misc 0xB8FC3816'00402086) |
291 |
Critical |
CPU |
12/17/2024 09:50 |
1 |
Uncorrectable Machine Check Exception (Board 0, Processor 1, APIC ID 0x00000006, Bank 0x00000000, Status 0xBD800000'00000006, Address 0x00000000'00000000, Misc 0x000007FA'9A729000) |
290 |
Critical |
PCI Bus |
12/17/2024 09:48 |
1 |
PCI Bus Error (Slot 0, Bus 0, Device 0, Function 0) |
之前在 HPE DL360 Gen9服务器PCI Bus Error 也有PCIe错误,当时通过插拔内存似乎恢复了。
我google一下, HP支持论坛的Uncorrectable Machine Check Exception一个帖子提到的情况 启发了我:
看起来PCIe错误导致的MCE,而PCIe错误关联的处理器有 Processor 1 也有 Processor 2(两个处理器同时硬件故障可能性极低)
感觉时PCIe上连接的设备触发的问题,在
hibernate
恢复时响应出现了问题导致主机判断为MCE错误怀疑 Intel Optane(傲腾) M10 最近添加在PCIe转接卡上,这个设备初始化非常诡异,之前有无法识别的问题,可能会在
hibernate
恢复时无法恢复到休眠前状态,导致触发 MCE(Machine-Check Exception) 错误
虽然可以通过硬件替换,逐个排除PCIe上设备是否和 hibernate
时 MCE(Machine-Check Exception) 错误有关,但是我现在暂时没有时间继续折腾。目前看平时运行时是稳定的,只在 hibernate
恢复时触发异常,待后续观察或者有时间再排查。