HPE DL360 Gen9服务器PCI Bus Error¶
警告
服务器适合长时间加电运行,不适合反复开关:
我感觉二手服务器尤其脆弱,不适合长时间关机。
我的 HPE DL360 Gen9服务器 是2021年9月购买,算起来持续使用了2年半。不过,最近半年因为失业( 凡是过往 皆为序章 )外出旅行,所以关机了半年。这应该是这台受到伤害的最大原因,受到上海潮湿闷热天气的折磨之后,终于在今天开机出现了严重的错误告警:
从 Integrated Management Log
(CSV格式)可以看到:
ID |
Severity |
Class |
Last Update |
Initial Update |
Count |
Description |
---|---|---|---|---|---|---|
246 |
Critical |
System Error |
08/08/2024 01:31 |
[NOT SET] |
8 |
An Unrecoverable System Error (NMI) has occurred (Service Information: 0x00000000, 0x00000000) |
245 |
Critical |
System Error |
08/08/2024 01:29 |
08/08/2024 01:29 |
1 |
An Unrecoverable System Error (NMI) has occurred (Service Information: 0x00000000, 0x00000000) |
244 |
Critical |
PCI Bus |
08/08/2024 01:31 |
[NOT SET] |
40 |
PCI Bus Error (Slot 1, Bus 0, Device 3, Function 2) |
243 |
Critical |
PCI Bus |
08/08/2024 01:31 |
[NOT SET] |
17 |
PCI Bus Error (Slot 1, Bus 0, Device 3, Function 0) |
242 |
Critical |
OS |
08/08/2024 01:28 |
[NOT SET] |
1 |
User Remotely Initiated NMI Switch |
241 |
Critical |
OS |
08/08/2024 01:28 |
08/08/2024 01:28 |
1 |
User Remotely Initiated NMI Switch |
240 |
Critical |
PCI Bus |
08/08/2024 01:31 |
[NOT SET] |
118 |
Uncorrectable PCI Express Error (Slot 1, Bus 0, Device 3, Function 2, Error status 0x00000020) |
239 |
Critical |
System Error |
08/08/2024 01:31 |
[NOT SET] |
83 |
Unrecoverable System Error (NMI) has occurred. System Firmware will log additional details in a separate IML entry if possible |
238 |
Critical |
PCI Bus |
08/08/2024 01:31 |
[NOT SET] |
49 |
Uncorrectable PCI Express Error (Slot 1, Bus 0, Device 3, Function 0, Error status 0x00000020) |
237 |
Critical |
PCI Bus |
08/08/2024 01:28 |
08/08/2024 01:28 |
1 |
Uncorrectable PCI Express Error (Slot 1, Bus 0, Device 3, Function 0, Error status 0x00000020) |
236 |
Critical |
System Error |
08/08/2024 01:28 |
08/08/2024 01:28 |
1 |
Unrecoverable System Error (NMI) has occurred. System Firmware will log additional details in a separate IML entry if possible |
235 |
Critical |
PCI Bus |
08/08/2024 01:28 |
08/08/2024 01:28 |
1 |
PCI Bus Error (Slot 1, Bus 0, Device 3, Function 0) |
234 |
Caution |
POST Message |
08/08/2024 01:25 |
[NOT SET] |
1 |
POST Error: 295-DIMM Failure - Uncorrectable Memory Error - Processor 1, DIMM 9. This memory will not be available to the operating system. ACTION: Replace the failed DIMM to restore the full amount of memory. |
233 |
Caution |
POST Message |
08/08/2024 01:25 |
[NOT SET] |
1 |
POST Error: 207-Memory initialization error on Processor 1, DIMM 8. The operating system may not have access to all of the memory installed in the system. |
232 |
Caution |
POST Message |
08/08/2024 01:25 |
08/08/2024 01:25 |
1 |
POST Error: 207-Memory initialization error on Processor 1, DIMM 9. The operating system may not have access to all of the memory installed in the system. |
太不幸了,两个月前还正常启动的服务器罢工了...
排查¶
HP服务器的 HP服务器iLO技术 提供了非常方便的图形管理,通过检查管理日志可以看到错误集中在
Processor 1
的两个内存插槽上DIMM 8
和DIMM 9
,这表明要么是内存条故障了,要么是内存接触不良:考虑到同时出现两根内存条硬件故障可能性较低,所以我倾向于是内存条插入连接不良,也就是通过内存条重新插拔可能能够解决这个问题
需要注意 HP DL360 Gen9 内存安装 顺序:
重新多次插拔出现报错的DIMM 内存,然后重新开机。果然,系统内存检测就完全正常,从 HP服务器iLO技术 的
System Information >> Memory Information
查看,可以看到刚才报错的DIMM内存条已经正常工作(状态Good, In Use
: