HPE DL360 Gen9服务器PCI Bus Error

警告

服务器适合长时间加电运行,不适合反复开关:

我感觉二手服务器尤其脆弱,不适合长时间关机。

我的 HPE DL360 Gen9服务器 是2021年9月购买,算起来持续使用了2年半。不过,最近半年因为失业( 凡是过往 皆为序章 )外出旅行,所以关机了半年。这应该是这台受到伤害的最大原因,受到上海潮湿闷热天气的折磨之后,终于在今天开机出现了严重的错误告警:

Integrated Management Log (CSV格式)可以看到:

HPE DL360 gen9服务器PCI总线错误日志

ID

Severity

Class

Last Update

Initial Update

Count

Description

246

Critical

System Error

08/08/2024 01:31

[NOT SET]

8

An Unrecoverable System Error (NMI) has occurred (Service Information: 0x00000000, 0x00000000)

245

Critical

System Error

08/08/2024 01:29

08/08/2024 01:29

1

An Unrecoverable System Error (NMI) has occurred (Service Information: 0x00000000, 0x00000000)

244

Critical

PCI Bus

08/08/2024 01:31

[NOT SET]

40

PCI Bus Error (Slot 1, Bus 0, Device 3, Function 2)

243

Critical

PCI Bus

08/08/2024 01:31

[NOT SET]

17

PCI Bus Error (Slot 1, Bus 0, Device 3, Function 0)

242

Critical

OS

08/08/2024 01:28

[NOT SET]

1

User Remotely Initiated NMI Switch

241

Critical

OS

08/08/2024 01:28

08/08/2024 01:28

1

User Remotely Initiated NMI Switch

240

Critical

PCI Bus

08/08/2024 01:31

[NOT SET]

118

Uncorrectable PCI Express Error (Slot 1, Bus 0, Device 3, Function 2, Error status 0x00000020)

239

Critical

System Error

08/08/2024 01:31

[NOT SET]

83

Unrecoverable System Error (NMI) has occurred. System Firmware will log additional details in a separate IML entry if possible

238

Critical

PCI Bus

08/08/2024 01:31

[NOT SET]

49

Uncorrectable PCI Express Error (Slot 1, Bus 0, Device 3, Function 0, Error status 0x00000020)

237

Critical

PCI Bus

08/08/2024 01:28

08/08/2024 01:28

1

Uncorrectable PCI Express Error (Slot 1, Bus 0, Device 3, Function 0, Error status 0x00000020)

236

Critical

System Error

08/08/2024 01:28

08/08/2024 01:28

1

Unrecoverable System Error (NMI) has occurred. System Firmware will log additional details in a separate IML entry if possible

235

Critical

PCI Bus

08/08/2024 01:28

08/08/2024 01:28

1

PCI Bus Error (Slot 1, Bus 0, Device 3, Function 0)

234

Caution

POST Message

08/08/2024 01:25

[NOT SET]

1

POST Error: 295-DIMM Failure - Uncorrectable Memory Error - Processor 1, DIMM 9. This memory will not be available to the operating system. ACTION: Replace the failed DIMM to restore the full amount of memory.

233

Caution

POST Message

08/08/2024 01:25

[NOT SET]

1

POST Error: 207-Memory initialization error on Processor 1, DIMM 8. The operating system may not have access to all of the memory installed in the system.

232

Caution

POST Message

08/08/2024 01:25

08/08/2024 01:25

1

POST Error: 207-Memory initialization error on Processor 1, DIMM 9. The operating system may not have access to all of the memory installed in the system.

太不幸了,两个月前还正常启动的服务器罢工了...

排查

  • HP服务器的 HP服务器iLO技术 提供了非常方便的图形管理,通过检查管理日志可以看到错误集中在 Processor 1 的两个内存插槽上 DIMM 8DIMM 9 ,这表明要么是内存条故障了,要么是内存接触不良:

    • 考虑到同时出现两根内存条硬件故障可能性较低,所以我倾向于是内存条插入连接不良,也就是通过内存条重新插拔可能能够解决这个问题

../../../../_images/dl360_gen9_pci_bus_error-1.png
../../../../_images/dl360_gen9_pci_bus_error-2.png
../../../../_images/hpe_dl360_gen9_memory.webp

HPE DL360 Gen9 内存插槽顺序

  • 重新多次插拔出现报错的DIMM 内存,然后重新开机。果然,系统内存检测就完全正常,从 HP服务器iLO技术System Information >> Memory Information 查看,可以看到刚才报错的DIMM内存条已经正常工作(状态 Good, In Use :

../../../../_images/dl360_gen9_pci_bus_error-3.png