存储设备S.M.A.R.T监控

我的二手 HPE ProLiant DL360 Gen9服务器 服务器使用了一块我很久以前购买的Intel SATA SSD磁盘,不过这块SSD时不时在系统日志中留下触目惊心的Err记录:

dmesg 中SSD磁盘错误日志
[Sun Aug  6 11:05:54 2023] ata5.00: exception Emask 0x0 SAct 0x80080000 SErr 0x0 action 0x6 frozen
[Sun Aug  6 11:05:54 2023] ata5.00: failed command: READ FPDMA QUEUED
[Sun Aug  6 11:05:54 2023] ata5.00: cmd 60/08:98:98:20:9c/00:00:02:00:00/40 tag 19 ncq dma 4096 in
                                    res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[Sun Aug  6 11:05:54 2023] ata5.00: status: { DRDY }
[Sun Aug  6 11:05:54 2023] ata5.00: failed command: READ FPDMA QUEUED
[Sun Aug  6 11:05:54 2023] ata5.00: cmd 60/08:f8:e8:e4:8c/00:00:00:00:00/40 tag 31 ncq dma 4096 in
                                    res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[Sun Aug  6 11:05:54 2023] ata5.00: status: { DRDY }
[Sun Aug  6 11:05:54 2023] ata5: hard resetting link
[Sun Aug  6 11:05:54 2023] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[Sun Aug  6 11:05:54 2023] ata5.00: configured for UDMA/133
[Sun Aug  6 11:05:54 2023] sd 4:0:0:0: [sdb] tag#31 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=30s
[Sun Aug  6 11:05:54 2023] sd 4:0:0:0: [sdb] tag#31 Sense Key : Illegal Request [current]
[Sun Aug  6 11:05:54 2023] sd 4:0:0:0: [sdb] tag#31 Add. Sense: Unaligned write command
[Sun Aug  6 11:05:54 2023] sd 4:0:0:0: [sdb] tag#31 CDB: Read(10) 28 00 00 8c e4 e8 00 00 08 00
[Sun Aug  6 11:05:54 2023] blk_update_request: I/O error, dev sdb, sector 9233640 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[Sun Aug  6 11:05:54 2023] ata5: EH complete
[Sun Aug  6 11:05:54 2023] ata5.00: Enabling discard_zeroes_data
[Sun Aug  6 11:06:24 2023] ata5.00: exception Emask 0x0 SAct 0x1000000 SErr 0x0 action 0x6 frozen
[Sun Aug  6 11:06:24 2023] ata5.00: failed command: READ FPDMA QUEUED
[Sun Aug  6 11:06:24 2023] ata5.00: cmd 60/08:c0:70:1f:ce/00:00:00:00:00/40 tag 24 ncq dma 4096 in
                                    res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[Sun Aug  6 11:06:24 2023] ata5.00: status: { DRDY }
[Sun Aug  6 11:06:24 2023] ata5: hard resetting link
[Sun Aug  6 11:06:24 2023] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[Sun Aug  6 11:06:24 2023] ata5.00: configured for UDMA/133
[Sun Aug  6 11:06:24 2023] ata5.00: device reported invalid CHS sector 0
[Sun Aug  6 11:06:24 2023] sd 4:0:0:0: [sdb] tag#24 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=30s
[Sun Aug  6 11:06:24 2023] sd 4:0:0:0: [sdb] tag#24 Sense Key : Illegal Request [current]
[Sun Aug  6 11:06:24 2023] sd 4:0:0:0: [sdb] tag#24 Add. Sense: Unaligned write command
[Sun Aug  6 11:06:24 2023] sd 4:0:0:0: [sdb] tag#24 CDB: Read(10) 28 00 00 ce 1f 70 00 00 08 00
[Sun Aug  6 11:06:24 2023] blk_update_request: I/O error, dev sdb, sector 13508464 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[Sun Aug  6 11:06:24 2023] ata5: EH complete
[Sun Aug  6 11:06:24 2023] ata5.00: Enabling discard_zeroes_data

备注

我感觉这个 Intel 545s Series SSDs 的firmware可能存在问题,参考 Latest Firmware For Solidigm™ (Formerly Intel®) Solid State Drives 可以看到这款 Intel 545s Series SSDs 最新的firmware 是 004C (针对512GB) 和 0B3C (针对1TB) 。我准备做一次firmware升级来尝试修复这个reset问题。

我想通过存储的 S.M.A.R.T. 技术来检测和监视磁盘的异常:

安装 smartmontools

在Ubuntu安装 smartmontools
sudo apt install smartmontools

SMART info

  • 检查磁盘设备是否支持和激活SMART:

smartctl -i 检查磁盘info信息
sudo smartctl -i /dev/sda

我的 SanDisk CloudSpeed Eco Gen. II SATA SSD企业级固态硬盘 SMART 信息如下:

smartctl -i 检查Sandisk SSD磁盘info信息
=== START OF INFORMATION SECTION ===
Model Family:     Sandisk SATA Cloudspeed Max and GEN2 ESS SSDs
Device Model:     SDLF1CRR-019T-1HA1
Serial Number:    A007C9D9
LU WWN Device Id: 5 001173 100a88424
Firmware Version: ZR11RPA1
User Capacity:    1,920,383,410,176 bytes [1.92 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
TRIM Command:     Available, deterministic, zeroed
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4c
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Aug 23 11:43:03 2023 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
smartctl -i 检查Intel SSD磁盘info信息
=== START OF INFORMATION SECTION ===
Model Family:     Intel 545s Series SSDs
Device Model:     INTEL SSDSC2KW512G8
Serial Number:    BTLA7513037S512DGN
LU WWN Device Id: 5 5cd2e4 14eea7536
Firmware Version: LHF002C
User Capacity:    512,110,190,592 bytes [512 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic, zeroed
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 (minor revision not indicated)
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Aug 23 11:42:31 2023 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

SMART test

SMART提供 两种 不同的测试:

  • Background Mode(后台模式): 后台测试的优先级低,也就是说硬盘仍然会处理常规指令。如果硬盘繁忙,则测试会暂停并且以低负载速度进行,这样不会中断硬盘工作

  • Foreground Mode(前台模式): 测试采用了 CHECK CONDITION 状态必须响应,这种模式只能在不使用的硬盘上进行。

根据经验, 建议采用后台模式

ATA/SCSI(共有的)测试

Short Test

短测试 的目的是快速识别有缺陷的硬盘驱动器。因此,短测试的最大持续实践大约2分钟。该测试将磁盘氛围3个不同阶段来检查:

  • Electrical Properties (电气特性): 控制器测试自己的的电子电路,由于这个测试是每个制造商特有的,因此无法确切解释正在测试的内容。例如测试内部RAM,读写电路或磁头电子器件

  • Mechanical Properties (机械特性): 测试伺服系统和定位机构的确切顺序也因每个制造商而异

  • Read/Verify (读取/验证): 读取磁盘的某个区域并验证某些数据,读取的区域的大小和位置也是每个制造商特定的

Long Test

长测试 被设计成生产中的最终测试,与短测试相同,但有 2点区别 :

  • 长测试没有时间限制

  • 长测试会 Read/Verify (读取/验证) 整个磁盘而不仅仅是一小部分

ATA特有的测试

运输测试(Conveyance Tests)

运输测试(Conveyance Test)可以在短短几分钟内确定硬盘在运输过程中的损坏情况

选择测试(Select Tests)

选择测试可以指定LBA范围,即只扫描指定的LBA区域:

指定LBA进行扫描
sudo smartctl -t select,10-20 /dev/sdc #LBA 10 to LBA 20 (incl.)
sudo smartctl -t select,10+11 /dev/sdc #LBA 10 to LBA 20 (incl.)

而且可以指定多个范围(最多5个)进行扫描:

指定多个LBA范围进行扫描
sudo smartctl -t select,0-10 -t select,5-15 -t select,10-20 /dev/sdc

使用 smartctl 测试

检查存储设备SMART能力

  • 在测试前,可以预估一下不同测试所需时间:

smartctl 检查存储设备能力,可以看到预估测试时间
sudo smartctl -c /dev/sda

可以看到 /dev/sda ( SanDisk CloudSpeed Eco Gen. II SATA SSD企业级固态硬盘 )预估测试时间:

smartctl 检查存储 SanDisk CloudSpeed Eco Gen. II SATA SSD企业级固态硬盘 设备能力,可以看到预估测试时间
=== START OF READ SMART DATA SECTION ===
General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever
					been run.
Total time to complete Offline
data collection: 		(20160) seconds.
Offline data collection
capabilities: 			 (0x5d) SMART execute Offline immediate.
					No Auto Offline data collection support.
					Abort Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 (   1) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

我的另一个磁盘 /dev/sdb ( Intel 545s系列 ):

smartctl 检查存储 Intel 545s系列SSD 设备能力,可以看到预估测试时间
=== START OF READ SMART DATA SECTION ===
General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever
					been run.
Total time to complete Offline
data collection: 		(    0) seconds.
Offline data collection
capabilities: 			 (0x53) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (  30) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

测试

/dev/sda

  • 执行测试(long test):

smartctl 对sda进行长测试,注意参数结合 -C 表示Foreground Mode
sudo smartctl -t long -C /dev/sda
  • 长测试输出信息

smartctl 对sda进行长测试的输出信息
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-78-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in captive mode".
Drive command "Execute SMART Extended self-test routine immediately in captive mode" successful.
Testing has begun.
Please wait 1 minutes for test to complete.
Test will complete after Wed Aug 23 15:05:27 2023 CST

可以看到这个 SanDisk CloudSpeed Eco Gen. II SATA SSD企业级固态硬盘 仅需要1分钟就能完成长测试 ( 搞笑? 这个长测试和短测试的时间是一样的,不会是虚假吧 )

  • 查看测试结果( -a 参数 ):

smartctl 查看sda测试结果
sudo smartctl -a /dev/sda
smartctl 查看sda测试结果,可以看到存储健康度(剩余寿命) 92%
=== START OF INFORMATION SECTION ===
Model Family:     Sandisk SATA Cloudspeed Max and GEN2 ESS SSDs
Device Model:     SDLF1CRR-019T-1HA1
Serial Number:    A007C9D9
LU WWN Device Id: 5 001173 100a88424
Firmware Version: ZR11RPA1
User Capacity:    1,920,383,410,176 bytes [1.92 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
TRIM Command:     Available, deterministic, zeroed
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4c
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Aug 23 15:13:40 2023 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (20160) seconds.
Offline data collection
capabilities:                    (0x5d) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Abort Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (   1) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   046   046   000    Old_age   Always       -       47763 (2 27 0)
 13 Lifetime_UECC_Ct        0x0012   100   100   001    Old_age   Always       -       0
 32 Lifetime_Write_AmpFctr  0x0002   100   100   000    Old_age   Always       -       0
 33 Write_AmpFctr           0x0002   100   100   000    Old_age   Always       -       100
170 Reserve_Erase_BlkCt     0x0032   100   100   000    Old_age   Always       -       18218
171 Program_Fail_Ct         0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Ct           0x0032   100   100   000    Old_age   Always       -       0
175 Lifetime_Die_Failure_Ct 0x0032   100   100   000    Old_age   Always       -       0
178 SSD_LifeLeft(0.01%)     0x0012   100   100   000    Old_age   Always       -       9126
183 LT_Link_Rate_DwnGrd_Ct  0x0032   100   100   000    Old_age   Always       -       0
191 Clean_Shutdown_Ct       0x0032   100   100   000    Old_age   Always       -       46
192 Unclean_Shutdown_Ct     0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   068   059   030    Old_age   Always       -       32 (Min/Max 19/41)
196 Lifetime_Retried_Blk_Ct 0x001b   100   100   010    Pre-fail  Always       -       0
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       0
211 Read_Disturb_ReallocEvt 0x0032   100   100   000    Old_age   Always       -       0
233 Lifetime_Nand_Writes    0x0032   100   100   000    Old_age   Always       -       1347968
235 Capacitor_Health        0x0032   100   100   000    Old_age   Always       -       0
241 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       806144
242 Total_LBAs_Read         0x0032   100   100   000    Old_age   Always       -       923840
244 Therm_Throt_Activation  0x0032   100   100   000    Old_age   Always       -       0
245 Drive_Life_Remaining%   0x0012   092   092   002    Old_age   Always       -       92
253 SPI_Test_Remaining      0x0012   100   100   001    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended captive    Completed without error       00%     47763         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

这里可以看到 SSD_LifeLeft(0.01%) 表示以 万分比 0.01% 为单位得到的数值是 9126 ,折算为百分比就是 91.26% ,所以在 Drive_Life_Remaining% 的数值就是 92

/dev/sdb

  • 执行测试(long test):

smartctl 对sdb进行长测试,注意参数结合 -C 表示Foreground Mode
sudo smartctl -t long -C /dev/sdb
  • 长测试输出信息

smartctl 对sdb进行长测试的输出信息
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-78-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in captive mode".
Drive command "Execute SMART Extended self-test routine immediately in captive mode" successful.
Testing has begun.
Please wait 30 minutes for test to complete.
Test will complete after Wed Aug 23 16:10:44 2023 CST

Intel SSD的长测试 似乎是真测试 需要花费30分钟完成

  • 查看测试结果( -a 参数 ):

smartctl 查看sdb(Intel SSD)测试结果
sudo smartctl -a /dev/sdb
smartctl 查看sdb测试结果,测试了两次都没有完成 Extended captive : Interrupted (host reset)
=== START OF INFORMATION SECTION ===
Model Family:     Intel 545s Series SSDs
Device Model:     INTEL SSDSC2KW512G8
Serial Number:    BTLA7513037S512DGN
LU WWN Device Id: 5 5cd2e4 14eea7536
Firmware Version: LHF002C
User Capacity:    512,110,190,592 bytes [512 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic, zeroed
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 (minor revision not indicated)
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Aug 23 22:41:29 2023 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x05)	Offline data collection activity
					was aborted by an interrupting command from host.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (  41)	The self-test routine was interrupted
					by the host with a hard or soft reset.
Total time to complete Offline
data collection: 		(    0) seconds.
Offline data collection
capabilities: 			 (0x53) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (  30) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       24193
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       160
170 Unknown_Attribute       0x0033   100   100   010    Pre-fail  Always       -       0
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Unknown_Attribute       0x0033   079   079   005    Pre-fail  Always       -       1413069406491
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       36
183 SATA_Downshift_Count    0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0033   100   100   090    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
190 Temperature_Case        0x0032   027   044   000    Old_age   Always       -       27 (Min/Max 13/44)
192 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       36
199 CRC_Error_Count         0x0032   100   100   000    Old_age   Always       -       0
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       1787678
226 Workld_Media_Wear_Indic 0x0032   100   100   000    Old_age   Always       -       0
227 Workld_Host_Reads_Perc  0x0032   100   100   000    Old_age   Always       -       0
228 Workload_Minutes        0x0032   100   100   000    Old_age   Always       -       0
232 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   079   079   000    Old_age   Always       -       0
236 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
241 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       1787678
242 Host_Reads_32MiB        0x0032   100   100   000    Old_age   Always       -       47693
249 NAND_Writes_1GiB        0x0032   100   100   000    Old_age   Always       -       168517
252 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       329

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended captive    Interrupted (host reset)      90%     24191         -
# 2  Extended captive    Interrupted (host reset)      90%     24191         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

比较奇怪,这个 Intel SSD 的SMART测试看不到健康度(剩余寿命 ID #245 ),而且测试状态没有完成 Interrupted (host reset) 。我连做两次测试都是这样(见高亮部分)

我想了一下,是不是因为这个 /dev/sdb 正在使用(挂载为系统盘),所以 Foreground Test 会被磁盘读写操作中断?

  • 改为 Background Mode long tests 测试( 去掉 -C 参数 ):

smartctl 对sdb进行长测试,注意 没有使用 -C 参数表示 Background Mode
sudo smartctl -t long /dev/sdb

此时会看到立即返回终端提示(不像 -C 参数需要等待卡住一会):

smartctl 对sdb进行长测试( Background Mode )输出信息
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 30 minutes for test to complete.
Test will complete after Wed Aug 23 23:15:21 2023 CST
Use smartctl -X to abort test.

可以看到测试时间依然是30分钟,不过提示是 off-line mode (之前 -C 参数显示 captive mode )

  • 果然,采用 offline mode 方式扫描,就能够正常完成测试,输出结果如下:

smartctl 对sdb进行长测试( Background Mode )能够正常完成测试,结果输出
=== START OF INFORMATION SECTION ===
Model Family:     Intel 545s Series SSDs
Device Model:     INTEL SSDSC2KW512G8
Serial Number:    BTLA7513037S512DGN
LU WWN Device Id: 5 5cd2e4 14eea7536
Firmware Version: LHF002C
User Capacity:    512,110,190,592 bytes [512 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic, zeroed
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 (minor revision not indicated)
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Aug 24 00:33:08 2023 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x02)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever
					been run.
Total time to complete Offline
data collection: 		(    0) seconds.
Offline data collection
capabilities: 			 (0x53) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (  30) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       24193
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       160
170 Unknown_Attribute       0x0033   100   100   010    Pre-fail  Always       -       0
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Unknown_Attribute       0x0033   079   079   005    Pre-fail  Always       -       1413069406491
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       36
183 SATA_Downshift_Count    0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0033   100   100   090    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
190 Temperature_Case        0x0032   028   045   000    Old_age   Always       -       28 (Min/Max 13/45)
192 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       36
199 CRC_Error_Count         0x0032   100   100   000    Old_age   Always       -       0
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       1787849
226 Workld_Media_Wear_Indic 0x0032   100   100   000    Old_age   Always       -       0
227 Workld_Host_Reads_Perc  0x0032   100   100   000    Old_age   Always       -       0
228 Workload_Minutes        0x0032   100   100   000    Old_age   Always       -       0
232 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   079   079   000    Old_age   Always       -       0
236 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
241 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       1787849
242 Host_Reads_32MiB        0x0032   100   100   000    Old_age   Always       -       47693
249 NAND_Writes_1GiB        0x0032   100   100   000    Old_age   Always       -       168542
252 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       329

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     24193         -
# 2  Extended captive    Interrupted (host reset)      90%     24191         -
# 3  Extended captive    Interrupted (host reset)      90%     24191         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

这里看到 LifeTime(hours) 值是 24193 这个值就是 Power_On_Hours 值,也就是磁盘加电时长

很奇怪,为何Intel SSD无法查看 Drive_Life_Remaining% ?

搜索了一下,看来Intel有自己的诊断工具 How to Perform Quick/Full Diagnostic of Intel® SSDs Using Intel® Memory and Storage Tool (Intel® MAS) GUI (这个是Intel Optane SSDs / Memory 设备检测工具)

详细请参考 ` Support for Intel® Memory and Storage Tool <https://www.intel.com/content/www/us/en/support/products/202249/memory-and-storage/ssd-management-tools/intel-memory-and-storage-tool.html>`_

参考