Linux 硬碟檢測方法

判斷硬碟故障方式以下三種,若三項有超過兩項都發生異常,建議直接更換硬碟

1.查看硬碟是否故障 LOG 指令
2.測試壞軌指令
3.使用S.M.A.R.T.硬碟檢測

1.查看硬碟是否故障 LOG 指令

#grep "I/O error" /var/log/messages
Jun 12 15:23:30 2011 barracuda 2011 kernel: [ 376.184695] end_request: I/O error, dev sdb, sector 793283772
Jun 12 15:23:31 2011 barracuda 2011 kernel: [ 394.836691] end_request: I/O error, dev sdb, sector 793283772
Jun 12 15:23:31 2011 barracuda 2011 kernel: [ 413.212693] end_request: I/O error, dev sdb, sector 793283775
Jun 12 15:23:32 2011 barracuda 2011 kernel: [ 431.624701] end_request: I/O error, dev sdb, sector 793283783
Jun 12 15:23:33 2011 barracuda 2011 kernel: [ 449.808699] end_request: I/O error, dev sdb, sector 793283791
Jun 12 15:23:33 2011 barracuda 2011 kernel: [ 468.988698] end_request: I/O error, dev sdb, sector 793283929
Jun 12 15:23:33 2011 barracuda 2011 kernel: [ 493.208697] end_request: I/O error, dev sdb, sector 793286447
Jun 12 15:23:34 2011 barracuda 2011 kernel: [ 511.308686] end_request: I/O error, dev sdb, sector 793286447
Jun 20 08:25:08 2011 barracuda 2011 kernel: [ 96.904694] end_request: I/O error, dev sdb, sector 793286460
Jun 20 08:26:43 2011 barracuda 2011 kernel: [ 131.892704] end_request: I/O error, dev sdb, sector 793283799
Jun 20 08:26:43 2011 barracuda 2011 kernel: [ 149.992695] end_request: I/O error, dev sdb, sector 793283799
Jun 20 08:26:43 2011 barracuda 2011 kernel: [ 168.384693] end_request: I/O error, dev sdb, sector 793283808

表示這台機器的 /dev/sdb 硬碟有發生壞軌的狀況

2.測試壞軌指令

#badblocks

#badblocks -v /dev/sdb9 -o blocks.txt
Checking blocks 0 to 476461408
Checking for bad blocks (read-only test): ^C 2759968/ 476461408
#cat blocks.txt
960
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014

這個就是壞軌的sector

3.使用S.M.A.R.T.硬碟檢測

#smartctl

#smartctl -d ata -a /dev/sda
smartctl 5.39.1 2010-01-28 r3054 [i686-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda ES.2
Device Model: ST3250310NS              #硬碟型號
Serial Number: 9SF0TRBH                #硬碟序號
Firmware Version: SN05
User Capacity: 250,059,350,016 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Wed Jul 30 13:39:50 2014 CST

==> WARNING: There are known problems with these drives,
see the following Seagate web pages:
http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207931
http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207963

SMART support is: Available - device has SMART capability.
SMART support is: Enabled              #硬碟支援S.M.A.R.T並且已經開啟
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED  #硬碟檢測正常

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
 was completed without error.
 Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
 without error or no self-test has ever 
 been run.
Total time to complete Offline 
data collection: ( 634) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
 Auto Offline data collection on/off support.
 Suspend Offline collection upon new
 command.
 Offline surface scan supported.
 Self-test supported.
 Conveyance Self-test supported.
 Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
 power-saving mode.
 Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
 General Purpose Logging supported.
Short self-test routine 
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 58) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x003d) SCT Status supported.
 SCT Feature Control supported.
 SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
 1 Raw_Read_Error_Rate 0x000f 075 063 044 Pre-fail Always - 36165465          
 3 Spin_Up_Time 0x0003 099 099 000 Pre-fail Always - 0
 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 71                    
 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0               
 7 Seek_Error_Rate 0x000f 078 060 030 Pre-fail Always - 13114328965           
 9 Power_On_Hours 0x0032 078 078 000 Old_age Always - 19465                   
 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
 12 Power_Cycle_Count 0x0032 100 037 020 Old_age Always - 71
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 096 000 Old_age Always - 212
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 062 058 045 Old_age Always - 38 (Lifetime Min/Max 35/42)
194 Temperature_Celsius 0x0022 038 042 000 Old_age Always - 38 (0 21 0 0)
195 Hardware_ECC_Recovered 0x001a 056 033 000 Old_age Always - 36165465
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0

SMART Error Log Version: 1
No Errors Logged             #如果有error log表示硬碟可能曾經發生過問題

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Interrupted (host reset) 00% 617 -
# 2 Extended offline Completed without error 00% 617 -
# 3 Extended offline Completed without error 00% 616 -
# 4 Extended offline Completed without error 00% 615 -
# 5 Extended offline Completed without error 00% 614 -
# 6 Extended offline Completed without error 00% 613 -
# 7 Extended offline Completed without error 00% 612 -
# 8 Extended offline Completed without error 00% 611 -
# 9 Extended offline Completed without error 00% 610 -
#10 Extended offline Completed without error 00% 609 -
#11 Extended offline Completed without error 00% 608 -
#12 Extended offline Completed without error 00% 607 -
#13 Extended offline Completed without error 00% 606 -
#14 Extended offline Completed without error 00% 605 -
#15 Extended offline Completed without error 00% 604 -
#16 Extended offline Completed without error 00% 602 -
#17 Extended offline Completed without error 00% 601 -
#18 Extended offline Completed without error 00% 600 -
#19 Extended offline Completed without error 00% 599 -
#20 Extended offline Completed without error 00% 598 -
#21 Extended offline Completed without error 00% 597 -

SMART Selective self-test log data structure revision number 1
 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
 1 0 0 Not_testing
 2 0 0 Not_testing
 3 0 0 Not_testing
 4 0 0 Not_testing
 5 0 0 Not_testing
Selective self-test flags (0x0):
 After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

針對S.M.A.R.T.資訊,找出重要的幾樣來分析
http://en.wikipedia.org/wiki/S.M.A.R.T.

1
Raw Read Error Rate
讀取錯誤率 (供應商提供參數,Seagate硬碟可忽略)
3
Spin Up Time
起轉時間
可能代表主軸馬達老化
4
Start/Stop Count
啟動/停止的次數
這個值是硬碟主軸馬達啟動/關閉的次數,一般來說就是你開關電源的次數 由於硬碟馬達啟/停時磨耗最大,這是一個壽命參考值
5
Reallocated Sector Count
重新分配磁區數量
硬碟發現有磁區讀取/寫入/驗證 錯誤時,會把磁區標記起來並且把這些磁區重新分配到空的區塊。 當這個值出現時可能出現瑕疵 ,此數值只大於1即可判定硬碟有問題
7
Seek Error Rate
尋軌錯誤率 (供應商提供參數,Seagate硬碟可忽略)
9
Power-On Hours Count
總通電時間
硬碟總通電時間,這是一個壽命參考值
(超過25000小時約等於3年時間)
10
Spin-up Retry Count
嘗試起轉重試次數
這個屬性值如果一直增加表示可能主軸馬達損壞的前兆
12
Power Cycle Count
電源循環次數
硬碟開機/關機的循環次數
188
Command Timeout
對硬碟下指令逾時
通常這個屬性值應該為0,若大於零可能與硬碟電源或者是硬碟接線有關
197
Current Pending Sector Count
代表無法修復的壞磁區數量,也就是壞軌
此數值只大於1即可判定硬碟有問題

Refer: http://en.wikipedia.org/wiki/S.M.A.R.T.

Facebook Comments