判斷硬碟故障方式以下三種,若三項有超過兩項都發生異常,建議直接更換硬碟
1.查看硬碟是否故障 LOG 指令
2.測試壞軌指令
3.使用S.M.A.R.T.硬碟檢測
1.查看硬碟是否故障 LOG 指令
#grep "I/O error" /var/log/messages
Jun 12 15:23:30 2011 barracuda 2011 kernel: [ 376.184695] end_request: I/O error, dev sdb, sector 793283772
Jun 12 15:23:31 2011 barracuda 2011 kernel: [ 394.836691] end_request: I/O error, dev sdb, sector 793283772
Jun 12 15:23:31 2011 barracuda 2011 kernel: [ 413.212693] end_request: I/O error, dev sdb, sector 793283775
Jun 12 15:23:32 2011 barracuda 2011 kernel: [ 431.624701] end_request: I/O error, dev sdb, sector 793283783
Jun 12 15:23:33 2011 barracuda 2011 kernel: [ 449.808699] end_request: I/O error, dev sdb, sector 793283791
Jun 12 15:23:33 2011 barracuda 2011 kernel: [ 468.988698] end_request: I/O error, dev sdb, sector 793283929
Jun 12 15:23:33 2011 barracuda 2011 kernel: [ 493.208697] end_request: I/O error, dev sdb, sector 793286447
Jun 12 15:23:34 2011 barracuda 2011 kernel: [ 511.308686] end_request: I/O error, dev sdb, sector 793286447
Jun 20 08:25:08 2011 barracuda 2011 kernel: [ 96.904694] end_request: I/O error, dev sdb, sector 793286460
Jun 20 08:26:43 2011 barracuda 2011 kernel: [ 131.892704] end_request: I/O error, dev sdb, sector 793283799
Jun 20 08:26:43 2011 barracuda 2011 kernel: [ 149.992695] end_request: I/O error, dev sdb, sector 793283799
Jun 20 08:26:43 2011 barracuda 2011 kernel: [ 168.384693] end_request: I/O error, dev sdb, sector 793283808
表示這台機器的 /dev/sdb 硬碟有發生壞軌的狀況
2.測試壞軌指令
#badblocks
#badblocks -v /dev/sdb9 -o blocks.txt
Checking blocks 0 to 476461408
Checking for bad blocks (read-only test): ^C 2759968/ 476461408
#cat blocks.txt
960
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
這個就是壞軌的sector
3.使用S.M.A.R.T.硬碟檢測
#smartctl
#smartctl -d ata -a /dev/sda
smartctl 5.39.1 2010-01-28 r3054 [i686-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda ES.2
Device Model: ST3250310NS #硬碟型號
Serial Number: 9SF0TRBH #硬碟序號
Firmware Version: SN05
User Capacity: 250,059,350,016 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Wed Jul 30 13:39:50 2014 CST
==> WARNING: There are known problems with these drives,
see the following Seagate web pages:
http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207931
http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207963
SMART support is: Available - device has SMART capability.
SMART support is: Enabled #硬碟支援S.M.A.R.T並且已經開啟
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED #硬碟檢測正常
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 634) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 58) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 075 063 044 Pre-fail Always - 36165465
3 Spin_Up_Time 0x0003 099 099 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 71
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 078 060 030 Pre-fail Always - 13114328965
9 Power_On_Hours 0x0032 078 078 000 Old_age Always - 19465
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 037 020 Old_age Always - 71
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 096 000 Old_age Always - 212
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 062 058 045 Old_age Always - 38 (Lifetime Min/Max 35/42)
194 Temperature_Celsius 0x0022 038 042 000 Old_age Always - 38 (0 21 0 0)
195 Hardware_ECC_Recovered 0x001a 056 033 000 Old_age Always - 36165465
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
SMART Error Log Version: 1
No Errors Logged #如果有error log表示硬碟可能曾經發生過問題
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Interrupted (host reset) 00% 617 -
# 2 Extended offline Completed without error 00% 617 -
# 3 Extended offline Completed without error 00% 616 -
# 4 Extended offline Completed without error 00% 615 -
# 5 Extended offline Completed without error 00% 614 -
# 6 Extended offline Completed without error 00% 613 -
# 7 Extended offline Completed without error 00% 612 -
# 8 Extended offline Completed without error 00% 611 -
# 9 Extended offline Completed without error 00% 610 -
#10 Extended offline Completed without error 00% 609 -
#11 Extended offline Completed without error 00% 608 -
#12 Extended offline Completed without error 00% 607 -
#13 Extended offline Completed without error 00% 606 -
#14 Extended offline Completed without error 00% 605 -
#15 Extended offline Completed without error 00% 604 -
#16 Extended offline Completed without error 00% 602 -
#17 Extended offline Completed without error 00% 601 -
#18 Extended offline Completed without error 00% 600 -
#19 Extended offline Completed without error 00% 599 -
#20 Extended offline Completed without error 00% 598 -
#21 Extended offline Completed without error 00% 597 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
針對S.M.A.R.T.資訊,找出重要的幾樣來分析
http://en.wikipedia.org/wiki/S.M.A.R.T.
1
|
Raw Read Error Rate
|
讀取錯誤率 (供應商提供參數,Seagate硬碟可忽略)
|
3
|
Spin Up Time
|
起轉時間
可能代表主軸馬達老化
|
4
|
Start/Stop Count
|
啟動/停止的次數
這個值是硬碟主軸馬達啟動/關閉的次數,一般來說就是你開關電源的次數 由於硬碟馬達啟/停時磨耗最大,這是一個壽命參考值
|
5
|
Reallocated Sector Count
|
重新分配磁區數量
硬碟發現有磁區讀取/寫入/驗證 錯誤時,會把磁區標記起來並且把這些磁區重新分配到空的區塊。 當這個值出現時可能出現瑕疵 ,此數值只大於1即可判定硬碟有問題
|
7
|
Seek Error Rate
|
尋軌錯誤率 (供應商提供參數,Seagate硬碟可忽略)
|
9
|
Power-On Hours Count
|
總通電時間
硬碟總通電時間,這是一個壽命參考值
(超過25000小時約等於3年時間)
|
10
|
Spin-up Retry Count
|
嘗試起轉重試次數
這個屬性值如果一直增加表示可能主軸馬達損壞的前兆
|
12
|
Power Cycle Count
|
電源循環次數
硬碟開機/關機的循環次數
|
188
|
Command Timeout
|
對硬碟下指令逾時
通常這個屬性值應該為0,若大於零可能與硬碟電源或者是硬碟接線有關
|
197
|
Current Pending Sector Count
|
代表無法修復的壞磁區數量,也就是壞軌
此數值只大於1即可判定硬碟有問題
|
Refer: http://en.wikipedia.org/wiki/S.M.A.R.T.