判斷硬碟故障方式以下三種,若三項有超過兩項都發生異常,建議直接更換硬碟
1.查看硬碟是否故障 LOG 指令
2.測試壞軌指令
3.使用S.M.A.R.T.硬碟檢測
1.查看硬碟是否故障 LOG 指令
#grep "I/O error" /var/log/messages
Jun 12 15:23:30 2011 barracuda 2011 kernel: [ 376.184695] end_request: I/O error, dev sdb, sector 793283772 Jun 12 15:23:31 2011 barracuda 2011 kernel: [ 394.836691] end_request: I/O error, dev sdb, sector 793283772 Jun 12 15:23:31 2011 barracuda 2011 kernel: [ 413.212693] end_request: I/O error, dev sdb, sector 793283775 Jun 12 15:23:32 2011 barracuda 2011 kernel: [ 431.624701] end_request: I/O error, dev sdb, sector 793283783 Jun 12 15:23:33 2011 barracuda 2011 kernel: [ 449.808699] end_request: I/O error, dev sdb, sector 793283791 Jun 12 15:23:33 2011 barracuda 2011 kernel: [ 468.988698] end_request: I/O error, dev sdb, sector 793283929 Jun 12 15:23:33 2011 barracuda 2011 kernel: [ 493.208697] end_request: I/O error, dev sdb, sector 793286447 Jun 12 15:23:34 2011 barracuda 2011 kernel: [ 511.308686] end_request: I/O error, dev sdb, sector 793286447 Jun 20 08:25:08 2011 barracuda 2011 kernel: [ 96.904694] end_request: I/O error, dev sdb, sector 793286460 Jun 20 08:26:43 2011 barracuda 2011 kernel: [ 131.892704] end_request: I/O error, dev sdb, sector 793283799 Jun 20 08:26:43 2011 barracuda 2011 kernel: [ 149.992695] end_request: I/O error, dev sdb, sector 793283799 Jun 20 08:26:43 2011 barracuda 2011 kernel: [ 168.384693] end_request: I/O error, dev sdb, sector 793283808
表示這台機器的 /dev/sdb 硬碟有發生壞軌的狀況
2.測試壞軌指令
#badblocks
#badblocks -v /dev/sdb9 -o blocks.txt
Checking blocks 0 to 476461408 Checking for bad blocks (read-only test): ^C 2759968/ 476461408
#cat blocks.txt
960 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014
這個就是壞軌的sector
3.使用S.M.A.R.T.硬碟檢測
#smartctl
#smartctl -d ata -a /dev/sda
smartctl 5.39.1 2010-01-28 r3054 [i686-pc-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Seagate Barracuda ES.2 Device Model: ST3250310NS #硬碟型號 Serial Number: 9SF0TRBH #硬碟序號 Firmware Version: SN05 User Capacity: 250,059,350,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Wed Jul 30 13:39:50 2014 CST ==> WARNING: There are known problems with these drives, see the following Seagate web pages: http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207931 http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207963 SMART support is: Available - device has SMART capability. SMART support is: Enabled #硬碟支援S.M.A.R.T並且已經開啟 === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED #硬碟檢測正常 General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 634) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 58) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 075 063 044 Pre-fail Always - 36165465 3 Spin_Up_Time 0x0003 099 099 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 71 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 078 060 030 Pre-fail Always - 13114328965 9 Power_On_Hours 0x0032 078 078 000 Old_age Always - 19465 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 037 020 Old_age Always - 71 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 096 000 Old_age Always - 212 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 062 058 045 Old_age Always - 38 (Lifetime Min/Max 35/42) 194 Temperature_Celsius 0x0022 038 042 000 Old_age Always - 38 (0 21 0 0) 195 Hardware_ECC_Recovered 0x001a 056 033 000 Old_age Always - 36165465 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged #如果有error log表示硬碟可能曾經發生過問題 SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Interrupted (host reset) 00% 617 - # 2 Extended offline Completed without error 00% 617 - # 3 Extended offline Completed without error 00% 616 - # 4 Extended offline Completed without error 00% 615 - # 5 Extended offline Completed without error 00% 614 - # 6 Extended offline Completed without error 00% 613 - # 7 Extended offline Completed without error 00% 612 - # 8 Extended offline Completed without error 00% 611 - # 9 Extended offline Completed without error 00% 610 - #10 Extended offline Completed without error 00% 609 - #11 Extended offline Completed without error 00% 608 - #12 Extended offline Completed without error 00% 607 - #13 Extended offline Completed without error 00% 606 - #14 Extended offline Completed without error 00% 605 - #15 Extended offline Completed without error 00% 604 - #16 Extended offline Completed without error 00% 602 - #17 Extended offline Completed without error 00% 601 - #18 Extended offline Completed without error 00% 600 - #19 Extended offline Completed without error 00% 599 - #20 Extended offline Completed without error 00% 598 - #21 Extended offline Completed without error 00% 597 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
針對S.M.A.R.T.資訊,找出重要的幾樣來分析
http://en.wikipedia.org/wiki/S.M.A.R.T.
1
|
Raw Read Error Rate
|
讀取錯誤率 (供應商提供參數,Seagate硬碟可忽略)
|
3
|
Spin Up Time
|
起轉時間
可能代表主軸馬達老化
|
4
|
Start/Stop Count
|
啟動/停止的次數
這個值是硬碟主軸馬達啟動/關閉的次數,一般來說就是你開關電源的次數 由於硬碟馬達啟/停時磨耗最大,這是一個壽命參考值
|
5
|
Reallocated Sector Count
|
重新分配磁區數量
硬碟發現有磁區讀取/寫入/驗證 錯誤時,會把磁區標記起來並且把這些磁區重新分配到空的區塊。 當這個值出現時可能出現瑕疵 ,此數值只大於1即可判定硬碟有問題
|
7
|
Seek Error Rate
|
尋軌錯誤率 (供應商提供參數,Seagate硬碟可忽略)
|
9
|
Power-On Hours Count
|
總通電時間
硬碟總通電時間,這是一個壽命參考值
(超過25000小時約等於3年時間)
|
10
|
Spin-up Retry Count
|
嘗試起轉重試次數
這個屬性值如果一直增加表示可能主軸馬達損壞的前兆
|
12
|
Power Cycle Count
|
電源循環次數
硬碟開機/關機的循環次數
|
188
|
Command Timeout
|
對硬碟下指令逾時
通常這個屬性值應該為0,若大於零可能與硬碟電源或者是硬碟接線有關
|
197
|
Current Pending Sector Count
|
代表無法修復的壞磁區數量,也就是壞軌
此數值只大於1即可判定硬碟有問題
|