This post also applies to non-Exadata systems as hard drives work the same way in other storage arrays too – just the commands you would use for extracting the disk-level metrics would be different. Scroll down to smartctl if you wan’t to skip the Oracle stuff and get straight to the Linux disk diagnosis commands.
I just noticed that one of our Exadatas had a disk put into “predictive failure” mode and thought to show how to measure why the disk is in that mode (as opposed to just replacing it without really understanding the issue ;-)
SQL> @exadata/cellpd Show Exadata cell versions from V$CELL_CONFIG.... DISKTYPE CELLNAME STATUS TOTAL_GB AVG_GB NUM_DISKS PREDFAIL POORPERF WTCACHEPROB PEERFAIL CRITICAL -------------------- -------------------- -------------------- ---------- ---------- ---------- ---------- ---------- ----------- ---------- ---------- FlashDisk 192.168.12.3 normal 183 23 8 FlashDisk 192.168.12.3 not present 183 23 8 3 FlashDisk 192.168.12.4 normal 366 23 16 FlashDisk 192.168.12.5 normal 366 23 16 HardDisk 192.168.12.3 normal 20489 1863 11 HardDisk 192.168.12.3 warning - predictive 1863 1863 1 1 HardDisk 192.168.12.4 normal 22352 1863 12 HardDisk 192.168.12.5 normal 22352 1863 12
So, one of the disks in storage cell with IP 192.168.12.3 has been put into predictive failure mode. Let’s find out why!
To find out which exact disk, I ran one of my scripts for displaying Exadata disk topology (partial output below):
SQL> @exadata/exadisktopo2 Showing Exadata disk topology from V$ASM_DISK and V$CELL_CONFIG.... CELLNAME LUN_DEVICENAME PHYSDISK PHYSDISK_STATUS CELLDISK CD_DEVICEPART GRIDDISK ASM_DISK ASM_DISKGROUP LUNWRITECACHEMODE -------------------- -------------------- ------------------------------ -------------------------------------------------------------------------------- ------------------------------ -------------------------------------------------------------------------------- ------------------------------ ------------------------------ ------------------------------ ---------------------------------------------------------------------------------------------------- 192.168.12.3 /dev/sda 35:0 normal CD_00_enkcel01 /dev/sda3 DATA_CD_00_enkcel01 DATA_CD_00_ENKCEL01 DATA "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU" /dev/sda 35:0 normal CD_00_enkcel01 /dev/sda3 RECO_CD_00_enkcel01 RECO_CD_00_ENKCEL01 RECO "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU" /dev/sdb 35:1 normal CD_01_enkcel01 /dev/sdb3 DATA_CD_01_enkcel01 DATA_CD_01_ENKCEL01 DATA "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU" /dev/sdb 35:1 normal CD_01_enkcel01 /dev/sdb3 RECO_CD_01_enkcel01 RECO_CD_01_ENKCEL01 RECO "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU" /dev/sdc 35:2 normal CD_02_enkcel01 /dev/sdc DATA_CD_02_enkcel01 DATA_CD_02_ENKCEL01 DATA "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU" /dev/sdc 35:2 normal CD_02_enkcel01 /dev/sdc DBFS_DG_CD_02_enkcel01 DBFS_DG_CD_02_ENKCEL01 DBFS_DG "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU" /dev/sdc 35:2 normal CD_02_enkcel01 /dev/sdc RECO_CD_02_enkcel01 RECO_CD_02_ENKCEL01 RECO "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU" /dev/sdd 35:3 warning - predictive CD_03_enkcel01 /dev/sdd DATA_CD_03_enkcel01 "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU" /dev/sdd 35:3 warning - predictive CD_03_enkcel01 /dev/sdd DBFS_DG_CD_03_enkcel01 "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU" /dev/sdd 35:3 warning - predictive CD_03_enkcel01 /dev/sdd RECO_CD_03_enkcel01 "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU"
Ok, looks like /dev/sdd (with address 35:3) is the “failed” one.
When listing the alerts from the storage cell, indeed we see that a failure has been predicted, warning raised and even handled – XDMG process gets notified and the ASM disks get dropped from the failed grid disks (as you see from the exadisktopo output above if you scroll right).
CellCLI> LIST ALERTHISTORY WHERE alertSequenceID = 456 DETAIL;
name: 456_1
alertDescription: "Data hard disk entered predictive failure status"
alertMessage: "Data hard disk entered predictive failure status. Status : WARNING - PREDICTIVE FAILURE Manufacturer : HITACHI Model Number : H7220AA30SUN2.0T Size : 2.0TB Serial Number : 1016M7JX2Z Firmware : JKAOA28A Slot Number : 3 Cell Disk : CD_03_enkcel01 Grid Disk : DBFS_DG_CD_03_enkcel01, DATA_CD_03_enkcel01, RECO_CD_03_enkcel01"
alertSequenceID: 456
alertShortName: Hardware
alertType: Stateful
beginTime: 2013-11-27T07:48:03-06:00
endTime: 2013-11-27T07:55:52-06:00
examinedBy:
metricObjectName: 35:3
notificationState: 1
sequenceBeginTime: 2013-11-27T07:48:03-06:00
severity: critical
alertAction: "The data hard disk has entered predictive failure status. A white cell locator LED has been turned on to help locate the affected cell, and an amber service action LED has been lit on the drive to help locate the affected drive. The data from the disk will be automatically rebalanced by Oracle ASM to other disks. Another alert will be sent and a blue OK-to-Remove LED will be lit on the drive when rebalance completes. Please wait until rebalance has completed before replacing the disk. Detailed information on this problem can be found at https://support.oracle.com/CSP/main/article?cmd=show&type=NOT&id=1112995.1 "
name: 456_2
alertDescription: "Hard disk can be replaced now"
alertMessage: "Hard disk can be replaced now. Status : WARNING - PREDICTIVE FAILURE Manufacturer : HITACHI Model Number : H7220AA30SUN2.0T Size : 2.0TB Serial Number : 1016M7JX2Z Firmware : JKAOA28A Slot Number : 3 Cell Disk : CD_03_enkcel01 Grid Disk : DBFS_DG_CD_03_enkcel01, DATA_CD_03_enkcel01, RECO_CD_03_enkcel01 "
alertSequenceID: 456
alertShortName: Hardware
alertType: Stateful
beginTime: 2013-11-27T07:55:52-06:00
examinedBy:
metricObjectName: 35:3
notificationState: 1
sequenceBeginTime: 2013-11-27T07:48:03-06:00
severity: critical
alertAction: "The data on this disk has been successfully rebalanced by Oracle ASM to other disks. A blue OK-to-Remove LED has been lit on the drive. Please replace the drive."
The two alerts show that we first detected a (soon) failing disk (event 456_1) and then ASM kicked in and dropped the ASM disks from the failing disk and rebalanced the data elsewhere (event 456_2).
But we still do not know why we are expecting the disk to fail! And the alert info and CELLCLI command output do not have this detail. This is where the S.M.A.R.T monitoring comes in. Major hard drive manufacturers support the SMART standard for both reactive and predictive monitoring of the hard disk internal workings. And there are commands for querying these metrics.
Let’s find the failed disk info at the cell level with CELLCLI:
CellCLI> LIST PHYSICALDISK; 35:0 JK11D1YAJTXVMZ normal 35:1 JK11D1YAJB4V0Z normal 35:2 JK11D1YAJAZMMZ normal 35:3 JK11D1YAJ7JX2Z warning - predictive failure 35:4 JK11D1YAJB3J1Z normal 35:5 JK11D1YAJB4J8Z normal 35:6 JK11D1YAJ7JXGZ normal 35:7 JK11D1YAJB4E5Z normal 35:8 JK11D1YAJ8TY3Z normal 35:9 JK11D1YAJ8TXKZ normal 35:10 JK11D1YAJM5X9Z normal 35:11 JK11D1YAJAZNKZ normal FLASH_1_0 1014M02JC3 not present FLASH_1_1 1014M02JYG not present FLASH_1_2 1014M02JV9 not present FLASH_1_3 1014M02J93 not present FLASH_2_0 1014M02JFK not present FLASH_2_1 1014M02JFL not present FLASH_2_2 1014M02JF7 not present FLASH_2_3 1014M02JF8 not present FLASH_4_0 1014M02HP5 normal FLASH_4_1 1014M02HNN normal FLASH_4_2 1014M02HP2 normal FLASH_4_3 1014M02HP4 normal FLASH_5_0 1014M02JUD normal FLASH_5_1 1014M02JVF normal FLASH_5_2 1014M02JAP normal FLASH_5_3 1014M02JVH normal
Ok, let’s look into the details, as we also need the deviceId for querying the SMART info:
CellCLI> LIST PHYSICALDISK 35:3 DETAIL; name: 35:3 deviceId: 26 diskType: HardDisk enclosureDeviceId: 35 errMediaCount: 0 errOtherCount: 0 foreignState: false luns: 0_3 makeModel: "HITACHI H7220AA30SUN2.0T" physicalFirmware: JKAOA28A physicalInsertTime: 2010-05-15T21:10:49-05:00 physicalInterface: sata physicalSerial: JK11D1YAJ7JX2Z physicalSize: 1862.6559999994934G slotNumber: 3 status: warning - predictive failure
Ok, the disk device was /dev/sdd, the disk name is 35:3 and the device ID is 26. And it’s a SATA disk. So I will run smartctl with the sat+megaraid device type option to query the disk SMART metrics – via the SCSI controller where the disks are attached to. Note that the ,26 in the end is the deviceId reported by the LIST PHYSICALDISK command. There’s quite a lot of output, I have highlighted the important part in red:
> smartctl -a /dev/sdd -d sat+megaraid,26 smartctl 5.41 2011-06-09 r3365 [x86_64-linux-2.6.32-400.11.1.el5uek] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: HITACHI H7220AA30SUN2.0T 1016M7JX2Z Serial Number: JK11D1YAJ7JX2Z LU WWN Device Id: 5 000cca 221df9d11 Firmware Version: JKAOA28A User Capacity: 2,000,398,934,016 bytes [2.00 TB] Sector Size: 512 bytes logical/physical Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Thu Nov 28 06:28:13 2013 CST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED Warning: This result is based on an Attribute check. General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (22330) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 255) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 028 028 016 Pre-fail Always - 430833663 2 Throughput_Performance 0x0005 132 132 054 Pre-fail Offline - 103 3 Spin_Up_Time 0x0007 117 117 024 Pre-fail Always - 614 (Average 624) 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 69 5 Reallocated_Sector_Ct 0x0033 058 058 005 Pre-fail Always - 743 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0 8 Seek_Time_Performance 0x0005 112 112 020 Pre-fail Offline - 39 9 Power_On_Hours 0x0012 096 096 000 Old_age Always - 30754 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 69 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 80 193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 80 194 Temperature_Celsius 0x0002 253 253 000 Old_age Always - 23 (Min/Max 17/48) 196 Reallocated_Event_Count 0x0032 064 064 000 Old_age Always - 827 197 Current_Pending_Sector 0x0022 089 089 000 Old_age Always - 364 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0 SMART Error Log Version: 0 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 30754 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
So, from the highlighted output above we see that the Raw_Read_Error_Rate indicator for this hard drive is pretty close to the threshold of 16. The SMART metrics really are just “health indicators” following the defined standards (read the wiki article). The health indicator value can range from 0 to 253. For the Raw_read_Error_Rate metric (which measures the physical read success from the disk surface), bigger value is better (apparently 100 is the max with the Hitachi disks at least, most of the other disks were showing around 90-100). So whenever there are media read errors, the metric will drop, the more errors, the more it will drop.
Apparently some of the read errors are inevitable (and detected by various checks like ECC), especially in the high-density disks. The errors will be corrected/worked around, sometimes via ECC, sometimes by a re-read. So, yes, your hard drive performance can get worse as disks age or are about to fail. If the metric ever goes below the defined threshold of 16, the disk apparently (and consistently over some period of time) isn’t working so great, so it should better be replaced.
Note that the RAW_VALUE column does not neccesarily show the number of failed reads from the disk platter. It may represent the number of sectors that failed to be read or it may be just a bitmap – or both combined into the low- and high-order bytes of this value. For example, when converting the raw value of 430833663 to hex, we get 0x19ADFFFF. Perhaps the low-order FFFF is some sort of a bitmap and the high 0rder 0x19AD is the number of failed sectors or reads. There’s some more info available about Seagate disks, but in our V2 we have Hitachi ones and I couldn’t find anything about how to decode the RAW_VALUE for their disks. So, we just need to trust that the “normalized” SMART health indicators for the different metrics tell us when there’s a problem.
Even though when I ran the smartctl command, I did not see the actual value (nor the worst value) crossing the threshold, 28 is still pretty close to the threshold 16, considering that normally the indicator should be close to 100. So my guess here is that the indicator actually did cross the threshold, this is when the alert got raised. It’s just that by the time I logged in and ran my diagnostics commands, the disk worked better again. It looks like the “worst” values are not remembered properly by the disks (or it could be that some SMART tool resets these every now and then). Note that we would see SMART alerts with the actual problem metric values in the Linux /var/log/messages file _if _the smartd service were enabled in the Storage Cell Linux OS – but apparently it’s disabled and probably some Oracle’s own daemon in the cell is monitoring that.
So what does this info tell us – a low “health indicator” for the Raw_Read_Error_Rate means that there are problems with physically reading the sequences of bits from the disk platter. This means bad sectors or weak sectors (that are soon about to become bad sectors probably). So, had we seen a bad health state for UDMA_CRC_Error_Count for example, it would have indicated a data transfer issue over the SATA cable. So, it looks like the reason for the disk being in the predictive failure state is about it having just too many read errors from the physical disk platter.
If you look into the other highlighted metrics above – the Reallocated_Sector_Ct and Current_Pending_Sector, you see there are hundreds of disk sectors (743 and 364) that have had IO issues, but eventually the reads finished ok and the sectors were migrated (remapped) to a spare disk area. As these disks have 512B sector size, this means that some Oracle block-size IOs from a single logical sector range may actually have to read part of the data from the original location and seek to some other location on the disk for reading the rest (from the remapped sector). So, again, your disk performance may get worse when your disk is about to fail or is just having quality issues.
For reference, here’s an example from another, healthier disk in this Exadata storage cell:
SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 086 086 016 Pre-fail Always - 2687039 2 Throughput_Performance 0x0005 133 133 054 Pre-fail Offline - 99 3 Spin_Up_Time 0x0007 119 119 024 Pre-fail Always - 601 (Average 610) 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 68 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0 8 Seek_Time_Performance 0x0005 114 114 020 Pre-fail Offline - 38 9 Power_On_Hours 0x0012 096 096 000 Old_age Always - 30778 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 68 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 81 193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 81 194 Temperature_Celsius 0x0002 253 253 000 Old_age Always - 21 (Min/Max 16/46) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0
The Raw_Read_Error_Rate indicator still shows 86 (% from ideal?), but it’s much farther away from the threshold 16. Many of the other disks showed even 99 or 100 and apparently this metric value changed as the disk behavior changed. Some disks with value of 88 jumped to 100 and an hour later they were at 95 and so on. So the VALUE column allows real time monitoring of these internal disk metrics, the only thing that doesn’t make sense right now is that why does teh WORST column get reset over time.
For the better behaving disk, the Reallocated_Sector_Ct and Current_Pending_Sector metrics show zero, so this disk doesn’t seem to have bad or weak sectors (yet).
I hope that this post is another example that it is possible to dig deeper, but only when the piece of software or hardware is properly instrumented, of course. If it didn’t have such instrumentation, it would be way harder (you would have to take a stethoscope and record the noise of the hard drive for analysis or open the drive in dustless vacuum and see what it’s doing yourself ;-)