Monitoring Disk Events
As of Oracle Hardware Management Pack 2.3.2.2, enhanced diagnostic features have been added to collect disk error and SMART events from disks attached to the Sun Storage 6 Gb SAS PCIe HBA, Internal (SGX-SAS6-INT-Z), whether independent or in a RAID volume.
These enhanced diagnostic events are captured and logged in
/var/log/ssm/event.log
when the hardware management agent is running.
The following table lists the enhanced diagnostic events being logged.
Event Name in Log | Description |
---|---|
PD_RECOVERED_ERROR |
A disk recovered error was detected. |
PD_BAD_DEVICE_FAULT |
A non-recoverable drive failure was detected by the device while performing a command. |
PD_MEDIA_ERROR |
A medium error was detected by the device that was non-recoverable. |
PD_DEVICE_ERROR |
A non-recoverable hardware failure was detected by the device. The device may be offlined or degraded. |
PD_TRANSPORT_ERROR |
A path to the device has been unconfigured due to transport instability. |
PD_OVER_TEMPERATURE |
Disk SMART process reports a critical temperature. |
PD_SELF_TEST_FAILURE |
One or more disk SMART self tests failed. |
PD_PREDICTIVE_FAILURE |
SMART health-monitoring firmware reported that a disk failure is imminent. |
The controller polls each physical disk at regular intervals. If a disk has encountered an error, an event is generated by the controller. The hardware management agent captures that event and enters it in the hardware management event log.
To view the event information in the hardware management event log, type:
#
view /var/log/ssm/event.log
For enhance diagnostic disk events, you will see information similar to:
Thu Apr 30 12:32:31 2015:(CLI) Event Name : PD_MEDIA_ERROR Thu Apr 30 12:32:31 2015:(CLI) Event Description : A medium error was detected by the device that was non-recoverable. Thu Apr 30 12:32:31 2015:(CLI) ASC : 0x10 Thu Apr 30 12:32:31 2015:(CLI) ASCQ : 0x3 Thu Apr 30 12:32:31 2015:(CLI) Sense Key : 0x3 Thu Apr 30 12:32:31 2015:(CLI) Source : LSI Thu Apr 30 12:32:31 2015:(CLI) SAS Address : 0x5000cca01200fadd Thu Apr 30 12:32:31 2015:(CLI) LSI Description : Unexpected sense: PD 0c(e0xfc/s1) Path 5000cca01200fadd, CDB: 2f 00 00 fc 4d 42 00 10 00 00, Sense: 3/10/03 Thu Apr 30 12:32:31 2015:(CLI) Event TimeStamp : 04/30/2015 ; 19:30:25 Thu Apr 30 12:32:31 2015:(CLI) Node ID : 00000000:12 Thu Apr 30 12:32:31 2015:(CLI) Nac Name : /SYS/HDD1 Thu Apr 30 12:32:31 2015:(CLI) Serial Number : 001015N0JPXA PMG0JPXA Thu Apr 30 12:32:31 2015:(CLI) WWN No : PDS:5000cca01200fadd Thu Apr 30 12:32:31 2015:(CLI) Disk Model : H106030SDSUN300G
You can then use the information in the event listing to determine which physical disk in the system has the issue. Information such as the Oracle ILOM Nac Name (which matches the label on the front panel of the system) and drive Serial Number help you identify the disk and its drive slot in the system.
Note:
For PD_OVER_TEMPERATURE, PD_SELF_TEST_FAILURE and PD_PREDICTIVE_FAILURE events, use Oracle ILOM to configure proactive alerts.
For the other disk diagnostic events described in this document, it is up to the administrator to check the hardware management event log for these disk events when a disk problem is suspected. There is currently no alert mechanism to proactively announce these events.