Oracle Hardware Management Pack Diagnostics for SGX-SAS6-INT-Z HBA Need to Be Extended (20364298)

Language:

In Oracle Solaris 11.2 SRU 10, enhanced diagnostic features were added to collect more data from disks connected to the Sun Storage 6 Gb SAS PCIe HBA, Internal (SGX-SAS6-INT-Z). This includes various disk errors and SMART events. In addition, these events identify suspect physical disks among logical disks in a RAID volume. These events are captured and logged in /var/log/ssm/event.log when the hardware management agent (svc:/system/sp/management:default) is running.

The following table lists enhanced diagnostic events being logged.

Event Name in Log	Description
PD_RECOVERED_ERROR	A disk recovered error was detected.
PD_BAD_DEVICE_FAULT	A non-recoverable drive failure was detected by the device while performing a command.
PD_MEDIA_ERROR	A medium error was detected by the device that was non-recoverable.
PD_DEVICE_ERROR	A non-recoverable hardware failure was detected by the device. The device may be offlined or degraded.
PD_TRANSPORT_ERROR	A path to the device has been unconfigured due to transport instability.
PD_OVER_TEMPERATURE	Disk SMART process reports a critical temperature.
PD_SELF_TEST_FAILURE	One or more disk SMART self tests failed.
PD_PREDICTIVE_FAILURE	SMART health-monitoring firmware reported that a disk failure is imminent.

The controller polls each physical disk in the volume at regular intervals. If a disk has encountered an error, an event is generated by the controller. The hardware management agent captures that event and enters it in the hardware management event log.

To view the event information in the hardware management event log, type:

# view /var/log/ssm/event.log

For disk events, you will see information similar to:

Thu Apr 30 12:32:31 2015:(CLI) Event Name  : PD_MEDIA_ERROR
Thu Apr 30 12:32:31 2015:(CLI) Event Description : A medium error was 
detected by the device that was non-recoverable.
Thu Apr 30 12:32:31 2015:(CLI) ASC  : 0x10
Thu Apr 30 12:32:31 2015:(CLI) ASCQ : 0x3
Thu Apr 30 12:32:31 2015:(CLI) Sense Key : 0x3
Thu Apr 30 12:32:31 2015:(CLI) Source : LSI
Thu Apr 30 12:32:31 2015:(CLI) SAS Address : 0x5000cca01200fadd
Thu Apr 30 12:32:31 2015:(CLI) LSI Description : Unexpected sense: PD 
0c(e0xfc/s1) Path 5000cca01200fadd, CDB: 2f 00 00 fc 4d 42 00 10 00 00, 
Sense: 3/10/03
Thu Apr 30 12:32:31 2015:(CLI) Event TimeStamp : 04/30/2015 ; 19:30:25
Thu Apr 30 12:32:31 2015:(CLI) Node ID : 00000000:12
Thu Apr 30 12:32:31 2015:(CLI) Nac Name : /SYS/HDD1
Thu Apr 30 12:32:31 2015:(CLI) Serial Number : 001015N0JPXA   PMG0JPXA
Thu Apr 30 12:32:31 2015:(CLI) WWN No : PDS:5000cca01200fadd
Thu Apr 30 12:32:31 2015:(CLI) Disk Model : H106030SDSUN300G

You can then use the information in the event listing to determine which physical disk in the system has the issue. Information such as the Oracle ILOM Nac Name (which matches the label on the front panel of the system) and drive Serial Number help you identify the disk and its drive slot in the system.

Note - For PD_OVER_TEMPERATURE, PD_SELF_TEST_FAILURE and PD_PREDICTIVE_FAILURE events, use Oracle ILOM to configure proactive alerts.

For the other disk diagnostic events described in this document, it is up to the administrator to check the hardware management event log for these disk events when a disk problem is suspected. There is currently no alert mechanism to proactively announce these events.