Go to main content

Sun Server X4-2 Product Notes

Exit Print View

Updated: May 2019
 
 

Diagnosing SAS Data Path Failures on Servers Using MegaRAID Disk Controllers

Important Operating Note

On Oracle x86 servers using MegaRAID disk controllers, Serial Attached SCSI (SAS) data path errors can occur. To triage and isolate a data path problem on the SAS disk controller, disk backplane (DBP), SAS cable, SAS expander, or hard disk drive (HDD), gather and review the events in the disk controller event log. Classify and analyze all failure events reported by the disk controller based on the server SAS topology.

To classify a MegaRAID disk controller event:

  • Gather and parse the MegaRAID disk controller event logs either by running the automated sundiag utility or manually using the or StorCLI command.

    • For Oracle Exadata Database Machine database or storage cell servers, run the sundiag utility.

    • For Sun Server X4-2, use the StorCLI command.

For example, manually gather and parse the controller event log by using the StorCLI command. At the root prompt, type:

root# ./storcli64/c0 show events file=event.log
Controller=0
Status=Success

Note -  Use the existing name of the event log as the name for the disk controller event log. This produces a MegaRAID controller event log with the given file name event.log.

To show drive and slot errors separately, at the root prompt, type:

root# /opt/MegaRAID/storcli/storcli64 /c0 /eall /sall show errorcounters
Controller=0
Status=Success
Description=Show Drive/Cable Error Counters Succeeded.

Error Counters:

Drive
Error Counter for Drive Error
Error Counter for Slot
/c0/e8/s0
0
0
/c0/e8/s1
0
0
/c0/e8/s2
0
0
/c0/e8/s3
0
0
/c0/e8/s4
0
0
/c0/e8/s5
0
0
/c0/e8/s12
0
0
/c0/e8/s13
0
0

These error counters reflect drive or slot errors separately.

The following SCSI sense key errors found in the event log in SAS data path failures indicate a SAS data path fault:

B/4B/05 :SERIOUS: DATA OFFSET ERROR
B/4B/03 :SERIOUS: ACK/NAK TIMEOUT
B/47/01 :SERIOUS: DATA PHASE CRC ERROR DETECTED
B/4B/00 :SERIOUS: DATA PHASE ERROR

A communication fault between the disk and the host bus adapter causes these errors. The presence of these errors, even on a single disk, means there is a data path issue. The RAID controller, SAS cables, SAS expander, or disk backplane might be causing the interruption to the communication in the path between the RAID controller and the disks.

Oracle Service personnel can find more information about the diagnosis and triage of hard disk and SAS data path failures on x86 servers at the My Oracle Support web site: https://support.oracle.com. Refer to the Knowledge Article Doc ID 2161195.1. If there are multiple, simultaneous disk problems on an Exadata server, Oracle Service personnel can refer to Knowledge Article Doc ID 1370640.1.