C H A P T E R 7 |
Troubleshooting Your Array |
This chapter covers the following troubleshooting topics:
To check front-panel and back-panel LEDs, see Chapter 5.
For more troubleshooting tips, refer to the Sun StorEdge 3320 SCSI Array Release Notes at:
http://www.sun.com/products-n-solutions/hardware/docs/Network_Storage_Solutions/Workgroup/
Monitoring conditions at different points within the array enables you to avoid problems before they occur. Cooling element, temperature, voltage, and power sensors are located at key points in the enclosure. The SCSI Accessed Fault-Tolerant Enclosure (SAF-TE) processor monitors the status of these sensors. Refer to the Sun StorEdge 3000 Family RAID Firmware User’s Guide for additional details.
The following table describes the location of the enclosure devices from the back of the Sun StorEdge 3320 SCSI array orientation as shown in FIGURE 7-1.
FIGURE 7-1 Sun StorEdge 3320 SCSI Array Enclosure Device Orientation
The enclosure sensor locations and alarm conditions are described in the following table.
Disk slot identifier refers to the backplane FRU to which disks are connected |
||
By default, all RAID arrays are preconfigured with one or two logical drives. For a logical drive to be visible to the host server, its partitions must be mapped to host LUNs. To make the mapped LUNs visible to a specific host, perform the steps required for your operating system, if there are any special requirements. For host-specific information about different operating systems, see:
If you attach a JBOD array directly to a host server and do not see the drives on the host server, check that the cabling is correct and that there is proper termination. For details, see the special cabling procedures in Appendix B.
For additional information about specific servers, see the operating system appendices in this document.
Controller failure symptoms are as follows:
A “SCSI Bus Reset Issued” alert message is displayed for each of the SCSI channels. A “Redundant Controller Failure Detected” alert message is also displayed. These messages are also written to the event log.
If one controller in the redundant controller configuration fails, the surviving controller takes over for the failed controller. The primary controller state will be held by the surviving controller regardless of the serial number until redundancy is restored.
The surviving controller disables and disconnects from its counterpart while gaining access to all the signal paths. It then manages the ensuing event notifications and takes over all processes. It remains the primary controller regardless of its original status, and any replacement controller afterward assumes the role of secondary controller.
The failover and failback processes are completely transparent to the host.
Note - If the surviving controller is removed and the failed controller is left in the system, and the system is power-cycled, the failed controller can become primary and write stale data to disk. |
Controllers are hot-swappable if you are using a redundant configuration, and replacing a failed unit takes only a few minutes. Since the I/O connections are on the controllers, you might experience some unavailability between the times when the failed controller is removed and a new one is installed in its place.
To maintain your redundant controller configuration, replace the failed controller as soon as possible. For details, refer to the Sun StorEdge 3000 Family FRU Installation Guide.
Note - When the drives cannot be identified by the controller, either due to disk channel errors or powering up in the wrong sequence, the drive state will change to USED with all logical drives in a FATAL FAIL state. To recover from this state, the condition that caused the loss of access to the disk drives must be resolved and a power cycle of the system is required. The FATAL FAIL state remains following the power cycle and requires user intervention to clear. For details regarding the FATAL FAIL state, see Section 7.5, Recovering From Fatal Drive Failure. |
With a RAID array system, your system is protected with the RAID parity drive and a global spare or spares.
A Fatal Fail occurs when more drives fail than your RAID redundancy can accommodate. The redundancy of your RAID array depends on your configuration. In a RAID 3 or RAID 5 configuration, two or more drives must fail for a FATAL FAIL status. In a RAID 1 configuration, you can lose multiple drives without fatal failure if all the failed drives reside on one side of a mirrored pair.
It might be possible to recover the RAID array from a Fatal Fail. However, it might be impossible to do a full data recovery, depending on the circumstances of the failure. Recovering from a Fatal Fail requires reusing the drives that report as failed. It is important to check your recovered data using the data application or host-based tools following a Fatal Fail recovery.
It is rare for two or more drives to fail at the same time. To minimize the chance of this happening, regular RAID integrity checks should be performed. For RAID 3 and RAID 5, this can be done using the array console’s “regenerate Parity” option, or using the Sun StorEdge CLI command-line utility check parity. Refer to the Sun StorEdge 3000 Family RAID Firmware User’s Guide for details on the “regenerate Parity” option. Refer to the Sun StorEdge 3000 Family CLI User’s Guide for details on the check parity command line utility.
If a multiple drive failure has occurred, it might be possible to recover by performing the following steps:
1. Discontinue all I/O activity immediately.
2. To cancel the beeping alarm, from the RAID firmware Main Menu, choose “system Functions Mute beeper”.
See Section 6.4, Silencing Audible Alarms for more information about silencing audible alarms.
3. Physically check that all the drives are firmly seated in the array and that none have been partially or completely removed.
4. In the RAID firmware Main Menu, choose “view and edit Logical drives,” and look for:
Status: FATAL FAIL (two or more failed drives)
5. Select the logical drive, press Return, and choose “view scsi drives.”
If two physical drives fail, one drive has a BAD status and one drive has a MISSING status.
6. Unassign any global or local spare drives.
From the RAID firmware Main Menu, choose “system Functions Reset controller” and choose Yes when prompted.
8. When the system comes back up, clear the FATAL FAIL state.
a. From the RAID firmware Main Menu, choose “view and edit Logical drives.”
b. Select the logical drive with the FATAL FAIL status and press Enter.
d. Choose Yes when the “Back to degraded?” prompt is displayed.
Note - The prompt is “Back to normal?” for RAID 0 configurations. |
After clearing the FATAL FAIL, the status changes to DRV FAILED.
9. If the status is still FATAL FAIL, you might have lost all data on the logical drive, and it might be necessary to re-create the logical drive.
Proceed with the following procedures:
Refer to the Sun StorEdge 3000 Family RAID Firmware User’s Guide for more information.
b. Create a new logical drive.
Refer to the Sun StorEdge 3000 Family RAID Firmware User’s Guide for more information.
10. If the logical drive has changed to “degraded,” run fsck(1M).
11. After fsck(1M) completes successfully, rebuild the logical drive.
Note - The logical drive can be rebuilt with a local or a global spare drive. If no local or global spare is assigned, the logical drive will be rebuilt with the remaining BAD drive. |
a. If you unassigned any local or global spare drives in Step 6, reassign them now.
The rebuild will begin automatically.
12. If no spare drives are available, perform the following steps.
a. From the RAID firmware Main Menu, choose “view and edit Logical drives.”
b. Select the logical drive that has the status DRV FAILED.
c. Choose “Rebuild logical drive,” and then choose Yes to rebuild the logical drive.
The rebuilding progress is displayed on the screen. A notification message informs you when the process is complete.
Rebuilding the logical drive restores the RAID integrity to a self-consistent state. This does not guarantee that the data has not been corrupted. All possible application checks should be performed to ensure that the data is not corrupted before it is used for business or production purposes.
For additional troubleshooting tips, refer to the release notes for your array.
To silence audible alarms that are caused by component failures, use a paperclip to push the Reset button. See Section 6.4, Silencing Audible Alarms for more information about silencing audible alarms.
This section provides flowcharts to illustrate common troubleshooting methods.
The flowcharts included in this section are:
For the JBOD and expansion unit flowchart, see Section B.13, Troubleshooting Sun StorEdge 3320 SCSI JBOD Arrays.
For overview information about LEDs, see Chapter 5.
For information about replacing modules, refer to the Sun StorEdge 3000 Family FRU Installation Guide.
The following flowchart provides troubleshooting procedures for the power supply and fan module.
FIGURE 7-2 Power Supply or Fan Module Flowchart, 1 of 2
FIGURE 7-3 Power Supply or Fan Module Flowchart, 2 of 2
Before you perform the drive LED troubleshooting procedures, you might want to use the firmware application to identify a failed drive. For details, refer to the Sun StorEdge 3000 Family RAID Firmware User’s Guide.
For overview information about drive LEDs and how they work, see Section 5.2, Front-Panel LEDs.
Caution - To prevent any possible data loss, back up user data to another storage device prior to replacing a disk drive. |
Caution - When you replace drives, make sure that all I/O is stopped. |
The following flowchart provides troubleshooting procedures for drive LEDs.
FIGURE 7-4 Drive LEDs Flowchart, 1 of 3
FIGURE 7-5 Drive LEDs Flowchart, 2 of 3
FIGURE 7-6 Drive LEDs Flowchart, 3 of 3
For more information about checking and replacing drive modules, refer to the Sun StorEdge 3000 Family FRU Installation Guide.
The following flowchart provides troubleshooting procedures for the Sun StorEdge 3320 SCSI array front-panel LEDs.
FIGURE 7-7 Front-Panel LEDs Flowchart, 1 of 5
FIGURE 7-8 Front-Panel LEDs Flowchart, 2 of 5
FIGURE 7-9 Front-Panel LEDs Flowchart, 3 of 5
FIGURE 7-10 Front-Panel LEDs Flowchart, 4 of 5
FIGURE 7-11 Front-Panel LEDs Flowchart, 5 of 5
The following flowchart provides troubleshooting procedures for the I/O controller module.
FIGURE 7-12 I/O Controller Module Flowchart, 1 of 2
FIGURE 7-13 I/O Controller Module Flowchart, 2 of 2
Copyright © 2009 Sun Microsystems, Inc. All rights reserved.