C H A P T E R  7

Troubleshooting Your Array

This chapter covers the following troubleshooting topics:

To check front-panel and back-panel LEDs, see Chapter 5.

For more troubleshooting tips, refer to the Sun StorEdge 3310 SCSI Array Release Notes.


7.1 Sensor Locations

Monitoring conditions at different points within the array enables you to avoid problems before they occur. Cooling element, temperature, voltage, and power sensors are located at key points in the enclosure. The SCSI Accessed Fault-Tolerant Enclosure (SAF-TE) processor monitors the status of these sensors. Refer to the Sun StorEdge 3000 Family RAID Firmware User's Guide for additional details.

The following table describes the location of the enclosure devices from the back of the Sun StorEdge 3310 SCSI array orientation as shown in FIGURE 7-1.

  FIGURE 7-1 Sun StorEdge 3310 SCSI Array Enclosure Device Orientation

Figure showing the location of fans and power supplies for the Sun StorEdge 3310 SCSI array.

The enclosure sensor locations and alarm conditions are described in the following table.

TABLE 7-1 Sensor Locations and Alarms

Sensor Type

Description

Alarm Condition

Fan 0

Left side power supply fan

< 900 RPM

Fan 1

Right side power supply fan

< 900 RPM

PS 0

Left side power supply

Voltage, temperature, or fan fault

PS 1

Right side power supply

Voltage, temperature, or fan fault

Temp 0

Left drive temperature sensor

< 32°F (0°C) or > 131°F (55°C)

Temp 1

Center drive temperature sensor

< 32°F (0°C) or > 131°F (55°C)

Temp 2

Temperature sensor on left side power supply module

< 32°F (0°C) or > 140°F (60°C)

Temp 3

Temperature sensor on left side EMU module

< 32°F (0°C) or > 140°F (60°C)

Temp 4

Temperature sensor on right side EMU module

< 32°F (0°C) or > 140°F (60°C)

Temp 5

Right drive temperature sensor

< 32°F (0°C) or > 131°F (55°C)

Temp 6

Temperature sensor on right side power supply module

< 32°F (0°C) or > 140°F (60°C)

Disk Slot 0-11

Disk slot identifier refers to the backplane FRU to which disks are connected

Not applicable

Temp CPU

Temperature sensor on RAID controller

> 203°F (95°C)

Temp Board 1

Temperature sensor on RAID controller

> 185°F (85°C)

Temp Board 2

Temperature sensor on RAID controller

> 185°F (85°C)



7.2 RAID LUNs Not Visible to the Host



Note - Some versions of operating system software or utilities might not display all mapped LUNs if there is no partition or logical drive mapped to LUN 0. Map a partition or logical drive to LUN 0 if you are in doubt, or refer to your operating system documentation.



By default, all RAID arrays are preconfigured with one or two logical drives. For a logical drive to be visible to the host server, its partitions must be mapped to host LUNs. To make the mapped LUNs visible to a specific host, perform the steps required for your operating system, if there are any special requirements. For host-specific information about different operating systems, see:


7.3 JBOD Disks Not Visible to the Host

If you attach a JBOD array directly to a host server and do not see the drives on the host server, check that the cabling is correct and that there is proper termination. For details, see the special cabling procedures in Appendix B.

For additional information about specific servers, see the operating system appendices in this document.


7.4 Controller Failover

Controller failure symptoms are as follows:

A "SCSI Bus Reset Issued" alert message is displayed for each of the SCSI channels. A "Redundant Controller Failure Detected" alert message is also displayed. These messages are also written to the event log.

If one controller in the redundant controller configuration fails, the surviving controller takes over for the failed controller. The primary controller state will be held by the surviving controller regardless of the serial number until redundancy is restored.

The surviving controller disables and disconnects from its counterpart while gaining access to all the signal paths. It then manages the ensuing event notifications and takes over all processes. It remains the primary controller regardless of its original status, and any replacement controller afterward assumes the role of secondary controller.

The failover and failback processes are completely transparent to the host.



Note - If the surviving controller is removed and the failed controller is left in the system, and the system is power-cycled, the failed controller can become primary and write stale data to disk.





Note - If the system is powered down and the failed controller is replaced, if the replacement controller has a previous release of the firmware with a higher serial number than the surviving controller, the system might hang during boot up.



Controllers are hot-swappable if you are using a redundant configuration, and replacing a failed unit takes only a few minutes. Since the I/O connections are on the controllers, you might experience some unavailability between the times when the failed controller is removed and a new one is installed in its place.

To maintain your redundant controller configuration, replace the failed controller as soon as possible. For details, refer to the Sun StorEdge 3000 Family FRU Installation Guide.



Note - When the drives cannot be identified by the controller, either due to disk channel errors or powering up in the wrong sequence, the drive state will change to USED with all logical drives in a FATAL FAIL state. To recover from this state, the condition that caused the loss of access to the disk drives must be resolved and a power cycle of the system is required. The FATAL FAIL state remains following the power cycle and requires user intervention to clear. For details regarding the FATAL FAIL state, see Section 7.5, Recovering From Fatal Drive Failure.




7.5 Recovering From Fatal Drive Failure

With a RAID array system, your system is protected with the RAID parity drive and a global spare or spares.

A Fatal Fail occurs when more drives fail than your RAID redundancy can accommodate. The redundancy of your RAID array depends on your configuration. In a RAID 3 or RAID 5 configuration, two or more drives must fail for a FATAL FAIL status. In a RAID 1 configuration, you can lose multiple drives without fatal failure if all the failed drives reside on one side of a mirrored pair.

It might be possible to recover the RAID array from a Fatal Fail. However, it might be impossible to do a full data recovery, depending on the circumstances of the failure. Recovering from a Fatal Fail requires reusing the drives that report as failed. It is important to check your recovered data using the data application or host-based tools following a Fatal Fail recovery.

It is rare for two or more drives to fail at the same time. To minimize the chance of this happening, regular RAID integrity checks should be performed. For RAID 3 and RAID 5, this can be done using the array console's "regenerate Parity" option, or using the Sun StorEdge CLI command-line utility check parity. Refer to the Sun StorEdge 3000 Family RAID Firmware User's Guide for details on the "regenerate Parity" option. Refer to the Sun StorEdge 3000 Family CLI User's Guide for details on the check parity command line utility.

If a multiple drive failure has occurred, it might be possible to recover by performing the following steps:

1. Discontinue all I/O activity immediately.

2. To cancel the beeping alarm, from the RAID firmware Main Menu, choose "system Functions right arrow Mute beeper".

See Section 6.4, Silencing Audible Alarms for more information about silencing audible alarms.

3. Physically check that all the drives are firmly seated in the array and that none have been partially or completely removed.

4. In the RAID firmware Main Menu, choose "view and edit Logical drives," and look for:

Status: FATAL FAIL (two or more failed drives)

5. Select the logical drive, press Return, and choose "view scsi drives."

If two physical drives fail, one drive has a BAD status and one drive has a MISSING status.

6. Unassign any global or local spare drives.

7. Reset the controller.

From the RAID firmware Main Menu, choose "system Functions right arrow Reset controller" and choose Yes when prompted.

8. When the system comes back up, clear the FATAL FAIL state.

a. From the RAID firmware Main Menu, choose "view and edit Logical drives."

b. Select the logical drive with the FATAL FAIL status and press Enter.

c. Select "Clear state."

d. Choose Yes when the "Back to degraded?" prompt is displayed.



Note - The prompt is "Back to normal?" for RAID 0 configurations.



After clearing the FATAL FAIL, the status changes to DRV FAILED.

9. If the status is still FATAL FAIL, you might have lost all data on the logical drive, and it might be necessary to re-create the logical drive.

Proceed with the following procedures:

a. Delete the logical drive.

Refer to the Sun StorEdge 3000 Family RAID Firmware User's Guide for more information.

b. Create a new logical drive.

Refer to the Sun StorEdge 3000 Family RAID Firmware User's Guide for more information.

10. If the logical drive has changed to "degraded," run fsck(1M).

11. After fsck(1M) completes successfully, rebuild the logical drive.



Note - The logical drive can be rebuilt with a local or a global spare drive. If no local or global spare is assigned, the logical drive will be rebuilt with the remaining BAD drive.



a. If you unassigned any local or global spare drives in Step 6, reassign them now.

The rebuild will begin automatically.

12. If no spare drives are available, perform the following steps.

a. From the RAID firmware Main Menu, choose "view and edit Logical drives."

b. Select the logical drive that has the status DRV FAILED.

c. Choose "Rebuild logical drive," and then choose Yes to rebuild the logical drive.

The rebuilding progress is displayed on the screen. A notification message informs you when the process is complete.



Note - As physical drives fail and are replaced, the rebuild process regenerates the data and parity information that was on the failed drive. However, the NVRAM configuration file that was present on the drive is not re-created. For details on restoring the NVRAM configuration file to the drive, refer to the Sun StorEdge 3000 Family RAID Firmware User's Guide.



Rebuilding the logical drive restores the RAID integrity to a self-consistent state. This does not guarantee that the data has not been corrupted. All possible application checks should be performed to ensure that the data is not corrupted before it is used for business or production purposes.

For additional troubleshooting tips, refer to the release notes for your array.


7.6 Using the Reset Button

To test that the LEDs work, using a paperclip, press and hold the Reset button for 5 seconds. All the LEDs should change from green to amber when you perform this test. Any LED that fails to light indicates a problem with the LED. When you release the Reset button, the LEDs return to their initial state. See Chapter 5 for more information.

To silence audible alarms that are caused by component failures, use a paperclip to push the Reset button. See Section 6.4, Silencing Audible Alarms for more information about silencing audible alarms.


7.7 Troubleshooting Flowcharts

This section provides flowcharts to illustrate common troubleshooting methods.

The flowcharts included in this section are:

For the JBOD and expansion unit flowchart, see Section B.14, Troubleshooting Sun StorEdge 3310 SCSI JBOD Arrays.

For overview information about LEDs, see Chapter 5.

For information about replacing modules, refer to the Sun StorEdge 3000 Family FRU Installation Guide.



caution icon

Caution - Whenever you are troubleshooting and replacing components, there is an increased possibility of data loss. To prevent any possible data loss, back up user data to another storage device prior to troubleshooting your array.



7.7.1 Power Supply and Fan Module

The following flowchart provides troubleshooting procedures for the power supply and fan module.

  FIGURE 7-1 Power Supply or Fan Module Flowchart, 1 of 2

Flowchart diagram for diagnosing power supply and fan problems.

  FIGURE 7-2 Power Supply or Fan Module Flowchart, 2 of 2

Flowchart diagram for diagnosing power supply and fan problems (continued).

7.7.2 Drive LEDs

Before you perform the drive LED troubleshooting procedures, you might want to use the firmware application to identify a failed drive. For details, refer to the Sun StorEdge 3000 Family RAID Firmware User's Guide.

For overview information about drive LEDs and how they work, see Section 5.2, Front-Panel LEDs.



caution icon

Caution - When you rotate or replace drives, make sure that all I/O is stopped. To prevent any possible data loss, back up user data to another storage device prior to replacing a disk drive.



The following flowchart provides troubleshooting procedures for drive LEDs.

  FIGURE 7-3 Drive LEDs Flowchart, 1 of 3

Flowchart diagram for diagnosing drive LED problems.

  FIGURE 7-4 Drive LEDs Flowchart, 2 of 3

Flowchart diagram for diagnosing drive LED problems (continued).

  FIGURE 7-5 Drive LEDs Flowchart, 3 of 3

Flowchart diagram for diagnosing drive LED problems (continued).

For more information about checking and replacing drive modules, refer to the Sun StorEdge 3000 Family FRU Installation Guide.

7.7.3 Front-Panel LEDs

The following flowchart provides troubleshooting procedures for the Sun StorEdge 3310 SCSI array front-panel LEDs.



Note - The LED ribbon cable referred to in this flowchart is the white cable that connects the front-panel LEDs to the midplane. It is located on the right front-panel ear and is directly attached to the LEDs.



  FIGURE 7-6 Front-Panel LEDs Flowchart, 1 of 5

Flowchart diagram for diagnosing SCSI array front-panel LED problems.

  FIGURE 7-7 Front-Panel LEDs Flowchart, 2 of 5

Flowchart diagram for diagnosing SCSI array front-panel LED problems (continued).

  FIGURE 7-8 Front-Panel LEDs Flowchart, 3 of 5

Flowchart diagram for diagnosing SCSI array front-panel LED problems (continued).

  FIGURE 7-9 Front-Panel LEDs Flowchart, 4 of 5

Flowchart diagram for diagnosing SCSI array front-panel LED problems (continued).

  FIGURE 7-10 Front-Panel LEDs Flowchart, 5 of 5

Flowchart diagram for diagnosing SCSI array front-panel LED problems (continued).

7.7.4 I/O Controller Module

The following flowchart provides troubleshooting procedures for the I/O controller module.

  FIGURE 7-11 I/O Controller Module Flowchart, 1 of 2

Flowchart diagram for diagnosing I/O controller module problems.

  FIGURE 7-12 I/O Controller Module Flowchart, 2 of 2

Flowchart diagram for diagnosing I/O controller module problems.