C H A P T E R  7

Maintaining and Troubleshooting Your Array

This chapter describes troubleshooting procedures and error messages you can use to isolate configuration and hardware problems as well as maintenance procedures. This chapter covers the following topics:

To check front-panel and back-panel LEDs, see Chapter 6.

For more troubleshooting tips, refer to the Sun StorEdge 3120 SCSI Release Notes at:

http://www.sun.com/products-n-solutions/hardware/docs/Network_Storage_Solutions/Workgroup/


7.1 Sensor Locations

Monitoring conditions at different points within the array enables you to avoid problems before they occur. Cooling element, temperature, voltage, and power sensors are located at key points in the enclosure. The SCSI Accessed Fault-Tolerant Enclosure (SAF-TE) processor monitors the status of these sensors.

The following table describes the location of the enclosure devices from the back of the Sun StorEdge 3120 SCSI array orientation as shown in FIGURE 7-1.

  FIGURE 7-1 Sun StorEdge 3120 SCSI Array Enclosure Device Orientation

Figure showing the location of fans and power supplies for the Sun StorEdge 3120 SCSI array.

The enclosure sensor locations and alarm conditions are described in the following table.

TABLE 7-1 Sensor Locations and Alarms

Sensor Type

Description

Alarm Condition

Fan 0

Left side power supply fan

< 900 RPM

Fan 1

Right side power supply fan

< 900 RPM

PS 0

Left side power supply

Voltage, temperature, or fan fault

PS 1

Right side power supply

Voltage, temperature, or fan fault

Temp 0

Left drive temperature sensor

< 32°F (0°C) or > 131°F (55°C)

Temp 1

Center drive temperature sensor

< 32°F (0°C) or > 131°F (55°C)

Temp 2

Temperature sensor on left side power supply module

< 32°F (0°C) or > 140°F (60°C)

Temp 3

Temperature sensor on left side I/O module

< 32°F (0°C) or > 131°F (55°C)

Temp 4

Temperature sensor on right side I/O module

< 32°F (0°C) or > 131°F (55°C)

Temp 5

Right drive temperature sensor

< 32°F (0°C) or > 131°F (55°C)

Temp 6

Temperature sensor on right side power supply module

< 32°F (0°C) or > 140°F (60°C)

Disk Slot 0-3

Disk slot identifier refers to the backplane FRU to which disks are connected

Not applicable

CPU

Directly reported from the CPU

> 203°F (95°C)

Board 1

Located on R228 solder side under the 129 ASIC

> 185°F (85°C)

Board 2

Located on R298 solder side under the U38 1010 chip

> 185°F (85°C)



7.2 Upgrading Firmware

Firmware upgrades are made available as patches that you can download from the Sun web site, located at:

http://sunsolve.sun.com

Each patch applies to a particular type of firmware, including:

Each patch includes an associated README text file that provides detailed instructions about how to download and install that patch. Firmware downloads follow the same general steps:


7.3 Failed Component Alarms

Failed component alarm tones use Morse code dot and dash characters. The dot "." is a short tone sounding for one unit of time. The dash "-" is a long tone sounding for three units of time.

Alarms, also referred to as beep codes, are presented in a sequence, starting with the critical component failure alarm, which alerts you to a component problem or failure or a firmware mismatch. This alarm is then followed by alarms for whichever components or assemblies have failed. Once the beep code sequence is complete, it repeats. To understand the beep codes, listen to the sequence of codes until you can break down the sequence into its separate alarms. You can also check your software or firmware for alarms, error messages, or logs to isolate and understand the cause.

For example, in the case of a fan failure in a power supply, you might first hear the critical component failure alarm, followed by a power supply failure alarm from power supply 0 or power supply 1, followed by a fan failure event alarm, followed by an event alarm. This sequence will continue to repeat.

TABLE 7-2 Failed Component Alarm Codes

Failure

Morse Code Letter

Morse Code Sound Pattern

Critical component failure or mismatch

8 dashes

--------

Power supply 0 failure

P0

. -- . -----

Power supply 1 failure

P1

. -- . . ----

Event alarm

E

.

Fan failure

F

. . - .

Voltage failure

V

. . . -

Temperature failure

T

-



7.4 Silencing Audible Alarms

An audible alarm indicates that an environmental component in the array has failed. These error conditions and events are reported by event messages and event logs. Component failures are also indicated by LED activity on the array.

To silence the alarm:

1. Use a paperclip to push the Reset button on the right ear of the array.

For details about where the Reset button is located, see Section 6.2, Front-Panel LEDs.

2. Check the front-panel and back-panel LEDs to determine the cause of the alarm.

For more information, see Chapter 6.

3. In Sun StorEdge Configuration Service, check the event log to determine the cause of the alarm.

Component event messages include but are not limited to the following terms:

For details about using Sun StorEdge Configuration Service to determine the cause of an alarm, see Section 5.2.2, Viewing Component and Alarm Characteristics.



caution icon

Caution - Be particularly careful to observe and rectify a temperature failure alarm. If you detect this alarm, shut down the JBOD and the server as well if it is actively performing I/O operations to the affected array. Otherwise system damage and data loss can occur.




7.5 General Troubleshooting Guidelines

When a problem is not otherwise reproducible, suspect hardware may need to be replaced. Always make only one change at a time and carefully monitor results. When possible, it is best to restore the original hardware before replacing another part to eliminate the introduction of additional unknown problem sources.

After hardware replacement, a problem can usually be considered solved if it does not resurface during a period equal to twice its original frequency of occurrence. For example, if a problem was occurring once a week on average before a potential fix was made, running two weeks without seeing the problem again suggests a successful fix took place.

Troubleshooting hardware problems is usually accomplished by a FRU isolation sequence that uses the process of elimination. Set up a minimal configuration that shows the problem and then replace elements in this order, testing after each replacement until the problem is solved:

Often you can also find out what causes a hardware problem by determining the elements that do not cause it. Start out by testing the smallest configuration that does work, and then keep adding components until a failure is detected.

To view error messages reported by JBODs, use any of the following:

For more information about replacing the chassis, see Section 8.7, Installing a JBOD Chassis FRU.



caution icon

Caution - Back up the chassis data onto another storage device prior to replacing a disk drive to prevent any possible data loss.



Before you begin troubleshooting JBODs, check the cables that connect the host to the JBOD. Look for bent pins, loose wires, loose cable shields, loose cable casing and any cables with 90 degree or more bends in them. If you find any of these problems, replace the cable.

The FIGURE 7-2 flowchart provides troubleshooting procedures specifically for JBODs.

7.5.1 Writing Events to a Log File for an IBM AIX Host

For an IBM AIX operating system, the event logs are not logged by default. You might need to change /etc/syslog.conf to enable it to write to a log file.

1. Modify /etc/syslog.conf to add the following line:

*.info /tmp/syslog rotate size 1000k

2. Make sure the file that is specified in the added line exists.

If it does not exist, you must create it. For example, in the above configuration, you would create a file named /tmp/syslog.

3. Change to /tmp/syslog and restart the syslog by typing:

kill -HUP `cat /etc/syslog.pid`


7.6 Troubleshooting Solaris Operating System Configuration Issues

Follow this sequence of general steps to isolate software and configuration issues.



Note - Look for storage-related messages in /var/adm/messages and identify any suspect Sun StorEdge 3120 SCSI arrays.



1. Check the Sun StorEdge Configuration Service Console for alerts or messages.

2. Check the LEDs.

For more information, see Chapter 6.

3. In Sun StorEdge CLI, run the show enclosure-status command.

For more information, see Section 5.4, Monitoring with the Sun StorEdge CLI.

4. Check revisions of software package, patches, and hardware.

5. Verify the correct device file paths.

6. Check any related software, configuration, or startup files for recent changes.

7. Search SunSolve Online for any known related bugs and problems at:
http://sunsolve.Sun.COM.


7.7 JBOD Disks Not Visible to the Host

If you attach a JBOD array directly to a host server and do not see the drives on the host server, check that the cabling is correct and that there is proper termination. See the special cabling procedures in Section 4.6, Connecting Sun StorEdge 3120 SCSI Arrays to Hosts.

7.7.1 Making JBODs Visible to Hosts Running the Solaris Operating System

If the JBOD cabling is correct and the drives are still not visible, run the devfsadm utility to rescan the drives. The new disks can be seen when you perform the format command.

If the drives are still not visible, reboot the host(s) with the reboot -- -r command so that the drives are visible to the host.

7.7.2 Making JBODs Visible to Hosts Running the Windows 2000 and Windows 2003 Operating Systems

Before beginning this procedure, make sure that you are using a supported SCSI host bus adapter (HBA) such as an Adaptec 39160. Refer to the Release Notes for your array for current information about which HBAs are supported.

Also make sure that you are using a supported driver for your HBA. For the Adaptec 39160, use FMS V4.0a or later.

1. Boot your system and verify that the host bus adapter (HBA) basic input/output system (BIOS) recognizes the new SCSI device.



Note - While your system is starting up, you should see the new SCSI device.



2. If a Found New Hardware Wizard is displayed, click Cancel.

You are now ready to format your new device.

3. Open the Disk Management folder.

a. Right-click the My Computer icon and choose Manage.

b. Select the Disk Management folder.

c. If a Write Signature and Upgrade Disk Wizard is displayed, click Cancel.

A "Connecting to Logical Disk Manager Server" status message is displayed.

4. Select the new device when it is displayed.

 Screen capture showing the Computer Management window.

5. Right-click in the Unallocated partition of the device and choose Create Partition.

A Create Partition Wizard is displayed.

 Screen capture showing the Create Partition Wizard.

6. Click Next.

7. Choose Primary partition and click Next.

8. Specify the amount of disk space to use or accept the default value, and click Next.

 Screen capture showing the Amount of disk space value.

9. Assign a drive letter and click Next.

10. Choose Format this partition with the following settings.

a. Specify NTFS as the File system to use.

b. Make sure the Perform a Quick Format checkbox is checked.

 Screen capture showing the Perform a Quick Format checkbox.

c. Click Next.

A confirmation dialog box displays the settings you have specified.

 Screen capture showing the settings you have specified in the wizard.

11. Click Finish.

The new partition is formatted and the formatted partition is identified as NTFS in the Computer Management window.

 Screen capture showing the Computer Management window.

12. Repeat these steps for any other new partitions and devices you want to format.

7.7.3 Making JBODs Visible to Hosts Running the Linux Operating System

When booting the server, watch for the host bus adapter (HBA) card BIOS message line to display onscreen and then press the proper sequence of keys in order to get into the HBA BIOS: Key strokes for SCSI Adaptec cards = <Ctrl><A>.

The key strokes are listed onscreen when the adapter is initializing. After you enter the Adaptec HBA BIOS with <Ctrl><A>, perform the following steps.

1. Highlight Configure/View Host Adapter Settings and press Return.

2. Go to Advanced Configuration Options and press Return.

3. Go to Host Adapter BIOS and press Return.

a. Select disabled:scan bus if this is not going to be a bootable device.

b. If it is going to be bootable device, select the default Enabled. The * represents the default setting.

4. Press Esc until you return to the main options screen where Configure/View Host Adapter Settings was located.

5. Select SCSI Disk Utilities and press Return.

The BIOS will now scan the SCSI card for any SCSI devices attached to the HBA. You will see the HBA's SCSI ID as well as any other SCSI devices attached to the HBA. If you only see the HBA's SCSI ID, then something is not correct with the configuration on the SCSI attached device, or the cable between the HBA and the SCSI device is bad or not attached.

6. If you are satisfied with the configuration, press Esc until a screen opens and displays Exit Utility?. Select Yes and press Return. A screen opens stating Please press any key to reboot. Press a key to reboot the server.

7. Repeat the same steps for every HBA that you want to attach to the Sun StorEdge 3120 JBOD array.

7.7.4 Making JBODs Visible to Hosts Running the
HP-UX Operating System

The following steps describe how to discover drives on systems running the HP-UX operating system.

1. Run the command:

# ioscan -fnC disk

2. If the drive is still not seen, the host might need to be rebooted. Run the commands:

# sync;sync;sync
# reboot

7.7.5 Making JBODs Visible to Hosts Running the
IBM AIX Operating System

The following steps describe how to discover drives on systems running the IBM AIX operating system.



Note - You must have superuser privileges to run these commands.



1. Create the logical drive and map its LUN to the correct host channel.

2. Run the command:

# cfgmgr

3. Run the command:

# lspv

Output similar to the following is displayed.

hdisk0 000df50dd520b2e rootvghdisk1 000df50d928c3c98 Nonehdisk1 000df50d928c3c98 None

4. If any of the drives show "none," you must assign a Physical Volume IDENTIFIER.

5. Run the command:

# smitty

a. Select Devices.

b. Select Fixed Disk.

c. Select Change/Show Characteristics of a Disk.

d. Select the disk without a pvid.

e. Select ASSIGN physical volume identifier, press Tab once to display Yes for the value, and press Return.

f. Press Return again to confirm and repeat steps a-f as necessary.

6. From the smitty main menu, select System Storage Management (Physical & Logical Storage) right arrow Logical Volume Manager right arrow Volume Groups right arrow Add a Volume Group.

7. Specify a name for the volume group, make sure the partitions for the journaled file system are large enough, and select the Physical Volume Name(s).

8. From the smitty main menu, select System Storage Management (Physical & Logical Storage) right arrow File Systems right arrow Add / Change / Show / Delete File Systems right arrow (Enhanced) Journaled File System.

9. Select the volume group and set the field.

Run the command:

# umount mount point


7.8 Identifying a Failed Drive for Replacement

You can identify a failed drive by checking:



caution icon

Caution - You can mix capacity in the same chassis, but not spindle speed (RPM) on the same SCSI bus. For instance, you can use 36-Gbyte and 73-Gbyte drives with no performance problems if both are 10K RPM drives. Violating this configuration guideline leads to poor performance.



7.8.1 Verifying Operating System Device Information

To identify failed disks, you can review the operating system device information to verify drive status.



Note - While your system is starting, you will see the new SCSI device.




7.9 JBOD Troubleshooting Decision Trees

  FIGURE 7-2 Troubleshooting Decision Tree for JBODs, Figure 1 of 2

Diagram showing troubleshooting steps for the Sun StorEdge 3120 JBOD.

  FIGURE 7-3 Troubleshooting Decision Tree for JBODs, Figure 2 of 2

Diagram showing troubleshooting steps for the Sun StorEdge 3120 JBOD.