C H A P T E R  14

Maintaining Your Array

Refer to the Sun StorEdge 3000 Family Installation, Operation and Service Manual for your array to see hardware-related maintenance and troubleshooting information.

This chapter covers the following firmware-oriented maintenance and troubleshooting topics:


Battery Operation

The battery LED (on the far right side of the I/O controller module) is amber if the battery is bad or missing. The LED blinks green if the battery is charging and is solid green when the battery is fully charged.

Battery Status

Battery status is displayed at the top of the initial firmware screen. BAT: status displays somewhere in the range from BAD to ----- (charging) to +++++ (fully charged).

For maximum life, lithium ion batteries are not recharged until the charge level is very low, indicated by a status of -----. Automatic recharging at this point takes very little time.

A battery module whose status shows one or more + signs can support cache memory for 72 hours. As long as one or more + signs are displayed, your battery is performing correctly.

TABLE 14-1 Battery Status Indicators

Battery Display

Description

-----

Discharged; the battery is automatically recharged when it reaches this state.

+----

Adequately charged to maintain cache memory for 72 hours or more in case of power loss. Automatic recharging occurs when the battery status drops below this level.

++---

90% charged; adequate to maintain cache memory for 72 hours or more in case of power loss.

+++--

92% charged; adequate to maintain cache memory for 72 hours or more in case of power loss.

++++-

95% charged; adequate to maintain cache memory for 72 hours or more in case of power loss.

+++++

Over 97% charged; adequate to maintain cache memory for 72 hours or more in case of power loss.


Your lithium ion battery should be changed every two years if the unit is continuously operated at 77 degrees F (25 degrees C). If the unit is continuously operated at 95 degrees F (35 degrees C) or higher, the battery should be changed every year. The shelf life of your battery is three years.



Note - The RAID controller has a temperature sensor which shuts off battery charging when the temperature reaches 129 degrees F (54 degrees C). When this happens, the battery status might be reported as BAD, but no alarm is written to the event log because no actual battery failure has occurred. This behavior is normal. As soon as the temperature returns to the normal range, battery charging resumes and the battery status is reported correctly. It is not necessary to replace or otherwise interfere with the battery in this situation.



Refer to the Sun StorEdge 3000 Family Installation, Operation and Service Manual for your array to see your array's acceptable operating and nonoperating temperature ranges.

For information about the date of manufacture and how to replace the battery module, refer to the Sun StorEdge 3000 Family FRU Installation Guide.

Battery Support for Cache Operations

Unfinished writes are cached in memory in write-back mode. If power to the array is discontinued, data stored in the cache memory is not lost. Battery modules can support cache memory for several days.

Write cache is not automatically disabled when the battery is offline due to battery failure or a disconnected battery, but you can set an event trigger to make this happen. See Configuring the Battery Backup (BBU) Low Event or BBU Failed Event Trigger for more information.


Checking Status Windows

The status windows used to monitor and manage the array are described in the following sections:

Logical Drive Status Table

To check and configure logical drives, from the Main Menu choose "view and edit Logical drives" and press Return.

 Screen capture shows the Main Menu with "view and edit Logical drives" selected.

The status of all logical drives is displayed.

 Screen capture shows the status of all logical drives for this array.

TABLE 14-2 shows definitions and values for logical drive parameters.

TABLE 14-2 Parameters Displayed in the Logical Drive Status Window

Parameter

Description

LG

Logical drive number

P0: Logical drive 0 of the primary controller where P = primary controller and 0 = logical drive number

S1: Logical drive 1 of the secondary controller where S = secondary controller and 1 = logical drive number

ID

Logical drive ID number (controller-generated)

LV

The logical volume to which this logical drive belongs. NA indicates no logical volume.

RAID

Assigned RAID level

SIZE (MB)

Capacity of the logical drive

Status1

Logical drive status:

 

COPYING

 

The logical drive is in the process of copying from another drive.

 

CREATING

 

The logical drive is being initiated.

 

GOOD

The logical drive is in good condition.

 

DRV FAILED

A drive member failed in the logical drive.

 

FATAL FAIL

 

More than one drive member in a logical drive has failed.

 

INCOMPLETE

 

Two or more member drives in this logical drive have failed.

 

SHUT-DOWN

 

The controller has been shut down using the Shutdown command. Restart the controller to restore it to GOOD status.

 

REBUILDING

 

The logical drive is being rebuilt.

Status2

 

Logical Drive status column 2

 

I

 

The logical drive is initializing.

 

A

 

Adding a physical drive to the logical drive.

 

E

 

Expanding a logical drive.

Status3

 

Logical Drive status column 3

 

R

 

The logical drive is rebuilding.

 

P

 

Regenerating parity on the logical drive.

O

 

Stripe size:

 

 

2

4 KB

 

3

8 KB

 

4

.16 KB

 

5

32 KB

 

6

 

64 KB

 

7

 

128 KB

 

8

 

256 KB

C

 

Write policy setting

 

B

 

Write-back

 

T

 

Write-through

#LN

Total number of drive members in this logical drive

#SB

Number of standby drives available for the logical drive. This includes local spare and global spare drives available for the logical drive.

#FL

Number of failed drive members in the logical drive

Name

Logical drive name (user configurable)




Note - The SIZE (MB) parameter for a logical drive might not correspond exactly with the total size reported for each of the physical drives that make up the logical drive when using the "view and edit Logical drives" menu option. Any discrepancy is minor and is a result of how the drive manufacturers report their device size, which varies among manufacturers.



To handle failed, incomplete, or fatal failure status, refer to the Sun StorEdge 3000 Family Installation, Operation and Service Manual for your array.

Physical Drive Status Table

To check and configure physical drives, from the Main Menu, choose "view and edit Drives" and press Return.

 Screen capture shows the Main Menu with "view and edit scsi Drives" selected.

The Physical Drive Status table is displayed with the status of all physical drives in the array.

 Screen capture shows the status of all SCSI drives in the array.

TABLE 14-3 Parameters Displayed in the Physical Drive Status Window

Parameters

 

Description

 

Chl

 

Channel that is assigned to the drive

ID

 

ID of the drive

Size (MB)

 

Drive capacity in megabytes

Speed

 

xxMB Maximum synchronous transfer rate of this drive.

ASYNC The drive is using asynchronous mode.[1]

LG_DRV

 

x The drive is a physical drive member of logical drive x.

Status

COPYING

 

The logical drive is in the process of copying from another drive.

 

GLOBAL

 

The drive is a global spare drive.

 

INITING

The drive is initializing.

 

ON-LINE

The drive is in good condition.

 

REBUILD

The drive is rebuilding.

 

STAND-BY

Local spare drive or global spare drive. If the drive is a local spare, the LG_DRV column displays the drive number of the logical drive to which the spare is assigned. If the drive is a global spare, the LG_DRV column displays GLOBAL.

 

NEW DRV

The new drive has not been configured to any logical drive or as a spare drive.

 

USED DRV

The drive was previously configured as part of a logical drive from which it has been removed; it still contains data from that logical drive.

 

FRMT DRV

The drive has been formatted with reserved space allocated for controller-specific information.

 

BAD

Failed drive.

 

ABSENT

Drive slot is not occupied or the drive is defective and cannot be detected.

 

MISSING

Drive once existed, but is now missing.

 

SB-MISS

Spare drive missing.

Vendor and
product ID

Vendor and product model information of the drive.


A physical drive has a USED status when it was once part of a logical drive but no longer is. This can happen, for instance, when a drive in a RAID 5 array is replaced by a spare drive and the logical drive is rebuilt with the new drive. If the removed drive is later reinstalled in the array and scanned, the drive status is identified as USED because the drive still has data on it from a logical drive.

When a logical drive is deleted properly, this user information is erased and the drive status is shown as FRMT rather than USED. A drive with FRMT status has been formatted with either 64 KB or 256 MB of reserved space for storing controller-specific information, but has no user data on it.

If you remove the reserved space using the "view and edit Drives" menu, the drive status changes to NEW.

To replace BAD drives, or if two drives show BAD and MISSING status, refer to the Sun StorEdge 3000 Family Installation, Operation and Service Manual for your array.



Note - If a drive is installed but not listed, the drive might be defective or installed incorrectly.





Note - When power is turned on, the controller scans all physical drives that are connected through the drive channels. If a physical drive is connected after a Sun StorEdge 3310 SCSI controller or Sun StorEdge 3320 SCSI controller completes initialization, use the "Scan scsi drive" menu option ("view and edit Drives right arrow Scan scsi drive") to let the controller recognize the newly added physical drive so you can configure it as a member of a logical drive or as a spare drive.



Channel Status Table

To check and configure channels, from the Main Menu, choose "view and edit channelS," and press Return.

The Channel Status table is displayed with the status of all channels on the array.

 Screen capture shows the status of all channels on the array.


Note - Each controller has an RS232 port and an Ethernet port. This architecture ensures continuous access for communication should either controller fail. Since the connection is established with only one controller at a time, even when the array is in redundant mode, the CurSyncClk and CurWid settings are displayed only for the connected controller. Therefore, if a user maps one LUN to the primary controller, and another LUN to a secondary controller, only the LUN mapped with the currently connected controller is displayed through the serial and Ethernet port.





caution icon

Caution - Do not change the PID and SID values of drive channels.



TABLE 14-4 Parameters Displayed in the Channel Status Table

Parameters

Description

Chl

Channel's ID.

Mode

Channel mode.

 

RCCOM

Redundant controller communication channel. Displays as RCC in the Channel Status table.

 

Host

The channel is functioning as a host channel.

 

Drive

The channel is functioning as a drive channel.

 

DRV+RCC

The channel is functioning as a drive channel with a redundant controller communication channel. (Fibre Channel only).

PID

Primary controller's ID mapping:

 

*

Multiple IDs were applied (host channel mode only).

 

#

The ID to which host LUNs are mapped in the host channel mode. The ID for the primary controller in the drive channel mode.

 

NA

No ID applied.

SID

Secondary controller's ID mapping:

 

*

Multiple IDs (host channel mode only).

 

#

The ID to which host LUNs are mapped in the host channel mode. The ID for the secondary controller in drive channel the mode.

 

NA

No ID applied.

DefSynClk

Default bus synchronous clock:

 

xx.x MHz

Maximum synchronous transfer rate (SCSI array only)

 

x GHz

Maximum synchronous transfer rate (FC array only).

 

Async

Channel is set for asynchronous transfers (SCSI arrays only).

 

Auto

Channel is configured to communicate at 1 or 2 GHz (FC arrays only).

DefWid

Default bus width:

 

Wide

Channel is set to allow wide (16-bit) transfers (SCSI arrays only).

 

Narrow

Channel is set to allow narrow (8-bit) transfers (SCSI arrays only).

 

Serial

Channel is using serial communication.

S

Signal:

 

S

Single-ended

 

 

L

LVD

 

 

F

Fibre

 

Term

Terminator status:

 

On

Termination is enabled (SCSI arrays only).

 

Off

Termination is disabled (SCSI arrays only).

 

NA

For a redundant controller communications (RCCOM) channel (SCSI arrays) and all FC array channels.

CurSynClk

Current bus synchronous clock. This field only displays values for channels that are assigned to the primary controller.

 

xx.x MHz

The current speed at which a SCSI array channel is communicating.

 

x GHz

The current speed at which a FC array channel is communicating.

 

Async

The channel is communicating asynchronously or no device is detected.

 

(empty)

The default bus synchronous clock has changed. Reset the controller for changes to take effect.

CurWid

Current bus width. This field only displays values for channels that are assigned to the primary controller.

 

Wide

The channel is currently servicing wide 16-bit transfers (SCSI arrays only).

 

Narrow

The channel is currently servicing narrow 8-bit transfers (SCSI arrays only).

 

Serial

Channel is using serial communication.

 

(empty)

The default bus width has changed. Reset the controller for the changes to take effect.



Upgrading Firmware

From time to time, firmware upgrades are made available as patches. Check the release notes for your array to find out the current patch IDs available for your array.

You can download RAID controller firmware patches from SunSolve Online, located at:

http://sunsolve.sun.com

Each patch applies to one or more particular piece of firmware, including:



Note - Disk drive firmware is provided through Sun disk firmware patches, which include the required download utility. Sun disk firmware patches are separate from Sun StorEdge 3000 family firmware patches. Do not use Sun StorEdge Configuration Service or the Sun StorEdge CLI to download disk drive firmware.



SunSolve has extensive search capabilities that can help you find these patches, as well as regular patch reports and alerts to let you know when firmware upgrades and other patches become available. In addition, SunSolve provides reports about bugs that have been fixed in patch updates.

Each patch includes an associated README text file that provides detailed instructions about how to download and install that patch. But, generally speaking, all firmware downloads follow the same steps:



Note - For instructions on how to download firmware to disk drives in a JBOD directly attached to a host, refer to the README file in the patch that contains the firmware.





caution icon

Caution - Be particularly careful about downloading and installing PLD firmware. If the wrong firmware is installed, or the firmware is installed on the wrong device, your controller might be rendered inoperable. Always be sure to upgrade your SES firmware first before trying to determine if you need a PLD upgrade.



Patch Downloads

1. Once you have determined that a patch is available to update firmware on your array, make note of the patch number or use SunSolve Online's search capabilities to locate and navigate to the patch.

2. Read the README text file associated with that patch for detailed instructions on downloading and installing the firmware upgrade.

3. Follow those instructions to download and install the patch.

Installing Firmware Upgrades

It is important that you run a version of firmware that is supported by your array. Before updating your firmware, make sure that the version of firmware you want to use is supported by your array.

Refer to the release notes for your array for Sun Microsystems patches containing firmware upgrades that are available for your array. Refer to SunSolve Online for subsequent patches containing firmware upgrades.

If you are downloading a Sun patch that includes a firmware upgrade, the README file associated with that patch tells you which Sun StorEdge 3000 family arrays support that firmware release.



caution icon

Caution - Major upgrades of controller firmware, or replacing a controller with one that has a significantly different version of firmware, might involve differences in non-volatile RAM (NVRAM) that require following special upgrade procedures. For more information, refer to the Sun StorEdge 3000 Family FRU Installation Guide and to the release notes for your array.



To download new versions of controller firmware, or SES and PLD firmware, use one of the following tools:



Note - Do not use both in-band and out-of-band connections at the same time to manage the array. You might cause conflicts between multiple operations.





Note - Disk drive firmware is provided through Sun disk firmware patches which include the required download utility. Sun disk firmware patches are separate from the Sun StorEdge 3000 family firmware patches. Do not use the Sun StorEdge CLI or Sun StorEdge Configuration Service to download disk drive firmware.



Controller Firmware Upgrade Features

The following firmware upgrade features apply to the controller firmware:

When downloading is performed on a dual-controller system, firmware is flashed onto both controllers without interrupting host I/O. When the download process is complete, the primary controller resets and lets the secondary controller take over the service temporarily. When the primary controller comes back online, the secondary controller hands over the workload and then resets itself for the new firmware to take effect. The rolling upgrade is automatically performed by controller firmware, and the user's intervention is not necessary.

A controller that replaces a failed unit in a dual-controller system often has a newer release of the firmware installed than the firmware in the controller it replaced. To maintain compatibility, the surviving primary controller automatically updates the firmware running on the replacement secondary controller to the firmware version of the primary controller.



Note - When you upgrade your controller firmware in the Solaris operating system, the format(1M) command still shows the earlier revision level.



Upgrading SES and PLD Firmware

When you replace an I/O controller, the new controller might have a version of SES or PLD firmware different from the other controller in your array. If this mismatch occurs, when you install a controller you hear an audible alarm and see a blinking amber Event LED.

To synchronize the SES firmware and hardware PLD versions, you must download new SES firmware through Sun StorEdge Configuration Service or the Sun StorEdge CLI.

If you have not installed this software, you must install it from the software CD that shipped with your array. Refer to the Sun StorEdge 3000 Family Configuration Service User's Guide for your array to see instructions for downloading firmware for devices. Refer to the Sun StorEdge 3000 Family CLI User's Guide, or the sccli(1M) man page for similar instructions for using the Sun StorEdge CLI. Refer to the release notes for your array for instructions about where to obtain the firmware that you need to download.

When you open Sun StorEdge Configuration Service or the Sun StorEdge CLI and connect to the array, an error message alerts you to the mismatched version problem.


Troubleshooting Your Array

For hardware troubleshooting information, refer to the Installation, Operation and Service Manual for your array. For additional troubleshooting tips, refer to the release notes for your array.

Controller Failover

Controller failure symptoms include:

A Bus Reset Issued warning message is displayed for each of the channels. In addition, a Redundant Controller Failure Detected alert message is displayed.

If one controller in the redundant controller configuration fails, the surviving controller takes over for the failed controller.

A failed controller is managed by the surviving controller which disables and disconnects from its counterpart while gaining access to all the signal paths. The surviving controller then manages the ensuing event notifications and takes over all processes. The surviving controller is always the primary controller regardless of its original status, and any replacement controller afterward assumes the role of the secondary controller.

The failover and failback processes are completely transparent to hosts.

Controllers are hot-swappable if they are in a redundant configuration. Replacing a failed controller takes only a few minutes. Since the I/O connections are on the controllers, you might experience some unavailability between the time when cables on the failed controller are disconnected and the time when a new controller is installed and its cables are connected.

To maintain your redundant controller configuration, replace a failed controller as soon as possible. For details, refer to the Sun StorEdge 3000 Family FRU Installation Guide.

RAID LUNs Not Visible to Host

By default, all RAID arrays are preconfigured with one or two logical drives. For a logical drive to be visible to the host server, its partitions must be mapped to host LUNs. For mapping details, see Mapping a Partition to a Host LUN for SCSI arrays or LUN Mapping for FC and SATA arrays.

To make the mapped LUNs visible to a specific host, perform any steps required for your operating system. Refer to the Installation, Operation and Service Manual for your array to see host-specific information about different operating systems.

Rebuilding Logical Drives

This section describes automatic and manual procedures for rebuilding logical drives. The time required to rebuild a logical drive is determined by the size of the logical drive, the I/O that is being processed by the controller and the array's Rebuild Priority setting. With no I/O being processed, the time required to build a 2-Tbyte RAID 5 logical drive can be approximately:



Note - As disks fail and are replaced, the rebuild process regenerates the data and parity information that was on the failed disk. However, the NVRAM configuration file that was present on the disk is not re-created. After the rebuild process is complete, restore your configuration as described in Restoring Your Configuration (NVRAM) From Disk.



Automatic Logical Drive Rebuild

Rebuild with Spare. When a member drive in a logical drive fails, the controller first determines whether there is a local spare drive assigned to the logical drive. If there is a local spare drive, the controller automatically starts to rebuild the data from the failed drive onto the spare.

If there is no local spare drive available, the controller searches for a global spare drive. If there is a global spare, the controller automatically uses the global spare to rebuild the logical drive.

Failed Drive Swap Detect. If neither a local spare drive nor a global spare drive is available, and Periodic Auto-Detect Failure Drive Swap Check Time is disabled, the controller does not attempt to rebuild unless you apply a forced-manual rebuild.

To enable Periodic Auto-Detect Failure Drive Swap Check Time, perform the following steps:

1. From the Main Menu, choose "view and edit Configuration parameters right arrow Drive-side Parameters right arrow Periodic Auto-Detect Failure Drive Swap Check Time."

A list of check time intervals is displayed.

2. Select a Periodic Auto-Detect Failure Drive Swap Check Time interval.

A confirmation message is displayed.

3. Choose Yes to confirm.

When Periodic Auto-Detect Failure Drive Swap Check Time is enabled (that is, when a check time interval has been selected), the controller detects whether the failed drive has been replaced by checking the failed drive's channel and ID. Once the failed drive has been replaced, the rebuild begins immediately.



Note - This feature requires system resources and can impact performance.



If the failed drive is not replaced but a local spare is added to the logical drive, the rebuild begins with the spare.

FIGURE 14-1 illustrates this automatic rebuild process.

  FIGURE 14-1 Automatic Rebuild

Flowchart shows automatic rebuild process.

Manual Rebuild

When a user applies forced-manual rebuild, the controller first determines whether there is a local spare drive assigned to the logical drive. If a local spare drive is available, the controller automatically starts to rebuild onto the spare drive.

If no local spare drive is available, the controller searches for a global spare drive. If there is a global spare drive, the controller begins to rebuild the logical drive immediately. FIGURE 14-2 illustrates this manual rebuild process.

If neither local spare nor global spare drive is available, the controller monitors the channel and ID of the failed drive. After the failed drive has been replaced with a healthy one, the controller begins to rebuild the logical drive rebuild onto the new drive. If no drive is available for rebuilding, the controller does not attempt to rebuild until the user applies another forced-manual rebuild.

  FIGURE 14-2 Manual Rebuild

Flowchart shows manual rebuild process.

Concurrent Rebuild in RAID 1+0

RAID 1+0 allows multiple-drive failure and concurrent multiple-drive rebuild. Drives newly installed must be scanned and configured as local spares. These drives are rebuilt at the same time; you do not need to repeat the rebuilding process for each drive.

Modifying Drive-Side Parameters

There are a number of interrelated drive-side configuration parameters you can set using the "view and edit Configuration parameters" menu option. It is possible to encounter undesirable results if you experiment with these parameters. Only change parameters when you have good reason to do so.

See Drive-Side Parameters Menu for cautions about changing sensitive drive-side parameter settings. In particular, do not set Periodic SAF-TE and SES Device Check Time to less than one second, and do not set Drive I/O Timeout to anything less than 30 seconds for FC or SATA arrays.


Additional Troubleshooting Information

For additional troubleshooting tips, refer to the Installation, Operation, and Service manual for your array, and to the release notes for your array.


1 (TableFootnote) When a Sun StorEdge 3310 SCSI array or Sun StorEdge 3320 SCSI array is powered up, it can take approximately 30-40 seconds before the drive speed is displayed correctly. Before that happens, the drive speed can display as Async.