C H A P T E R 14 - Maintaining Your Array

C H A P T E R 14

Maintaining Your Array

Refer to the Sun StorEdge 3000 Family Installation, Operation and Service Manual for your array to see hardware-related maintenance and troubleshooting information.

This chapter covers the following firmware-oriented maintenance and troubleshooting topics:

Battery Operation

Battery Status

Battery Support for Cache Operations

Checking Status Windows

Logical Drive Status Table

Physical Drive Status Table

Channel Status Table

Upgrading Firmware

Patch Downloads

Installing Firmware Upgrades

Controller Firmware Upgrade Features

Upgrading SES and PLD Firmware

Troubleshooting Your Array

Controller Failover

RAID LUNs Not Visible to Host

Rebuilding Logical Drives

Modifying Drive-Side Parameters

Additional Troubleshooting Information

Battery Operation

The battery LED (on the far right side of the I/O controller module) is amber if the battery is bad or missing. The LED blinks green if the battery is charging and is solid green when the battery is fully charged.

Battery Status

Battery status is displayed at the top of the initial firmware screen. BAT: status displays somewhere in the range from BAD to ----- (charging) to +++++ (fully charged).

For maximum life, lithium ion batteries are not recharged until the charge level is very low, indicated by a status of -----. Automatic recharging at this point takes very little time.

A battery module whose status shows one or more + signs can support cache memory for 72 hours. As long as one or more + signs are displayed, your battery is performing correctly.

TABLE 14-1 Battery Status Indicators
Battery Display	Description
-----	Discharged; the battery is automatically recharged when it reaches this state.
+----	Adequately charged to maintain cache memory for 72 hours or more in case of power loss. Automatic recharging occurs when the battery status drops below this level.
++---	90% charged; adequate to maintain cache memory for 72 hours or more in case of power loss.
+++--	92% charged; adequate to maintain cache memory for 72 hours or more in case of power loss.
++++-	95% charged; adequate to maintain cache memory for 72 hours or more in case of power loss.
+++++	Over 97% charged; adequate to maintain cache memory for 72 hours or more in case of power loss.

Your lithium ion battery should be changed every two years if the unit is continuously operated at 77 degrees F (25 degrees C). If the unit is continuously operated at 95 degrees F (35 degrees C) or higher, the battery should be changed every year. The shelf life of your battery is three years.

Note - The RAID controller has a temperature sensor which shuts off battery charging when the temperature reaches 129 degrees F (54 degrees C). When this happens, the battery status might be reported as BAD, but no alarm is written to the event log because no actual battery failure has occurred. This behavior is normal. As soon as the temperature returns to the normal range, battery charging resumes and the battery status is reported correctly. It is not necessary to replace or otherwise interfere with the battery in this situation.

Refer to the Sun StorEdge 3000 Family Installation, Operation and Service Manual for your array to see your array's acceptable operating and nonoperating temperature ranges.

For information about the date of manufacture and how to replace the battery module, refer to the Sun StorEdge 3000 Family FRU Installation Guide.

Battery Support for Cache Operations

Unfinished writes are cached in memory in write-back mode. If power to the array is discontinued, data stored in the cache memory is not lost. Battery modules can support cache memory for several days.

Write cache is not automatically disabled when the battery is offline due to battery failure or a disconnected battery, but you can set an event trigger to make this happen. See Configuring the Battery Backup (BBU) Low Event or BBU Failed Event Trigger for more information.

Checking Status Windows

The status windows used to monitor and manage the array are described in the following sections:

Logical Drive Status Table

Physical Drive Status Table

Channel Status Table

Logical Drive Status Table

To check and configure logical drives, from the Main Menu choose "view and edit Logical drives" and press Return.

Screen capture shows the Main Menu with "view and edit Logical drives" selected.

The status of all logical drives is displayed.

Screen capture shows the status of all logical drives for this array.

TABLE 14-2 shows definitions and values for logical drive parameters.

TABLE 14-2 Parameters Displayed in the Logical Drive Status Window
Parameter		Description
LG		Logical drive number P0: Logical drive 0 of the primary controller where P = primary controller and 0 = logical drive number S1: Logical drive 1 of the secondary controller where S = secondary controller and 1 = logical drive number
ID		Logical drive ID number (controller-generated)
LV		The logical volume to which this logical drive belongs. NA indicates no logical volume.
RAID		Assigned RAID level
SIZE (MB)		Capacity of the logical drive
Status1		Logical drive status:
	COPYING		The logical drive is in the process of copying from another drive.
	CREATING		The logical drive is being initiated.
	GOOD		The logical drive is in good condition.
	DRV FAILED		A drive member failed in the logical drive.
	FATAL FAIL		More than one drive member in a logical drive has failed.
	INCOMPLETE		Two or more member drives in this logical drive have failed.
	SHUT-DOWN		The controller has been shut down using the Shutdown command. Restart the controller to restore it to GOOD status.
	REBUILDING		The logical drive is being rebuilt.
Status2		Logical Drive status column 2
	I		The logical drive is initializing.
	A		Adding a physical drive to the logical drive.
	E		Expanding a logical drive.
Status3		Logical Drive status column 3
	R		The logical drive is rebuilding.
	P		Regenerating parity on the logical drive.
O		Stripe size:
	2		4 KB
	3		8 KB
	4		.16 KB
	5		32 KB
	6		64 KB
	7		128 KB
	8		256 KB
C		Write policy setting
	B		Write-back
	T		Write-through
#LN		Total number of drive members in this logical drive
#SB		Number of standby drives available for the logical drive. This includes local spare and global spare drives available for the logical drive.
#FL		Number of failed drive members in the logical drive
Name		Logical drive name (user configurable)

Note - The SIZE (MB) parameter for a logical drive might not correspond exactly with the total size reported for each of the physical drives that make up the logical drive when using the "view and edit Logical drives" menu option. Any discrepancy is minor and is a result of how the drive manufacturers report their device size, which varies among manufacturers.

To handle failed, incomplete, or fatal failure status, refer to the Sun StorEdge 3000 Family Installation, Operation and Service Manual for your array.

Physical Drive Status Table

To check and configure physical drives, from the Main Menu, choose "view and edit Drives" and press Return.

Screen capture shows the Main Menu with "view and edit scsi Drives" selected.

The Physical Drive Status table is displayed with the status of all physical drives in the array.

Screen capture shows the status of all SCSI drives in the array.

TABLE 14-3 Parameters Displayed in the Physical Drive Status Window
Parameters		Description
Chl		Channel that is assigned to the drive
ID		ID of the drive
Size (MB)		Drive capacity in megabytes
Speed		xxMB Maximum synchronous transfer rate of this drive. ASYNC The drive is using asynchronous mode.^[1]
LG_DRV		x The drive is a physical drive member of logical drive x.
Status	COPYING		The logical drive is in the process of copying from another drive.
	GLOBAL		The drive is a global spare drive.
	INITING		The drive is initializing.
	ON-LINE		The drive is in good condition.
	REBUILD		The drive is rebuilding.
	STAND-BY		Local spare drive or global spare drive. If the drive is a local spare, the LG_DRV column displays the drive number of the logical drive to which the spare is assigned. If the drive is a global spare, the LG_DRV column displays GLOBAL.
	NEW DRV		The new drive has not been configured to any logical drive or as a spare drive.
	USED DRV		The drive was previously configured as part of a logical drive from which it has been removed; it still contains data from that logical drive.
	FRMT DRV		The drive has been formatted with reserved space allocated for controller-specific information.
	BAD		Failed drive.
	ABSENT		Drive slot is not occupied or the drive is defective and cannot be detected.
	MISSING		Drive once existed, but is now missing.
	SB-MISS		Spare drive missing.
Vendor and product ID			Vendor and product model information of the drive.

A physical drive has a USED status when it was once part of a logical drive but no longer is. This can happen, for instance, when a drive in a RAID 5 array is replaced by a spare drive and the logical drive is rebuilt with the new drive. If the removed drive is later reinstalled in the array and scanned, the drive status is identified as USED because the drive still has data on it from a logical drive.

When a logical drive is deleted properly, this user information is erased and the drive status is shown as FRMT rather than USED. A drive with FRMT status has been formatted with either 64 KB or 256 MB of reserved space for storing controller-specific information, but has no user data on it.

If you remove the reserved space using the "view and edit Drives" menu, the drive status changes to NEW.

To replace BAD drives, or if two drives show BAD and MISSING status, refer to the Sun StorEdge 3000 Family Installation, Operation and Service Manual for your array.

Note - If a drive is installed but not listed, the drive might be defective or installed incorrectly.

Note - When power is turned on, the controller scans all physical drives that are connected through the drive channels. If a physical drive is connected after a Sun StorEdge 3310 SCSI controller or Sun StorEdge 3320 SCSI controller completes initialization, use the "Scan scsi drive" menu option ("view and edit Drives Scan scsi drive") to let the controller recognize the newly added physical drive so you can configure it as a member of a logical drive or as a spare drive.

Channel Status Table

To check and configure channels, from the Main Menu, choose "view and edit channelS," and press Return.

The Channel Status table is displayed with the status of all channels on the array.

Screen capture shows the status of all channels on the array.

Note - Each controller has an RS232 port and an Ethernet port. This architecture ensures continuous access for communication should either controller fail. Since the connection is established with only one controller at a time, even when the array is in redundant mode, the CurSyncClk and CurWid settings are displayed only for the connected controller. Therefore, if a user maps one LUN to the primary controller, and another LUN to a secondary controller, only the LUN mapped with the currently connected controller is displayed through the serial and Ethernet port.

Caution - Do not change the PID and SID values of drive channels.

TABLE 14-4 Parameters Displayed in the Channel Status Table
Parameters	Description
Chl	Channel's ID.
Mode	Channel mode.
	RCCOM	Redundant controller communication channel. Displays as `RCC` in the Channel Status table.
	Host	The channel is functioning as a host channel.
	Drive	The channel is functioning as a drive channel.
	DRV+RCC	The channel is functioning as a drive channel with a redundant controller communication channel. (Fibre Channel only).
PID	Primary controller's ID mapping:
	*	Multiple IDs were applied (host channel mode only).
	#	The ID to which host LUNs are mapped in the host channel mode. The ID for the primary controller in the drive channel mode.
	NA	No ID applied.
SID	Secondary controller's ID mapping:
	*	Multiple IDs (host channel mode only).
	#	The ID to which host LUNs are mapped in the host channel mode. The ID for the secondary controller in drive channel the mode.
	NA	No ID applied.
DefSynClk	Default bus synchronous clock:
	xx.x MHz	Maximum synchronous transfer rate (SCSI array only)
	x GHz	Maximum synchronous transfer rate (FC array only).
	Async	Channel is set for asynchronous transfers (SCSI arrays only).
	Auto	Channel is configured to communicate at 1 or 2 GHz (FC arrays only).
DefWid	Default bus width:
	Wide	Channel is set to allow wide (16-bit) transfers (SCSI arrays only).
	Narrow	Channel is set to allow narrow (8-bit) transfers (SCSI arrays only).
	Serial	Channel is using serial communication.
S	Signal:
	S	Single-ended
	L	LVD
	F	Fibre
Term	Terminator status:
	On	Termination is enabled (SCSI arrays only).
	Off	Termination is disabled (SCSI arrays only).
	NA	For a redundant controller communications (RCCOM) channel (SCSI arrays) and all FC array channels.
CurSynClk	Current bus synchronous clock. This field only displays values for channels that are assigned to the primary controller.
	xx.x MHz	The current speed at which a SCSI array channel is communicating.
	x GHz	The current speed at which a FC array channel is communicating.
	Async	The channel is communicating asynchronously or no device is detected.
	(empty)	The default bus synchronous clock has changed. Reset the controller for changes to take effect.
CurWid	Current bus width. This field only displays values for channels that are assigned to the primary controller.
	Wide	The channel is currently servicing wide 16-bit transfers (SCSI arrays only).
	Narrow	The channel is currently servicing narrow 8-bit transfers (SCSI arrays only).
	Serial	Channel is using serial communication.
	(empty)	The default bus width has changed. Reset the controller for the changes to take effect.

Upgrading Firmware

From time to time, firmware upgrades are made available as patches. Check the release notes for your array to find out the current patch IDs available for your array.

You can download RAID controller firmware patches from SunSolve Online, located at:

http://sunsolve.sun.com

Each patch applies to one or more particular piece of firmware, including:

Controller firmware

SES firmware

PLD firmware

SATA router firmware (SATA only)

MUX firmware (SATA only)

Note - Disk drive firmware is provided through Sun disk firmware patches, which include the required download utility. Sun disk firmware patches are separate from Sun StorEdge 3000 family firmware patches. Do not use Sun StorEdge Configuration Service or the Sun StorEdge CLI to download disk drive firmware.

SunSolve has extensive search capabilities that can help you find these patches, as well as regular patch reports and alerts to let you know when firmware upgrades and other patches become available. In addition, SunSolve provides reports about bugs that have been fixed in patch updates.

Each patch includes an associated README text file that provides detailed instructions about how to download and install that patch. But, generally speaking, all firmware downloads follow the same steps:

Locating the patch on SunSolve that contains the firmware upgrade you want

Downloading the patch to a location on your network

Using your array software (Sun StorEdge Configuration Service or the Sun StorEdge CLI) to "flash" the firmware to the device it updates

Note - For instructions on how to download firmware to disk drives in a JBOD directly attached to a host, refer to the README file in the patch that contains the firmware.

Caution - Be particularly careful about downloading and installing PLD firmware. If the wrong firmware is installed, or the firmware is installed on the wrong device, your controller might be rendered inoperable. Always be sure to upgrade your SES firmware first before trying to determine if you need a PLD upgrade.

Patch Downloads

1. Once you have determined that a patch is available to update firmware on your array, make note of the patch number or use SunSolve Online's search capabilities to locate and navigate to the patch.

2. Read the README text file associated with that patch for detailed instructions on downloading and installing the firmware upgrade.

3. Follow those instructions to download and install the patch.

Installing Firmware Upgrades

It is important that you run a version of firmware that is supported by your array. Before updating your firmware, make sure that the version of firmware you want to use is supported by your array.

Refer to the release notes for your array for Sun Microsystems patches containing firmware upgrades that are available for your array. Refer to SunSolve Online for subsequent patches containing firmware upgrades.

If you are downloading a Sun patch that includes a firmware upgrade, the README file associated with that patch tells you which Sun StorEdge 3000 family arrays support that firmware release.

Caution - Major upgrades of controller firmware, or replacing a controller with one that has a significantly different version of firmware, might involve differences in non-volatile RAM (NVRAM) that require following special upgrade procedures. For more information, refer to the Sun StorEdge 3000 Family FRU Installation Guide and to the release notes for your array.

To download new versions of controller firmware, or SES and PLD firmware, use one of the following tools:

The Sun StorEdge CLI (with an in-band connection, for Linux and Microsoft Windows hosts, and for servers running the Solaris operating system)

Sun StorEdge Configuration Service (with an in-band connection, for Solaris and Microsoft Windows hosts)

Note - Do not use both in-band and out-of-band connections at the same time to manage the array. You might cause conflicts between multiple operations.

Note - Disk drive firmware is provided through Sun disk firmware patches which include the required download utility. Sun disk firmware patches are separate from the Sun StorEdge 3000 family firmware patches. Do not use the Sun StorEdge CLI or Sun StorEdge Configuration Service to download disk drive firmware.

Controller Firmware Upgrade Features

The following firmware upgrade features apply to the controller firmware:

Redundant Controller Rolling Firmware Upgrade

When downloading is performed on a dual-controller system, firmware is flashed onto both controllers without interrupting host I/O. When the download process is complete, the primary controller resets and lets the secondary controller take over the service temporarily. When the primary controller comes back online, the secondary controller hands over the workload and then resets itself for the new firmware to take effect. The rolling upgrade is automatically performed by controller firmware, and the user's intervention is not necessary.

Automatically Synchronized Controller Firmware Versions

A controller that replaces a failed unit in a dual-controller system often has a newer release of the firmware installed than the firmware in the controller it replaced. To maintain compatibility, the surviving primary controller automatically updates the firmware running on the replacement secondary controller to the firmware version of the primary controller.

Note - When you upgrade your controller firmware in the Solaris operating system, the format(1M) command still shows the earlier revision level.

Upgrading SES and PLD Firmware

When you replace an I/O controller, the new controller might have a version of SES or PLD firmware different from the other controller in your array. If this mismatch occurs, when you install a controller you hear an audible alarm and see a blinking amber Event LED.

To synchronize the SES firmware and hardware PLD versions, you must download new SES firmware through Sun StorEdge Configuration Service or the Sun StorEdge CLI.

If you have not installed this software, you must install it from the software CD that shipped with your array. Refer to the Sun StorEdge 3000 Family Configuration Service User's Guide for your array to see instructions for downloading firmware for devices. Refer to the Sun StorEdge 3000 Family CLI User's Guide, or the sccli(1M) man page for similar instructions for using the Sun StorEdge CLI. Refer to the release notes for your array for instructions about where to obtain the firmware that you need to download.

When you open Sun StorEdge Configuration Service or the Sun StorEdge CLI and connect to the array, an error message alerts you to the mismatched version problem.

Troubleshooting Your Array

For hardware troubleshooting information, refer to the Installation, Operation and Service Manual for your array. For additional troubleshooting tips, refer to the release notes for your array.

Controller Failover

Controller failure symptoms include:

The surviving controller sounds an audible alarm.

The Controller Status LED turns solid amber on the failed controller.

The surviving controller sends event messages announcing the controller failure of the other controller.

A Bus Reset Issued warning message is displayed for each of the channels. In addition, a Redundant Controller Failure Detected alert message is displayed.

If one controller in the redundant controller configuration fails, the surviving controller takes over for the failed controller.

A failed controller is managed by the surviving controller which disables and disconnects from its counterpart while gaining access to all the signal paths. The surviving controller then manages the ensuing event notifications and takes over all processes. The surviving controller is always the primary controller regardless of its original status, and any replacement controller afterward assumes the role of the secondary controller.

The failover and failback processes are completely transparent to hosts.

Controllers are hot-swappable if they are in a redundant configuration. Replacing a failed controller takes only a few minutes. Since the I/O connections are on the controllers, you might experience some unavailability between the time when cables on the failed controller are disconnected and the time when a new controller is installed and its cables are connected.

To maintain your redundant controller configuration, replace a failed controller as soon as possible. For details, refer to the Sun StorEdge 3000 Family FRU Installation Guide.

RAID LUNs Not Visible to Host

By default, all RAID arrays are preconfigured with one or two logical drives. For a logical drive to be visible to the host server, its partitions must be mapped to host LUNs. For mapping details, see Mapping a Partition to a Host LUN for SCSI arrays or LUN Mapping for FC and SATA arrays.

To make the mapped LUNs visible to a specific host, perform any steps required for your operating system. Refer to the Installation, Operation and Service Manual for your array to see host-specific information about different operating systems.

Rebuilding Logical Drives

This section describes automatic and manual procedures for rebuilding logical drives. The time required to rebuild a logical drive is determined by the size of the logical drive, the I/O that is being processed by the controller and the array's Rebuild Priority setting. With no I/O being processed, the time required to build a 2-Tbyte RAID 5 logical drive can be approximately:

4.5 hours for a Sun StorEdge 3310 SCSI array or Sun StorEdge 3510 FC array

6.5 hours for a Sun StorEdge 3511 SATA array

Note - As disks fail and are replaced, the rebuild process regenerates the data and parity information that was on the failed disk. However, the NVRAM configuration file that was present on the disk is not re-created. After the rebuild process is complete, restore your configuration as described in Restoring Your Configuration (NVRAM) From Disk.

Automatic Logical Drive Rebuild

Rebuild with Spare. When a member drive in a logical drive fails, the controller first determines whether there is a local spare drive assigned to the logical drive. If there is a local spare drive, the controller automatically starts to rebuild the data from the failed drive onto the spare.

If there is no local spare drive available, the controller searches for a global spare drive. If there is a global spare, the controller automatically uses the global spare to rebuild the logical drive.

Failed Drive Swap Detect. If neither a local spare drive nor a global spare drive is available, and Periodic Auto-Detect Failure Drive Swap Check Time is disabled, the controller does not attempt to rebuild unless you apply a forced-manual rebuild.

To enable Periodic Auto-Detect Failure Drive Swap Check Time, perform the following steps:

1. From the Main Menu, choose "view and edit Configuration parameters Drive-side Parameters Periodic Auto-Detect Failure Drive Swap Check Time."

A list of check time intervals is displayed.

2. Select a Periodic Auto-Detect Failure Drive Swap Check Time interval.

A confirmation message is displayed.

3. Choose Yes to confirm.

When Periodic Auto-Detect Failure Drive Swap Check Time is enabled (that is, when a check time interval has been selected), the controller detects whether the failed drive has been replaced by checking the failed drive's channel and ID. Once the failed drive has been replaced, the rebuild begins immediately.

Note - This feature requires system resources and can impact performance.

If the failed drive is not replaced but a local spare is added to the logical drive, the rebuild begins with the spare.

FIGURE 14-1 illustrates this automatic rebuild process.

FIGURE 14-1 Automatic Rebuild

Flowchart shows automatic rebuild process.

Manual Rebuild

When a user applies forced-manual rebuild, the controller first determines whether there is a local spare drive assigned to the logical drive. If a local spare drive is available, the controller automatically starts to rebuild onto the spare drive.

If no local spare drive is available, the controller searches for a global spare drive. If there is a global spare drive, the controller begins to rebuild the logical drive immediately. FIGURE 14-2 illustrates this manual rebuild process.

If neither local spare nor global spare drive is available, the controller monitors the channel and ID of the failed drive. After the failed drive has been replaced with a healthy one, the controller begins to rebuild the logical drive rebuild onto the new drive. If no drive is available for rebuilding, the controller does not attempt to rebuild until the user applies another forced-manual rebuild.

FIGURE 14-2 Manual Rebuild

Concurrent Rebuild in RAID 1+0

RAID 1+0 allows multiple-drive failure and concurrent multiple-drive rebuild. Drives newly installed must be scanned and configured as local spares. These drives are rebuilt at the same time; you do not need to repeat the rebuilding process for each drive.

Modifying Drive-Side Parameters

There are a number of interrelated drive-side configuration parameters you can set using the "view and edit Configuration parameters" menu option. It is possible to encounter undesirable results if you experiment with these parameters. Only change parameters when you have good reason to do so.

See Drive-Side Parameters Menu for cautions about changing sensitive drive-side parameter settings. In particular, do not set Periodic SAF-TE and SES Device Check Time to less than one second, and do not set Drive I/O Timeout to anything less than 30 seconds for FC or SATA arrays.

Additional Troubleshooting Information

For additional troubleshooting tips, refer to the Installation, Operation, and Service manual for your array, and to the release notes for your array.

^{1 (TableFootnote) When a Sun StorEdge 3310 SCSI array or Sun StorEdge 3320 SCSI array is powered up, it can take approximately 30-40 seconds before the drive speed is displayed correctly. Before that happens, the drive speed can display as Async.}