C H A P T E R 14 |
Maintaining Your Array |
Refer to the Sun StorEdge 3000 Family Installation, Operation and Service Manual for your array to see hardware-related maintenance and troubleshooting information.
This chapter covers the following firmware-oriented maintenance and troubleshooting topics:
The battery LED (on the far right side of the I/O controller module) is amber if the battery is bad or missing. The LED blinks green if the battery is charging and is solid green when the battery is fully charged.
Battery status is displayed at the top of the initial firmware screen. BAT: status displays somewhere in the range from BAD to ----- (charging) to +++++ (fully charged).
For maximum life, lithium ion batteries are not recharged until the charge level is very low, indicated by a status of -----. Automatic recharging at this point takes very little time.
A battery module whose status shows one or more + signs can support cache memory for 72 hours. As long as one or more + signs are displayed, your battery is performing correctly.
Your lithium ion battery should be changed every two years if the unit is continuously operated at 77 degrees F (25 degrees C). If the unit is continuously operated at 95 degrees F (35 degrees C) or higher, the battery should be changed every year. The shelf life of your battery is three years.
Refer to the Sun StorEdge 3000 Family Installation, Operation and Service Manual for your array to see your array's acceptable operating and nonoperating temperature ranges.
For information about the date of manufacture and how to replace the battery module, refer to the Sun StorEdge 3000 Family FRU Installation Guide.
Unfinished writes are cached in memory in write-back mode. If power to the array is discontinued, data stored in the cache memory is not lost. Battery modules can support cache memory for several days.
Write cache is not automatically disabled when the battery is offline due to battery failure or a disconnected battery, but you can set an event trigger to make this happen. See Configuring the Battery Backup (BBU) Low Event or BBU Failed Event Trigger for more information.
The status windows used to monitor and manage the array are described in the following sections:
To check and configure logical drives, from the Main Menu choose "view and edit Logical drives" and press Return.
The status of all logical drives is displayed.
TABLE 14-2 shows definitions and values for logical drive parameters.
To handle failed, incomplete, or fatal failure status, refer to the Sun StorEdge 3000 Family Installation, Operation and Service Manual for your array.
To check and configure physical drives, from the Main Menu, choose "view and edit Drives" and press Return.
The Physical Drive Status table is displayed with the status of all physical drives in the array.
xxMB Maximum synchronous transfer rate of this drive. ASYNC The drive is using asynchronous mode.[1] |
||||
The logical drive is in the process of copying from another drive. |
||||
Local spare drive or global spare drive. If the drive is a local spare, the LG_DRV column displays the drive number of the logical drive to which the spare is assigned. If the drive is a global spare, the LG_DRV column displays GLOBAL. |
||||
The new drive has not been configured to any logical drive or as a spare drive. |
||||
The drive was previously configured as part of a logical drive from which it has been removed; it still contains data from that logical drive. |
||||
The drive has been formatted with reserved space allocated for controller-specific information. |
||||
Drive slot is not occupied or the drive is defective and cannot be detected. |
||||
A physical drive has a USED status when it was once part of a logical drive but no longer is. This can happen, for instance, when a drive in a RAID 5 array is replaced by a spare drive and the logical drive is rebuilt with the new drive. If the removed drive is later reinstalled in the array and scanned, the drive status is identified as USED because the drive still has data on it from a logical drive.
When a logical drive is deleted properly, this user information is erased and the drive status is shown as FRMT rather than USED. A drive with FRMT status has been formatted with either 64 KB or 256 MB of reserved space for storing controller-specific information, but has no user data on it.
If you remove the reserved space using the "view and edit Drives" menu, the drive status changes to NEW.
To replace BAD drives, or if two drives show BAD and MISSING status, refer to the Sun StorEdge 3000 Family Installation, Operation and Service Manual for your array.
Note - If a drive is installed but not listed, the drive might be defective or installed incorrectly. |
To check and configure channels, from the Main Menu, choose "view and edit channelS," and press Return.
The Channel Status table is displayed with the status of all channels on the array.
Caution - Do not change the PID and SID values of drive channels. |
From time to time, firmware upgrades are made available as patches. Check the release notes for your array to find out the current patch IDs available for your array.
You can download RAID controller firmware patches from SunSolve Online, located at:
Each patch applies to one or more particular piece of firmware, including:
SunSolve has extensive search capabilities that can help you find these patches, as well as regular patch reports and alerts to let you know when firmware upgrades and other patches become available. In addition, SunSolve provides reports about bugs that have been fixed in patch updates.
Each patch includes an associated README text file that provides detailed instructions about how to download and install that patch. But, generally speaking, all firmware downloads follow the same steps:
Note - For instructions on how to download firmware to disk drives in a JBOD directly attached to a host, refer to the README file in the patch that contains the firmware. |
1. Once you have determined that a patch is available to update firmware on your array, make note of the patch number or use SunSolve Online's search capabilities to locate and navigate to the patch.
2. Read the README text file associated with that patch for detailed instructions on downloading and installing the firmware upgrade.
3. Follow those instructions to download and install the patch.
It is important that you run a version of firmware that is supported by your array. Before updating your firmware, make sure that the version of firmware you want to use is supported by your array.
Refer to the release notes for your array for Sun Microsystems patches containing firmware upgrades that are available for your array. Refer to SunSolve Online for subsequent patches containing firmware upgrades.
If you are downloading a Sun patch that includes a firmware upgrade, the README file associated with that patch tells you which Sun StorEdge 3000 family arrays support that firmware release.
To download new versions of controller firmware, or SES and PLD firmware, use one of the following tools:
Note - Do not use both in-band and out-of-band connections at the same time to manage the array. You might cause conflicts between multiple operations. |
The following firmware upgrade features apply to the controller firmware:
When downloading is performed on a dual-controller system, firmware is flashed onto both controllers without interrupting host I/O. When the download process is complete, the primary controller resets and lets the secondary controller take over the service temporarily. When the primary controller comes back online, the secondary controller hands over the workload and then resets itself for the new firmware to take effect. The rolling upgrade is automatically performed by controller firmware, and the user's intervention is not necessary.
A controller that replaces a failed unit in a dual-controller system often has a newer release of the firmware installed than the firmware in the controller it replaced. To maintain compatibility, the surviving primary controller automatically updates the firmware running on the replacement secondary controller to the firmware version of the primary controller.
Note - When you upgrade your controller firmware in the Solaris operating system, the format(1M) command still shows the earlier revision level. |
When you replace an I/O controller, the new controller might have a version of SES or PLD firmware different from the other controller in your array. If this mismatch occurs, when you install a controller you hear an audible alarm and see a blinking amber Event LED.
To synchronize the SES firmware and hardware PLD versions, you must download new SES firmware through Sun StorEdge Configuration Service or the Sun StorEdge CLI.
If you have not installed this software, you must install it from the software CD that shipped with your array. Refer to the Sun StorEdge 3000 Family Configuration Service User's Guide for your array to see instructions for downloading firmware for devices. Refer to the Sun StorEdge 3000 Family CLI User's Guide, or the sccli(1M) man page for similar instructions for using the Sun StorEdge CLI. Refer to the release notes for your array for instructions about where to obtain the firmware that you need to download.
When you open Sun StorEdge Configuration Service or the Sun StorEdge CLI and connect to the array, an error message alerts you to the mismatched version problem.
For hardware troubleshooting information, refer to the Installation, Operation and Service Manual for your array. For additional troubleshooting tips, refer to the release notes for your array.
Controller failure symptoms include:
A Bus Reset Issued warning message is displayed for each of the channels. In addition, a Redundant Controller Failure Detected alert message is displayed.
If one controller in the redundant controller configuration fails, the surviving controller takes over for the failed controller.
A failed controller is managed by the surviving controller which disables and disconnects from its counterpart while gaining access to all the signal paths. The surviving controller then manages the ensuing event notifications and takes over all processes. The surviving controller is always the primary controller regardless of its original status, and any replacement controller afterward assumes the role of the secondary controller.
The failover and failback processes are completely transparent to hosts.
Controllers are hot-swappable if they are in a redundant configuration. Replacing a failed controller takes only a few minutes. Since the I/O connections are on the controllers, you might experience some unavailability between the time when cables on the failed controller are disconnected and the time when a new controller is installed and its cables are connected.
To maintain your redundant controller configuration, replace a failed controller as soon as possible. For details, refer to the Sun StorEdge 3000 Family FRU Installation Guide.
By default, all RAID arrays are preconfigured with one or two logical drives. For a logical drive to be visible to the host server, its partitions must be mapped to host LUNs. For mapping details, see Mapping a Partition to a Host LUN for SCSI arrays or LUN Mapping for FC and SATA arrays.
To make the mapped LUNs visible to a specific host, perform any steps required for your operating system. Refer to the Installation, Operation and Service Manual for your array to see host-specific information about different operating systems.
This section describes automatic and manual procedures for rebuilding logical drives. The time required to rebuild a logical drive is determined by the size of the logical drive, the I/O that is being processed by the controller and the array's Rebuild Priority setting. With no I/O being processed, the time required to build a 2-Tbyte RAID 5 logical drive can be approximately:
Note - As disks fail and are replaced, the rebuild process regenerates the data and parity information that was on the failed disk. However, the NVRAM configuration file that was present on the disk is not re-created. After the rebuild process is complete, restore your configuration as described in Restoring Your Configuration (NVRAM) From Disk. |
Rebuild with Spare. When a member drive in a logical drive fails, the controller first determines whether there is a local spare drive assigned to the logical drive. If there is a local spare drive, the controller automatically starts to rebuild the data from the failed drive onto the spare.
If there is no local spare drive available, the controller searches for a global spare drive. If there is a global spare, the controller automatically uses the global spare to rebuild the logical drive.
Failed Drive Swap Detect. If neither a local spare drive nor a global spare drive is available, and Periodic Auto-Detect Failure Drive Swap Check Time is disabled, the controller does not attempt to rebuild unless you apply a forced-manual rebuild.
To enable Periodic Auto-Detect Failure Drive Swap Check Time, perform the following steps:
1. From the Main Menu, choose "view and edit Configuration parameters Drive-side Parameters Periodic Auto-Detect Failure Drive Swap Check Time."
A list of check time intervals is displayed.
2. Select a Periodic Auto-Detect Failure Drive Swap Check Time interval.
A confirmation message is displayed.
When Periodic Auto-Detect Failure Drive Swap Check Time is enabled (that is, when a check time interval has been selected), the controller detects whether the failed drive has been replaced by checking the failed drive's channel and ID. Once the failed drive has been replaced, the rebuild begins immediately.
Note - This feature requires system resources and can impact performance. |
If the failed drive is not replaced but a local spare is added to the logical drive, the rebuild begins with the spare.
FIGURE 14-1 illustrates this automatic rebuild process.
When a user applies forced-manual rebuild, the controller first determines whether there is a local spare drive assigned to the logical drive. If a local spare drive is available, the controller automatically starts to rebuild onto the spare drive.
If no local spare drive is available, the controller searches for a global spare drive. If there is a global spare drive, the controller begins to rebuild the logical drive immediately. FIGURE 14-2 illustrates this manual rebuild process.
If neither local spare nor global spare drive is available, the controller monitors the channel and ID of the failed drive. After the failed drive has been replaced with a healthy one, the controller begins to rebuild the logical drive rebuild onto the new drive. If no drive is available for rebuilding, the controller does not attempt to rebuild until the user applies another forced-manual rebuild.
RAID 1+0 allows multiple-drive failure and concurrent multiple-drive rebuild. Drives newly installed must be scanned and configured as local spares. These drives are rebuilt at the same time; you do not need to repeat the rebuilding process for each drive.
There are a number of interrelated drive-side configuration parameters you can set using the "view and edit Configuration parameters" menu option. It is possible to encounter undesirable results if you experiment with these parameters. Only change parameters when you have good reason to do so.
See Drive-Side Parameters Menu for cautions about changing sensitive drive-side parameter settings. In particular, do not set Periodic SAF-TE and SES Device Check Time to less than one second, and do not set Drive I/O Timeout to anything less than 30 seconds for FC or SATA arrays.
For additional troubleshooting tips, refer to the Installation, Operation, and Service manual for your array, and to the release notes for your array.
Copyright © 2009, Dot Hill Systems Corporation. All rights reserved.