Managing System Hardware and Data

C H A P T E R 5

This chapter describes how to start and stop hives, cells, and nodes, as well as how to remove all data from the 5800 system. It contains the following sections:

Starting and Stopping System Components

System Performance and Capacity Impact after Disks or Nodes are Offline

Recovering From a Power Failure

Deleting All Data From the System

Note - For instructions on accessing the CLI commands and GUI functions described in this chapter, see Using the Administrative Interfaces.

Starting and Stopping System Components

To perform administrative actions on the hardware, you might need to shut down or reboot cells.

Caution - For best results, before shutting down or rebooting a cell, be sure that any applications storing or retrieving data to or from that cell are also shut down until you have completed the maintenance action on the cell.

Caution - After you reboot a cell, make sure that the query engine status, as reported by the sysstat command is HAFaultTolerant before resuming applications that are storing or retrieving data to or from the cell. See sysstat for more information on the sysstat command.

To Shut Down a Cell Using the CLI

Shut down a cell with the command shutdown --cellid cellid.

For example:

ST5800 $ shutdown --cellid 1
shutdown? [y/N]: n
ST5800 $ shutdown --cellid 1 
shutdown? [y/N]: y 
Connection to hc1-admin closed.

Note - If you want to completely power down a cell (for example, so that you can move a rack), issue the shutdown --all command, which shuts down the service node as well as all the storage nodes in the system. Then, switch all power switches on the front of the rack to the off or 0 position.

To Shut Down a Cell Using the GUI

1. From the navigational panel, choose Cells > Cell <identifier>.

The Cell Summary panel is displayed.

2. From the Cell Operations drop-down list box, choose Shutdown Cell.

3. Click Apply.

A confirmation message asks if you want to continue with shutting down the cell and if you want to shutdown the service node as part of the shutdown process.

4. Select the Shutdown service node checkbox to shut down the service node as part of the shutdown process.

5. Click Yes to begin the shutdown process.

To Reboot a Cell Using the CLI

Reboot a cell with the command reboot --cellid cellid.

For example:

ST5800 $ reboot --cellid 1
Reboot? [y/N]: n 
ST5800 $ reboot 
Reboot? [y/N]: y 
Connection to hc1-admin closed.

Note - If you want to reboot the switches and service node along with the storage nodes on the cell, issue the reboot cellid cellid --all command.

To Reboot a Cell Using the GUI

1. From the navigational panel, choose Cells > Cell <identifier>.

The Cell Summary panel is displayed.

2. From the Cell Operations drop-down list box, choose Reboot Cell.

3. Click Apply.

A confirmation message asks if you want to continue with rebooting the cell and if you want to reboot the service node and switches as part of the reboot process.

4. Select the Reboot service node and switches checkbox to reboot the service node and switches as part of the reboot process.

5. Click Yes to begin the reboot process.

To Power Up a Cell

1. Verify that the system is completely shut down by ensuring that the power switches on the front of the rack are set to the off or 0 position.

2. Switch the black power switches on the front of the rack to the on or 1 position.

3. Wait several minutes.

4. Log in to the CLI and verify that the 5800 system is operational using the hwstat and sysstat commands. (For more information, see hwstat and sysstat.)

System Performance and Capacity Impact after Disks or Nodes are Offline

The 5800 system includes extensive healing capabilities that allow the system to recover from failed disks or nodes. This healing activity may affect system performance and capacity, as described in this section.

If disks fail and are replaced, or nodes go offline and then back online, you may notice that the amount of space utilized on the system changes. (Use the df command to display space utilization on the system.)

If a disk goes offline, or if a disk that was previously offline comes back online, the resulting healing activity will affect performance of input and output operations to the 5800 system. Performance of these operations may decrease by approximately 30% for the duration of a healing cycle. The sysstat command displays the status of a healing cycle as the Data Reliability Check. For example:

ST5800 $ sysstat 
Cell 0: Online. Estimated Free Space: 7T
8 nodes online, 32 disks online.
Data VIP 10.8.60.104, Admin VIP 10.8.60.103
Data services Online, Query Engine Status: HAFaultTolerant
Data Integrity check last completed at Fri Aug 03 19:51:50 UTC 2007
Data Reliability check last completed at Tue Aug 07 07:52:45 UTC 2007
Query Integrity check last completed at Tue Aug 07 07:52:45 UTC 2007
NDMP status: Backup ready.
ST5800 $

If a disk fails or is replaced, a healing cycle is completed when the last completed date for the Data Reliability Check reflects a date and time after the failure or replacement occurred.

The healing cycle can take 12 hours for a single disk failure or up to 36 hours for a node failure. During this period, the system is less fault tolerant than usual. While normally the system can sustain simultaneous failures of any two disks without losing data, during a healing cycle, the system can tolerate the failure of only one additional disk (other than the one for which the system is healing).

If two or more additional disks fail while the system is in the healing cycle caused by the original disk failure, some data may be lost. (The likelihood of so many failures within such a short time period is extremely low, however.)

Note - For best performance, avoid taking disks or nodes offline during a healing cycle as it may create the appearance of data loss.

Recovering From a Power Failure

When power is restored after a power failure, the 5800 system becomes operational automatically without administrator intervention.

Note - You might have to push the power button on the service node to resume power to that node.

It takes approximately two hours from the time that power is restored until the disks come back online and data services are available. Use the hwstat command to verify that all nodes and disks are online. See hwstat for more information about the hwstat command.

After the disks are back online, the query engine is repopulated, which requires a minimum of 12 hours. During repopulation, queries to the data stored on the system might return incomplete results. When the sysstat command returns a status of Query Integrity Established, you can be sure that queries are now returning complete results. (See sysstat for more information about the sysstat command.)

Data Availability After Power Loss

No data loss should occur as a result of a power failure. Any client store operations that were in progress at the time of power failure will have failed, but any stored data for which the client received an OID remains securely stored on the 5800 system.

In very rare instances, however, individual fragments of stored objects may become unavailable after the system recovers from the power failure. If three fragments of the same object become unavailable, the system will return an ArchiveException “Error opening fragments for oid” error when a client tries to retrieve the object. In this case, contact Sun service for assistance in restoring the object that has become unavailable.

To determine if any objects have become unavailable as a result of the power loss, wait approximately 12 hours after power is restored, and issue the sysstat command to see if the Data Reliability Check has completed. If the Data Reliability Check is listed as not completed since boot, wait a few more hours and check sysstat again.

When sysstat indicates that the Data Reliability Check has completed, check the external log messages for RecoverLostFrags warnings and errors such as the following:

Sep 4 21:24:37 10.7.224.101 java: [local1.warning] java[1228]: [ID 702911 local1.warning] 286 EXT_WARNING [MgmtServer.monitorDataDoctor] (296.1) Healing Task RecoverLostFrags completed with 10 errors: This may indicate a potential serious problem and should be escalated to a Service Technician.

If you see an error of this type, wait approximately 12 more hours for another healing cycle to be completed. (To determine when a healing cycle has completed, issue the sysstat command and check the timestamp for Data Reliability Check.) Then, check the log messages again for RecoverLostFrags warnings or errors issued around the time that the most recent Data Reliability Check was completed.

If the system consistently issues RecoverLostFrags errors and warnings at the end of every healing cycle, contact Sun service, since there may be some risk of data being unavailable.

Deleting All Data From the System

You can delete (“wipe”) all data stored on a 5800 system hive. When you perform the wipe operation, all user data is destroyed. The system resets the metadata schema file to the original factory settings, while other settings (such as network settings and passwords) are unaffected.

Note - The option to wipe data from a single cell is not available in a multicell configuration; in a multicell configuration, you must wipe data from all cells simultaneously.

Caution - When you wipe data from the system, the metadata schema file is also reset back to the original factory settings. If you want to save your metadata schema file, be sure to back it up before wiping the data.

To Delete All Data Using the CLI

Delete all data and metadata from the hive with the command wipe.

For example:

ST5800 $ wipe 
Destroy all data and clear the metadata schema? [y/N]: y

To Delete All Data Using the GUI

1. From the navigational panel, choose Cells > Cell <identifier>.

The Cell Summary panel is displayed.

2. From the Cell Operations drop-down list box, choose Wipe Cell (or Wipe All Cells for a multi-cell system).

3. Click Apply.

A confirmation message asks if you want to continue with removing data and metadata from all cells.

4. Click Yes to begin the wipe process.

Note - In a multicell configuration, you cannot wipe data from a single cell; you must wipe all cells simultaneously.