5 Abnormal Operations





This chapter describes how to manage Prestoserve under abnormal operating conditions. Abnormal operations include cases where the system could not shut down cleanly, or where a disk accelerated by Prestoserve encounters errors or failure.

5.1 Unclean Shutdowns

Prestoserve always flushes all of the data back to the real disk when the system is halted, therefore, the system should be shutdown normally before installing new operating systems and/or bootblocks. If an abnormal shutdown occurs, some of the recently written blocks may still be cached in the Prestoserve non- volatile memory and the boot program will not be able to find them since it has no knowledge of Prestoserve being present in the system.

A "clean" shutdown results when the system is halted according to the instructions that come with the operating system. It is also possible to shut down the Sun Prestoserve subsystem gracefully using the presto -d command prior to shutting the system down.

An "unclean" shutdown results if there is a power failure, hardware failure, or the abort L1-A, BREAK, or STOP-A sequence from the system monitor when Sun Prestoserve is not in the DOWN state.

Note - After an unclean shutdown, Prestoserve may contain valuable data which has not been written to disk.

5.1.1 Determining if an SBus Card has Unwritten Cached Data

The "CHECK DATA" button and "DATA OK" LED on the faceplate of the SBus Prestoserve card inform you when the card contains unwritten cached data. When you press the button the LED lights if there is cached data on the Sun Prestoserve card which has not been written to disk. When the SBus Prestoserve card is removed from the server, or the server is powered down, the LED is powered by two batteries on the card. The batteries can support the LED drain for a few days, however, you must make sure that you do not hold the button down for extended periods of time when the card is not in a powered-up system.

As a standard precaution, before installing or removing a Sun Prestoserve card, you should always press the "CHECK DATA" button and verify that the card is clean (unless of course you are intentionally removing a card which has unwritten cached data). If a card contains unwritten cached data you can follow the procedure described in "Disabling the Batteries" on page 4-4 to discard the data; or see the section below, "Moving a Prestoserve SBus or NVRAM Card Containing Dirty Data," to write the data to disk.

5.1.2 Booting a Different Kernel

Sun Prestoserve uses internal physical device numbers to identify data blocks. If you reconfigure your machine after an unclean shutdown, you run the risk of having cached data blocks flushed to the wrong device. This could happen under the following conditions:

    If you must boot a different kernel and the system was not cleanly shut down, disable Sun Prestoserve by following the procedure in "Disabling the Batteries" on page 4-4 to clear the Sun Prestoserve state. (SunOS 4.x only)

Note - After an unclean shutdown, Prestoserve may contain valuable data which has not been written to disk.

5.1.3 Moving a Prestoserve SBus or NVRAM Card
Containing Dirty Data

The Prestoserve device driver keeps the machine ID of the system it is serving in non-volatile memory. If you must move a Prestoserve card containing dirty data to a different machine, the Prestoserve driver notices at boot time that the machine ID from the new machine differs from the other machine ID, and the driver prompts you to select one of these options.

    This solution is useful if you no longer care about the data from the previous system. One scenario for this is when the disk on the previous system was destroyed.

    This solution is useful when you have swapped CPU boards or installed new ID PROMS in your CPU, but the disk configuration remains the same.

    This solution is useful in situations where the original machine had been shut down uncleanly by mistake.

5.1.4 Moving a CPU Board or ID PROM

In the event that you must move the original CPU board or the ID PROM to the new machine along with the Prestoserve card, and you want to discard the data it contains, clear the Prestoserve buffers by hand by disabling the battery for more than five minutes.

5.2 Disk Error Handling

This section describes how Prestoserve manages disk error conditions. Temporary disk failures are those that can be fixed without major repairs, such as a disk being off-line or write-protected. Serious disk failures, such as a head crash, involve significant repair work and may result in data loss.

5.2.1 Temporary Disk Failures

Since Sun Prestoserve is caching disk blocks, data written by an application may not be written to disk for some time. If a disk fails with Sun Prestoserve enabled, the system does not notice the failure until Sun Prestoserve attempts to flush its cache. When this happens, Sun Prestoserve enters the ERROR state and attempts to immediately flush its entire cache. If the cache is flushed successfully, Sun Prestoserve leaves the ERROR state. However, if the cache cannot be flushed, Sun Prestoserve becomes a read-only data cache and subsequent writes that do not match the blocks that are already in the Sun Prestoserve cache are passed directly through to the real disk driver.

When Sun Prestoserve is in the ERROR state, the new data written to a block already in the Sun Prestoserve cache replaces the existing block. Then, this block is flushed synchronously to the disk to see if the error condition still persists. If the error persists, the application receives the error from the failed write operation. If the write succeeds, Sun Prestoserve leaves the ERROR state once it can successfully flush all of its buffers.

A display message lists the major and minor numbers of the real device the first time Sun Prestoserve enters the ERROR state. A device-specific error message from the real device driver may have been previously displayed. Note that any retries that a disk driver would normally do in an error condition are still performed for each I/O request by Sun Prestoserve.

Sun Prestoserve exits the ERROR state only when it can successfully flush its entire cache to the disk. It attempts to flush its cache only when a request to write a block already in the cache is made and this block is successfully written out to disk. Requests to write blocks not already in the cache are directly passed through to the real disk driver. Thus, Sun Prestoserve is not accelerating requests when in the ERROR state, and Sun Prestoserve may remain in the ERROR state even after the disk problem is corrected.

If you locate the cause of the I/O failure and fix it (e.g., disabling the disk write protect), re-enable Sun Prestoserve so that it can verify the fix and leave the ERROR state. This can be accomplished by using this command:

# presto -u

Rebooting the system also causes Sun Prestoserve to attempt to flush its cache.

5.2.2 Serious Disk Failures

Use the information in this section to resolve problems that occur when Sun Prestoserve contains dirty data and an accelerated disk has suffered a failure that cannot be easily repaired.

If you suffer a major I/O failure that necessitates replacing media (and full restores), consider using the presto -R command, which attempts to flush all cached data and then destroys any data that cannot be written back to the disk. presto -R is the only software means available to destroy data cached on Sun Prestoserve. Prior to replacing a bad disk that cannot be written, presto -R can be used to ensure that disk blocks logically belonging to the bad disk are not flushed onto the new disk. However, when installing a new disk that has no valid data on it, no damage results from flushing random blocks to it. Thus, presto -R is only necessary when replacing a defective disk with a new disk containing an existing filesystem. In this case, presto - R must be run prior to replacing the disk.

If you are experiencing disk errors, but want to continue running with the faulty disk disabled, follow these steps:

    1. Use presto -R to flush all cached blocks and disable Sun Prestoserve.
    2. Unmount the bad disk.
    3. Use presto -u to enable Sun Prestoserve.

The ERROR state affects all "Presto-ized" disks, so it is necessary to deactivate the defective disk before re-enabling Prestoserve acceleration.