System Administration Guide: Network Services

Client Recovery in NFS Version 4

The NFS version 4 protocol is a stateful protocol. A protocol is stateful when both the client and the server maintain current information about the following.

When a failure occurs, such as a server crash, the client and the server work together to reestablish the open and lock states that existed prior to the failure.

When a server crashes and is rebooted, the server loses its state. The client detects that the server has rebooted and begins the process of helping the server rebuild its state. This process is known as client recovery, because the client directs the process.

When the client discovers that the server has rebooted, the client immediately suspends its current activity and begins the process of client recovery. When the recovery process starts, a message, such as the following, is displayed in the system error log /var/adm/messages.


NOTICE: Starting recovery server basil.example.company.com

During the recovery process, the client sends the server information about the client's previous state. Note, however, that during this period the client does not send any new requests to the server. Any new requests to open files or set file locks must wait for the server to complete its recovery period before proceeding.

When the client recovery process is complete, the following message is displayed in the system error log /var/adm/messages.


NOTICE: Recovery done for server basil.example.company.com

Now the client has successfully completed sending its state information to the server. However, even though the client has completed this process, other clients might not have completed their process of sending state information to the server. Therefore, for a period of time, the server does not accept any open or lock requests. This period of time, which is known as the grace period, is designated to permit all the clients to complete their recovery.

During the grace period, if the client attempts to open any new files or establish any new locks, the server denies the request with the GRACE error code. On receiving this error, the client must wait for the grace period to end and then resend the request to the server. During the grace period the following message is displayed.


NFS server recovering

Note that during the grace period the commands that do not open files or set file locks can proceed. For example, the commands ls and cd do not open a file or set a file lock. Thus, these commands are not suspended. However, a command such as cat, which opens a file, would be suspended until the grace period ends.

When the grace period has ended, the following message is displayed.


NFS server recovery ok.

The client can now send new open and lock requests to the server.

Client recovery can fail for a variety of reasons. For example, if a network partition exists after the server reboots, the client might not be able to reestablish its state with the server before the grace period ends. When the grace period has ended, the server does not permit the client to reestablish its state because new state operations could create conflicts. For example, a new file lock might conflict with an old file lock that the client is trying to recover. When such situations occur, the server returns the NO_GRACE error code to the client.

If the recovery of an open operation for a particular file fails, the client marks the file as unusable and the following message is displayed.


WARNING: The following NFS file could not be recovered and was marked dead 
(can't reopen:  NFS status 70):  file :  filename

Note that the number 70 is only an example.

If reestablishing a file lock during recovery fails, the following error message is posted.


NOTICE: nfs4_send_siglost:  pid PROCESS-ID lost
lock on server SERVER-NAME

In this situation, the SIGLOST signal is posted to the process. The default action for the SIGLOST signal is to terminate the process.

For you to recover from this state, you must restart any applications that had files open at the time of the failure. Note that the following can occur.

Thus, some processes can access a particular file while other processes cannot.