Solaris Resource Manager 1.3 System Administration Guide

Crash Recovery

There are many concerns for an administrator when a Solaris system has a failure, but there are some additional considerations when a Solaris Resource Manager system is being used. They are:

The limits database may have been corrupted by a disk error or for some other reason
The operation of limdaemon during a system failure, particularly in relation to connect-time usage charging, may be faulty

The following sections discuss these in detail and offer suggestions for handling the situation, where appropriate.

Corruption of the Limits Database

Solaris Resource Manager maintenance of the limits database is robust and corruption is unlikely. However, if corruption does occur, it is of major concern since this database is basic to the operation of Solaris Resource Manager. Any potential corruption should be investigated and, if detected, corrected.

Symptoms

No single symptom can reliably be used to determine whether the limits database has been corrupted, but there are a number of indicators that potentially reflect a corrupted limits database:

The detection of a group loop by the Solaris Resource Manager kernel is a positive indication that the limits database has been corrupted. Group loops are strictly prevented by Solaris Resource Manager, and they can only occur when an sgroup attribute has become corrupted in some way. Refer to Group Loops for more details.
Users seeing the message 'No limits information available' displayed when they attempt to log in, and their logins being rejected. This can occur if the corruption to the limits database causes their flag.real attributes to be cleared, which effectively deletes their lnodes. It will affect not only the lnode which is deleted, but also any lnodes which are orphaned (refer to Orphaned Lnodes for details). Note that the 'No limits information available' message will also appear if no lnode has been created for the account or if the lnode has been intentionally deleted, so it is not a clear indicator that the limits database has been corrupted.
Unrealistic values suddenly appearing in usage or limits attributes. This may cause some users to suddenly hit limits.
Users suddenly complaining of a loss of privilege or unexpected privileges, caused by corruption of privilege flags.

If an administrator suspects that there is corruption in the limits database, the best way to detect it is to use limreport to request a list of lnodes with attributes that should have values within a known range. If values outside that range are reported, corruption has taken place. limreport could also be used to list lnodes which have a clear flag.real. This will indicate accounts in the password map for which no lnode exists.

Correction

When corruption is detected, the administrator should revert to an uncorrupted version of the limits database. If the corruption is limited to a small section of the limits database, the administrator may be able to save the contents of all other lnodes and reload them into a fresh limits database using the limreport and limadm commands. This would be preferable if no recent copy of the limits database is available since the new limits database would now contain the most recent usage and accrue attributes. The procedure for saving and restoring the limits database is documented in Chapter 5, Managing Lnodes. For simple cases of missing lnodes it could be sufficient to just recreate them by using the limadm command.

Connect-Time Loss by `limdaemon`

If limdaemon terminates for any reason, all users currently logged in cease to be charged for any connect-time usage. Furthermore, when limdaemon is restarted, any users logged in will continue to use those terminals free of charge. This is because the daemon relies on login notifications from login to establish a Solaris Resource Manager login session record within the internal structures it uses to calculate connect-time usages. Therefore, whenever it starts, there are no Solaris Resource Manager login sessions established until the first notification is received.

Typically this will not be a problem if limdaemon terminated due to a system crash, since the crash will also cause other processes to terminate. Login sessions would then not be able to recommence until the system is restarted.

If limdaemon terminates for some other reason, the administrator has two choices:

Restart the daemon immediately, and ignore the lost charging of terminal connect-time for users who are already logged in. This could mean that a user has free use of a terminal indefinitely unless identified and logged out.
Bring the system back to single-user mode then return to multi-user mode, thus ensuring that all current login sessions are terminated and users can only log in again after the daemon has been restarted.

Crash Recovery

Corruption of the Limits Database

Symptoms

Correction

Connect-Time Loss by limdaemon

Connect-Time Loss by `limdaemon`