Managing Server States

Introduction

This chapter describes the various states that servers can have, the automatic switchover capability for certain failures, and how you can manage the states of the servers manually.

Understanding Server States

The LSMS has two servers for high availability. Usually, the LSMS is in duplex mode, with one server the active server and the other server in a standby state. In duplex mode, the active server is the master MySQL database server, and the standby server acts as the MySQL slave. Any database changes are made on the active server and are replicated to the standby server.

If the active server is not able to run LSMS functions, the standby server can take over to be the active server. The servers are peers; either server can be the active server, but only one server can be active at a time.

When one server is in ACTIVE state and the other server is not in STANDBY state, the LSMS is in simplex mode. When the LSMS is in simplex mode, the non-ACTIVE server should be brought back to STANDBY state as soon as possible (use the procedure described in “Starting a Server”).

The state of each server is monitored by the LSMS High Availability (HA) utility. Table 6-1 shows the possible states for each server (but only one server at a time can be in the ACTIVE state).

Table 6-1 LSMS Server States

State	Server Status
ACTIVE	Server is online, running the LSMS application, and acts as the MySQL master.
STANDBY	Server is online and participating in database replication. The server ready to become the active server if automatic switchover is necessary or if manual switchover is performed. The server is not currently running the LSMS application.
UNINITIALIZED "INHIBITED"	Server is online but it is not participating in database replication and no application is running.
Note: Other transitional states may be displayed while a server is changing from one to another of these states.

Understanding Switchover

Changing active status from one server to another is called switchover. The server on which the LSMS is running at a given time is called the active server. If the other server is in STANDBY state, it is called the standby server. (If the other server is in UNINITIALIZED "INHIBITED" state, the LSMS is said to be running in simplex mode, which means that only one server is currently available to run the LSMS application, and switchover is not possible.) During switchover, the server that was in ACTIVE state changes to UNINITIALIZED "INHIBITED" state and the server that was in STANDBY state changes to ACTIVE state.

What Happens During Switchover?

During a switchover, the following functions occur:

The active server shuts down the LSMS application and transitions to UNINITIALIZED "INHIBITED" state.
The standby server stops replicating the MySQL database.
The standby server starts the LSMS application.

Note:
After switchover the state of the previously active server is UNINITIALIZED "INHIBITED", so this server is not ready to act as a standby server. As soon as possible, perform the procedure described in “Starting a Server” to put this server in STANDBY state.

The following items describe the results of a switchover:

Any server-side GUIs (started using the start_mgui command) are terminated. This type of GUI must be restarted manually.
All NPAC associations are terminated and then automatically restarted to connect to the newly active server (for more information, see LSMS Connectivity)
All EMS associations are terminated and then automatically restarted to connect to the newly active server (for more information, see LSMS Connectivity)
The Virtual IP (VIP) address is switched from the previously active server to the newly active server. In all types of network configuration, the VIP address is used for the application network, which is used by the following functions:
- The Service Assurance feature is restarted by the Surveillance feature after the newly active server takes over.
- After directly-connected Query Servers detect a period of inactivity, they attempt to reconnect. The reconnection is made to the newly active server.
- Web-based GUIs (if this feature is enabled).
  
  Note:
  Although it is possible to start a web-based GUI by specifying the server’s specific IP address, it is recommended that web-based GUIs use the VIP address. Any web-based GUIs that do not use the VIP address will terminate during switchover.
  
  Switchover has the following effects on connections on the web-based GUIs that use the VIP address:
  - An alarm that switchover is being initiated is displayed
  - Any user-initiated actions, such as audits or bulk loads, are terminated
  - All web-based GUI sessions automatically reconnect themselves to the newly active server within the GUI refresh interval
  - Until the GUI reconnects, no new GUI notifications will be displayed

For some types of failure on the active server, the LSMS automatically attempts to switch over. If automatic switchover is not possible, or at any time you wish, you can manually switch over to the other server. For more information about switching over, see the following:

What Needs to Happen When Switchover Completes?

When automatic or manual switchover completes, the LSMS is operating in simplex mode, with one server in ACTIVE state and the other server in UNINITIALIZED "INHIBITED" state. Only the server in ACTIVE state is in a condition that is available for running the LSMS application.

As soon as possible, manual intervention is needed to change the state of the non-active server to STANDBY state by performing the procedure described in Starting a Server. When this procedure is performed on a non-active server (while the other server is in ACTIVE state), the following functions are performed:

The MySQL binary logs of the active server are copied to the server being started.
The server being started takes the MySQL slave role and begins database replication.
The server changes to STANDBY state; it is now available if switchover is needed again.

Understanding Automatic Switchover

The LSMS is designed with a number of redundant systems (such as power feeds and CPUs) to enable a server to continue hosting the LSMS application even after some failures. For cases of double-faults or other failure conditions for which there is no designed redundancy, the LSMS is designed to automatically switch over from the active server to the standby server. These failure conditions fall into the following categories:

Hardware-related failures, such as loss of both power feeds, loss of redundant power feeds, loss of memory controller, and so on
Database-related failures, such as failed mysqld process
Network-related failures, if the user has defined certain network interfaces to be critical

Automatic Switchover Due to Hardware-Related Failure

The LSMS HA daemons on the active and standby servers send each other heartbeats once every second. When a server detects a loss of 10 heartbeats in a row, the server concludes that the other server is no longer functional and does the following:

If the active server detects the loss of 10 heartbeats in a row from the standby server, the active server disqualifies the standby server from either automatic or manual switchover and posts the following notification:
```
LSMS4015|14:58 Oct 22, 2005|xxxxxxx|Notify:Sys Admin - Heartbeat failure
```
Until the standby server returns to STANDBY state, automatic switchover is not possible, and if manual switchover is attempted, the lsmsmgr text interface displays a warning indicating that there is no standby mode and no action is taken.

Figure 6-1 Unable to Switchover to Standby
If the standby server detects the loss of 10 heartbeats in a row from the other server, the standby server transitions to ACTIVE state. The results are the same as those described in What Happens During Switchover?.
```
LSMS4015|14:58 Oct 22, 2005|xxxxxxx|Notify:Sys Admin - Heartbeat failure
```

Automatic Switchover Due to Database-Related Failure

Each server monitors itself for accessibility to its database. In addition, the standby server monitors whether the replication process running and whether its replication of the active server’s database is within a configured threshhold (the default is one day).

If a server finds an error in any of these conditions, it posts the following notification:
```
LSMS4007|14:58 Oct 22, 2005|xxxxxxx|Notify:Sys Admin - DB repl error
```
In addition, the server does the following:
- If the active server detects that its database is inaccessible, the active server switches over to the standby server and posts the following notifications:
```
LSMS4000|14:58 Oct 22, 2005|xxxxxxx|Notify:Sys Admin - Switchover initiated
```
  If switchover is successful, the following notification is posted:
```
LSMS4001|14:58 Oct 22, 2005|xxxxxxx|Notify:Sys Admin - Switchover complete
```
  If switchover is not successful, the following notification is posted:
```
LSMS4002|14:58 Oct 22, 2005|xxxxxxx|Notify:Sys Admin - Switchover failed
```
- If the standby server detects that its replication process is not running, its database is inaccessible, or its database is lagging by more than the configured threshhold, the standby server transitions to UNINITIALIZED "INHIBITED" state, and posts one of the following notifications, depending on whether the standby server is Server A (the server with the default server name lsmspri) or Server B (the server with the default server name lsmssec):
```
LSMS4013|14:58 Oct 22, 2005|xxxxxxx|Notify:Sys Admin - Primary inhibited
```
```
LSMS4014|14:58 Oct 22, 2005|xxxxxxx|Notify:Sys Admin - Secondary inhibited
```

Automatic Switchover Due to Network-Related Failure

Users have the option of defining any network interfaces (NPAC, EMS, and/or Application) as critical. For each network interface that the user defines as critical, the user defines one or more IP addresses to be pinged by each server every minute. (For information about how to define a network interface as critical, refer to the Configuration Guide.)

When a network interface is defined as critical, each server pings the first configured IP address every minute. If the ping fails and only one IP address has been defined for that network interface, the interface is considered to have failed. If the interface has additional IP addresses defined, the interface is not considered to have failed until all IP addresses have been pinged with no response.

When a network interface is considered to have failed, the server posts one of the following notifications that corresponds to the failed interface:


LSMS2000|14:58 Oct 22, 2005|xxxxxxx|Notify:Sys Admin - NPAC interface failure


LSMS0001|14:58 Oct 22, 2005|xxxxxxx|Notify:Sys Admin - EMS interface failure


LSMS4004|14:58 Oct 22, 2005|xxxxxxx|Notify:Sys Admin - APP interface failure

After the server posts the notification of interface failure, it does the following:

If the active server detects that a critical network interface has failed, the active server determines whether any critical network interfaces are considered to have failed on the standby server:
- If any critical network interfaces are considered to have failed on the standby server, the active server continues in the ACTIVE state; it does not switch over.
- If all critical network interfaces are responding to pings on the standby server, the active server switches over to the standby server and posts the following notifications:
```
LSMS4000|14:58 Oct 22, 2005|xxxxxxx|Notify:Sys Admin - Switchover initiated
```
  If switchover is successful, the following notification is posted:
```
LSMS4001|14:58 Oct 22, 2005|xxxxxxx|Notify:Sys Admin - Switchover complete
```
  If switchover is not successful, the following notification is posted:
```
LSMS4002|14:58 Oct 22, 2005|xxxxxxx|Notify:Sys Admin - Switchover failed
```
If the standby server detects that a critical network interface has failed, it continues to operate in STANDBY state. Although automatic switchover is not performed in this case, it is possible to manually switch over to a standby server that has detected a critical network interface has failed.

Managing Server States Manually

The following sections describe how you can manually manage the server states:

Determining the Server Status

Use either of the following to determine the server status:

Using the `lsmsmgr` Interface to Determine the Server Status

Use the following procedure to determine the status of both servers.

Log into either server as the lsmsmgr user.
From the main lsmsmgr interface, select Maintenance, and then LSMS Node Status.

Figure 6-2 LSMS Node Status

In Figure 6-2, the server that was logged into is named lsmspri and its state is ACTIVE; the mate server is named lsmssec and its state is STANDBY.
Press any key to return to the lsmsmgr Maintenance menu.

Using the `hastatus` Command to Determine the Server Status

To use the command line to determine the state an individual server, perform the following procedure.

Log in as the lsmsadm or lsmsall user to the command line of the server whose state you want to determine.
(For information about logging in, see “Logging In to LSMS Server Command Line”.)
Enter the following command:
$ hastatus
The command line interface displays the status, similar to the following example, and then returns the prompt.
ACTIVE
$

Manually Switching Over from the Active Server to the Standby Server

When there is a failure on the active server, or at other times for testing, you can use the lsmsmgr interface to manually switch over to the standby server, as described in the following procedure.

Log in as the lsmsmgr user to the active server.
(For information about logging in as lsmsmgr, see “Logging In to LSMS Server Command Line”.)
From the main lsmsmgr interface, select Maintenance, and then Inhibit Node.
If the server you logged into is the ACTIVE server, the lsmsmgr interface displays information that confirms that the local node (the server you logged into) is active and the mate server is available as a standby (which implies that its state is STANDBY).

Figure 6-3 Inhibit Active Node
Ensure that the Yes button is highlighted and press Enter.
A window, as shown in Figure 6-4, displays, but no action is needed.

Figure 6-4 Check Network Status on Standby Node
After the network status on the standby node is checked, a confirmation window displays.

Figure 6-5 Confirm Switchover
Ensure that the Yes button is highlighted and press Enter.
The window shown in Figure 6-7 displays.

Figure 6-6 Manual Switchover In Progress
When the switchover is complete, press any key to continue.

Figure 6-7 Manual Switchover Complete

The server that was previously in STANDBY state is now in ACTIVE state, and the server that was previously in ACTIVE state is now in UNINITIALIZED "INHIBITED" state.

Note:

As soon as possible, perform the procedure described in “Starting a Server” to change the state of the server that is in UNINITIALIZED "INHIBITED" state to STANDBY state so that it is available if automatic switchover is needed or if manual switchover is desired.

Inhibiting a Standby Server

Occasionally (for example, before powering down), it may be necessary to inhibit the standby server.

Note:

Inhibiting the active server results in switchover, as described in “Manually Switching Over from the Active Server to the Standby Server”.

Use the following procedure to inhibit the standby server.

Log in as the lsmsmgr user to the standby server.
(For information about logging in as lsmsmgr, see “Logging In to LSMS Server Command Line”.)
From the main lsmsmgr interface, select Maintenance, and then Inhibit Node.
The lsmsmgr interface displays the window shown in Figure 6-8.

Figure 6-8 Inhibit a Non-Active Server
Ensure that the Yes button is highlighted and press Enter.
While the server is being inhibited, the lsmsmgr interface disappears and the following text is displayed on the command line, where <hostname> is the name of the server:
```
Inhibiting node <hostname>...
```
When the server has been completely inhibited, the lsmsmgr interface appears again. Press any key to continue.

Figure 6-9 Node Successfully Inhibited

The lsmsmgr main menu is displayed again.

Note:

Do not allow this server to remain in UNINITIALIZED "INHIBITED" state any longer than necessary. As soon as possible, perform the procedure described in perform the procedure described in “Starting a Server” to change the state of the server to STANDBY state so that it is available if automatic switchover is needed or if manual switchover is desired.

Starting a Server

A server in UNINITIALIZED "INHIBITED" state cannot run the LSMS application and is not available as a standby server. Use the following procedure to change the state of a server from UNINITIALIZED "INHIBITED" to a state where it is available to run the LSMS application.

During the starting process on a given server, the LSMS HA utility checks to see if the other server is in ACTIVE state. Therefore, the state of the server at the end of this procedure will be one of the following:

If the other server is not in the ACTIVE state, this server will transition to ACTIVE state.
If the other server was in the ACTIVE state, this server will perform the following functions:
- Copy the MySQL binary logs from the active server
- Take a snapshot of the active server’s database
- Transition to STANDBY state
- Configure its MySQL to be a slave to the active server’s master
- Start performing MySQL replication

Log in as the lsmsmgr user to the appropriate server, depending on the server states, as follows (for information about logging in as lsmsmgr, see “Logging In to LSMS Server Command Line”):
- If both servers are in UNINITIALIZED "INHIBITED" state, log into the server that you want to make active.
  After you have finished this procedure on that server, repeat this procedure for the other server.
- If one server is in ACTIVE state, log into the server that is not active.
  
  Note:
  Do not attempt to change the state of the server while any of the following processes are running on the active server: backups (automatic or manual), running the import command, running the lsmsdb quickaudit command, or creating query server snapshots, all of which use temporary storage space. If you attempt to change the state of the server while any of these processes are running, you may not have enough disk space to complete the process. Since backups can be run automatically, perform the procedure described in “Checking for Running Backups” to ensure that no backups are running.
From the main lsmsmgr interface, select Maintenance, and then Start Node.
The lsmsmgr interface displays.

Figure 6-10 Starting a Server
Ensure that the Yes button is highlighted and press Enter.
While the server is being started, the lsmsmgr interface disappears and text similar to the following is displayed on the command line when this procedure is being performed on a server (lsmssec in this example) in UNINITIALIZED "INHIBITED" state while the other server is in ACTIVE state:
```
LSMS starting up on lsmssec...
Checking status from active mate...
Running status on lsmspri node
Copying DB from active mate. Local node will become standby.
  This may take a while
LSMS shutting down lsmssec...
Syncing mate:/mnt/snap/ to /var/TKLC/lsms/db/
Sync'ed
LSMS starting up on lsmssec...
Unihibiting node lsmssec...
Startup of local node successful

Press enter to continue...
```
Note:
The text that displays is different when this procedure is being performed when both servers were originally in UNINITIALIZED "INHIBITED" state, but the condition when both servers are in UNINITIALIZED "INHIBITED" state happens only during upgrade.
Press any key.
The lsmsmgr main menu is displayed again.

The state of the server will be as described in the beginning of this section. To display the server state, use the procedure described in Determining the Server Status.

Introduction

Understanding Server States

Understanding Switchover

Understanding Automatic Switchover

Managing Server States Manually

Using the lsmsmgr Interface to Determine the Server Status

Using the hastatus Command to Determine the Server Status

Manually Switching Over from the Active Server to the Standby Server

Inhibiting a Standby Server

Starting a Server

Using the `lsmsmgr` Interface to Determine the Server Status

Using the `hastatus` Command to Determine the Server Status