6 Managing Server States
This chapter describes the various states that servers can have, the automatic switchover capability for certain failures, and how you can manage the states of the servers manually.
Introduction
This chapter describes the various states that servers can have, the automatic switchover capability for certain failures, and how you can manage the states of the servers manually.
Understanding Server States
The LSMS has two servers for high availability. Usually, the LSMS is in duplex mode, with one server the active server and the other server in a standby state. In duplex mode, the active server is the master MySQL database server, and the standby server acts as the MySQL slave. Any database changes are made on the active server and are replicated to the standby server.
If the active server is not able to run LSMS functions, the standby server can take over to be the active server. The servers are peers; either server can be the active server, but only one server can be active at a time.
When one server is in ACTIVE state and the other server is not in STANDBY state, the LSMS is in simplex mode. When the LSMS is in simplex mode, the non-ACTIVE server should be brought back to STANDBY state as soon as possible (use the procedure described in “Starting a Server”).
The state of each server is monitored by the LSMS High Availability (HA) utility. Table 6-1 shows the possible states for each server (but only one server at a time can be in the ACTIVE state).
Table 6-1 LSMS Server States
State | Server Status |
---|---|
ACTIVE |
Server is online, running the LSMS application, and acts as the MySQL master. |
STANDBY |
Server is online and participating in database replication. The server ready to become the active server if automatic switchover is necessary or if manual switchover is performed. The server is not currently running the LSMS application. |
UNINITIALIZED "INHIBITED" |
Server is online but it is not participating in database replication and no application is running. |
Note: Other transitional states may be displayed while a server is changing from one to another of these states. |
Understanding Switchover
Changing active status from one server to another is called switchover. The server on which the LSMS is running at a given time is called the active server. If the other server is in STANDBY state, it is called the standby server. (If the other server is in UNINITIALIZED "INHIBITED" state, the LSMS is said to be running in simplex mode, which means that only one server is currently available to run the LSMS application, and switchover is not possible.) During switchover, the server that was in ACTIVE state changes to UNINITIALIZED "INHIBITED" state and the server that was in STANDBY state changes to ACTIVE state.
What Happens During Switchover?
During a switchover, the following functions occur:
- The active server shuts down the LSMS application and transitions to UNINITIALIZED "INHIBITED" state.
- The standby server stops replicating the MySQL database.
- The standby server starts the LSMS application.
Note:
After switchover the state of the previously active server is UNINITIALIZED "INHIBITED", so this server is not ready to act as a standby server. As soon as possible, perform the procedure described in “Starting a Server” to put this server in STANDBY state.The following items describe the results of a switchover:
- Any server-side GUIs (started using the
start_mgui
command) are terminated. This type of GUI must be restarted manually. - All NPAC associations are terminated and then automatically restarted to connect to the newly active server (for more information, see LSMS Connectivity)
- All EMS associations are terminated and then automatically restarted to connect to the newly active server (for more information, see LSMS Connectivity)
- The Virtual IP (VIP) address is switched from the previously active server to the newly active server. In all types of network configuration, the VIP address is used for the application network, which is used by the following functions:
- The Service Assurance feature is restarted by the Surveillance feature after the newly active server takes over.
- After directly-connected Query Servers detect a period of inactivity, they attempt to reconnect. The reconnection is made to the newly active server.
- Web-based GUIs (if this feature is enabled).
Note:
Although it is possible to start a web-based GUI by specifying the server’s specific IP address, it is recommended that web-based GUIs use the VIP address. Any web-based GUIs that do not use the VIP address will terminate during switchover.Switchover has the following effects on connections on the web-based GUIs that use the VIP address:
- An alarm that switchover is being initiated is displayed
- Any user-initiated actions, such as audits or bulk loads, are terminated
- All web-based GUI sessions automatically reconnect themselves to the newly active server within the GUI refresh interval
- Until the GUI reconnects, no new GUI notifications will be displayed
For some types of failure on the active server, the LSMS automatically attempts to switch over. If automatic switchover is not possible, or at any time you wish, you can manually switch over to the other server. For more information about switching over, see the following:
What Needs to Happen When Switchover Completes?
When automatic or manual switchover completes, the LSMS is operating in simplex mode, with one server in ACTIVE state and the other server in UNINITIALIZED "INHIBITED" state. Only the server in ACTIVE state is in a condition that is available for running the LSMS application.
As soon as possible, manual intervention is needed to change the state of the non-active server to STANDBY state by performing the procedure described in Starting a Server. When this procedure is performed on a non-active server (while the other server is in ACTIVE state), the following functions are performed:
- The MySQL binary logs of the active server are copied to the server being started.
- The server being started takes the MySQL slave role and begins database replication.
- The server changes to STANDBY state; it is now available if switchover is needed again.
Understanding Automatic Switchover
The LSMS is designed with a number of redundant systems (such as power feeds and CPUs) to enable a server to continue hosting the LSMS application even after some failures. For cases of double-faults or other failure conditions for which there is no designed redundancy, the LSMS is designed to automatically switch over from the active server to the standby server. These failure conditions fall into the following categories:
- Hardware-related failures, such as loss of both power feeds, loss of redundant power feeds, loss of memory controller, and so on
- Database-related failures, such as failed mysqld process
- Network-related failures, if the user has defined certain network interfaces to be critical
Automatic Switchover Due to Hardware-Related Failure
The LSMS HA daemons on the active and standby servers send each other heartbeats once every second. When a server detects a loss of 10 heartbeats in a row, the server concludes that the other server is no longer functional and does the following:
- If the active server detects the loss of 10 heartbeats in a row from the standby server, the active server disqualifies the standby server from either automatic or manual switchover and posts the following notification:
LSMS4015|14:58 Oct 22, 2005|xxxxxxx|Notify:Sys Admin - Heartbeat failure
Until the standby server returns to STANDBY state, automatic switchover is not possible, and if manual switchover is attempted, the
lsmsmgr
text interface displays a warning indicating that there is no standby mode and no action is taken.Figure 6-1 Unable to Switchover to Standby
- If the standby server detects the loss of 10 heartbeats in a row from the other server, the standby server transitions to ACTIVE state. The results are the same as those described in What Happens During Switchover?.
LSMS4015|14:58 Oct 22, 2005|xxxxxxx|Notify:Sys Admin - Heartbeat failure
Automatic Switchover Due to Database-Related Failure
Each server monitors itself for accessibility to its database. In addition, the standby server monitors whether the replication process running and whether its replication of the active server’s database is within a configured threshhold (the default is one day).
- If a server finds an error in any of these conditions, it posts the following notification:
LSMS4007|14:58 Oct 22, 2005|xxxxxxx|Notify:Sys Admin - DB repl error
In addition, the server does the following:
- If the active server detects that its database is inaccessible, the active server switches over to the standby server and posts the following notifications:
LSMS4000|14:58 Oct 22, 2005|xxxxxxx|Notify:Sys Admin - Switchover initiated
If switchover is successful, the following notification is posted:
LSMS4001|14:58 Oct 22, 2005|xxxxxxx|Notify:Sys Admin - Switchover complete
If switchover is not successful, the following notification is posted:
LSMS4002|14:58 Oct 22, 2005|xxxxxxx|Notify:Sys Admin - Switchover failed
- If the standby server detects that its replication process is not running, its database is inaccessible, or its database is lagging by more than the configured threshhold, the standby server transitions to UNINITIALIZED "INHIBITED" state, and posts one of the following notifications, depending on whether the standby server is Server A (the server with the default server name
lsmspri
) or Server B (the server with the default server namelsmssec
):LSMS4013|14:58 Oct 22, 2005|xxxxxxx|Notify:Sys Admin - Primary inhibited
LSMS4014|14:58 Oct 22, 2005|xxxxxxx|Notify:Sys Admin - Secondary inhibited
- If the active server detects that its database is inaccessible, the active server switches over to the standby server and posts the following notifications:
Automatic Switchover Due to Network-Related Failure
Users have the option of defining any network interfaces (NPAC, EMS, and/or Application) as critical. For each network interface that the user defines as critical, the user defines one or more IP addresses to be pinged by each server every minute. (For information about how to define a network interface as critical, refer to the Configuration Guide.)
When a network interface is defined as critical, each server pings the first configured IP address every minute. If the ping fails and only one IP address has been defined for that network interface, the interface is considered to have failed. If the interface has additional IP addresses defined, the interface is not considered to have failed until all IP addresses have been pinged with no response.
When a network interface is considered to have failed, the server posts one of the following notifications that corresponds to the failed interface:
LSMS2000|14:58 Oct 22, 2005|xxxxxxx|Notify:Sys Admin - NPAC interface failure
LSMS0001|14:58 Oct 22, 2005|xxxxxxx|Notify:Sys Admin - EMS interface failure
LSMS4004|14:58 Oct 22, 2005|xxxxxxx|Notify:Sys Admin - APP interface failure
After the server posts the notification of interface failure, it does the following:
-
If the active server detects that a critical network interface has failed, the active server determines whether any critical network interfaces are considered to have failed on the standby server:
-
If any critical network interfaces are considered to have failed on the standby server, the active server continues in the ACTIVE state; it does not switch over.
-
If all critical network interfaces are responding to pings on the standby server, the active server switches over to the standby server and posts the following notifications:
LSMS4000|14:58 Oct 22, 2005|xxxxxxx|Notify:Sys Admin - Switchover initiated
If switchover is successful, the following notification is posted:
LSMS4001|14:58 Oct 22, 2005|xxxxxxx|Notify:Sys Admin - Switchover complete
If switchover is not successful, the following notification is posted:
LSMS4002|14:58 Oct 22, 2005|xxxxxxx|Notify:Sys Admin - Switchover failed
-
-
If the standby server detects that a critical network interface has failed, it continues to operate in STANDBY state. Although automatic switchover is not performed in this case, it is possible to manually switch over to a standby server that has detected a critical network interface has failed.
Managing Server States Manually
The following sections describe how you can manually manage the server states:
Determining the Server Status
Use either of the following to determine the server status:
Using the lsmsmgr
Interface to Determine the Server Status
Use the following procedure to determine the status of both servers.
Using the hastatus
Command to Determine the Server Status
To use the command line to determine the state an individual server, perform the following procedure.
Manually Switching Over from the Active Server to the Standby Server
When there is a failure on the active server, or at other times for testing, you can use the lsmsmgr
interface to manually switch over to the standby server, as described in the following procedure.
The server that was previously in STANDBY state is now in ACTIVE state, and the server that was previously in ACTIVE state is now in UNINITIALIZED "INHIBITED" state.
Note:
As soon as possible, perform the procedure described in “Starting a Server” to change the state of the server that is in UNINITIALIZED "INHIBITED" state to STANDBY state so that it is available if automatic switchover is needed or if manual switchover is desired.Inhibiting a Standby Server
Occasionally (for example, before powering down), it may be necessary to inhibit the standby server.
Note:
Inhibiting the active server results in switchover, as described in “Manually Switching Over from the Active Server to the Standby Server”.Use the following procedure to inhibit the standby server.
Note:
Do not allow this server to remain in UNINITIALIZED "INHIBITED" state any longer than necessary. As soon as possible, perform the procedure described in perform the procedure described in “Starting a Server” to change the state of the server to STANDBY state so that it is available if automatic switchover is needed or if manual switchover is desired.Starting a Server
A server in UNINITIALIZED "INHIBITED" state cannot run the LSMS application and is not available as a standby server. Use the following procedure to change the state of a server from UNINITIALIZED "INHIBITED" to a state where it is available to run the LSMS application.
During the starting process on a given server, the LSMS HA utility checks to see if the other server is in ACTIVE state. Therefore, the state of the server at the end of this procedure will be one of the following:
- If the other server is not in the ACTIVE state, this server will transition to ACTIVE state.
- If the other server was in the ACTIVE state, this server will perform the following functions:
- Copy the MySQL binary logs from the active server
- Take a snapshot of the active server’s database
- Transition to STANDBY state
- Configure its MySQL to be a slave to the active server’s master
- Start performing MySQL replication
The state of the server will be as described in the beginning of this section. To display the server state, use the procedure described in Determining the Server Status.