Overview of the Membership Service in TimesTen Scaleout

The membership service tracks the status of the data and management instances and resolves inconsistency issues caused by a network partition error.

Tracking the Instance Status

A grid is a collection of instances that reside on multiple hosts that communicate over a single internal network. The membership service knows which instances are active. When each instance starts, it connects to a membership server within the membership service to register itself, as shown in Figure 3-1. If one of the membership servers fails, the instances that were connected to the failed membership server transparently reconnect to one of the available membership servers.

Figure 3-1 Instances Register with the Membership Servers

Description of Figure 3-1 follows
Description of "Figure 3-1 Instances Register with the Membership Servers"

Each instance maintains a persistent connection to one of the membership servers, so that it can query the active instance list. If the network between the membership servers and the instances is down, the instances refuse to perform until the network is fixed and communication is restored with the membership servers.

Figure 3-2 demonstrates how data instances in a grid connect to each other, where each data instance connects to every other data instance in a grid. It also shows how each data instance in this example maintains a persistent connection with one of the membership servers.

Figure 3-2 Data Instances Communicating with Each Other

Description of Figure 3-2 follows
Description of "Figure 3-2 Data Instances Communicating with Each Other"

If a data instance loses a connection to another instance, it queries the active instance list on its membership server to verify if the "lost" instance is up. If the "lost" instance is up, then the data instance makes an effort to re-establish a connection with that instance. Otherwise, to avoid unnecessary delays, no further attempts are made to establish communication to the "lost" instance.

When a "lost" instance restarts, it registers itself with the membership service and proactively informs all other instances in a grid that it is up. When it is properly synchronized with the rest of a grid, the recovered instance is once again used to process transactions from applications.

In Figure 3-3, the host4.instance1 data instance is not up. If the host3.instance1 data instance tries to communicate with the host4.instance1 data instance, it discovers a broken connection. The host3.instance1 data instance queries the active instance list on its membership server, which informs it that the host4.instance1 data instance is not on the active instance list. If the host4.instance1 data instance comes back up, it registers itself again with the membership service, which then includes it in the list of active instances in this grid.

Figure 3-3 Instance Reacts to a Dead Connection

Description of Figure 3-3 follows
Description of "Figure 3-3 Instance Reacts to a Dead Connection"

Recovering from a Network Partition Error

A network partition error splits the instances involved in a single grid into two subsets. With a network partition error, each subset of instances is unable to communicate with the other subset of instances.

Figure 3-4 shows a network partition that would return inconsistent results to application queries without the membership service, since the application could access one subset of instances without being able to contact the disconnected subset of instances. Any updates made to one subset of instances would not be reflected in the other subset. If an application connects to the host1 data instance, then the query returns results from the host1 and host3 data instances; but any data that resides on the host2 and host4 data instances is not available because there is no connection between the two subsets.

Figure 3-4 Network Partition Failure

Description of Figure 3-4 follows
Description of "Figure 3-4 Network Partition Failure"

If you encounter a network partition, the membership service provides a resolution. Figure 3-5 shows a grid with six instances and three membership servers. A network communications error has split a grid into two subsets where host3.instance1 and host4.instance1 no longer know about or communicate with the rest of the instances. In addition, the ms_host1 membership server is not in communication with the other two membership servers.

For the membership service to work properly to manage the status of a grid, there must be a majority of active membership servers of the total servers created that can communicate with each other in order to work properly. If a membership server fails, the others continue to serve requests as long as a majority is available.

For example:

  • A membership service that consists of three membership servers can handle one membership server failure.

  • A membership service of five membership servers can handle two membership server failures.

  • A membership service of six membership servers can handle only two failures since three membership servers are not a majority.

Note:

When you configure the number of membership servers, you should always create an odd number of membership servers to serve as the membership service. If you have an even number of membership servers and a network partition error occurs, then each subset of a grid might have the same number of membership servers where neither side would have a majority. Thus, both sides of the network partitioned grid would stop working.

If the number of remaining membership servers falls below the number needed for a majority, the remaining membership servers refuse all requests until at least a majority of membership servers are running. In addition, data instances that cannot communicate with the membership service cannot run any transactions. You must research the failure issue and restart any failed membership servers.

Because of the communications failure, the ms_host1 membership server does not know about the other two membership servers. Since there are not enough membership servers to constitute a majority, the ms_host1 membership server can no longer accept incoming requests from the host3.instance1 and host4.instance1 data instances. The host3.instance1 and host4.instance1 data instances cannot run any transactions until the failed membership server is restarted.

Figure 3-5 Network Partition with Membership Service

Description of Figure 3-5 follows
Description of "Figure 3-5 Network Partition with Membership Service"

To discover if there may be a network partition, you will see errors in the daemon log about elements losing contact with their membership server.

Once you resolve the connection error that caused your grid to split into two, all of the membership servers reconnect and synchronize the membership information. In our example in Figure 3-5, the ms_host1 membership server rejoins the membership service. After which, the host3.instance1 and host4.instance1 data instances also rejoin this grid as active instances.