Cluster behavior

The Cluster Coordinator ensures that the nodes (represented by the Dgraph processes) in the cluster provide query processing that is stable in the face of individual follower node failures. This topic discusses cluster behavior in various scenarios, such as cluster startup, updates to the data files, and response to a node failure.

Bringing a cluster online

On startup, the following actions take place:
  • Any node can be started in either a leader or follower mode. The leader node must be started first. Any number of follower nodes and exactly one leader node can be added to a cluster.
  • If you attempt to start two leader nodes in the same cluster, you will receive an error.
  • If you attempt to start a follower node before the leader node has been started, the follower node will issue an error and will not start until the leader node is started.
  • Once started, each node registers with the Cluster Coordinator that manages the distributed state of the cluster.

    One node for which you do not specify that it must be a follower is the leader; you identify all other nodes as follower nodes.

  • The leader node determines the current version of the data files and informs the Cluster Coordinator.
  • Once the leader node has been started, the follower nodes can be started.
  • After all follower nodes have started, each of them acquires read-only access to the current version of the data files.
  • Follower nodes do not alter the data files in any way; they continue answering queries based on the version of the data files to which they have read-only access at startup, even if the leader node is in the process of updating, merging, or deleting data files on disk.

    Follower nodes refuse all updating web service and HTTP requests (such as admin?op=updateaspell) with a 403 HTTP status code (forbidden).

Processing an update

In a cluster, updates to the records in the data files (or any other updates) are sent only to the Oracle Endeca Server that is hosting the leader node Dgraph process.

The leader node processes the update and commits it to the on-disk data files. The Cluster Coordinator informs all follower nodes that a new version of data files is available. The leader node and all follower nodes can continue to use files from the previous version of the data files to finish query processing that had started against that version.

As each node finishes processing queries on the previous version, it releases references to it. Once the follower nodes are notified of the new version, they acquire read-only access to it and start using it.

It is recommended to wrap updates in the cluster in an outer transaction operation, although this is optional.

Responding to a node failure

In a cluster, a follower or a leader node may fail:
  • Failure of the leader node. When the leader node goes offline, updates do not occur on the leader node, and are not propagated to the follower nodes until the leader comes up again. However, follower nodes continue maintaining a consistent view of the data and answering queries.

    For updates to resume, the Dgraph process on the leader node must be restarted. If the Dgraph process is running as a service, you can set it to restart automatically. Once restarted, the leader node joins the cluster and becomes operational, and the updates start to propagate from this node to all other nodes.

  • Failure of a follower node. When one of the follower nodes goes offline, it is removed from the cluster. The other nodes do not need to keep track of this event. If the follower node is restarted, it joins the cluster. The follower node should be restarted with the same name.

Responding to network failures of the Cluster Coordinator service

If a network connection fails between the nodes in the cluster that connect to the Cluster Coordinator service, the Dgraph process on those nodes will shut down.

A node will rejoin the cluster once the Dgraph process on the leader node is restarted (this will happen automatically if it is run as a service) and is able to establish a connection with the Cluster Coordinator service.

Responding to a Cluster Coordinator failure

The Endeca Cluster Coordinator service cannot be configured to run as a Windows service. This means that if you are using the Endeca Clustering feature and run the Endeca Server as a Windows service, you should closely monitor the state of the Cluster Coordinator service.

The reason is that if the Endeca Server service crashes, the Windows Service Controller will automatically restart it. However, if the Cluster Coordinator service crashes, then it is not automatically restarted. This can lead to a situation where the Dgraph processes are running but the Cluster Coordinator service is not.