How high availability is achieved

This topic discusses how the BDD cluster deployment ensures high availability of query-processing.

Important: Because you can have an arbitrary number of nodes, the BDD cluster deployment provides high availability only if a BDD cluster is deployed on a sufficient number of Dgraph nodes and Hadoop nodes and at least three Zookeper instances are running on each of the Hadoop nodes. This topic discusses the cluster behavior that enables high availability and notes instances where system administrators need to take action to restore services.

The following three sections discuss the BDD cluster behavior for providing high availability.

Note: This topic discusses BDD deployments with more than one running instance of the Dgraph. Even though you can deploy BDD on a single node, such deployments can only serve development environments, as they do not guarantee high availability of query processing in BDD. Namely, in a BDD deployment where only one node is hosting a single Dgraph instance, a failure of the Dgraph node shuts down the Dgraph process.

Availability of WebLogic Server nodes hosting Studio

When a WebLogic Server node goes down, Studio also goes down. As long as the BDD cluster utilizes an external load balancer and consists of more than one WebLogic Server node on which Studio is started, this does not disrupt Big Data Discovery operations.

If a WebLogic Studio node hosting Studio fails, the BDD cluster (that uses an external load balancer) stops using it and relies on other Studio nodes, until you restart it.

Availability of Dgraph nodes

The ZooKeeper ensemble running on a subset of Hadoop (CDH, HDP, or MapR) nodes ensures high availability of the Dgraph cluster nodes and services:

Failure of a leader Dgraph. When the leader Dgraph of a database goes offline, the BDD cluster relies on Zookeeper and Dgraph Gateway to elect a new leader. It then starts sending updates to it. During this stage, follower Dgraphs continue maintaining a consistent view of the data and answering queries. You should manually restart this node with the bdd-admin script. When the Dgraph that had a leader role is restarted and joins the cluster, it becomes one of the follower Dgraphs. It is also possible that the leader Dgraph is restarted and joins the cluster before the cluster needs to appoint a new leader. In this case, that Dgraph continues to serve as the leader.
Failure of a follower Dgraph. When a follower Dgraph goes offline, the BDD cluster starts routing requests to other available Dgraphs. You should manually restart this node using the bdd-admin script. Once the node is restarted, it rejoins the cluster, and the cluster adjusts its routing information accordingly.

Availability of ZooKeeper instances

The ZooKeeper instances themselves must be highly available. The following statements describe the requirements in detail:

Each Hadoop node in the BDD cluster deployment can be optionally configured at deployment time to host a ZooKeeper instance. To ensure availability of ZooKeeper instances, it is recommended to deploy them in a cluster of their own, known as an ensemble. At deployment time, it is recommended that a subset of the Hadoop nodes is configured to host ZooKeeper instances. As long as a majority of the ensemble is running, ZooKeeper services are used by the BDD cluster. Because ZooKeeper requires a majority, the optimal number of Hadoop nodes hosting Zookeeper instances is an odd number that is at least 3.
A Hadoop node hosting a ZooKeeper instance assumes responsibility for ensuring the ZooKeeper process uptime. It will start ZooKeeper when BDD is deployed and will restart it should it stop running.
If you do not configure at least three Hadoop nodes to run ZooKeeper, it will be a single point of failure. Should ZooKeeper fail, the data sets served by BDD become entirely unavailable. To recover from this situation, the Hadoop node that was running a failed ZooKeeper must be restarted or replaced (the action required depends on the nature of the failure).