How enhanced availability is achieved

This topic discusses how the BDD cluster deployment ensures enhanced availability of query-processing.

Important: The BDD cluster deployment provides enhanced availability but does not provide high availability. This topic discusses the cluster behavior that enables enhanced availability and notes instances where system administrators need to take action to restore services.

The following three sections discuss the BDD cluster behavior for providing enhanced availability.

Note: This topic discusses BDD deployments with more than one running instance of the Dgraph. Even though you can deploy BDD on a single node, such deployments can only serve development environments, as they do not guarantee the availability of query processing in BDD. Namely, in a BDD deployment where only one node is hosting a single Dgraph instance, a failure of the Dgraph node shuts down the Dgraph process.

Availability of WebLogic Server nodes hosting Studio

When a WebLogic Server node goes down, Studio also goes down. As long as the BDD cluster utilizes an external load balancer and consists of more than one WebLogic Server node on which Studio is started, this does not disrupt Big Data Discovery operations.

If a WebLogic Studio node hosting Studio fails, the BDD cluster (that uses an external load balancer) stops using it and relies on other Studio nodes, until you restart it.

Availability of Dgraph nodes

The ZooKeeper ensemble running on a subset of Hadoop (CDH or HDP) nodes ensures the enhanced availability of the Dgraph cluster nodes and services:

Failure of a leader Dgraph. When the leader Dgraph of a database goes offline, the BDD cluster elects a new leader and starts sending updates to it. During this stage, follower Dgraphs continue maintaining a consistent view of the data and answering queries. You should manually restart this node with the bdd-admin script. When the Dgraph that had a leader role is restarted and joins the cluster, it becomes one of the follower Dgraphs. It is also possible that the leader Dgraph is restarted and joins the cluster before the cluster needs to appoint a new leader. In this case, that Dgraph continues to serve as the leader.
Failure of a follower Dgraph. When a follower Dgraph goes offline, the BDD cluster starts routing requests to other available Dgraphs. You should manually restart this node using the bdd-admin script. Once the node is restarted, it rejoins the cluster, and the cluster adjusts its routing information accordingly.

Availability of ZooKeeper instances

The ZooKeeper instances themselves must be highly available. The following statements describe the requirements in detail:

Each Hadoop node in the BDD cluster deployment can be optionally configured at deployment time to host a ZooKeeper instance. To ensure availability of ZooKeeper instances, it is recommended to deploy them in a cluster of their own, known as an ensemble. At deployment time, it is recommended that a subset of the Hadoop nodes is configured to host ZooKeeper instances. As long as a majority of the ensemble is running, ZooKeeper services are used by the BDD cluster. Because ZooKeeper requires a majority, the optimal number of Hadoop nodes hosting Zookeeper instances is an odd number that is at least 3.
A Hadoop node hosting a ZooKeeper instance assumes responsibility for ensuring the ZooKeeper process uptime. It will start ZooKeeper when BDD is deployed and will restart it should it stop running.
If you do not configure at least three Hadoop nodes to run ZooKeeper, it will be a single point of failure. Should ZooKeeper fail, the data sets served by BDD become entirely unavailable. To recover from this situation, the Hadoop node that was running a failed ZooKeeper must be restarted or replaced (the action required depends on the nature of the failure).