Managing ODH Updates for Big Data Service Clusters
Manage updates for Big Data Service ODH clusters from the Cluster details page.
ODH update implementation involves updating with downtime impacting running application. Big Data Service 3.0.28 and later includes Host Order Upgrade cluster update strategy to support minimal downtime.
Verify that no other group except `bds_rp_users` exists with group ID `54339`. This group ID is reserved for Big Data Service and updating fails if this group ID is occupied by any other group.
In this approach, Big Data Service updates and restarts services simultaneously on batches/groups of hosts. The update orchestrator provides options to select the batch size for the upgrade. Host Order Upgrade provides the best strategy that's faster than other upgrade strategies and suitable for environments that can't take downtime.
-
Installation with downtime:
The downtime-based update strategy is the same as Express Upgrade, and it's suitable for clusters that can afford downtime. This strategy is the default ODH update installation strategy.
-
AD/FD Installation:
The Availability Domain(AD) based update strategy is a variation of the Host Order Upgrade update, where host group sets are created based on AD, and all the hosts in the set are upgraded sequentially before moving to the next set. You can provide sleep duration between ADs.
Note
Order of Preference: AD containing maximum type of nodes. For example, master, utility, Edge, and worker node. -
Batch manner:
The batching based update strategy is a variation of Host Ordered Patch where host group sets are created based on the batch size you provide, except for First Batch selected by Big Data Service. The First Batch is a special batch irrespective of the batch size you provide.
- Big Data Service picks the First Batch with all available node types on a cluster across all ADs/FDs to ensure the update succeeds on all types of nodes.
- The provided batch size is less than or equal to the number of nodes in an AD.
- You can provide inter-batch update pause duration.
Note
- If the batch size isn't passed, the default value of the batch size is the minimum of the number of nodes in an AD.
- For multi-AD regions, except first the batch, all batches are prepared within each AD. The sequence of batches start from the AD where First Batch is created.
For more information on Big Data Service updates, see:
- Planning ODH Update Installation
- Listing Updates for a Cluster
- Installing Available Updates for a Cluster
- Listing the Update History for a Cluster
You can list updates, install updates, and view update history for a cluster on the Cluster details page.
Planning ODH Update Installation
Installation with Downtime
Plan required downtime for Big Data Service ODH cluster update installation. The following information explains how to quantify the downtime and how to measure when it's complete. For information on update installation stages and required downtime, see Monitoring ODH Update Workflow Steps.
Gauging Impact
For HA and secure ODH cluster with 7 to 25 nodes, downtime is expected to start 40 to 50 minutes after starting the update.
AD/FD Installation or Batch Manner
Each update supports backward and forward compatibility. For example, if one AD is updated with a newer version of components, the components running on the other ADs will be compatible with the newer version.
Gauging Impact
Prerequisites to update ODH with minimal downtime:
- Big Data Service 3.0.28 or later.
- The cluster must be HA-enabled to avoid impact on applications.
- Enable HA for various components, such as Hive, Oozie, Ranger, and Schema Registry, to avoid downtime. ODH updating doesn't explicitly check for HA for any of the components. Downtime occurs when HA isn't configured for supported components.
- If your cluster has less than 10 worker nodes, it's advised not to adopt AD/FD installation patch strategy. Instead you should adopt Batch manner patch strategy with batch size equal to one.
- Trino and Hue don't support HA, and upgrade can incur downtime for these services.
- When an ODH update is in progress, no LCM operations are allowed.
- When an ODH updates has failed, no LCM operations are allowed except re-triggering an ODH update or deleting a cluster.
Expected Behavior of Components
The following table lists the expected behavior for components during an ODH update.
| Component | Expected Behavior |
|---|---|
| Hadoop |
On an HA cluster, there's no downtime during updating of the HDFS component. Assumes that HDFS components are backward/forward compatible. Note: Wait time is configured based on how much time the datanode takes to sync data and the data ingestion rate. |
| Yarn |
As part of the AD/FD Installation or Batch manner update process, the following happens:
Note: All components using Yarn as the resource schedule, for example, Flink, Spark, Hive, and MapReduce, are expected to have similar behavior. You must set For upgrade, the following steps are followed:
|
| Spark |
For upgrade, the following steps are followed:
|
| Kafka |
Prerequisites:
For upgrade, the following steps are followed:
Note: The wait time between (Batches / AD) are decided based on data ingestion / processing. |
| Hbase |
Prerequisite: For bulk load to an HBase table, you must set For upgrade, the following steps are followed:
|
| Hive |
No expected downtime if Hive HA is enabled. For upgrade, the following steps are followed:
|
| JupyterHub |
For upgrade, the following steps are followed:
|
| Oozie |
No expected downtime if Oozie HA is enabled. For upgrade, the following steps are followed:
|
| Ranger |
No expected downtime if Ranger HA is enabled. For upgrade, the following steps are followed:
|
| Trino |
Trino HA isn't available in Open Source. Therefore, it's not implemented in ODH. When Trino-coordinator is stopped, all pending Trino queries fail. You must restart the Trino applications after Trino-coordinator is restarted. |
| Hue |
No expected downtime. For upgrade, the following steps are followed:
|
| Flume |
Data processed by the Flume agent is stopped. There's no data loss as the checkpoint is maintained based on the source of data. If a checkpoint is present, after the agent starts, it resumes work from saved checkpoint. For upgrade, the following steps are followed:
|
| Schema Registry |
No expected downtime if Schema Registry HA is enabled. For upgrade, the following steps are followed:
|
Monitoring ODH Update Workflow Steps
When you see PREPARE_UPGRADE, the update installation and downtime are about to begin.
| Step | Time Line | Time Line Update Stage | Installation with Downtime | AD/FD Installation or Batch manner |
|---|---|---|---|---|
| 1 | T0 | DOWNLOAD patch | No downtime required | No downtime required |
| 2 | T1 | PROCESS_PATCH_METADATA patch | No downtime required | No downtime required |
| 3 | T2 | PATCH_AMBARI_SERVER_JAR patch | Requires no downtime/Ambari restart | Requires no downtime/Ambari restart |
| 4 | T2 | REGISTER_PATCH patch | No downtime required | No downtime required |
| 5 | T2 | CREATE_PATCH_REPO patch | No downtime required | No downtime required |
| 6 | T2 | APPLY_CUSTOM_PATCH patch | Requires no downtime/Ambari restart | Requires no downtime/Ambari restart |
| 7 | T2 | INSTALL_PATCH patch | No downtime required | No downtime required |
| 8 | T3 | PREPARE_UPGRADE patch | Required downtime | See Planning ODH Update Installation |
| 9 | T3 | APPLY_UPGRADE patch | Required downtime | See Planning ODH Update Installation |
| 10 | T4 | Patching complete | No downtime required | No downtime required |
High-level update time line
- T0: Apply Update in OCI Console.
- T1: The cluster health is checked for update readiness, and the ODH update bundle is downloaded to the cluster nodes (no downtime) (stages 1 and 2 in the previous table).
- T2: While the ODH update is being prepared on the cluster, the Ambari Server can be restarted. If you're signed in to Ambari, you must sign out and sign back in. (no downtime for your Hadoop jobs) (stages 3 to 7 in the previous table).
If updated with the Installation with Downtime strategy:
- T3: Downtime starts: All ODH/Hadoop services are stopped. The ODH update is applied to all the nodes in the cluster, and all Hadoop services are started. (stages 8 and 9 in the previous table)
- T4: Update application is completed, and downtime ends. (stage 9 in the previous table)
If updated with the AD/FD Installation or Batch manner strategy:
- T3: Occurs in two phases:
- Phase1: Update an initial collection of nodes in the cluster (First Batch).
An initial collection following node types are picked and updated one node at a time.
- 1 Utility node in AD-X
- 1 master node in AD-Y
- 1 Storage worker node from AD-Z
- 1 Compute only worker node from any AD
- 1 Edge node from any AD
- If a node of specific type is unavailable in an AD, pick the node of the same type from any other AD at random
Updating in this collection must succeed for further progress to occur. Otherwise, update fails and the cluster is rolled back to its initial state.
- Phase2: Update the remaining clusters in batches.
Irrespective of the size of the batch, updating progresses with all nodes in AD-X, followed by AD-Y, and then by AD-Z. For example:
- For a cluster of 100 nodes in a multi-AD region
- Distributed as 33 nodes in AD-X, 33 nodes in AD-Y, and 34 nodes in AD-Z
- With batch size of 20
- Updating progresses in the following order:
- 20 nodes in AD-X, 13 nodes in AD-X
- 20 nodes in AD-Y, 13 nodes in AD-Y
- 20 nodes in AD-Z, 14 nodes in AD-Z
- Phase1: Update an initial collection of nodes in the cluster (First Batch).
- T4: Update application is completed. (Stage 9 in the previous table):
Time between update stages
For HA and secure ODH cluster with 7 to 25 nodes:
| Update Strategy | Time Between T0 to T3 | Time Between T3 to T4 |
|---|---|---|
| Installation with Downtime | ~ 30 to 40 minutes | ~ 40 to 50 minutes of downtime |
| AD/FD Installation | ~ 30 to 40 minutes | ~ 70 to 250 minutes of minimal impact |
| Batch Manner | ~ 30 to 40 minutes | ~ 100 to 250 minutes of minimal impact (with batch size 1) |
Rollback Scenarios
For AD/FD Installation or Batch manner strategies, if an initial collector of nodes (First Batch) upgrade fails, it rolls back the ODH update. However, if the First Batch has passed and there's a failure in the later batches, the rollback isn't called, and the ODH update moves to a failed state. The cluster state is active, and only re-triggering the ODH update or deleting the cluster is allowed. No other LCM operations are allowed.