High Availability

Oracle Engineered Systems are built to eliminate single points of failure, allowing the system and hosted workloads to remain operational in case of hardware or software faults, as well as during upgrades and maintenance operations. Private Cloud Appliance has redundancy built into its architecture at every level: hardware, controller software, master database, services, and so on. Features such as backup, automated service requests and optional disaster recovery further enhance the system's serviceability and continuity of service.

Hardware Redundancy

The minimum base rack configuration contains redundant networking, storage and server components to ensure that failure of any single element does not affect overall system availability.

Data connectivity throughout the system is built on redundant pairs of leaf and spine switches. Link aggregation is configured on all interfaces: switch ports, host NICs and uplinks. The leaf switches interconnect the rack components using cross-cabling to redundant network interfaces in each component. Each leaf switch also has a connection to each of the spine switches, which are also interconnected. The spine switches form the backbone of the network and enable traffic external to the rack. Their uplinks to the data center network consist of two cable pairs, which are cross-connected to two redundant ToR (top-of-rack) switches.

The management cluster, which runs the controller software and system-level services, consists of three fully active management nodes. Inbound requests pass through the virtual IP of the management node cluster, and are distributed across the three nodes by a load balancer. If one of the nodes stops responding and fences from the cluster, the load balancer continues to send traffic to the two remaining nodes until the failing node is healthy again and rejoins the cluster.

Storage for the system as well as for the cloud resources in the environment is provided by the internal ZFS Storage Appliance. Its two controllers form an active-active cluster, providing high availability and excellent throughput at the same time. The ZFS pools are built on disks in a mirrored configuration for optimum data protection. This applies to the standard high-capacity disk tray as well as an optional SSD-based high-performance tray.

System Availability

The appliance controller software and services layer are deployed on the three-node management cluster, and take advantage of the high availability that is inherent to the cluster design. The Kubernetes container orchestration environment also uses clustering for both its own controller nodes and the service pods it hosts. Multiple replicas of the microservices are running at any given time. Nodes and pods are distributed across the management nodes, and Kubernetes ensures that failing pods are replaced with new instances to keep all services running in an active/active setup.

All services and components store data in a common, central database. It is a MySQL cluster database with instances deployed across the three management nodes. Availability, load balancing, data synchronization and clustering are all controlled by internal components of the MySQL cluster.

A significant part of the system-level infrastructure networking is software-defined, just like all the virtual networking at the VCN and instance level. The configuration of virtual switches, routers and gateways is not stored and managed by the switches, but is distributed across several components of the network architecture. The network controller is deployed as a highly available containerized service.

The upgrade framework leverages the hardware redundancy and the clustered designs to provide rolling upgrades for all components. In essence, during the upgrade of one component instance, the remaining instances ensure that there is no downtime. The upgrade is complete when all component instances have been upgraded and returned to normal operation.

Compute Instance Availability

At the level of a compute instance, high availability refers to the automated recovery of an instance in case the underlying infrastructure fails. The state of the compute nodes, hypervisors, and compute instances is monitored continually. Each compute node is polled with a 5 minute interval. When compute instances go down, by default the system attempts to recover them automatically.

By default, the system attempts to restart instances in their selected fault domain but will restart instances in other fault domains if insufficient resources are available in the selected fault domain. The selected fault domain is the fault domain that is specified in the instance configuration. You can configure the Compute service to restart instances only in their selected fault domain, and stop instances if insufficient resources are available in the selected fault domain. See "Configuring the Compute Service for High Availability" in the Hardware Administration chapter of the Oracle Private Cloud Appliance Administrator Guide to configure strict fault domain enforcement.

If a compute node goes down because of an unplanned reboot, when the compute node successfully returns to normal operation, by default instances are restarted. This behavior is configurable. At the next polling interval, by default if instances are found that should be running but are in a different state, the start command is issued again. If any instances have crashed and remain in that state, the hypervisor attempts to restart them up to 5 times. Instances that were not running before the compute node became unavailable, remain shut down when the compute node is up and running again.

If a compute node is lost due to a failure, by default the system attempts to live migrate running compute instances from the failed compute node to other compute nodes. Actual behavior depends on how you have configured the Compute service, as described in "Configuring the Compute Service for High Availability" in the Hardware Administration chapter of the Oracle Private Cloud Appliance Administrator Guide.

A compute node is considered failing when it has been disconnected from the data network or has been in powered-off state for more than 10 minutes. This 10-minute timeout corresponds with two unsuccessful polling attempts, and is the threshold for placing the compute node in FAIL state and its agent in EVACUATING state. This condition is required before the reboot migration can start.

Reboot migration implies that all compute instances from the failing compute node are stopped and restarted on another compute node. When migration is complete, the failing compute node's agent indicates that instances have been evacuated. If the compute node eventually reboots successfully, it must go through a cleanup process that removes all stale instance configurations and associated virtual disks. After cleanup, the compute node can host compute instances again. If automatic resolve is enabled on the Compute service, instances that were migrated to a different fault domain can be migrated back to their selected fault domain.

During the entire reboot migration, the instances remain in "moving" configuration state. Once migration is completed, the instance configuration state is changed to "running". Instances that were stopped before the failure are not migrated, since they are not associated with any compute node.

By default, fault domain preference is not strictly enforced with instance migration. This is another preference that you can configure as described in "Configuring the Compute Service for High Availability" in the Hardware Administration chapter of the Oracle Private Cloud Appliance Administrator Guide.

In case of planned maintenance, the administrator must first disable provisioning for the compute node in question, and apply a maintenance lock. When the compute node is under a provisioning lock, the administrator can live-migrate all running compute instances to another compute node as described in "Migrating Instances from a Compute Node" in the Hardware Administration chapter of the Oracle Private Cloud Appliance Administrator Guide. Maintenance mode can only be activated when there are no more running instances on the compute node. You can specify the force option to stop any instances that cannot be migrated. All compute instance operations on this compute node are disabled. A compute node in maintenance mode cannot be provisioned or deprovisioned.

Continuity of Service

Private Cloud Appliance offers several features that support and further enhance high availability. Health monitoring at all levels of the system is a key factor. Diagnostic and performance data is collected from all components, then centrally stored and processed, and made available to administrators in the form of visualizations on standard dashboards. In addition, alerts are generated when metrics exceed their defined thresholds.

Monitoring allows an administrator to track system health over time, take preventative measures when required, and respond to issues when they occur. In addition, systems registered with My Oracle Support provide phone-home capabilities, using the collected diagnostic data for fault monitoring and targeted proactive support. Registered systems can also submit automated service requests with Oracle for specific problems reported by the appliance.

To mitigate data loss and support the recovery of system and services configuration in case of failure, consistent and complete backups are made regularly. A backup can also be executed manually, for example to create a restore point just before a critical modification. The backups are stored in a dedicated NFS share on the ZFS Storage Appliance, and allow the entire Service Enclave to be restored when necessary.

Optionally, workloads deployed on the appliance can be protected against downtime and data loss through the implementation of disaster recovery. To achieve this, two Private Cloud Appliance systems need to be set up at different sites, and configured to be each other's replica. Resources under disaster recovery control are stored separately on the ZFS Storage Appliances in each system, and replicated between the two. When an incident occurs at one site, the environment is brought up on the replica system with practically no downtime. Oracle recommends that disaster recovery is implemented for all critical production systems.