1 Architecture and Design

Welcome to Oracle Private Cloud Appliance. This chapter provides an overview of Oracle's platform for building cloud services and applications inside your own data center.

Introduction to Oracle Private Cloud Appliance

Oracle Private Cloud Appliance is engineered to deliver a comprehensive suite of cloud infrastructure services within the secure environment of your on-premises network. The system integrates all required hardware and software components, and has been tested, configured and tuned for optimal performance by Oracle engineers. In essence, it is a flexible general purpose IaaS (Infrastructure as a Service) solution in the sense that it supports the widest variety of workloads. Its pluggable platform provides an excellent foundation to layer PaaS (Platform as a Service) and SaaS (Software as a Service) solutions on top of the infrastructure.

This release of Private Cloud Appliance provides API compatibility with Oracle's public cloud solution, Oracle Cloud Infrastructure. You access the core IaaS services using the same methods, tools and interfaces as with Oracle Cloud Infrastructure. An installation of Private Cloud Appliance represents a customer region. Workloads are portable between your private and public cloud environments, but the private cloud is disconnected from Oracle Cloud Infrastructure and thus runs its own control plane components in order to host its set of compatible services.

As an engineered system, Private Cloud Appliance complies with the highest business continuity and serviceability requirements. It has the capabilities to monitor all components, detect potential problems, send out alerts, and automatically log a service request. Subsequent troubleshooting and repair can be performed without affecting the uptime of the environment.

System upgrades are also designed for minimum disruption and maximum availability. Health checks are performed before an upgrade to ensure that all components are in an acceptable state. The upgrade process is modular and allows components – such as firmware, operating systems, containerized services or the system's main database – to be upgraded individually or as an integrated multi-component workflow.

Compatibility with Oracle Cloud Infrastructure

A principal objective of Private Cloud Appliance is to allow you to consume the core Oracle Cloud Infrastructure services from the safety of your own on-premises network, behind your own firewall. The infrastructure services provide a foundation for building PaaS and SaaS solutions; the deployed workloads can be migrated between the public and the private cloud infrastructure with minimal or no modification required. For this purpose, Private Cloud Appliance offers API compatibility with Oracle Cloud Infrastructure.

As a rack-scale system, Private Cloud Appliance can be considered the smallest deployable unit of Oracle Cloud Infrastructure, aligned with the physical hierarchy of the public cloud design:

Hierarchy Concept Oracle Cloud Infrastructure Design Oracle Private Cloud Appliance Mapping

Realm

A Realm is a superset of Regions, and the highest physical subdivision of the Oracle cloud. There are no cross-realm features. Oracle Cloud Infrastructure currently consists of a Realm for Commercial Regions and a Realm for Government Cloud Regions.

The concept of a Realm exists in Private Cloud Appliance, but it has no practical function. It allows the appliance to participate in any Realm.

Region

A Region is a geographic area. An Oracle Cloud Infrastructure Region is composed of at least three Availability Domains. It is possible to migrate or replicate data and resources between Regions.

Private Cloud Appliance is designed as a single Region. Because this private region is disconnected from any other systems, it has no practical function.

Domain and system identifiers are used in system configuration instead, and mapped to the region and realm values.

Availability Domain

An Availability Domain consists of one or more data centers. Availability Domains are isolated from each other; they have independent power and cooling infrastructure and separate internal networking. A failure in one Availability Domain is highly unlikely to impact others.

Availability Domains within the same region are interconnected through an encrypted network with high bandwith and low latency. This is a critical factor in providing high availability and disaster recovery.

Each Private Cloud Appliance is configured as an Availability Domain. Multiple installations are distinct from each other: they do not function as Availability Domains within the same region.

Fault Domain

A Fault Domain is a grouping of infrastructure components within an Availability Domain. The goal is to isolate downtime events due to failures or maintenance, and make sure that resources in other Fault Domains are not affected.

Each Availability Domain contains three Fault Domains. Fault Domains provide anti-affinity: the ability to distribute instances so that they do not run on the same physical hardware.

Private Cloud Appliance adheres to the public cloud design: each Availability Domain contains three Fault Domains. A Fault Domain corresponds with one or more physical compute nodes.

Private Cloud Appliance also aligns with the logical partitioning of Oracle Cloud Infrastructure. It supports multiple tenancies, which are securely isolated from each other by means of tunneling and encapsulation in the appliance underlay network. Tenancies are hosted on the same physical hardware, but users and resources that belong to a given tenancy cannot interact with other tenancies. In addition, the Compute Enclave – which refers to all tenancies collectively, and to the cloud resources created and managed within them – is logically isolated from the Service Enclave, from where the appliance infrastructure is controlled. Refer to Enclaves and Interfaces for more information.

The Compute Enclave interfaces provide access in the same way as Oracle Cloud Infrastructure. Its CLI is identical while the browser UI offers practically the same user experience. API support is also identical, but limited to the subset of cloud services that Private Cloud Appliance offers.

The consistency of the supported APIs is a crucial factor in the compatibility between the public and private cloud platforms. It ensures that the core cloud services support resources and configurations in the same way. More specifically, Private Cloud Appliance supports the same logical constructs for networking and storage, manages user identity and access in the same way, and offers the same compute shapes and images for instance deployment as Oracle Cloud Infrastructure. As a result, workloads set up in a private cloud environment are easily portable to Private Cloud Appliance and vice versa. However, due to the disconnected operating mode of the private cloud environment, workloads must be migrated offline.

Enclaves and Interfaces

From a cloud user perspective, the Private Cloud Appliance Compute Enclave offers a practically identical experience to Oracle Cloud Infrastructure. However, the appliance also runs its own specific and separate administration area known as the Service Enclave. This section describes the boundaries between the enclaves and their interfaces, which are intended for different groups of users and administrators with clearly distinct access profiles.

Enclave Boundaries

The Compute Enclave was deliberately designed for maximum compatibility with Oracle Cloud Infrastructure. Users of the Compute Enclave have certain permissions to create and manage cloud resources. These privileges are typically based on group membership. The Compute Enclave is where workloads are created, configured and hosted. The principal building blocks at the users' disposal are compute instances and associated network and storage resources.

Compute instances are created from a compute image, which contains a preconfigured operating system and optional additional software. Compute instances have a particular shape, which is a template of virtual hardware resources such as CPUs and memory. For minimal operation, a compute instance needs a boot volume and a connection to a virtual cloud network (VCN). As you continue to build the virtual infrastructure for your workload, you will likely add more compute instances, assign private and public network interfaces, set up NFS shares or object storage buckets, and so on. All those resources are fully compatible with Oracle Cloud Infrastructure and can be ported between your private and public cloud environments.

The Service Enclave is the part of the system where the appliance infrastructure is controlled. Access is closely monitored and restricted to privileged administrators. It runs on a cluster of three management nodes. Because Private Cloud Appliance is operationally disconnected from Oracle Cloud Infrastructure, it needs a control plane of its own, specific to the design and scale of the appliance. The API is specific to Private Cloud Appliance, and access is very strictly controlled. Functionality provided by the Service Enclave includes hardware and capacity management, service delivery, monitoring and tools for service and support.

Both enclaves are strictly isolated from each other. Each enclave provides its own set of interfaces: a web UI, a CLI and an API per enclave. An administrator account with full access to the Service Enclave has no permissions whatsoever in the Compute Enclave. The administrator creates the tenancy with a primary user account for initial access, but has no information about the tenancy contents and activity. Users of the Compute Enclave are granted permission to use, manage and create cloud resources, but they have no control over the tenancy they work in, or the hardware on which their virtual resources reside.

Access Profiles

Each enclave has its own interfaces. To access the Compute Enclave, you use either the Compute Web UI or OCI CLI. To access the Service Enclave, you use either the Service Web UI or Service CLI.

Note:

You access the graphical interfaces of both enclaves using a web browser. For support information, please refer to the Oracle software web browser support policy.

The properties of your account determine which operations you are authorized to perform and which resources you can view, manage or create. Whether you use the web UI or the CLI makes no difference in terms of permissions. All operations result in requests to a third, central interface of the enclave: the API. Incoming requests from the Service API or Compute API are evaluated and subsequently authorized or rejected by the API service.

Different categories of users interact with the appliance for different purposes. At the enclave level we distinguish between administrators of the appliance infrastructure on the one hand, and users managing cloud resources within tenancies on the other hand. Within each enclave, different access profiles exist that offer different permissions.

In the Service Enclave, only a select team of administrators should be granted full access. There are other administrator roles with restricted access, for example for those responsible specifically for system monitoring, capacity planning, availability, upgrade, and so on. For more information about administrator roles, see Administrator Access. Whenever Oracle accesses the Service Enclave to perform service and support operations, an account with full access must be used.

When a tenancy is created, it has only one Compute Enclave user account: the tenancy administrator, who has full access to all resources in the tenancy. Practically speaking, every additional account with access to the tenancy is a regular Compute Enclave user account, with more or less restrictive permissions depending on group membership and policy definitions. It is the task of the tenancy administrator to set up additional user accounts and user groups, define a resource organization and management strategy, and create policies to apply that strategy.

Once a resource management strategy has been defined and a basic configuration of users, groups and compartments exists, the tenancy administrator can delegate responsibilities to other users with elevated privileges. You could decide to use a simple policy allowing a group of administrators to manage resources for the entire organization, or you might prefer a more granular approach. For example, you can organize resources in a compartment per team or project and let a compartment administrator manage them. In addition, you might want to keep network resources and storage resources contained within their own separate compartments, controlled respectively by a network administrator and storage administrator. The policy framework of the Identity and Access Management service offers many different options to organize resources and control access to them. For more information, refer to the chapter Identity and Access Management Overview.

When creating scripts or automation tools to interact directly with the API, make sure that the developers understand the authentication and authorization principles and the strict separation of the enclaves. Basic API reference documentation is available for both enclaves.

To view the API reference, append /api-reference to the base URL of the Compute Web UI or Service Web UI. For example:

  • Service Web UI base URL: https://adminconsole.myprivatecloud.example.com.

    Service Enclave API reference: https://adminconsole.myprivatecloud.example.com/api-reference.

  • Compute Web UI base URL: https://console.myprivatecloud.example.com.

    Compute Enclave API reference: https://console.myprivatecloud.example.com/api-reference.

Layered Architecture

For the architecture of Oracle Private Cloud Appliance a layered approach is taken. At the foundation are the hardware components, on which the core platform is built. This, in turn, provides a framework for administrative and operational services exposed to different user groups. The layers are integrated but not monolithic: they can be further developed at different rates as long as they maintain compatibility. For instance, supporting a new type of server hardware or extending storage functionality are enhancements that can be applied separately, and without redeploying the entire controller software stack.

Hardware Layer

The hardware layer contains all physical system components and their firmware and operating systems.

  • The three management nodes form a cluster that runs the base environment for the controller software.

  • The compute nodes provide the processing capacity to host compute instances.

  • The storage appliance provides disk space for storage resources used by compute instances. It also provides the storage space required by the appliance internally for its operation.

  • The network switches provide the physical connections between all components and the uplinks to the public (data center) network.

Platform Layer

Private Cloud Appliance uses a service-based deployment model. The product is divided into functional areas that run as services; each one within its own container. The platform provides the base for this model. Leveraging the capabilities of Oracle Cloud Native Environment, the management node cluster orchestrates the deployment of the containerized services, for which it also hosts the image registry.

In addition, the platform offers a number of fundamental services of its own, that are required by all other services: message transport, secrets management, database access, logging, monitoring, and so on. These fundamental services are standardized so that all services deployed on top of the platform can plug into them in the same way, which makes new service integrations easier and faster.

The platform also plays a central role in hardware administration, managing the data exchange between the hardware layer and the services. Information about the hardware layer, and any changes made to it, must be communicated to the services layer to keep the inventory up-to-date. When operations are performed at the service layer, an interface is required to pass down commands to the hardware. For this purpose, the platform has a tightly secured API that is only exposed internally and requires the highest privileges. This API interacts with management interfaces such as the server ILOMs and the storage controllers, as well as the inventory database and container orchestration tools.

For additional information about this layer in the appliance architecture, refer to Platform Layer Overview.

Infrastructure Services Layer

This layer contains all the services deployed on top of the platform. They form two functionally distinct groups: user-level cloud services and administrative services.

Cloud services offer functionality to users of the cloud environment, and are very similar in operation to the corresponding Oracle Cloud Infrastructure services. They constitute the Compute Enclave, and enable the deployment of customer workloads through compute instances and associated resources. Cloud services include the compute and storage services, identity and access management, and networking.

The administrative services are either internal or restricted to administrators of the appliance. These enable the operation of the cloud services and provide support for them. They constitute the Service Enclave. Administrator operations include system initialization, compute node provisioning, capacity expansion, tenancy management, upgrade, and so on. These operations have no externalized equivalent in Oracle Cloud Infrastructure, where Oracle fulfills the role of the infrastructure administrator.

Platform Layer Overview

In the Oracle Private Cloud Appliance architecture, the platform layer is the part that provides a standardized infrastructure for on-premises cloud services. It controls the hardware layer, enables the services layer, and establishes a secure interface that allows those layers to interact in a uniform and centralized manner. Common components and features of the infrastructure services layer are built into the platform, thus simplifying and accelerating the deployment of those microservices.

Fundamental Services

To fulfill its core role of providing the base infrastructure to deploy on-premises cloud services, the platform relies on a set of fundamental internal services of its own. This section describes their function.

Hardware Management

When a system is initialized, low level platform components orchestrate the provisioning of the management node cluster and the compute nodes. During this process, all nodes, including the controllers of the ZFS Storage Appliance, are connected to the required administration and data networks. When additional compute nodes are installed at a later stage, the same provisioning mechanism integrates the new node into the global system configuration. Additional disk trays are also automatically integrated by the storage controllers.

The first step in managing the hardware is to create an inventory of the rack's components. The inventory is a separate database that contains specifications and configuration details for the components installed in the rack. It maintains a history of all components that were ever presented to the system, and is updated continuously with the latest information captured from the active system components.

The services layer and several system components need the hardware inventory details so they can interact with the hardware. For example, a component upgrade or service deployment process needs to send instructions to the hardware layer and receive responses. Similarly, when you create a compute instance, a series of operations needs to be performed at the level of the compute nodes, network components and storage appliance to bring up the instance and its associated network and storage resources.

All the instructions intended for the hardware layer are centralized in a hardware management service, which acts as a gateway to the hardware layer. The hardware management service uses the dedicated and highly secured platform API to execute the required commands on the hardware components: server ILOMs, ZFS storage controllers, and so on. This API runs directly on the management node operating system. It is separated from the container orchestration environment where microservices are deployed.

Service Deployment

Oracle Private Cloud Appliance follows a granular, service-based development model. Functionality is logically divided into separate microservices, which exist across the architectural layers and represent a vertical view of the system. Services have internal as well as externalized functions, and they interact with each other in different layers.

These microservices are deployed in Kubernetes containers. The container runtime environment as well as the registry are hosted on the three-node management cluster. Oracle Cloud Native Environment provides the basis for container orchestration, which includes the automated deployment, configuration and startup of the microservices containers. By design, all microservices consist of multiple instances spread across different Kubernetes nodes and pods. Besides high availability, the Kubernetes design also offers load balancing between the instances of each microservice.

Containerization simplifies service upgrades and functional enhancements. The services are tightly integrated but not monolithic, allowing individual upgrades on condition that compatibility requirements are respected. A new version of a microservice is published to the container registry and automatically propagated to the Kubernetes nodes and pods.

Common Service Components

Some components and operational mechanisms are required by many or all services, so it is more efficient to build those into the platform and allow services to consume them when they are deployed on top of the platform. These common components, add a set of essential features to each service built on top of the platform, thus simplifying service development and deployment.

  • Message Transport

    All components and services are connected to a common transport layer. It is a message broker that allows components to send and receive messages written in a standardized format. This message transport service is deployed as a cluster of three instances for high availability and throughput, and uses TLS for authentication and traffic encryption.

  • Secret Service

    Secrets used programmatically throughout the system, such as login credentials and certificates, are centrally managed by the secret service. All components and services are clients of the secret service: after successful authentication the client receives a token for use with every operation it attempts to execute. Policies defined within the secret service determine which operations a client is authorized to perform. Secrets are not stored in a static way; they have a limited lifespan and are dynamically created and managed.

    During system initialization, the secret service is unsealed and prepared for use. It is deployed as an active/standby cluster on the management nodes, within a container at the platform layer, but outside of the Kubernetes microservices environment. This allows the secret service to offer its functionality to the platform layer at startup, before the microservices are available. All platform components and microservices must establish their trust relationship with the secret service before they are authorized to execute any operations.

  • Logging

    The platform provides unified logging across the entire system. For this purpose, all services and components integrate with the Fluentd data collector. Fluentd collects data from a pre-configured set of log files and stores it in a central location. Logs are captured from system components, the platform layer and the microservices environment, and made available through the Loki log aggregation system for traceability and analysis.

  • Monitoring

    For monitoring purposes, the platform relies on Prometheus to collect metric data. Since Prometheus is deployed inside the Kubernetes environment, it has direct access to the microservices metrics. Components outside Kubernetes, such as hardware components and compute instances, provide their metric data to Prometheus through the internal network and the load balancer. The management nodes and Kubernetes itself can communicate directly with Prometheus.

  • Analytics

    Logging and monitoring data are intended for infrastructure administrators. They can consult the data through the Service Web UI, where a number of built-in queries for health and performance parameters are visualized on a dashboard. Alerts are sent when a key threshold is exceeded, so that appropriate countermeasures can be taken.

  • Database

    All services and components store data in a common, central database. It is a MySQL cluster database with instances deployed across the three management nodes and running on bare metal. Availability, load balancing, data synchronization and clustering are all controlled by internal components of the MySQL cluster. For optimum performance, data storage is provided by LUNs on the ZFS Storage Appliance, directly attached to each of the management nodes. Access to the database is strictly controlled by the secret service.

  • Load Balancing

    The management nodes form a cluster of three active nodes, meaning they are all capable of simultaneously receiving inbound connections. The ingress traffic is controlled by a statically configured load balancer that listens on a floating IP address and distributes traffic across the three management nodes. An instance of the load balancer runs on each of the management nodes.

    In a similar way, all containerized microservices run as multiple pods within the container orchestration environment on the management node cluster. Kubernetes provides the load balancing for the ingress traffic to the microservices.

Physical Resource Allocation

When users deploy compute instances – or virtual machines – these consume physical resources provided by the hardware layer. The hypervisor manages the allocation of those physical resources based on algorithms that enable the best performance possible for a given configuration.

The compute nodes in Private Cloud Appliance have a Non-Uniform Memory Access (NUMA) architecture, meaning each CPU has access not only to its own local memory but also the memory of the other CPUs. Each CPU socket and its associated local memory banks are called a NUMA node. Local memory access, within the same NUMA node, always provides higher bandwidth and lower latency.

In general, the memory sharing design of NUMA helps with the scaling of multiprocessor workloads, but it may also adversely affect virtual machine performance if its resources are distributed across multiple NUMA nodes. Hypervisor policies ensure that a virtual machine's CPU and memory reside on the same NUMA node whenever possible. Using specific CPU pinning techniques, each virtual machine is pinned to one or multiple CPU cores so that memory is accessed locally on the NUMA node where the virtual machine is running. Cross-node memory transports are avoided or kept at a minimum, so the users of compute instances benefit from optimal performance across the entire system.

If a virtual machine can fit into a single NUMA node on a hypervisor host then this NUMA node is used for pinning, referred to as strict pinning. If a virtual machine cannot fit into single NUMA node on a hypervisor host, but can fit into multiple NUMA nodes, then multiple NUMA nodes are used for pinning, referred to as loose pinning.

Caution:

The CPU pinning applied through the hypervisor to optimize CPU and memory allocation to compute instances is not configurable by an appliance administrator, tenancy administrator or instance owner.

Private Cloud Appliance detects the NUMA topology of compute nodes during their provisioning (or any upgrade path from one release to another) and stores this information for use during the deployment of compute instances. The NUMA details of each compute instance are stored in the instance configuration. The NUMA settings are preserved when compute instances are migrated to another host compute node, but they might be overridden and dynamically adjusted if the target compute node is unable to accommodate that particular configuration.

Compute instances running on a Private Cloud Appliance that is not yet NUMA-aware, can take advantage of the optimization policies as soon as the system has been patched or upgraded. There is no action required from the administrator to align instances with NUMA topology; the existing instance configurations are made compatible as part of the process.

High Availability

Oracle Engineered Systems are built to eliminate single points of failure, allowing the system and hosted workloads to remain operational in case of hardware or software faults, as well as during upgrades and maintenance operations. Private Cloud Appliance has redundancy built into its architecture at every level: hardware, controller software, master database, services, and so on. Features such as backup, automated service requests and optional disaster recovery further enhance the system's serviceability and continuity of service.

Hardware Redundancy

The minimum base rack configuration contains redundant networking, storage and server components to ensure that failure of any single element does not affect overall system availability.

Data connectivity throughout the system is built on redundant pairs of leaf and spine switches. Link aggregation is configured on all interfaces: switch ports, host NICs and uplinks. The leaf switches interconnect the rack components using cross-cabling to redundant network interfaces in each component. Each leaf switch also has a connection to each of the spine switches, which are also interconnected. The spine switches form the backbone of the network and enable traffic external to the rack. Their uplinks to the data center network consist of two cable pairs, which are cross-connected to two redundant ToR (top-of-rack) switches.

The management cluster, which runs the controller software and system-level services, consists of three fully active management nodes. Inbound requests pass through the virtual IP of the management node cluster, and are distributed across the three nodes by a load balancer. If one of the nodes stops responding and fences from the cluster, the load balancer continues to send traffic to the two remaining nodes until the failing node is healthy again and rejoins the cluster.

Storage for the system as well as for the cloud resources in the environment is provided by the internal ZFS Storage Appliance. Its two controllers form an active-active cluster, providing high availability and excellent throughput at the same time. The ZFS pools are built on disks in a mirrored configuration for optimum data protection. This applies to the standard high-capacity disk tray as well as an optional SSD-based high-performance tray.

System Availability

The appliance controller software and services layer are deployed on the three-node management cluster, and take advantage of the high availability that is inherent to the cluster design. The Kubernetes container orchestration environment also uses clustering for both its own controller nodes and the service pods it hosts. Multiple replicas of the microservices are running at any given time. Nodes and pods are distributed across the management nodes, and Kubernetes ensures that failing pods are replaced with new instances to keep all services running in an active/active setup.

All services and components store data in a common, central database. It is a MySQL cluster database with instances deployed across the three management nodes. Availability, load balancing, data synchronization and clustering are all controlled by internal components of the MySQL cluster.

A significant part of the system-level infrastructure networking is software-defined, just like all the virtual networking at the VCN and instance level. The configuration of virtual switches, routers and gateways is not stored and managed by the switches, but is distributed across several components of the network architecture. The network controller is deployed as a highly available containerized service.

The upgrade framework leverages the hardware redundancy and the clustered designs to provide rolling upgrades for all components. In essence, during the upgrade of one component instance, the remaining instances ensure that there is no downtime. The upgrade is complete when all component instances have been upgraded and returned to normal operation.

Compute Instance Availability

At the level of a compute instance, high availability refers to the automated recovery of an instance in case the underlying infrastructure fails. The state of the compute nodes, hypervisors and compute instances is monitored continually; each compute node is polled with a 5 minute interval. When compute instances go down, the system takes measures to recover them automatically.

If an individual compute instance crashes due to internal issues, the hypervisor attempts to restart the instance on the same compute node.

In the scenario where a compute node goes down because of an unplanned reboot, when the compute node successfully returns to normal operation, instances are restarted on the same host compute node. At the next polling interval, if instances are found that should be running but are in a different state, the start command is issued again. If any instances have crashed and remain in that state, the hypervisor attempts to restart them up to 5 times. Instances that were not running before the compute node became unavailable, remain shut down when the compute node is up and running again.

Note:

When a compute node becomes unavailable, the affected running compute instances remain in that configuration state in the Compute Web UI. When the compute node has rebooted and the instance recovery operations are executed, their configuration state changes to "shut down" and eventually back to "running" once the instance is available again.

In case of planned maintenance, the administrator must first disable provisioning for the compute node in question, and apply a maintenance lock. When the compute node is under a provisioning lock, the administrator can live-migrate all running compute instances to another compute node. If no target compute node is available in the same fault domain, instances may be migrated to another fault domain. If live migration fails, the administrator must manually shut down the instance and restart it on a different compute node. Maintenance mode can only be activated when there are no more running instances on the compute node. All compute instance operations on this compute node are disabled. A compute node in maintenance mode cannot be provisioned or deprovisioned.

Continuity of Service

Private Cloud Appliance offers several features that support and further enhance high availability. Health monitoring at all levels of the system is a key factor. Diagnostic and performance data is collected from all components, then centrally stored and processed, and made available to administrators in the form of visualizations on standard dashboards. In addition, alerts are generated when metrics exceed their defined thresholds.

Monitoring allows an administrator to track system health over time, take preventative measures when required, and respond to issues when they occur. In addition, systems registered with My Oracle Support provide phone-home capabilities, using the collected diagnostic data for fault monitoring and targeted proactive support. Registered systems can also submit automated service requests with Oracle for specific problems reported by the appliance.

To mitigate data loss and support the recovery of system and services configuration in case of failure, consistent and complete backups are made regularly. A backup can also be executed manually, for example to create a restore point just before a critical modification. The backups are stored in a dedicated NFS share on the ZFS Storage Appliance, and allow the entire Service Enclave to be restored when necessary.

Optionally, workloads deployed on the appliance can be protected against downtime and data loss through the implementation of disaster recovery. To achieve this, two Private Cloud Appliance systems need to be set up at different sites, and configured to be each other's replica. Resources under disaster recovery control are stored separately on the ZFS Storage Appliances in each system, and replicated between the two. When an incident occurs at one site, the environment is brought up on the replica system with practically no downtime. Oracle recommends that disaster recovery is implemented for all critical production systems.