3 Appliance Administration Overview

The appliance infrastructure is controlled from a system area that is securely isolated from the workspace where cloud resources are created and managed. This administration area is called the Service Enclave, and is available only to privileged administrators. This chapter describes how administrators access the Service Enclave and which features and functions are available to configure the appliance and keep it in optimum operating condition.

Administrator Access

An appliance administrator is a highly privileged user with access to the physical components of Oracle Private Cloud Appliance. There is no functional relationship between an appliance administrator account and a tenancy administrator account; these are entirely separate entities. While appliance administrators may be authorized to create and delete tenancies, their account does not grant any permission to access a tenancy or use its resources. An appliance administrator has no access whatsoever to user data or instances.

Access to the administrative functionality is provided through separate interfaces: a Service Web UI, a Service CLI and a Service API, which are all highly restricted. Administrative functionality includes hardware management, tenancy management, system and component upgrade, system backup and restore, monitoring, and so on. Infrastructure administrators have one or more of the roles described below:

Admin Role Description

SuperAdmin

Administrators with the SuperAdmin role have unrestricted access to the Service Enclave. They are authorized to perform all available operations, including the setup of other administrator accounts and management of authorization groups (admin roles).

Admin

The Admin role grants permission to list, create, modify and delete practically all supported object types. Permissions excluded from this role are: administrator account and authorization group management, and disaster recovery operations.

Monitor

Administrators with a Monitor role are authorized to execute read-only commands. For example, using the get API calls, they can list and filter for objects of a certain type.

Some objects related to specific features, such as the disaster recovery items, are excluded because they require additional privileges.

DR Admin

The DrAdmin role grants the same permissions as the Admin role, with the addition of all operations related to disaster recovery.

Day Zero Config

The Day0Config role only provides specific access to operations related to the initial setup of the appliance – a process also referred to as the "day zero configuration".

The state of the system determines which operations an administrator with this role is allowed to perform. For example, when the system is ready for the primary administrator account to be created, only that specific command is available. Then, when the system is ready to register system initialization data, only the commands to set those parameters are available.

Internal

This role is reserved for internal system use.

Appliance administrator accounts can be created locally, but Private Cloud Appliance also supports federating with an existing identity provider, so people can log in with their existing id and password. User groups from the identity provider must be mapped to the appliance administrator groups, to ensure that administrator roles are assigned correctly to each authorized account.

A single federated identity provider is supported for appliance administrator accounts. The process of establishing a federation trust with the identity provider is the same as for identity federation at the tenancy level. This is described in the chapter Identity and Access Management Overview. Refer to the section Federating with Identity Providers.

Status and Health Monitoring

The overall health status of the system is continually monitored, using real-time data from the hardware and platform layers. The system health checks and monitoring data are the foundation of problem detection. When an unhealthy condition is found, administrators use this information to begin troubleshooting. If necessary, they register a service request with Oracle for assistance in resolving the problem. If the system's built-in Oracle Auto Service Request client has been configured, it can generate a service request automatically, depending on the nature and severity of the issue.

Monitoring

Independently of the built-in health checks, an administrator can consult the monitoring data at any time to verify the overall status of the system or the condition of a particular component or service. This is done through the Grafana interface, by querying the system-wide metric data stored in Prometheus.

Grafana provides a visual approach to monitoring: it allows you to create dashboards composed of a number of visualization panels. Each panel corresponds with a single metric query or a combination of metric queries, displayed in the selected format. Options include graphs, tables, charts, diagrams, gauges, and so on. For each metric panel, thresholds can be defined. When the query result exceeds or drops below a given threshold, the display color changes, providing a quick indication of which elements are healthy, require investigation, or are malfunctioning.

Oracle provides a set of pre-defined dashboards that allow administrators to start monitoring the system as soon as it is up and running. The default monitoring dashboards are grouped into the following categories:

Monitoring Dashboard Set Description

Service Advisor

Appliance-specific collection of dashboards for monitoring the Kubernetes container orchestration environment, the containerized services it hosts, and the system health check services.

Service Level Monitoring

A read-only collection of dashboards that provide statistic data for all the microservices.

Kubernetes Monitoring

An additional collection of dashboards provided by Oracle's cloud native monitoring and visualization experts. These provide vast and detailed information about the Kubernetes cluster and its services.

The default dashboards contain metric data for the system's physical components – servers, switches, storage providers and their operating systems and firmware – as well as its logical components – controller software, platform, Kubernetes cluster and microservices, compute instances and their virtualized resources –. This allows the administrator or support engineer to verify the health status of both component categories independently, and find correlations between them. For example, a particular microservice might exhibit poor performance due to lack of available memory. The monitoring data indicates whether this is a symptom of a system configuration issue, a lack of physical resources, or a hardware failure. The monitoring system has an alerting service capable of detecting and reporting hardware faults. The administrator may optionally configure a notification channel to receive alerts based on rules defined in the monitoring system.

As part of service and support operations, Oracle may ask you to report specific metric data displayed in the default dashboards. For this reason, the default dashboard configurations should always be preserved. However, if some of the monitoring functionality is inadvertently modified or broken, the defaults can be restored. In a similar way, it is possible for Oracle to create new dashboards or modify existing ones for improved monitoring, and push them to your operational environment without the need for a formal upgrade procedure.

All the open source monitoring and logging tools described here have public APIs that allow customers to integrate with their existing health monitoring and alerting systems. However, Oracle does not provide support for such custom configurations.

Fault Domain Observability

When it comes to keeping the appliance infrastructure, the compute instances and their related resources running in a healthy state, the Fault Domain is an extremely important concept. It groups a set of infrastructure components with the goal of isolating downtime events due to failures or maintenance, making sure that resources in other Fault Domains are not affected.

In line with Oracle Cloud Infrastructure, there are always three Fault Domains in a Private Cloud Appliance. Each of its Fault Domains corresponds with one or more physical compute nodes. Apart from using Grafana to consult monitoring data across the entire system, an administrator can also access key capacity metrics for Fault Domains directly from the Service Enclave:

  • Number of compute nodes per Fault Domain

  • Total and available amount of RAM per Fault Domain

  • Total and available number of vCPUs per Fault Domain

  • Unassigned system CPU and RAM capacity

The Fault Domain metrics reflect the actual physical resources that can be consumed by compute instances hosted on the compute nodes. The totals do not include resources reserved for the operation of the hypervisor: 40GB RAM and 4 CPU cores (8 vCPUs).

In addition to the three Fault Domains, the Service CLI displays an "Unassigned" category. It refers to installed compute nodes that have not been provisioned, and thus are not part of a Fault Domain yet. For unassigned compute nodes the memory capacity cannot be calculated, but the CPU metrics are displayed.

System Health Checks

Health checks are the most basic form of detection. They run at regular intervals as Kubernetes CronJob services, which are very similar to regular UNIX cron jobs. A status entry is created for every health check result, which is always one of two possibilities: healthy or not healthy. All status information is stored for further processing in Prometheus; the unhealthy results also generate log entries in Loki with details to help advance the troubleshooting process.

Health checks are meant to verify the status of specific system components, and to detect status changes. Each health check process follows the same basic principle: to record the current condition and compare it to the expected result. If they match, the health check passes; if they differ, the health check fails. A status change from healthy to not healthy indicates that troubleshooting is required.

For the purpose of troubleshooting, there are two principal data sources at your disposal: logs and metrics. Both categories of data are collected from all over the system and stored in a central location: logs are aggregated in Loki and metrics in Prometheus. Both tools have a query interface that allows you to filter and visualize the data: they both integrate with Grafana. Its browser-based interface can be accessed from the Service Web UI.

To investigate what causes a health check to fail, it helps to filter logs and metric data based on the type of failure. Loki categorizes data with a labeling system, displaying log messages that match the selected log label. Select a label from the list to view the logs for the service or application you are interested in. This list allows you to select not only the health checks but also the internal and external appliance services.

In addition, the latest status from each health check is displayed in the Platform Health Check dashboard, which is part of the Service Advisor dashboard set provided by default in Grafana.

Private Cloud Appliance runs the health checks listed below.

Health Check Service Description

cert-checker

Verifies on each management node that no certificates have expired.

flannel-checker

Verifies that the Flannel container network service is fully operational on each Kubernetes node.

kubernetes-checker

Verifies the health status of Kubernetes nodes and pods, as well as the containerized services and their connection endpoints.

mysql-cluster-checker

Verifies the health status of the MySQL cluster database.

l0-cluster-services-checker

Verifies that low-level clustering services and key internal components (such as platform API, DHCP) in the hardware and platform layer are fully operational.

network-checker

Verifies that the system network configuration is correct.

registry-checker

Verifies that the container registry is fully operational on each management node.

vault-checker

Verifies that the secret service is fully operational on each management node.

etcd-checker

Verifies that the etcd service is fully operational on each management node.

zfssa-analytics-exporter

Reports ZFS Storage Appliance cluster status, active problems, and management path connection information. It also reports analytics information for a configurable list of datasets.

Centralized Logging

The platform provides unified logging across the entire system. The Fluentd data collector retrieves logs from components across the entire system and stores them in a central location, along with the appliance telemetry data. As a result, all the necessary troubleshooting and debugging information is maintained in a single data store, and does not need to be collected from different sources when an issue needs to be investigated. The overall health of the system is captured in one view, a Grafana dashboard, meaning there is no need for an administrator to check individual components.

Whenever an issue is found that requires assistance from Oracle, the administrator logs a service request. A support bundle is usually requested as part of that process. Thanks to the centralized logging, the support bundle is straightforward to generate, and remains possible even if system operation is severely compromised. Generating the support bundle is a scripted operation that produces a single compressed archive file. The administrator does need to manually upload the archive file containing the consolidated logs and other diagnostic data.

If the Oracle Auto Service Request client has been configured on your system, it can generate a service request automatically for you. However, the support bundle must still be uploaded manually.

Upgrade

Upgrading components of Private Cloud Appliance is the responsibility of the appliance administrator. The system provides a framework to verify the state of the appliance prior to an upgrade, and to execute an upgrade procedure as a workflow that initiates each individual task and tracks its progress until it completes. Thanks to built-in redundancy at all system levels, the appliance components can be upgraded without service interruptions to the operational environment.

The source content for upgrades – packages, archives, deployment charts and so on – is delivered through an ISO image. The location of the image is a required parameter for upgrade commands. The administrator can perform an upgrade either through the Service Web UI or the Service CLI, and must select one of two available options: individual component upgrade or full management node cluster upgrade.

Pre-Checks

All upgrade operations are preceded by a verification process to ensure that system components are in the correct state to be upgraded. For these pre-checks, the upgrade mechanism relies on the platform-level health checks. Even though health checks are executed continually for monitoring purposes, they must be run specifically before an upgrade operation. The administrator is not required to run the pre-checks manually; they are executed by the upgrade code when an upgrade command is entered. All checks must pass for an upgrade to be allowed to start, even if a single-component upgrade is selected.

Certain upgrade procedures require that the administrator first sets a provisioning lock and maintenance lock. While the locks are active, no provisioning operations or other conflicting activity can occur, meaning the upgrade process is protected against potential disruptions. Once the upgrade has completed, the maintenance and provisioning locks must be released so the system returns to full operational mode.

Single Component Upgrade

Private Cloud Appliance upgrades are designed to be modular, allowing individual components to be upgraded rather than the entire system at once. With single component upgrade, the following component options are available:

  • ILOM firmware

    Use this option to upgrade the Oracle Integrated Lights Out Manager (ILOM) firmware of a specific server within the appliance. After the firmware is upgraded successfully, the ILOM is automatically rebooted. However, the administrator must manually restart the server for all changes to take effect.

  • Switch firmware

    Use this option to upgrade the operating software of the switches. You must specify which switch category to upgrade: the leaf switches, the spine switches, or the management switch.

  • ZFS Storage Appliance firmware

    Use this option to upgrade the operating software on the ZFS Storage Appliance. Both controllers, which operate in an active-active cluster configuration, are upgraded as part of the same process.

  • Host operating system

    Use this option to upgrade the Oracle Linux operating system on a management node. It triggers a yum upgrade on the selected management node, and is configured to use a yum repository populated through the ISO image.

  • Clustered MySQL database

    Use this option to upgrade the MySQL database on all management nodes. The database installation is rpm-based and thus relies on the yum repository that is populated through the ISO image. The packages for the database are deliberately kept out of the host operating system upgrade, because the timing of the database upgrade is critical. The database upgrade workflow manages the backup operations and the cluster state, and stops and restarts the relevant services. It ensures all the steps are performed in the correct order on each management node.

  • Kubernetes cluster

    Use this option to upgrade the Kubernetes cluster, which is the container orchestration environment where services are deployed. The Kubernetes cluster runs on all the management nodes and compute nodes; its upgrade involves three major operations:

    • Upgrading the Kubernetes packages and all dependencies: kubeadm, kubelet, kubectl and so on.

    • Upgrading the Kubernetes container images: kube-apiserver, kube-controller-manager, kube-proxy and so on.

    • Updating any deprecated Kubernetes APIs and services YAML manifest files.

  • Secret service

    The process to upgrade the secret service on all management nodes consists of two parts. It involves a rolling upgrade of the two main secret service components: the etcd key value store and the Vault secrets manager. Both are upgraded in no particular order and independently of each other, using the new image files made available in the podman registry.

  • Platform services

    Use this option to upgrade the containerized services running within the Kubernetes cluster on the management nodes. The service upgrade mechanism is based on Helm, the Kubernetes equivalent of a package manager. For services that need to be upgraded, new container images and Helm deployment charts are delivered through an ISO image and uploaded to the internal registry. None of the operations up to this point have an effect on the operational environment.

    At the platform level, an upgrade is triggered by restarting the pods that run the services. The new deployment charts are detected, causing the pods to retrieve the new container image when they restart. If a problem is found, a service can be rolled back to the previous working version of the image.

  • Compute node

    Use this option to perform a yum upgrade of the Oracle Linux operating system on a compute node. Upgrades include the ovm-agent package, which contains appliance-specific code to optimize virtual machine operations and hypervisor functionality. You must upgrade the compute nodes one by one; there can be no concurrent upgrade operations.

Full Management Node Cluster Upgrade

Upgrades of individual components are largely self-contained. The full management node cluster upgrade integrates a number of those component upgrades into a global workflow that executes the component upgrades in a predefined order. With a single command, all three management nodes in the cluster are upgraded sequentially and component by component. This means an upgrade of a given component is executed on each of the three management nodes before the global workflow moves to the next component upgrade.

The order in which components are upgraded is predefined because of dependencies, and must not be changed. During the full management node cluster upgrade, the following components are upgraded:

  1. Management node host operating system

  2. Clustered MySQL database

  3. Secret service

  4. Kubernetes cluster

  5. Platform services

Patching

Patching refers to the ability to apply security enhancements and functional updates to Private Cloud Appliance independently of regular product releases. Patches are delivered as RPM packages through a series of dedicated channels on the Unbreakable Linux Network (ULN). To gain access to these channels, you need a Customer Support Identifier (CSI) and a ULN subscription.

The appliance is not allowed to connect directly to Oracle's ULN servers. The supported process is to set up a ULN mirror on a system inside the data center. The patch channels are then synchronized on the ULN mirror, where the management nodes can access the RPMs. Compute nodes need access to a subset of the RPMs, which are copied to a designated location on the internal shared storage and kept up-to-date.

Patches are installed using a mechanism similar to upgrade. A key difference is that patching commands include the path to the ULN mirror. The Service CLI provides a separate patch command for each supported component type. In the Service Web UI the administrator applies a patch by creating an upgrade request for a given component type, but selecting the Patch option instead of Upgrade.

ULN patches may be delivered for any of the component types that are part of a traditional product release: operating system updates for management and compute nodes, platform updates, microservices updates, firmware updates, compute image updates, and so on.

For more detailed information and step-by-step instructions, refer to the Oracle Private Cloud Appliance Patching Guide

Backup and Restore

The integrated Private Cloud Appliance backup service is intended to protect the system configuration against data loss and corruption. It does not create backups of the customer environment, but is instead geared toward storing the data required for system and service operation, so that any crucial service or component can be restored to its last-known healthy state. In line with the microservice-based deployment model, the backup service orchestrates the various backup operations across the entire system and ensures data consistency and integrity, but it does not define the individual component backup requirements. That logic is part of the component backup plugins.

The backup plugin is the key element that determines which files must be backed up for a given system component or service, and how the data must be collected. For example, a simple file copy may work for certain files while a snapshot is required for other data, or in some cases a service may need to be stopped to allow the backup to be created. The plugin also determines the backup frequency and retention time. Each plugin registers with the backup service so that it is aware of the active plugins and can schedule the required backup operations in a consistent manner as Kubernetes CronJobs. Plugins are aggregated into a backup profile; the backup profile is the task list that the backup service executes when a backup job is launched.

The backup data collected through the plugins is then stored by the backup service in a dedicated NFS share on the internal ZFS Storage Appliance, using ZFS encryption to ensure that the data at rest is secure. If required, the backup files can optionally be replicated to an external storage location.

When restoring a service or component from a backup, the service again relies on the logic provided by the plugin. A component restore process has two major phases: verification and data management. During the verification phase, the backup is evaluated for completeness and appropriateness in relation to the current condition of the component. Next, during the data management phase, the required actions are taken to stop or suspend a component, replace the data, and restart or resume normal operation. As with backup, the operations to restore the data are specific to the component in question.

The default backup and restore implementation is to execute a global backup profile that covers the MySQL cluster database, the ZFS Storage Appliance configuration, a snapshot of the ZFS projects on the storage appliance, and all registered component backup plugins. The default profile is executed daily at midnight UTC and has a 14-day retention policy. Backups are stored in /nfs/shared_storage/backups/backup_*. All restore operations must be performed manually on a per-component basis.

Note:

In release 3.0.1 of Private Cloud Appliance, the Backup and Restore service is not available through the Service Web UI or Service CLI, which also implies that administrators cannot configure the backup schedule.

Automated restore operations based on the backup plugins are currently not possible. If a manual restore from a backup is required, please contact Oracle for assistance.

Disaster Recovery

The goal of disaster recovery is to provide high availability at the level of an installation site, and to protect critical workloads hosted on a Private Cloud Appliance against outages and data loss. The implementation requires two Private Cloud Appliance systems installed at different sites, and a third system running an Oracle Enterprise Manager installation with Oracle Site Guard.

The two Private Cloud Appliance systems are both fully operational environments on their own, but at the same time configured to be each other's replica. A dedicated network connection between the two peer ZFS Storage Appliances – one in each rack – ensures reliable data replication at 5-minute intervals. When an incident is detected in either environment, the role of Oracle Site Guard is to execute the failover workflows, known as operation plans.

Setting up disaster recovery is the responsibility of an appliance administrator or Oracle engineer. It involves interconnecting all participating systems, and configuring the Oracle Site Guard operation plans and the replication settings on both Private Cloud Appliance systems. The administrator determines which workloads and resources are under disaster recovery control by creating and managing DR configurations through the Service CLI on the two appliances.

The DR configurations are the core elements. The administrator adds critical compute instances to a DR configuration, so that they can be protected against site-level incidents. Storage and network connection information is collected and stored for each instance included in the DR configuration. With the creation of a DR configuration, a dedicated ZFS project is set up for replication to the peer ZFS Storage Appliance, and the compute instance resources involved are moved from the default storage location to this new ZFS project. A DR configuration can be refreshed at all times to pick up changes that might have occurred to the instances it includes.

Next, site mapping details are added to the DR configuration. All relevant compartments and subnets must be mapped to their counterparts on the replica system. A DR configuration cannot work unless the compartment hierarchy and network configuration exist on both Private Cloud Appliance systems.

When an incident occurs, failover operations are launched to bring up the instances under disaster recovery control on the replica system. This failover is not granular but a site-wide process involving these steps:

  1. The site-level incident causes running instances to fail abruptly. They cannot be shut down gracefully.

  2. Reversing the roles of the primary and replica ZFS Storage Appliance.

  3. Recovering the affected compute instances of the primary system by starting them on the replica system.

  4. Cleaning up the primary system: removing stopped instances, frozen DR configurations, and so on.

  5. Setting up reverse DR configurations based on the ZFS project and instance metadata.

A failover is the result of a disruptive incident being detected at one of the installation sites. However, Oracle Site Guard also supports switchover, which is effectively the same process but manually triggered by an administrator. In a controlled switchover scenario the first step in the process is to safely stop the running instances on the primary system to avoid data loss or corruption. Switchover is typically used for planned maintenance or testing. After a failover or switchover, when both sites are fully operational, a failback is performed to return the two Private Cloud Appliance systems to their original configuration.

Serviceability

Serviceability refers to the ability to detect, diagnose and correct issues that occur in an operational system. Its primary requirement is the collection of system data: general hardware telemetry details, log files from all system components, and results from system and configuration health checks. For a more detailed description of monitoring, system health and logging, refer to Status and Health Monitoring.

As an engineered system, Private Cloud Appliance is designed to process the collected data in a structured manner. To provide real-time status information, the system collects and stores metric data from all components and services in a unified way using Prometheus. Centrally aggregated data from Prometheus is visualized through metric panels in a Grafana dashboard, which permits an administrator to check overall system status at a glance. Logs are captured across the appliance using Fluentd, and collected in Loki for diagnostic purposes.

When a component status changes from healthy to not healthy, the alerting mechanism can be configured to send notifications to initiate a service workflow. If support from Oracle is required, the first step is for an administrator to open a service request and provide a problem description. However, if the Oracle Auto Service Request client service is activated, and depending on its specific configuration, a service request can be opened automatically in your name. Either way, the administrator is responsible for generating a support bundle and uploading it for inclusion in the service request. For more information about Oracle Auto Service Request, see Status and Health Monitoring in the "Oracle Private Cloud Appliance Administrator Guide".

To resolve the reported issue, Oracle may need access to the appliance infrastructure. For this purpose, a dedicated service account is configured during system initialization. For security reasons, this non-root account has no password. You must generate and provide a service key to allow the engineer to work on the system on your behalf. Activity related to the service account leaves an audit trail and is clearly separated from other user activity.

Most service scenarios for Oracle Engineered Systems are covered by detailed action plans, which are executed by the assigned field engineer. When the issue is resolved, or if a component has been repaired or replaced, the engineer validates that the system is healthy again before the service request is closed.

This structured approach to problem detection, diagnosis and resolution ensures that high-quality service is delivered, with minimum operational impact, delay and cost, and with maximum efficiency.