Status and Health Monitoring

The overall health status of the system is continually monitored, using real-time data from the hardware and platform layers. System health checks and monitoring data are the foundation of problem detection. When an unhealthy condition is found, administrators use this information to begin troubleshooting. If necessary, they register a service request with Oracle for assistance in resolving the problem. If the Private Cloud Appliance is registered for Oracle Auto Service Request (ASR), certain hardware failures cause a service request and diagnostic data to be automatically sent to Oracle support.

Monitoring

Independently of the built-in health checks, an administrator can consult the monitoring data at any time to verify the overall status of the system or the condition of a particular component or service. This is done through the Grafana interface, by querying the system-wide metric data stored in Prometheus.

Grafana provides a visual approach to monitoring: it allows you to create dashboards composed of a number of visualization panels. Each panel corresponds with a single metric query or a combination of metric queries, displayed in the selected format. Options include graphs, tables, charts, diagrams, gauges, and so on. For each metric panel, thresholds can be defined. When the query result exceeds or drops below a given threshold, the display color changes, providing a quick indication of which elements are healthy, require investigation, or are malfunctioning.

Oracle provides a set of pre-defined dashboards that allow administrators to start monitoring the system as soon as it is up and running. The default monitoring dashboards are grouped into the following categories:

Monitoring Dashboard Set	Description
Service Advisor	Appliance-specific collection of dashboards for monitoring the Kubernetes container orchestration environment, the containerized services it hosts, and the system health check services.
Service Level Monitoring	A read-only collection of dashboards that provide statistic data for all the microservices.
Kubernetes Monitoring	An additional collection of dashboards provided by Oracle's cloud native monitoring and visualization experts. These provide vast and detailed information about the Kubernetes cluster and its services.

The default dashboards contain metric data for the system's physical components – servers, switches, storage providers and their operating systems and firmware – as well as its logical components – controller software, platform, Kubernetes cluster and microservices, compute instances and their virtualized resources. This allows the administrator or support engineer to verify the health status of both component categories independently, and find correlations between them. For example, a particular microservice might exhibit poor performance due to lack of available memory. The monitoring data indicates whether this is a symptom of a system configuration issue, a lack of physical resources, or a hardware failure. The monitoring system has an alerting service capable of detecting and reporting hardware faults. The administrator may optionally configure a notification channel to receive alerts based on rules defined in the monitoring system.

As part of service and support operations, Oracle may ask you to report specific metric data displayed in the default dashboards. For this reason, the default dashboard configurations should always be preserved. However, if some of the monitoring functionality is inadvertently modified or broken, the defaults can be restored. In a similar way, it is possible for Oracle to create new dashboards or modify existing ones for improved monitoring, and push them to your operational environment without the need for a formal upgrade procedure.

All the open source monitoring and logging tools described here have public APIs that allow customers to integrate with their existing health monitoring and alerting systems. However, Oracle does not provide support for such custom configurations.

Fault Domain Observability

When it comes to keeping the appliance infrastructure, the compute instances and their related resources running in a healthy state, the Fault Domain is an extremely important concept. It groups a set of infrastructure components with the goal of isolating downtime events due to failures or maintenance, making sure that resources in other Fault Domains are not affected.

In line with Oracle Cloud Infrastructure, there are always three Fault Domains in a Private Cloud Appliance. Each of its Fault Domains corresponds with one or more physical compute nodes. Apart from using Grafana to consult monitoring data across the entire system, an administrator can also access key capacity metrics for Fault Domains directly from the Service Enclave:

Number of compute nodes per Fault Domain
Total and available amount of RAM per Fault Domain
Total and available number of vCPUs per Fault Domain
Unassigned system CPU and RAM capacity

The Fault Domain metrics reflect the actual physical resources that can be consumed by compute instances hosted on the compute nodes. The totals do not include resources reserved for the operation of the hypervisor: 40GB RAM and 4 CPU cores (8 vCPUs).

In addition to the three Fault Domains, the Service CLI displays an "Unassigned" category. It refers to installed compute nodes that have not been provisioned, and thus are not part of a Fault Domain yet. For unassigned compute nodes the memory capacity cannot be calculated, but the CPU metrics are displayed.

System Health Checks

Health checks are the most basic form of detection. They run at regular intervals as Kubernetes CronJob services, which are very similar to regular UNIX cron jobs. A status entry is created for every health check result, which is always one of two possibilities: healthy or not healthy. All status information is stored for further processing in Prometheus; the unhealthy results also generate log entries in Loki with details to help advance the troubleshooting process.

Health checks are meant to verify the status of specific system components, and to detect status changes. Each health check process follows the same basic principle: to record the current condition and compare it to the expected result. If they match, the health check passes; if they differ, the health check fails. A status change from healthy to not healthy indicates that troubleshooting is required.

For the purpose of troubleshooting, there are two principal data sources at your disposal: logs and metrics. Both categories of data are collected from all over the system and stored in a central location: logs are aggregated in Loki and metrics in Prometheus. Both tools have a query interface that allows you to filter and visualize the data: they both integrate with Grafana. Its browser-based interface can be accessed from the Service Web UI.

To investigate what causes a health check to fail, it helps to filter logs and metric data based on the type of failure. Loki categorizes data with a labeling system, displaying log messages that match the selected log label. Select a label from the list to view the logs for the service or application you are interested in. This list allows you to select not only the health checks but also the internal and external appliance services.

In addition, the latest status from each health check is displayed in the Platform Health Check dashboard, which is part of the Service Advisor dashboard set provided by default in Grafana.

Private Cloud Appliance runs the health checks listed below.

Health Check Service	Description
cert-checker	Verifies on each management node that no certificates have expired.
flannel-checker	Verifies that the Flannel container network service is fully operational on each Kubernetes node.
kubernetes-checker	Verifies the health status of Kubernetes nodes and pods, as well as the containerized services and their connection endpoints.
mysql-cluster-checker	Verifies the health status of the MySQL cluster database.
l0-cluster-services-checker	Verifies that low-level clustering services and key internal components (such as platform API, DHCP) in the hardware and platform layer are fully operational.
network-checker	Verifies that the system network configuration is correct.
registry-checker	Verifies that the container registry is fully operational on each management node.
vault-checker	Verifies that the secret service is fully operational on each management node.
etcd-checker	Verifies that the etcd service is fully operational on each management node.
zfssa-analytics-exporter	Reports ZFS Storage Appliance cluster status, active problems, and management path connection information. It also reports analytics information for a configurable list of datasets.

Centralized Logging

The platform provides unified logging across the entire system. The Fluentd data collector retrieves logs from components across the entire system and stores them in a central location, along with the appliance telemetry data. As a result, all the necessary troubleshooting and debugging information is maintained in a single data store, and does not need to be collected from different sources when an issue needs to be investigated. The overall health of the system is captured in one view, a Grafana dashboard, meaning there is no need for an administrator to check individual components.

Whenever an issue is found that requires assistance from Oracle, the administrator logs a service request. A support bundle is usually requested as part of that process. Thanks to the centralized logging, the support bundle is straightforward to generate, and remains possible even if system operation is severely compromised. Generating the support bundle is a scripted operation that produces a single compressed archive file. The administrator does need to manually upload the archive file containing the consolidated logs and other diagnostic data.

If the Private Cloud Appliance is registered for ASR, certain hardware failures cause a service request and diagnostic data to be automatically sent to Oracle support.