Status and Health Monitoring

5 Status and Health Monitoring

The system health checks and monitoring data are the foundation of problem detection. All the necessary troubleshooting and debugging information is maintained in a single data store, and does not need to be collected from individual components when an issue needs to be investigated. The overall health of the system is captured in one central location: Grafana.

Oracle has built default dashboards and alerts into Grafana, as well as a mechanism to consult the logs stored in Loki. Customers might prefer to expand and customize this setup, but this is beyond the scope of the Oracle Private Cloud Appliance documentation.

Implementation details and technical background information for this feature can be found in the Oracle Private Cloud Appliance Concepts Guide. Refer to the section "Status and Health Monitoring" in the chapter Appliance Administration Overview.

Using Grafana

With Grafana, Oracle Private Cloud Appliance offers administrators a single, visually oriented interface to the logs and metrics collected at all levels and across all components of the system. This section provides basic guidelines to access Grafana and navigate through the logs and monitoring dashboards.

To access the Grafana home page

Open the Service Web UI and log in.
On the right-hand side of the dashboard, click the Monitoring tile.

The Grafana home page opens in a new browser tab. Enter your user name and password when prompted.

When logs and metrics are stored in Prometheus they are given a time stamp based on the time and time zone settings of the appliance. However, Grafana displays the time based on user preferences, which may result in an offset because you are in a different time zone. It might be preferable to synchronize the time line in the Grafana visualizations with the time zone of the appliance.

To change the Grafana time line display

Open the Grafana home page.
In the menu bar on the left hand side, click your user account icon (near the bottom) to display your account preferences.
In the Preferences section, change the Time Zone setting to the same time zone as the appliance.
Click the Save button below to apply the change.

The pre-defined dashboards for Private Cloud Appliance are not directly accessible from the Grafana home page, although you can star your most used dashboards to appear on your home page later. Dashboards are organized in folders, which you access through the Dashboards section of the main menu.

To browse the Grafana dashboards

In the menu bar on the left hand side, point to Dashboards and select Manage.

The list of folders, or dashboard sets, is displayed.
Click a folder to display the list of dashboards it contains. Click a dashboard to display its contents.
To navigate back to the list of folders and dashboards, use the menu bar as you did in step 1.

With the exception of the My Sauron (Read Only) dashboard set, all pre-defined dashboards and panels are editable by design. You can modify them or create your own using the specific metrics you want to monitor. The same applies to the alerts.

Alerts are managed in a separate area. Oracle has pre-defined a series of alerts for your convenience.

To access the alerting rules and notifications

In the menu bar on the left hand side, click Alerting (the bell icon).

A list of all defined alert rules is displayed, including their current status.
Click an alert rule to display a detail panel and see how its status has evolved over time and relative to the alert threshold.
To navigate back to the list of alert rules, use the menu bar as you did in step 1.
To configure alert notifications, go to the Notification Channels tab of the Alerting page.

Note:

If you wish to configure custom alerts using your own external notification channel, you must first configure the proxy for Grafana using the Sauron API endpoint. To do so, log in to the management node that owns the management virtual IP and run the following command:

$ sudo curl -u <admin_user_name> \
-XPUT 'https://api.<mypca>.example.com/v1/grafana/proxy/config?http-proxy=<proxy_fqdn>:<proxy_port>&https-proxy=<proxy_fqdn>:<proxy_port>'
Enter host password for user '<admin_user_name>':
Grafana proxy config successfully updated!

Finally, Grafana also provides access to the appliance logs, which are aggregated through Loki. For more information, see Accessing System Logs.

Checking the Health and Status of Hardware and Platform Components

The hardware and platform layers form the foundations of the system architecture. Any unhealthy condition at this level is expected to have an adverse effect on operations in the infrastructure services. A number of pre-defined Grafana dashboards allow you to check the status of those essential low-level components, and drill down into the real-time and historic details of the relevant metrics.

The dashboards described in this section provide a good starting point for basic system health checks, and troubleshooting in case issues are found. You might prefer to use different dashboards, metrics and visualizations instead. The necessary data, collected across the entire system, is stored in Prometheus, and can be queried and presented through Grafana in countless ways.

Grafana Folder	Dashboard	Description
Service Monitoring	Server Stats	This comprehensive dashboard displays telemetry data for the server nodes. It includes graphs for CPU and memory utilization, disk activity, network traffic, and so on. Some panels in this dashboard display a large number of time series in a single graph, so note that you can click to display a single one, or hover over the graph to view detailed data at a specific point on the time axis.
PCA 3.0 Service Advisor	Platform Health Check	This dashboard integrates the appliance health check mechanisms into the centralized approach that Grafana provides for logging and monitoring. By default, the Platform Health Check dashboard displays the failures for all health check services. You can change the panel display by selecting a health checker from the list of platform services, and you can choose to display healthy, unhealthy or all results. Typically, if you see health check failures you want to start troubleshooting. For that purpose, each health check result contains a time stamp that serves as a direct link to the related Loki logs. To view the logs related to any health check result, simply click the time stamp.
My Sauron (Read Only)	Node Exporter Full	This dashboard displays a large number of detailed metric panels for a single compute or management node. Select a host from the list to display its data. This dashboard could be considered a fine-grained extension of the Server Stats dashboard. The many different panels provide detailed coverage of the server node hardware status as well as the operating system services and processes. Information that you would typically collect at the command line of each physical node is combined into a single dashboard showing live data and its evolution over time. All dashboards in the My Sauron (Read Only) folder provide data that would be critical in case a system-level failure needs to be resolved. Therefore, these dashboards cannot be modified or deleted.

Grafana Folder

Dashboard

Description

Service Monitoring

Server Stats

This comprehensive dashboard displays telemetry data for the server nodes. It includes graphs for CPU and memory utilization, disk activity, network traffic, and so on.

Some panels in this dashboard display a large number of time series in a single graph, so note that you can click to display a single one, or hover over the graph to view detailed data at a specific point on the time axis.

PCA 3.0 Service Advisor

Platform Health Check

This dashboard integrates the appliance health check mechanisms into the centralized approach that Grafana provides for logging and monitoring.

By default, the Platform Health Check dashboard displays the failures for all health check services. You can change the panel display by selecting a health checker from the list of platform services, and you can choose to display healthy, unhealthy or all results.

Typically, if you see health check failures you want to start troubleshooting. For that purpose, each health check result contains a time stamp that serves as a direct link to the related Loki logs. To view the logs related to any health check result, simply click the time stamp.

My Sauron (Read Only)

Node Exporter Full

This dashboard displays a large number of detailed metric panels for a single compute or management node. Select a host from the list to display its data.

This dashboard could be considered a fine-grained extension of the Server Stats dashboard. The many different panels provide detailed coverage of the server node hardware status as well as the operating system services and processes. Information that you would typically collect at the command line of each physical node is combined into a single dashboard showing live data and its evolution over time.

All dashboards in the My Sauron (Read Only) folder provide data that would be critical in case a system-level failure needs to be resolved. Therefore, these dashboards cannot be modified or deleted.

Viewing and Interpreting Monitoring Data

The infrastructure services layer, which is built on top of the platform and enables all the cloud user and administrator functionality, can be monitored through an extensive collection of Grafana dashboards. These microservices are deployed across the three management nodes in Kubernetes containers, so their monitoring is largely based on Kubernetes node and pod metrics. The Kubernetes cluster also extends onto the compute nodes, where Kubernetes worker nodes collect vital additional data for system operation and monitoring.

The dashboards described in this section provide a good starting point for microservices health monitoring. You might prefer to use different dashboards, metrics and visualizations instead. The necessary data, collected across the entire system, is stored in Prometheus, and can be queried and presented through Grafana in countless ways.

Grafana Folder	Dashboard	Description
Service Monitoring	ClusterLabs HA Cluster Details	This dashboard uses a bespoke Prometheus exporter to display data for HA clusters based on Pacemaker. On each HTTP request it locally inspects the cluster status, by parsing pre-existing distributed data provided by the cluster components' tools. The monitoring data includes Pacemaker cluster summary, nodes and resource stats, and Corosync ring errors and quorum votes.
Service Monitoring	MySQL Cluster Exporter	This dashboard displays performance details for the MySQL database cluster. Data includes database service metrics such as uptime, connection statistics, table lock counts, as well as more general information about MySQL objects, connections, network traffic, memory and CPU usage, etc.
Service Monitoring	Service Level	This dashboard displays detailed information about RabbitMQ requests that are received by the fundamental appliance services. It allows you to monitor the number of requests, request latency, and any requests that caused an error.
Service Monitoring	VM Stats	This comprehensive dashboard displays resource consumption information across the compute instances in your environment. It includes graphs for CPU and memory utilization, disk activity, network traffic, and so on. The panels in this dashboard display a large number of time series in a single graph, so note that you can click to display a single one, or hover over the graph to view detailed data at a specific point on the time axis.
PCA 3.0 Service Advisor	Kube Endpoint	This dashboard focuses specifically on the Kubernetes endpoints and provides endpoint alerts. These alerts can be sent to a notification channel of your choice.
PCA 3.0 Service Advisor	Kube Ingress	This dashboard provides data about ingress traffic to the Kubernetes services and their pods. Two alerts are built-in and can be sent to a notification channel of your choice.
PCA 3.0 Service Advisor	Kube Node	This dashboard displays metric data for all the server nodes, meaning management and compute nodes, that belong to the Kubernetes cluster and host microservices pods. You can monitor pod count, CPU and memory usage, and so on. The metric panels display information for all nodes. In the graph-based panels you can click to view information for just a single node.
PCA 3.0 Service Advisor	Kube Pod	This dashboard displays metric data at the level of the microservices pods, allowing you to view the total number of pods overall and how they are distributed across the nodes. You can monitor their status per namespace and per service, and check if they have triggered any alerts.
PCA 3.0 Service Advisor	Kube Service	This dashboard displays metric data at the Kubernetes service level. The data can be filtered for specific services, but displays all by default. Two alerts are built-in and can be sent to a notification channel of your choice.
Kubernetes Monitoring Kubernetes Monitoring Containers Kubernetes Monitoring Node	(all)	These folders contains a large and diverse collection of dashboards with a wide range of monitoring data. covering practically all aspects of your Kubernetes cluster. The data covers Kubernetes at the cluster, node, pod and container levels. Metrics provide insights into deployment, ingress, usage of CPU, disk, memory and network, and much more.

Accessing System Logs

Logs are collected from all over the system and aggregated in Loki. All the log data can be queried, filtered and displayed using the central interface of Grafana

To view the Loki logs

Open the Grafana home page.
In the menu bar on the left hand side, click Explore (the compass icon).

By default, the Explore page's data source is set to "Prometheus".
At the top of the page near the left hand side, select "Loki" from the data source list.
Use the Log Labels list to query and filter the logs.

The logs are categorized with labels, which you can query in order to display log entries of a particular type or category. The principal log label categories used within Private Cloud Appliance are the following:

job

The log labels in this category are divided into three categories:
- Platform: logs from services and components running in the foundation layers of the appliance architecture.
  
  Log labels in this category include: "him"/"has"/"hms" (hardware management), "api-server", "vault"/"etcd" (secret service), "corosync"/"pacemaker"/"pcsd" (management cluster), "messages" (RabbitMQ)"pca-platform-l0", "pca-platform-l1api", and so on.
- Infrastructure services: logs from the user-level cloud services and administrative services deployed on top of the platform. These services are easier to identify by their name.
  
  Log labels in this category include: "brs" (backup/restore), "ceui" (Compute Web UI), "seui" (Service Web UI), "compute", "dr-admin" (disaster recovery), "filesystem", "iam" (identity and access management), "pca-upgrader", and so on.
- Standard output: logs that the containerized infrastructure services send to the stdout stream. This output is visible to users when they execute a UI operation or CLI command.
  
  Use the log label job="k8s-stdout-logs" to filter for the standard output logs. The log data comes from the microservices' Kubernetes containers, and can be filtered further by specifying a pod and/or container name.
k8s_app

Log labels in this category allow you to narrow down the standard output logs (job="k8s-stdout-logs"). That log data comes from the microservices' Kubernetes containers, and can be filtered further by selecting the label that corresponds with the name of the specific service you are interested in.

You navigate through the logs by selecting one of the job or k8s_app log labels. You pick the label that corresponds with the service or application you are interested in, and the list of logs is displayed in reverse chronological order. You can narrow your search by zooming in on a portion of the time line shown above the log entries. Color coding helps to identify the items that require your attention; for example: warnings are marked in yellow and errors are marked in red.

Audit Logs

The audit logs can be consulted as separate categories. From the Log Labels list, you can select these audit labels:

job="vault-audit"

Use this log label to filter for the audit logs of the Vault cluster. Vault, a key component of the secret service, keeps a detailed log of all requests and responses. You can view every authenticated interaction with Vault, including errors. Because these logs contain sensitive information, many strings within requests and responses are hashed so that secrets are not shown in plain text in the audit logs.
job="kubernetes-audit"

Use this log label to filter for the audit logs of the Kubernetes cluster. The Kubernetes audit policy is configured to log request metadata: requesting user, time stamp, resource, verb, etc. Request body and response body are not included in the audit logs.
job="audit"

Use this log label to filter for the Oracle Linux kernel audit daemon logs. The kernel audit daemon (auditd) is the userspace component of the Linux Auditing System. It captures specific events such as system logins, account modifications and sudo operations.
log="audit"

Use this log label to filter for the audit logs of the ZFS Storage Appliance.

In addition to using the log labels from the list, you can also build custom queries. For example, to filter for the audit logs of the admin service and API service, enter the following query into the field next to the Log Labels list:

{job=~"(admin|api-server)"} | json tag="tag" | tag=~"(api-audit.log|audit.log)"

To execute, either click the Run Query button in the top-right corner or press Shift + Enter.

Using Auto Service Requests

Oracle Private Cloud Appliance is qualified for Oracle Auto Service Request (ASR). ASR is a software feature for support purposes. It is integrated with My Oracle Support and helps resolve problems faster by automatically opening service requests when specific hardware failures occur. Using ASR is optional: the service must be registered and enabled for your appliance.

Understanding Oracle Auto Service Request (ASR)

Oracle Auto Service Request (ASR) is designed to automatically open service requests when specific Private Cloud Appliance hardware faults occur. To enable this feature, the Private Cloud Appliance must be configured to send hardware fault telemetry to Oracle directly at https://transport.oracle.com, to a proxy host, or to a different endpoint. For example, you can use a different endpoint if you have the ASR Manager software installed in your data center as an aggregation point for multiple systems.

When a hardware problem is detected, ASR submits a service request to Oracle Support Services. In many cases, Oracle Support Services can begin work on resolving the issue before the administrator is even aware the problem exists.

ASR detects faults in the most common hardware components, such as disks, fans, and power supplies, and automatically opens a service request when a fault occurs. ASR does not detect all possible hardware faults, and it is not a replacement for other monitoring mechanisms, such as SMTP alerts, within the customer data center. It is a complementary mechanism that expedites and simplifies the delivery of replacement hardware. ASR should not be used for downtime events in high-priority systems. For high-priority events, contact Oracle Support Services directly.

An email message is sent to both the My Oracle Support email account and the technical contact for Private Cloud Appliance to notify them of the creation of the service request. A service request may not be filed automatically on some occasions. This can happen because of the unreliable nature of the SNMP protocol or a loss of connectivity to ASR. Oracle recommends that customers continue to monitor their systems for faults and call Oracle Support Services if they do not receive notice that a service request has been filed automatically.

For more information about ASR, consult the following resources:

Oracle Auto Service Requestweb page: https://www.oracle.com/servers/technologies/auto-service-request.html.
Oracle Auto Service Request user documentation: https://docs.oracle.com/cd/E37710_01/index.htm.

Oracle Auto Service Request Prerequisites

Before you register for the Oracle Auto Service Request (ASR) service, make sure that the prerequisites in this section are met.

Make sure that you have a valid My Oracle Support account.

If necessary, create an account at https://support.oracle.com/portal/.
Ensure that the following are set up correctly in My Oracle Support:
- technical contact person at the customer site who is responsible for Private Cloud Appliance
- valid shipping address at the customer site where the Private Cloud Appliance is located, so that parts are delivered to the site where they must be installed
Verify connectivity to the Internet using HTTPS.

For example, try curl to test whether you can access https://support.oracle.com/portal/.

Registering Private Cloud Appliance for Oracle Auto Service Request

To register the Oracle Auto Service Request (ASR) client, the Private Cloud Appliance must be configured to send hardware fault telemetry to Oracle in one of three ways; directly at https://transport.oracle.com, to a proxy host, or to a different endpoint. For example, you can use a different endpoint if you have the ASR Manager software installed in your data center as an aggregation point for multiple systems.

When you register your Private Cloud Appliance for ASR, the service is automatically enabled.

Using the Service Web UI

Open the navigation menu and click ASR Phone Home.
Click the Register button.
Fill in the username and password, then complete the fields for the Phone Home configuration that you choose.
- Username*: Enter your Oracle Single Sign On (SSO) credentials, which can be obtained from My Oracle Support.
- Password*: Enter the password for your SSO account.
- Proxy Username: To use a proxy host, enter a username to access that host.
- Proxy Password: To use a proxy host, enter the password to access that host.
- Proxy Host: To use a proxy host, enter the name of that host.
- Proxy Port: To use a proxy host, enter the port used to access the host.
- Endpoint: Optionally, if you use an aggregation point, or other endpoint for ASR data consolidation, enter that endpoint in this format http://<host>[:<port>]/asr
*Required fields

Using the Service CLI

Configure ASR Directly to https://transport.oracle.com

Using SSH, log into the management node VIP as admin.
```
# ssh -l admin 100.96.2.32 -p 30006
```

Use the asrClientRegister custom command to register the appliance.

PCA-ADMIN> asrClientRegister username=asr-pca3_ca@example.com \ 
password=********  confirmPassword=******** \
endpoint=https://transport.oracle.com/ \
Command: asrClientRegister username=asr-pca3_ca@example.com \ 
password=*****  confirmPassword=***** \ 
endpoint=https://transport.oracle.com/
Status: Success
Time: 2021-07-12 18:47:14,630 UTC

Confirm the configuration.

PCA-ADMIN> show asrPhonehome
Command: show asrPhonehome
Status: Success
Time: 2021-09-30 13:08:42,210 UTC
Data:
  Is Registered = true
  Overall Enable Disable = true
  Username = asr.user@example.com  Endpoint = https\://transport.oracle.com/
PCA-ADMIN>

Configure ASR to a Proxy Host

Using SSH, log into the management node VIP as admin.
```
# ssh -l admin 100.96.2.32 -p 30006
```

Use the asrClientRegister custom command to register the appliance.

PCA-ADMIN> asrClientRegister username=asr-pca3_ca@oracle.com \ 
password=******** confirmPassword=******** \ 
proxyHost=zeb proxyPort=80 \ 
proxyUsername=support \ 
proxyPassword=**** proxyConfirmPassword=**** \

Configure ASR to a Different Endpoint

Using SSH, log into the management node VIP as admin.
```
# ssh -l admin 100.96.2.32 -p 30006
```

Use the asrClientRegister custom command to register the appliance.

PCA-ADMIN> asrClientRegister username=oracle_email@example.com \ 
password=********  confirmPassword=******** \
endpoint=https://transport.oracle.com/ \
Command: asrClientRegister username=oracle_email@example.com \ 
password=*****  confirmPassword=***** \ 
endpoint=https://transport.oracle.com/
Status: Success
Time: 2021-07-12 18:47:14,630 UTC

Testing Oracle Auto Service Request Configuration

Once configured, you can test your Oracle Auto Service Request (ASR) configuration to ensure end to end communication is working properly.

Using the Service Web UI

Open the navigation menu and click ASR Phone Home.
Select Test Registration in the Controls menu.
Click Test Registration. A dialog will confirm if the test is successful or not.
If the test is not succesful, confirm your ASR configuration information and repeat the test.

Using the Service CLI

Using SSH, log into the management node VIP as admin.
```
# ssh -l admin 100.96.2.32 -p 30006
```

Use the asrClientsendTestMsg custom command to test the ASR configuration.

PCA-ADMIN> asrClientsendTestMsg
Command: asrClientsendTestMsg
Status: Success
Time: 2021-12-08 18:43:30,093 UTC
PCA-ADMIN>

Unregistering Private Cloud Appliance for Oracle Auto Service Request

When you unregister your Private Cloud Appliance for Oracle Auto Service Request (ASR), the service is automatically disabled, so you do not need to perform that step.

Using the Service Web UI

Open the navigation menu and click ASR Phone Home.
Click the Unregister button. Confirm the operation when prompted.

Using the Service CLI

Using SSH, log into the management node VIP as admin.
```
# ssh -l admin 100.96.2.32 -p 30006
```

Use the asrClientUnregister custom command to register the appliance.

PCA-ADMIN> asrClientUnregister
Command: asrClientUnregister
Status: Success
Time: 2021-06-23 15:25:18,127 UTC
PCA-ADMIN>

Disabling Oracle Auto Service Request

During system maintenance, or other circumstances, you may want to temporarily disable Oracle Auto Service Request (ASR) on your appliance to halt the flow of fault messages to your configured endpoint, without unregistering the system.

Using the Service Web UI

Open the navigation menu and click ASR Phone Home.
Click the Disable button. Confirm the operation when prompted.

Using the Service CLI

Using SSH, log into the management node VIP as admin.
```
# ssh -l admin 100.96.2.32 -p 30006
```

Use the asrClientDisable custom command to halt the ASR service.

PCA-ADMIN> asrClientDisable
Command: asrClientDisable
Status: Success
Time: 2021-06-23 15:26:17,753 UTC
PCA-ADMIN>

Enabling Oracle Auto Service Request

If you have disabled Oracle Auto Service Request (ASR) on your appliance, use one of these methods to restart the ASR service.

Using the Service Web UI

Open the navigation menu and click ASR Phone Home.
Click the Enable button. Confirm the operation when prompted.

Using the Service CLI

Using SSH, log into the management node VIP as admin.
```
# ssh -l admin 100.96.2.32 -p 30006
```

Use the asrClientEnable custom command to start the ASR service.

PCA-ADMIN> asrClientEnable
Command: asrClientEnable
Status: Success
Time: 2021-06-23 15:26:47,632 UTC
PCA-ADMIN>