5 Status and Health Monitoring

The system health checks and monitoring data are the foundation of problem detection. All the necessary troubleshooting and debugging information is maintained in a single data store, and does not need to be collected from individual components when an issue needs to be investigated. The overall health of the system is captured in one central location: Grafana.

Oracle has built default dashboards and alerts into Grafana, as well as a mechanism to consult the logs stored in Loki. Customers might prefer to expand and customize this setup, but this is beyond the scope of the Oracle Private Cloud Appliance documentation.

Implementation details and technical background information for this feature can be found in the Oracle Private Cloud Appliance Concepts Guide. Refer to the section "Status and Health Monitoring" in the chapter Appliance Administration Overview.

Using Grafana

With Grafana, Oracle Private Cloud Appliance offers administrators a single, visually oriented interface to the logs and metrics collected at all levels and across all components of the system. This section provides basic guidelines to access Grafana and navigate through the logs and monitoring dashboards.

To access the Grafana home page

  1. Open the Service Web UI and log in.

  2. On the right-hand side of the dashboard, click the Monitoring tile.

    The Grafana home page opens in a new browser tab. Enter your user name and password when prompted.

When logs and metrics are stored in Prometheus they are given a time stamp based on the time and time zone settings of the appliance. However, Grafana displays the time based on user preferences, which may result in an offset because you are in a different time zone. It might be preferable to synchronize the time line in the Grafana visualizations with the time zone of the appliance.

To change the Grafana time line display

  1. Open the Grafana home page.

  2. In the menu bar on the left hand side, click your user account icon (near the bottom) to display your account preferences.

  3. In the Preferences section, change the Time Zone setting to the same time zone as the appliance.

  4. Click the Save button below to apply the change.

The pre-defined dashboards for Private Cloud Appliance are not directly accessible from the Grafana home page, although you can star your most used dashboards to appear on your home page later. Dashboards are organized in folders, which you access through the Dashboards section of the main menu.

To browse the Grafana dashboards

  1. In the menu bar on the left hand side, point to Dashboards and select Manage.

    The list of folders, or dashboard sets, is displayed.

  2. Click a folder to display the list of dashboards it contains. Click a dashboard to display its contents.

  3. To navigate back to the list of folders and dashboards, use the menu bar as you did in step 1.

With the exception of the My Sauron (Read Only) dashboard set, all pre-defined dashboards and panels are editable by design. You can modify them or create your own using the specific metrics you want to monitor. The same applies to the alerts.

Alerts are managed in a separate area. Oracle has pre-defined a series of alerts for your convenience.

To access the alerting rules and notifications

  1. In the menu bar on the left hand side, click Alerting (the bell icon).

    A list of all defined alert rules is displayed, including their current status.

  2. Click an alert rule to display a detail panel and see how its status has evolved over time and relative to the alert threshold.

  3. To navigate back to the list of alert rules, use the menu bar as you did in step 1.

  4. To configure alert notifications, go to the Notification Channels tab of the Alerting page.

Note:

If you wish to configure custom alerts using your own external notification channel, you must first configure the proxy for Grafana using the Sauron API endpoint. To do so, log in to the management node that owns the management virtual IP and run the following command:

$ sudo curl -u <admin_user_name> \
-XPUT 'https://api.<mypca>.example.com/v1/grafana/proxy/config?http-proxy=<proxy_fqdn>:<proxy_port>&https-proxy=<proxy_fqdn>:<proxy_port>'
Enter host password for user '<admin_user_name>':
Grafana proxy config successfully updated!

Finally, Grafana also provides access to the appliance logs, which are aggregated through Loki. For more information, see Accessing System Logs.

Checking the Health and Status of Hardware and Platform Components

The hardware and platform layers form the foundations of the system architecture. Any unhealthy condition at this level is expected to have an adverse effect on operations in the infrastructure services. A number of pre-defined Grafana dashboards allow you to check the status of those essential low-level components, and drill down into the real-time and historic details of the relevant metrics.

The dashboards described in this section provide a good starting point for basic system health checks, and troubleshooting in case issues are found. You might prefer to use different dashboards, metrics and visualizations instead. The necessary data, collected across the entire system, is stored in Prometheus, and can be queried and presented through Grafana in countless ways.

Grafana Folder Dashboard Description

Service Monitoring

Server Stats

This comprehensive dashboard displays telemetry data for the server nodes. It includes graphs for CPU and memory utilization, disk activity, network traffic, and so on.

Some panels in this dashboard display a large number of time series in a single graph, so note that you can click to display a single one, or hover over the graph to view detailed data at a specific point on the time axis.

PCA 3.0 Service Advisor

Platform Health Check

This dashboard integrates the appliance health check mechanisms into the centralized approach that Grafana provides for logging and monitoring.

By default, the Platform Health Check dashboard displays the failures for all health check services. You can change the panel display by selecting a health checker from the list of platform services, and you can choose to display healthy, unhealthy or all results.

Typically, if you see health check failures you want to start troubleshooting. For that purpose, each health check result contains a time stamp that serves as a direct link to the related Loki logs. To view the logs related to any health check result, simply click the time stamp.

My Sauron (Read Only)

Node Exporter Full

This dashboard displays a large number of detailed metric panels for a single compute or management node. Select a host from the list to display its data.

This dashboard could be considered a fine-grained extension of the Server Stats dashboard. The many different panels provide detailed coverage of the server node hardware status as well as the operating system services and processes. Information that you would typically collect at the command line of each physical node is combined into a single dashboard showing live data and its evolution over time.

All dashboards in the My Sauron (Read Only) folder provide data that would be critical in case a system-level failure needs to be resolved. Therefore, these dashboards cannot be modified or deleted.

Viewing and Interpreting Monitoring Data

The infrastructure services layer, which is built on top of the platform and enables all the cloud user and administrator functionality, can be monitored through an extensive collection of Grafana dashboards. These microservices are deployed across the three management nodes in Kubernetes containers, so their monitoring is largely based on Kubernetes node and pod metrics. The Kubernetes cluster also extends onto the compute nodes, where Kubernetes worker nodes collect vital additional data for system operation and monitoring.

The dashboards described in this section provide a good starting point for microservices health monitoring. You might prefer to use different dashboards, metrics and visualizations instead. The necessary data, collected across the entire system, is stored in Prometheus, and can be queried and presented through Grafana in countless ways.

Grafana Folder Dashboard Description

Service Monitoring

ClusterLabs HA Cluster Details

This dashboard uses a bespoke Prometheus exporter to display data for HA clusters based on Pacemaker. On each HTTP request it locally inspects the cluster status, by parsing pre-existing distributed data provided by the cluster components' tools.

The monitoring data includes Pacemaker cluster summary, nodes and resource stats, and Corosync ring errors and quorum votes.

Service Monitoring

MySQL Cluster Exporter

This dashboard displays performance details for the MySQL database cluster. Data includes database service metrics such as uptime, connection statistics, table lock counts, as well as more general information about MySQL objects, connections, network traffic, memory and CPU usage, etc.

Service Monitoring

Service Level

This dashboard displays detailed information about RabbitMQ requests that are received by the fundamental appliance services. It allows you to monitor the number of requests, request latency, and any requests that caused an error.

Service Monitoring

VM Stats

This comprehensive dashboard displays resource consumption information across the compute instances in your environment. It includes graphs for CPU and memory utilization, disk activity, network traffic, and so on.

The panels in this dashboard display a large number of time series in a single graph, so note that you can click to display a single one, or hover over the graph to view detailed data at a specific point on the time axis.

PCA 3.0 Service Advisor

Kube Endpoint

This dashboard focuses specifically on the Kubernetes endpoints and provides endpoint alerts. These alerts can be sent to a notification channel of your choice.

PCA 3.0 Service Advisor

Kube Ingress

This dashboard provides data about ingress traffic to the Kubernetes services and their pods. Two alerts are built-in and can be sent to a notification channel of your choice.

PCA 3.0 Service Advisor

Kube Node

This dashboard displays metric data for all the server nodes, meaning management and compute nodes, that belong to the Kubernetes cluster and host microservices pods. You can monitor pod count, CPU and memory usage, and so on. The metric panels display information for all nodes. In the graph-based panels you can click to view information for just a single node.

PCA 3.0 Service Advisor

Kube Pod

This dashboard displays metric data at the level of the microservices pods, allowing you to view the total number of pods overall and how they are distributed across the nodes. You can monitor their status per namespace and per service, and check if they have triggered any alerts.

PCA 3.0 Service Advisor

Kube Service

This dashboard displays metric data at the Kubernetes service level. The data can be filtered for specific services, but displays all by default. Two alerts are built-in and can be sent to a notification channel of your choice.

Kubernetes Monitoring

Kubernetes Monitoring Containers

Kubernetes Monitoring Node

(all)

These folders contains a large and diverse collection of dashboards with a wide range of monitoring data. covering practically all aspects of your Kubernetes cluster.

The data covers Kubernetes at the cluster, node, pod and container levels. Metrics provide insights into deployment, ingress, usage of CPU, disk, memory and network, and much more.

Monitoring System Capacity

It is important to track the key metrics that determine the system's capacity to host your compute instances and the storage they use. The detailed data for compute node load and storage usage can be found in the Grafana dashboards, but as an administrator you also have direct access to the current consumption of CPU and memory as well as storage space.

Viewing CPU and Memory Usage By Fault Domain

The getFaultDomainInfo command provides an overview of memory and CPU usage across a fault domain.

Using the Service Web UI

  1. In the PCA Config navigation menu, click Fault Domains.

    The table displays CPU and memory usage data by fault domain.

  2. To view more detailed information about a component, click its host name in the table.

Using the Service CLI

  1. To display a list of the CPU and memory usage in a fault domain, use the getFaultDomainInfo command.

    The UNASSIGNED row refers to compute nodes that are not currently assigned to a fault domain. Because these computes node do not belong to a fault domain, their memory and CPU usage in a fault domain is zero. You can access memory and CPU usage per compute node by viewing the Compute Node Information page in the Service Web UI.

    PCA-ADMIN> getFaultDomainInfo
    Command: getFaultDomainInfo
    Status: Success
    Time: 2022-06-17 14:43:13,292 UTC
    Data:
      id           totalCNs   totalMemory   freeMemory   totalvCPUs   freevCPUs  
    notes  
      --           --------   -----------   ----------   ----------   ---------  
    -----  
      UNASSIGNED   11         0.0           0.0          0            0          
            
      FD1          1          984.0         968.0        120          118        
            
      FD2          1          984.0         984.0        120          120        
            
      FD3          1          984.0         984.0        120          120   
Viewing Disk Space Usage on the ZFS Storage Appliance

The Service Enclave runs a storage monitoring tool called ZFS pool manager, which polls the ZFS Storage Appliance every 60 seconds. The Service CLI allows you to display its current information on the usage of available disk space in each ZFS pool. You can also set the usage threshold that triggers a fault when exceeded.

In a standard storage configuration you only have one pool. If your system includes high-performance disk trays then you can view usage information for both pools separately.

Use the Service CLI as follows to check storage capacity:

  1. Display the status of a ZFS pool.

    PCA-ADMIN> list ZfsPool
    Command: list ZfsPool
    Status: Success
    Time: 2022-10-10 08:44:11,938 UTC
    Data:
      id                                     name
      --                                     ----
      e898b147-7cf0-4bd0-8b54-e32ec83d04cb   PCA_POOL
      c2f67943-df81-47a5-9713-06768318b623   PCA_POOL_HIGH
    
    PCA-ADMIN> show ZfsPool id=e898b147-7cf0-4bd0-8b54-e32ec83d04cb
    Command: show ZfsPool id=e898b147-7cf0-4bd0-8b54-e32ec83d04cb
    Status: Success
    Time: 2022-10-10 08:44:22,051 UTC
    Data:
      Id = e898b147-7cf0-4bd0-8b54-e32ec83d04cb
      Type = ZfsPool
      Pool Status = Online
      Free Pool = 44879343128576
      Total Pool = 70506183131136
      Pool Usage Percent = 0.3634693989163486
      Name = PCA_POOL
      Work State = Normal
  2. Configure the fault threshold of the ZFS pool manager. It is set to 80 percent full (value = 0.8) by default.

    PCA-ADMIN> show ZfsPoolManager
    Command: show ZfsPoolManager
    Status: Success
    Time: 2022-10-10 08:58:11,231 UTC
    Data:
      Id = a6ca861b-f83a-4032-91c5-bc506394d0de
      Type = ZfsPoolManager
      LastRunTime = 2022-10-09 12:17:52,964 UTC
      Poll Interval (sec) = 60
      The minimum Zfs pool usage percentage to trigger a major fault = 0.8
      Manager's run state = Running
    
    PCA-ADMIN> edit ZfsPoolManager usageMajorFaultPercent=0.75
    Command: edit ZfsPoolManager usageMajorFaultPercent=0.75
    Status: Success
    Time: 2022-10-10 08:58:27,657 UTC
    JobId: 67cfe180-f2a2-4d59-a676-01b3d73cffae

Accessing System Logs

Logs are collected from all over the system and aggregated in Loki. All the log data can be queried, filtered and displayed using the central interface of Grafana

To view the Loki logs

  1. Open the Grafana home page.

  2. In the menu bar on the left hand side, click Explore (the compass icon).

    By default, the Explore page's data source is set to "Prometheus".

  3. At the top of the page near the left hand side, select "Loki" from the data source list.

  4. Use the Log Labels list to query and filter the logs.

The logs are categorized with labels, which you can query in order to display log entries of a particular type or category. The principal log label categories used within Private Cloud Appliance are the following:

  • job

    The log labels in this category are divided into three categories:

    • Platform: logs from services and components running in the foundation layers of the appliance architecture.

      Log labels in this category include: "him"/"has"/"hms" (hardware management), "api-server", "vault"/"etcd" (secret service), "corosync"/"pacemaker"/"pcsd" (management cluster), "messages" (RabbitMQ)"pca-platform-l0", "pca-platform-l1api", and so on.

    • Infrastructure services: logs from the user-level cloud services and administrative services deployed on top of the platform. These services are easier to identify by their name.

      Log labels in this category include: "brs" (backup/restore), "ceui" (Compute Web UI), "seui" (Service Web UI), "compute", "dr-admin" (disaster recovery), "filesystem", "iam" (identity and access management), "pca-upgrader", and so on.

    • Standard output: logs that the containerized infrastructure services send to the stdout stream. This output is visible to users when they execute a UI operation or CLI command.

      Use the log label job="k8s-stdout-logs" to filter for the standard output logs. The log data comes from the microservices' Kubernetes containers, and can be filtered further by specifying a pod and/or container name.

  • k8s_app

    Log labels in this category allow you to narrow down the standard output logs (job="k8s-stdout-logs"). That log data comes from the microservices' Kubernetes containers, and can be filtered further by selecting the label that corresponds with the name of the specific service you are interested in.

You navigate through the logs by selecting one of the job or k8s_app log labels. You pick the label that corresponds with the service or application you are interested in, and the list of logs is displayed in reverse chronological order. You can narrow your search by zooming in on a portion of the time line shown above the log entries. Color coding helps to identify the items that require your attention; for example: warnings are marked in yellow and errors are marked in red.

Audit Logs

The audit logs can be consulted as separate categories. From the Log Labels list, you can select these audit labels:

  • job="vault-audit"

    Use this log label to filter for the audit logs of the Vault cluster. Vault, a key component of the secret service, keeps a detailed log of all requests and responses. You can view every authenticated interaction with Vault, including errors. Because these logs contain sensitive information, many strings within requests and responses are hashed so that secrets are not shown in plain text in the audit logs.

  • job="kubernetes-audit"

    Use this log label to filter for the audit logs of the Kubernetes cluster. The Kubernetes audit policy is configured to log request metadata: requesting user, time stamp, resource, verb, etc. Request body and response body are not included in the audit logs.

  • job="audit"

    Use this log label to filter for the Oracle Linux kernel audit daemon logs. The kernel audit daemon (auditd) is the userspace component of the Linux Auditing System. It captures specific events such as system logins, account modifications and sudo operations.

  • log="audit"

    Use this log label to filter for the audit logs of the ZFS Storage Appliance.

In addition to using the log labels from the list, you can also build custom queries. For example, to filter for the audit logs of the admin service and API service, enter the following query into the field next to the Log Labels list:

{job=~"(admin|api-server)"} | json tag="tag" | tag=~"(api-audit.log|audit.log)"

To execute, either click the Run Query button in the top-right corner or press Shift + Enter.

Using Oracle Auto Service Request

Oracle Private Cloud Appliance is qualified for Oracle Auto Service Request (ASR). ASR is integrated with My Oracle Support. When specific hardware failures occur, ASR automatically opens a service request and sends diagnostic information. The appliance administrator receives notification that a service request is open.

Using ASR is optional: the service must be registered and enabled for your appliance.

Understanding Oracle Auto Service Request

ASR automatically opens service requests when specific Private Cloud Appliance hardware faults occur. To enable this feature, the Private Cloud Appliance must be configured to send hardware fault telemetry to Oracle directly at https://transport.oracle.com, to a proxy host, or to a different endpoint. For example, you can use a different endpoint if you have the ASR Manager software installed in your data center as an aggregation point for multiple systems.

When a hardware problem is detected, ASR submits a service request to Oracle Support Services. In many cases, Oracle Support Services can begin work on resolving the issue before the administrator is even aware the problem exists.

ASR detects faults in the most common hardware components, such as disks, fans, and power supplies, and automatically opens a service request when a fault occurs. ASR does not detect all possible hardware faults, and it is not a replacement for other monitoring mechanisms, such as SMTP alerts, within the customer data center. ASR is a complementary mechanism that expedites and simplifies the delivery of replacement hardware. ASR should not be used for downtime events in high-priority systems. For high-priority events, contact Oracle Support Services directly.

An email message is sent to both the My Oracle Support email account and the technical contact for Private Cloud Appliance to notify them of the creation of the service request. A service request might not be filed automatically in some cases, for example if a loss of connectivity to ASR occurs. Administrators should monitor their systems for faults and call Oracle Support Services if they do not receive notice that a service request has been filed automatically.

For more information about ASR, consult the following resources:

Oracle Auto Service Request Prerequisites

Before you register for the ASR service, ensure the following prerequisites are satisfied:

  1. You have a valid My Oracle Support account.

    If necessary, create an account at https://support.oracle.com/portal/.

  2. The following are set up correctly in My Oracle Support:

    • Technical contact person at the customer site who is responsible for Private Cloud Appliance

    • Valid shipping address at the customer site where the Private Cloud Appliance is located, so that parts are delivered to the site where they must be installed

  3. The management nodes have an active outbound Internet connection using HTTPS or an HTTPS proxy.

    For example, try curl to test whether you can access https://support.oracle.com/portal/.

Registering Private Cloud Appliance for Oracle Auto Service Request

To register a Private Cloud Appliance as an ASR client, the appliance must be configured to send hardware fault telemetry to Oracle in one of the following ways:

An example of when you would use a different endpoint is if you have the ASR Manager software installed in your data center as an aggregation point for multiple systems.

When you register your Private Cloud Appliance for ASR, the ASR service is automatically enabled.

Using the Service Web UI

  1. Open the navigation menu and click ASR Phone Home.

  2. Click the Register button.

  3. Fill in the username and password, then complete the fields for the Phone Home configuration that you choose.

    • Username: Required. Enter your Oracle Single Sign On (SSO) credentials, which can be obtained from My Oracle Support.

    • Password: Required. Enter the password for your SSO account.

    • Proxy Username: To use a proxy host, enter a username to access that host.

    • Proxy Password: To use a proxy host, enter the password to access that host.

    • Proxy Host: To use a proxy host, enter the name of that host.

    • Proxy Port: To use a proxy host, enter the port used to access the host.

    • Endpoint: I you use an aggregation point, or other endpoint for ASR data consolidation, enter that endpoint in this format: http://host[:port]/asr

Using the Service CLI

Configure ASR directly to https://transport.oracle.com

  1. Using SSH, log into the management node VIP as admin.

    # ssh -l admin 100.96.2.32 -p 30006
  2. Use the asrClientRegister custom command to register the appliance.

    PCA-ADMIN> asrClientRegister username=asr-pca3_ca@example.com \ 
    password=********  confirmPassword=******** \
    endpoint=https://transport.oracle.com/ \
    Command: asrClientRegister username=asr-pca3_ca@example.com \ 
    password=*****  confirmPassword=***** \ 
    endpoint=https://transport.oracle.com/
    Status: Success
    Time: 2021-07-12 18:47:14,630 UTC
  3. Confirm the configuration.

    PCA-ADMIN> show asrPhonehome
    Command: show asrPhonehome
    Status: Success
    Time: 2021-09-30 13:08:42,210 UTC
    Data:
      Is Registered = true
      Overall Enable Disable = true
      Username = asr.user@example.com  Endpoint = https\://transport.oracle.com/
    PCA-ADMIN>

Configure ASR to a Proxy Host

  1. Using SSH, log into the management node VIP as admin.

    # ssh -l admin 100.96.2.32 -p 30006
  2. Use the asrClientRegister custom command to register the appliance.

    PCA-ADMIN> asrClientRegister username=asr-pca3_ca@oracle.com \ 
    password=******** confirmPassword=******** \ 
    proxyHost=zeb proxyPort=80 \ 
    proxyUsername=support \ 
    proxyPassword=**** proxyConfirmPassword=**** \

Configure ASR to a Different Endpoint

  1. Using SSH, log into the management node VIP as admin.

    # ssh -l admin 100.96.2.32 -p 30006
  2. Use the asrClientRegister custom command to register the appliance.

    PCA-ADMIN> asrClientRegister username=oracle_email@example.com \ 
    password=********  confirmPassword=******** \
    endpoint=https://transport.oracle.com/ \
    Command: asrClientRegister username=oracle_email@example.com \ 
    password=*****  confirmPassword=***** \ 
    endpoint=https://transport.oracle.com/
    Status: Success
    Time: 2021-07-12 18:47:14,630 UTC

Testing Oracle Auto Service Request Configuration

Once configured, test your ASR configuration to ensure end-to-end communication is working properly.

Using the Service Web UI

  1. Open the navigation menu and click ASR Phone Home.

  2. Select Test Registration in the Controls menu.

  3. Click Test Registration. A dialog confirms whether the test is successful.

  4. If the test is not successful, confirm your ASR configuration information and repeat the test.

Using the Service CLI

  1. Using SSH, log into the management node VIP as admin.

    # ssh -l admin 100.96.2.32 -p 30006
  2. Use the asrClientsendTestMsg custom command to test the ASR configuration.

    PCA-ADMIN> asrClientsendTestMsg
    Command: asrClientsendTestMsg
    Status: Success
    Time: 2021-12-08 18:43:30,093 UTC
    PCA-ADMIN>

Unregistering Private Cloud Appliance for Oracle Auto Service Request

When you unregister your Private Cloud Appliance for ASR, the ASR service is automatically disabled; you do not need to perform a separate step.

Using the Service Web UI

  1. Open the navigation menu and click ASR Phone Home.

  2. Click the Unregister button. Confirm the operation when prompted.

Using the Service CLI

  1. Using SSH, log into the management node VIP as admin.

    # ssh -l admin 100.96.2.32 -p 30006
  2. Use the asrClientUnregister custom command to register the appliance.

    PCA-ADMIN> asrClientUnregister
    Command: asrClientUnregister
    Status: Success
    Time: 2021-06-23 15:25:18,127 UTC
    PCA-ADMIN>

Disabling Oracle Auto Service Request

You can disable ASR on an appliance to temporarily prevent fault messages from being sent and service requests created. For example, during system maintenance, components might be down but not failed or faulted. To restart the ASR service, see Enabling Oracle Auto Service Request.

Using the Service Web UI

  1. Open the navigation menu and click ASR Phone Home.

  2. Click the Disable button. Confirm the operation when prompted.

Using the Service CLI

  1. Using SSH, log into the management node VIP as admin.

    # ssh -l admin 100.96.2.32 -p 30006
  2. Use the asrClientDisable custom command to halt the ASR service.

    PCA-ADMIN> asrClientDisable
    Command: asrClientDisable
    Status: Success
    Time: 2021-06-23 15:26:17,753 UTC
    PCA-ADMIN>

Enabling Oracle Auto Service Request

This section describes how to restart the ASR service if the ASR service is disabled.

Using the Service Web UI

  1. Open the navigation menu and click ASR Phone Home.

  2. Click the Enable button. Confirm the operation when prompted.

Using the Service CLI

  1. Using SSH, log into the management node VIP as admin.

    # ssh -l admin 100.96.2.32 -p 30006
  2. Use the asrClientEnable custom command to start the ASR service.

    PCA-ADMIN> asrClientEnable
    Command: asrClientEnable
    Status: Success
    Time: 2021-06-23 15:26:47,632 UTC
    PCA-ADMIN>

Using Support Bundles

Support bundles are files of diagnostic data collected from the Private Cloud Appliance that are used to evaluate and fix problems.

Support bundles can be uploaded to Oracle Support automatically or manually. Support bundles are uploaded securely and contain the minimum required data: system identity (not IP addresses), problem symptoms, and diagnostic information such as logs and status.

Support bundles can be created and not uploaded. You might want to create a bundle for your own use. Creating a support bundle is a convenient way to collect related data.

Support bundles are created and uploaded in the following ways:

Oracle Auto Service Request (ASR)

ASR automatically creates a service request and support bundle when certain hardware faults occur. The service request and support bundle are automatically sent to Oracle Support, and the Private Cloud Appliance administrator is notified. See Using Oracle Auto Service Request.

asrInitiateBundle

The asrInitiateBundle command is a PCA-ADMIN command that creates a support bundle, attaches the support bundle to an existing service request, and uploads to Oracle Support. See Using the asrInitiateBundle Command.

support-bundles

The support-bundles command is a management node command that creates a support bundle of a specified type. Oracle Support might ask you to run this command to collect more data related to a service request, or you might want to collect this data for your own use. See Using the support-bundles Command.

Manual upload to Oracle Support

Several methods are available for uploading support bundles or other data to Oracle Support. See Uploading Support Bundles to Oracle Support.

Using the asrInitiateBundle Command

The asrInitiateBundle command takes three parameters, all required:

PCA-ADMIN> asrInitiateBundle mode=triage sr=SR_number bundleType=auto

A triage support bundle is collected and automatically attached to service request SR_number. For more information about the triage support bundle, see Triage Mode.

If the ASR service is enabled, bundleType=auto uploads the bundle to Oracle Support using the Phone Home service. For information about the Phone Home service, see Registering Private Cloud Appliance for Oracle Auto Service Request.

Using the support-bundles Command

The support-bundles command collects various types of bundles, or modes, of diagnostic data such as health check status, command outputs, and logs. This topic describes the available modes. The following is the recommended way to use this command:

  1. Start data collection by specifying triage mode to understand the preliminary status of the Private Cloud Appliance.

  2. If NOT_HEALTHY appears in the triage mode results, then do one of the following:

    • Use time_slice mode to collect data by time slots. These results can be further narrowed by specifying pod name, job, and k8s_app label.

    • Use smart mode to query data from specific health-checkers.

The support-bundles command requires a mode (-m) option. Some modes have additional options.

The following table lists the options that are common to all modes of the support-bundles command.

Option Description Required

-m mode

The type of bundle.

yes

-sr SR_number

--sr_number SR_number

The service request number.

no

For most modes, the support-bundles command produces a single archive file. The output archive file is named [SR_number_]pca-support-bundle.current-time.tgz. The SR_number is used if you provided the -sr option. If you are creating the support bundle for a service request, you should specify the SR_number.

For native mode, the support-bundles command produces a directory of archive files.

The archive files are stored in /nfs/shared_storage/support_bundles/ on the management node.

Log in to the Management Node

To use the support-bundles command, log in as root to the management node that is running Pacemaker resources. Collect data first from the management node that is running Pacemaker resources, then from other management nodes as needed.

If you do not know which management node is running Pacemaker resources, log in to any management node and check Pacemaker cluster status. The following command shows the Pacemaker cluster resources are running on pcamn01.

[root@pcamn01 ~]# pcs status
Cluster name: mncluster
Stack: corosync
Current DC: pcamn01
...
Full list of resources:

scsi_fencing (stonith:fence_scsi): Stopped (disabled)
Resource Group: mgmt-rg
vip-mgmt-int (ocf::heartbeat:IPaddr2): Started pcamn01
vip-mgmt-host (ocf::heartbeat:IPaddr2): Started pcamn01
vip-mgmt-ilom (ocf::heartbeat:IPaddr2): Started pcamn01
vip-mgmt-lb (ocf::heartbeat:IPaddr2): Started pcamn01
vip-mgmt-ext (ocf::heartbeat:IPaddr2): Started pcamn01
l1api (systemd:l1api): Started pcamn01
haproxy (ocf::heartbeat:haproxy): Started pcamn01
pca-node-state (systemd:pca_node_state): Started pcamn01
dhcp (ocf::heartbeat:dhcpd): Started pcamn01
hw-monitor (systemd:hw_monitor): Started pcamn01

Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled

Triage Mode

In triage mode, Prometheus platform_health_check is queried for both HEALTHY and NOT_HEALTHY status. If NOT_HEALTHY is found, use time_slice mode to get more detail.

[root@pcamn01 ~]# support-bundles -m triage

The following files are in the output archive file.

File Description

header.json

Timestamp and command line to generate this bundle.

compute_node_info.json

Pods running in the compute node.

management_node_info.json

Pods running in the management node.

rack_info.json

Rack installation time and build version.

loki_search_results.log.n

Chunk files in json.

Time Slice Mode

In time slice mode, data is collected by specifying start and end timestamps.

If you do not specify either the -j or --all option, then data is collected from all health checker jobs.

You can narrow the data collection by specifying any of the following:

  • Loki job label

  • Loki k8s_app label

  • Pod name

[root@pcamn01 ~]# support-bundles -m time_slice -j flannel-checker -s 2021-05-29T22:40:00.000Z \
-e 2021-06-29T22:40:00.000Z -l INFO

See more examples below.

The time slice mode of the support-bundles command has the following options in addition to the mode and service request number options listed at the beginning of this topic.

  • Only one of --job_name, --all, and --k8s_app an be specified.

  • If none of --job_name, --all, or --k8s_app is specified, the pod filtering will occur on the default (.+checker).

  • The --all option can collect a huge amount of data. You might want to limit your time slice to 48 hours.

Option Description Required

-j job_name

--job_name job_name

Loki job name. Default value: .+checker

See Label List Query below.

no

--all Queries all job names except for jobs known for too much logging, such as audit, kubernetes-audit, and vault-audit and k8s_app label pcacoredns. no
--k8s_app label

The k8s_app label value to query within the k8s-stdout-logs job.

See Label List Query below.

no

-l level

--levelname level

Message level

no

-s timestamp

--start_date timestamp

Start date in format yyyy-mmm-ddTHH:mm:ss

The minimum argument is yyyy-mmm-dd

yes

-e timestamp

--end_date timestamp

End date in format yyyy-mmm-ddTHH:mm:ss

The minimum argument is yyyy-mmm-dd

yes

--pod_name pod_name The pod name (such as kube or network-checker) to filter output based on the pod. Only the starting letters are necessary. no

Label List Query

Use the label list query to list the available job names and k8s_app label values.

[root@pcamn01 ~]# support-bundles -m label_list
2021-10-14T23:19:18.265 - support_bundles - INFO - Starting Support Bundles
2021-10-14T23:19:18.317 - support_bundles - INFO - Locating filter-logs Pod
2021-10-14T23:19:18.344 - support_bundles - INFO - Executing command - ['python3', 
'/usr/lib/python3.6/site-packages/filter_logs/label_list.py']
2021-10-14T23:19:18.666 - support_bundles - INFO -
Label:  job
Values: ['admin', 'api-server', 'asr-client', 'asrclient-checker', 'audit', 'cert-checker', 'ceui', 
'compute', 'corosync', 'etcd', 'etcd-checker', 'filesystem', 'filter-logs', 'flannel-checker', 
'his', 'hms', 'iam', 'k8s-stdout-logs', 'kubelet', 'kubernetes-audit', 'kubernetes-checker', 
'l0-cluster-services-checker', 'messages', 'mysql-cluster-checker', 'network-checker', 'ovm-agent', 
'ovn-controller', 'ovs-vswitchd', 'ovsdb-server', 'pca-healthchecker', 'pca-nwctl', 'pca-platform-l0', 
'pca-platform-l1api', 'pca-upgrader', 'pcsd', 'registry-checker', 'sauron-checker', 'secure', 
'storagectl', 'uws', 'vault', 'vault-audit', 'vault-checker', 'zfssa-checker', 'zfssa-log-exporter']
 
Label:  k8s_app
Values: ['admin', 'api', 'asr-client', 'asrclient-checker', 'brs', 'cert-checker', 'compute', 
'default-http-backend', 'dr-admin', 'etcd', 'etcd-checker', 'filesystem', 'filter-logs', 
'flannel-checker', 'fluentd', 'ha-cluster-exporter', 'has', 'his', 'hms', 'iam', 'ilom', 
'kube-apiserver', 'kube-controller-manager', 'kube-proxy', 'kubernetes-checker', '
l0-cluster-services-checker', 'loki', 'loki-bnr', 'mysql-cluster-checker', 'mysqld-exporter', 
'network-checker', 'pcacoredns', 'pcadnsmgr', 'pcanetwork', 'pcaswitchmgr', 'prometheus', 'rabbitmq', 
'registry-checker', 'sauron-api', 'sauron-checker', 'sauron-grafana', 'sauron-ingress-controller', 
'sauron-mandos', 'sauron-operator', 'sauron-prometheus', 'sauron-prometheus-gw', 
'sauron-sauron-exporter', 'sauron.oracledx.com', 'storagectl', 'switch-metric', 'uws', 'vault-checker', 
'vmconsole', 'zfssa-analytics-exporter', 'zfssa-csi-nodeplugin', 'zfssa-csi-provisioner', 'zfssa-log-exporter']

Examples:

No job label, no k8s_app label, collect log from all health checkers.

[root@pcamn01 ~]# support-bundles -m time_slice -sr 3-xxxxxxxxxxx -s "2022-01-11T00:00:00" -e "2022-01-12T23:59:59"

One job ceui.

[root@pcamn01 ~]# support-bundles -m time_slice -sr 3-xxxxxxxxxxx -j ceui -s "2022-01-11T00:00:00" -e "2022-01-12T23:59:59"

One k8s_app network-checker.

[root@pcamn01 ~]# support-bundles -m time_slice -sr 3-xxxxxxxxxxx --k8s_app network-checker -s "2022-01-11T00:00:00" -e "2022-01-12T23:59:59"

All jobs and date.

[root@pcamn01 ~]# support-bundles -m time_slice -sr 3-xxxxxxxxxxx -s `date -d "2 days ago" -u +"%Y-%m-%dT%H:%M:%S.000Z"` -e `date -d +u +"%Y-%m-%dT%H:%M:%S.000Z"`

All jobs.

[root@pcamn01 ~]# support-bundles -m time_slice -sr 3-xxxxxxxxxxx --all -s "2022-01-11T00:00:00" -e "2022-01-12T23:59:59"

The following files are in the output archive file.

File Description

header.json

Timestamp and command line to generate this bundle.

loki_search_results.log.n

Chunk files in json.

Smart Mode

In smart mode, health checkers are queried for recent NOT_HEALTHY status. By default, two days of logs are collected. If you need more than two days of logs, specify the --force option. Use the -hc option to specify a health checker.

[root@pcamn01 ~]# support-bundles -m smart

See more examples below.

The smart mode of the support-bundles command has the following options in addition to the mode and service request number options listed at the beginning of this topic.

If only the start date or only the end date is given, the time is calculated and queried two days prior to the given end date or two days after the given start date. If only the start date is given and under the two day time range, the default most recent unhealthy time is used.

Option Description Required

-hc health_checker_name

--health_checker health_checker_name

Loki health checker name.

See the health checker log files table below.

no

--errors_only Level name filtering takes place only on Error, Critical, and Severe. no
--force

Force the start date to override the two-day time range limit.

no

-s timestamp

--start_date timestamp

Start date in format yyyy-mmm-ddTHH:mm:ss

The minimum argument is yyyy-mmm-dd

Default value: End date minus 2 days

no

-e timestamp

--end_date timestamp

End date in format yyyy-mmm-ddTHH:mm:ss

The minimum argument is yyyy-mmm-dd

Default value: Most recent unhealthy time

no

The following table lists the log files for each health checker.

Health Checker Supporting Log Files

L0_hw_health-checker

  • pca.log, pca.health.log, pca.l1api.log, pacemaker.log

  • pca-platform-l1api

  • pca-healthchecker

  • pacemaker

  • pca-platform-l0

cert-checker

No logs - only certificate and expiry date (from the checker)

etcd-checker

  • etcd-container.log

flannel-checker

k8s-stdout-logs: filter by pod (flannel), node, and container

kubernetes-checker

k8s-stdout-logs: filter by pod (kube-apiserver), node, and container

l0-cluster-services-checker

  • pacemaker.log, corosync.log

  • corosync

  • pcsd

mysql-cluster-checker

  • mysqld

network-checker

  • HMS

registry-checker

messages (registry itself does not produce logs)

vault-checker

  • hc-vault-audit.log

  • hc-vault-audit.log

zfssa-checker

  • zfssa-checker

  • zfssa-log-exporter (log = alert | audit | pcalog)

Examples:

No -hc. Query unhealthy data from all health checkers.

[root@pcamn01 ~]# support-bundles -m smart -sr 3-xxxxxxxxxxx

Use -hc to specify one health checker.

[root@pcamn01 ~]# support-bundles -m smart -sr 3-xxxxxxxxxxx -hc network-checker

Timestamps with --force.

[root@pcamn01 ~]# support-bundles -m smart -sr 3-xxxxxxxxxxx -s "2022-01-11/00:00:00" -e "2022-01-15/23:59:59" --force

The following files are in the output archive file.

File Description

header.json

Timestamp and command line to generate this bundle.

loki_search_results.log.n

Chunk files in json.

Native Mode

Unlike other support bundle modes, the native bundle command returns immediately and the bundle collection runs in the background. Native bundles might take hours to collect. Collection progress information is provided in the native_collection.log in the bundle directory.

Also unlike other support bundle modes, the output of native bundles is not a single archive file. Instead, a bundle directory is created in the /nfs/shared_storage/support_bundles/ area on the management node. The directory contains the native_collection.log file and a number of tar.gz files.

[root@pcamn01 ~]# support-bundles -m native -t bundle_type [-c component_name] [-sr SR_number]

The native mode of the support-bundles command has the following options in addition to the mode and service request number options listed at the beginning of this topic.

Option Description Required

-t bundle_type

--type bundle_type

Bundle type: sosreport or zfs-bundle

yes

-c component_name

--component component_name

Component name

This option only applies to type sosreport.

no

ZFS Bundle

When type is zfs-bundle, a ZFS support bundle collection starts on both ZFS nodes and downloads the new ZFS support bundles into the bundle directory.

[root@pcamn01 ~]# support-bundles -m native -t zfs-bundle
2021-11-16T22:49:30.982 - support_bundles - INFO - Starting Support Bundles
2021-11-16T22:49:31.037 - support_bundles - INFO - Locating filter-logs Pod
2021-11-16T22:49:31.064 - support_bundles - INFO - Executing command - ['python3', '/usr/lib/python3.6/site-packages/filter_logs/native.py', '-t', 'zfs-bundle']
2021-11-16T22:49:31.287 - support_bundles - INFO - LAUNCHING COMMAND: ['python3', '/usr/lib/python3.6/site-packages/filter_logs/native_app.py', '-t', 'zfs-bundle', '--target_directory', '/support_bundles/zfs-bundle_20211116T224931267']
ZFS native bundle collection running to /nfs/shared_storage/support_bundles/zfs-bundle_20211116T224931267
Monitor /nfs/shared_storage/support_bundles/zfs-bundle_20211116T224931267/native_collection.log for progress.
 
2021-11-16T22:49:31.287 - support_bundles - INFO - Finished running Support Bundles

SOS Report Bundle

When type is sosreport, the component_name is a management node or compute node. If component_name is not specified, the report is collected from all management and compute nodes.

[root@pcamn01 ~]# support-bundles -m native -t sosreport -c pcacn003 -sr SR_number

Uploading Support Bundles to Oracle Support

After you create a support bundle using the support-bundles command as described in Using the support-bundles Command, you can use the methods described in this topic to upload the support bundle to Oracle Support.

To use these methods, you must satisfy the following requirements:

  • You must have a My Oracle Support user ID with Create and Update SR permissions granted by the appropriate Customer User Administrator (CUA) for each Support Identifier (SI) being used to upload files.

  • For file uploads to existing service requests, the Support Identifier associated with the service request must be in your profile.

  • To upload files larger than 2 GB, sending machines must have network access to connect to the My Oracle Support servers at transport.oracle.com to use FTPS and HTTPS.

    The Oracle FTPS service is a "passive" implementation. With an implicit configuration, the initial connection is from the client to the service on a control port of 990 and the connection is then switched to a high port to exchange data. Oracle defines a possible range of the data port of 32000-42000, and depending upon your network configuration you may need to enable outbound connections on both port 990 and 32000-42000. TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256 is the only encryption method enabled.

    The Oracle HTTPS diagnostic upload service uses the standard HTTPS port of 443 and does not require any additional ports to be opened.

    When using command line protocols, do not include your password in the command. Enter your password only when prompted.

  • Oracle requires the use of TLS 1.2+ for all file transfers.

  • Do not upload encrypted or password-protected files, standalone or within an archive. A Service Request update will note this as a corrupted file or reject the upload as disallowed file types were found. Files are encrypted when you use FTPS and HTTPS; additional protections are not required.

  • Do not upload files with file type extensions exe, bat, asp, or com, either standalone or within an archive. A Service Request update will note that a disallowed file type was found.

Uploading Files 2 GB or Smaller

Use the SR file upload utility on the My Oracle Support Portal.

  1. Log in to My Oracle Support with your My Oracle Support username and password.

  2. Do one of the following:

    • Create a new service request and in the next step, select the Upload button.

    • Select and open an existing service request.

  3. Click the Add Attachment button located at the top of the page.

  4. Click the Choose File button.

  5. Navigate and select the file to upload.

  6. Click the Attach File button.

You can also use the methods described in the next section for larger files.

Uploading Files Larger Than 2 GB

You cannot upload a file larger than 200 GB. See Splitting Files.

FTPS

Syntax:

Be sure to include the / character after the service request number.

$ curl -T path_and_filename -u MOS_user_ID ftps://transport.oracle.com/issue/SR_number/

Example:

$ curl -T /u02/files/bigfile.tar -u MOSuserID@example.com ftps://transport.oracle.com/issue/3-1234567890/

HTTPS

Syntax:

Be sure to include the / character after the service request number.

$ curl -T path_and_filename -u MOS_user_ID https://transport.oracle.com/upload/issue/SR_number/

Example:

$ curl -T D:\data\bigfile.tar -u MOSuserID@example.com https://transport.oracle.com/upload/issue/3-1234567890/

Renaming the file during send

$ curl -T D:\data\bigfile.tar -u MOSuserID@example.com https://transport.oracle.com/upload/issue/3-1234567890/NotSoBig.tar

Using a proxy

$ curl -k -T D:\data\bigfile.tar -x proxy.example.com:80 -u MOSuserID@example.com https://transport.oracle.com/upload/issue/3-1234567890/

Splitting Files

You can split a large file into multiple parts and upload the parts. Oracle Transport will concatenate the segments when you complete uploading all the parts.

Only HTTPS protocol can be used. Only the UNIX split utility can be used. The Microsoft Windows split utility produces an incompatible format.

To reduce upload times, compress the original file prior to splitting.

  1. Split the file.

    The following command splits the file file1.tar into 2 GB parts named file1.tar.partaa and file1.tar.partab.

    Important:

    Specify the .part extension exactly as shown below.

    $ split –b 2048m file1.tar file1.tar.part
  2. Upload the resulting file1.tar.partaa and file1.tar.partab files.

    Important:

    Do not rename these output part files.

    $ curl -T file1.tar.partaa -u MOSuserID@example.com https://transport.oracle.com/upload/issue/SR_number/
    $ curl -T file1.tar.partab -u MOSuserID@example.com https://transport.oracle.com/upload/issue/SR_number/
  3. Send the command to put the parts back together.

    The spit files will not be attached to the service request. Only the final concatenated file will be attached to the service request.

    $ curl -X PUT -H X-multipart-total-size:original_size -u MOSuserID@example.com https://transport.oracle.com/upload/issue/SR_number/file1.tar?multiPartComplete=true

    In the preceding command, original_size is the size of the original unsplit file as shown by a file listing.

  4. Verify the size of the newly-attached file.

    Note:

    This verification command must be executed immediately after the concatenation command in Step 3. Otherwise, the file will have begun processing and will no longer be available for this command.

    $ curl -I -u MOSuserID@example.com https://transport.oracle.com/upload/issue/SR_number/file1.tar
        X-existing-file-size: original_size

Resuming an Interrupted HTTPS Upload

You can resume a file upload that terminated abnormally. Resuming can only be done by using HTTPS. Resuming does not work with FTPS. When an upload is interrupted by some event, the start with retrieving the file size of the interrupted file

  1. Determine how much of the file has already been uploaded.

    $ curl -I -u MOSuserID@example.com https://transport.oracle.com/upload/issue/SR_number/myinfo.tar
    HTTP/1.1 204 No Content
    Date: Tue, 15 Nov 2022 22:53:54 GMT
    Content-Type: text/plain
    X-existing-file-size: already_uploaded_size
    X-Powered-By: Servlet/3.0 JSP/2.2
  2. Resume the file upload.

    Note the file size returned in “X-existing-file-size” in Step 1. Use that file size after the -C switch and in the -H “X-resume-offset:” switch.

    $ curl -Calready_uploaded_size -H "X-resume-offset: already_uploaded_size" -T myinfo.tar -u MOSuserID@example.com https://transport.oracle.com/upload/issue/SR_number/myinfo.tar
  3. Verify the final file size.

    $ curl -I -u MOSuserID@example.com https://transport.oracle.com/upload/issue/SR_number/myinfo.tar
    -H X-existing-file-size: original_size

    In the preceding command, original_size is the size of the original file as shown by a file listing.