5 Status and Health Monitoring
The system health checks and monitoring data are the foundation of problem detection. All the necessary troubleshooting and debugging information is maintained in a single data store, and does not need to be collected from individual components when an issue needs to be investigated. The overall health of the system is captured in one central location: Grafana.
Oracle has built default dashboards and alerts into Grafana, as well as a mechanism to consult the logs stored in Loki. Customers might prefer to expand and customize this setup, but this is beyond the scope of the Oracle Private Cloud Appliance documentation.
Implementation details and technical background information for this feature can be found in the Oracle Private Cloud Appliance Concepts Guide. Refer to the section "Status and Health Monitoring" in the chapter Appliance Administration Overview.
Using Grafana
With Grafana, Oracle Private Cloud Appliance offers administrators a single, visually oriented interface to the logs and metrics collected at all levels and across all components of the system. This section provides basic guidelines to access Grafana and navigate through the logs and monitoring dashboards.
To access the Grafana home page
-
Open the Service Web UI and log in.
-
On the right-hand side of the dashboard, click the Monitoring tile.
The Grafana home page opens in a new browser tab. Enter your user name and password when prompted.
When logs and metrics are stored in Prometheus they are given a time stamp based on the time and time zone settings of the appliance. However, Grafana displays the time based on user preferences, which may result in an offset because you are in a different time zone. It might be preferable to synchronize the time line in the Grafana visualizations with the time zone of the appliance.
To change the Grafana time line display
-
Open the Grafana home page.
-
In the menu bar on the left hand side, click your user account icon (near the bottom) to display your account preferences.
-
In the Preferences section, change the Time Zone setting to the same time zone as the appliance.
-
Click the Save button below to apply the change.
The pre-defined dashboards for Private Cloud Appliance are not directly accessible from the Grafana home page, although you can star your most used dashboards to appear on your home page later. Dashboards are organized in folders, which you access through the Dashboards section of the main menu.
To browse the Grafana dashboards
-
In the menu bar on the left hand side, point to Dashboards and select Manage.
The list of folders, or dashboard sets, is displayed.
-
Click a folder to display the list of dashboards it contains. Click a dashboard to display its contents.
-
To navigate back to the list of folders and dashboards, use the menu bar as you did in step 1.
With the exception of the My Sauron (Read Only) dashboard set, all pre-defined dashboards and panels are editable by design. You can modify them or create your own using the specific metrics you want to monitor. The same applies to the alerts.
Alerts are managed in a separate area. Oracle has pre-defined a series of alerts for your convenience.
To access the alerting rules and notifications
-
In the menu bar on the left hand side, click Alerting (the bell icon).
A list of all defined alert rules is displayed, including their current status.
-
Click an alert rule to display a detail panel and see how its status has evolved over time and relative to the alert threshold.
-
To navigate back to the list of alert rules, use the menu bar as you did in step 1.
-
To configure alert notifications, go to the Notification Channels tab of the Alerting page.
Note:
If you wish to configure custom alerts using your own external notification channel, you must first configure the proxy for Grafana using the Sauron API endpoint. To do so, log in to the management node that owns the management virtual IP and run the following command:
$ sudo curl -u <admin_user_name> \ -XPUT 'https://api.<mypca>.example.com/v1/grafana/proxy/config?http-proxy=<proxy_fqdn>:<proxy_port>&https-proxy=<proxy_fqdn>:<proxy_port>' Enter host password for user '<admin_user_name>': Grafana proxy config successfully updated!
Finally, Grafana also provides access to the appliance logs, which are aggregated through Loki. For more information, see Accessing System Logs.
Checking the Health and Status of Hardware and Platform Components
The hardware and platform layers form the foundations of the system architecture. Any unhealthy condition at this level is expected to have an adverse effect on operations in the infrastructure services. A number of pre-defined Grafana dashboards allow you to check the status of those essential low-level components, and drill down into the real-time and historic details of the relevant metrics.
The dashboards described in this section provide a good starting point for basic system health checks, and troubleshooting in case issues are found. You might prefer to use different dashboards, metrics and visualizations instead. The necessary data, collected across the entire system, is stored in Prometheus, and can be queried and presented through Grafana in countless ways.
Grafana Folder | Dashboard | Description |
---|---|---|
Service Monitoring |
Server Stats |
This comprehensive dashboard displays telemetry data for the server nodes. It includes graphs for CPU and memory utilization, disk activity, network traffic, and so on. Some panels in this dashboard display a large number of time series in a single graph, so note that you can click to display a single one, or hover over the graph to view detailed data at a specific point on the time axis. |
PCA 3.0 Service Advisor |
Platform Health Check |
This dashboard integrates the appliance health check mechanisms into the centralized approach that Grafana provides for logging and monitoring. By default, the Platform Health Check dashboard displays the failures for all health check services. You can change the panel display by selecting a health checker from the list of platform services, and you can choose to display healthy, unhealthy or all results. Typically, if you see health check failures you want to start troubleshooting. For that purpose, each health check result contains a time stamp that serves as a direct link to the related Loki logs. To view the logs related to any health check result, simply click the time stamp. |
My Sauron (Read Only) |
Node Exporter Full |
This dashboard displays a large number of detailed metric panels for a single compute or management node. Select a host from the list to display its data. This dashboard could be considered a fine-grained extension of the Server Stats dashboard. The many different panels provide detailed coverage of the server node hardware status as well as the operating system services and processes. Information that you would typically collect at the command line of each physical node is combined into a single dashboard showing live data and its evolution over time. All dashboards in the My Sauron (Read Only) folder provide data that would be critical in case a system-level failure needs to be resolved. Therefore, these dashboards cannot be modified or deleted. |
Viewing and Interpreting Monitoring Data
The infrastructure services layer, which is built on top of the platform and enables all the cloud user and administrator functionality, can be monitored through an extensive collection of Grafana dashboards. These microservices are deployed across the three management nodes in Kubernetes containers, so their monitoring is largely based on Kubernetes node and pod metrics. The Kubernetes cluster also extends onto the compute nodes, where Kubernetes worker nodes collect vital additional data for system operation and monitoring.
The dashboards described in this section provide a good starting point for microservices health monitoring. You might prefer to use different dashboards, metrics and visualizations instead. The necessary data, collected across the entire system, is stored in Prometheus, and can be queried and presented through Grafana in countless ways.
Grafana Folder | Dashboard | Description |
---|---|---|
Service Monitoring |
ClusterLabs HA Cluster Details |
This dashboard uses a bespoke Prometheus exporter to display data for HA clusters based on Pacemaker. On each HTTP request it locally inspects the cluster status, by parsing pre-existing distributed data provided by the cluster components' tools. The monitoring data includes Pacemaker cluster summary, nodes and resource stats, and Corosync ring errors and quorum votes. |
Service Monitoring |
MySQL Cluster Exporter |
This dashboard displays performance details for the MySQL database cluster. Data includes database service metrics such as uptime, connection statistics, table lock counts, as well as more general information about MySQL objects, connections, network traffic, memory and CPU usage, etc. |
Service Monitoring |
Service Level |
This dashboard displays detailed information about RabbitMQ requests that are received by the fundamental appliance services. It allows you to monitor the number of requests, request latency, and any requests that caused an error. |
Service Monitoring |
VM Stats |
This comprehensive dashboard displays resource consumption information across the compute instances in your environment. It includes graphs for CPU and memory utilization, disk activity, network traffic, and so on. The panels in this dashboard display a large number of time series in a single graph, so note that you can click to display a single one, or hover over the graph to view detailed data at a specific point on the time axis. |
PCA 3.0 Service Advisor |
Kube Endpoint |
This dashboard focuses specifically on the Kubernetes endpoints and provides endpoint alerts. These alerts can be sent to a notification channel of your choice. |
PCA 3.0 Service Advisor |
Kube Ingress |
This dashboard provides data about ingress traffic to the Kubernetes services and their pods. Two alerts are built-in and can be sent to a notification channel of your choice. |
PCA 3.0 Service Advisor |
Kube Node |
This dashboard displays metric data for all the server nodes, meaning management and compute nodes, that belong to the Kubernetes cluster and host microservices pods. You can monitor pod count, CPU and memory usage, and so on. The metric panels display information for all nodes. In the graph-based panels you can click to view information for just a single node. |
PCA 3.0 Service Advisor |
Kube Pod |
This dashboard displays metric data at the level of the microservices pods, allowing you to view the total number of pods overall and how they are distributed across the nodes. You can monitor their status per namespace and per service, and check if they have triggered any alerts. |
PCA 3.0 Service Advisor |
Kube Service |
This dashboard displays metric data at the Kubernetes service level. The data can be filtered for specific services, but displays all by default. Two alerts are built-in and can be sent to a notification channel of your choice. |
Kubernetes Monitoring Kubernetes Monitoring Containers Kubernetes Monitoring Node |
(all) |
These folders contains a large and diverse collection of dashboards with a wide range of monitoring data. covering practically all aspects of your Kubernetes cluster. The data covers Kubernetes at the cluster, node, pod and container levels. Metrics provide insights into deployment, ingress, usage of CPU, disk, memory and network, and much more. |
Monitoring System Capacity
It is important to track the key metrics that determine the system's capacity to host your compute instances and the storage they use. The detailed data for compute node load and storage usage can be found in the Grafana dashboards, but as an administrator you also have direct access to the current consumption of CPU and memory as well as storage space.
Viewing CPU and Memory Usage By Fault Domain
The getFaultDomainInfo
command provides an overview of memory and CPU usage
across a fault domain.
Using the Service Web UI
-
In the PCA Config navigation menu, click Fault Domains.
The table displays CPU and memory usage data by fault domain.
-
To view more detailed information about a component, click its host name in the table.
Using the Service CLI
-
To display a list of the CPU and memory usage in a fault domain, use the
getFaultDomainInfo
command.The
UNASSIGNED
row refers to compute nodes that are not currently assigned to a fault domain. Because these computes node do not belong to a fault domain, their memory and CPU usage in a fault domain is zero. You can access memory and CPU usage per compute node by viewing the Compute Node Information page in the Service Web UI.PCA-ADMIN> getFaultDomainInfo Command: getFaultDomainInfo Status: Success Time: 2022-06-17 14:43:13,292 UTC Data: id totalCNs totalMemory freeMemory totalvCPUs freevCPUs notes -- -------- ----------- ---------- ---------- --------- ----- UNASSIGNED 11 0.0 0.0 0 0 FD1 1 984.0 968.0 120 118 FD2 1 984.0 984.0 120 120 FD3 1 984.0 984.0 120 120
Viewing Disk Space Usage on the ZFS Storage Appliance
The Service Enclave runs a storage monitoring tool called ZFS pool manager, which polls the ZFS Storage Appliance every 60 seconds. The Service CLI allows you to display its current information on the usage of available disk space in each ZFS pool. You can also set the usage threshold that triggers a fault when exceeded.
In a standard storage configuration you only have one pool. If your system includes high-performance disk trays then you can view usage information for both pools separately.
Use the Service CLI as follows to check storage capacity:
-
Display the status of a ZFS pool.
PCA-ADMIN> list ZfsPool Command: list ZfsPool Status: Success Time: 2022-10-10 08:44:11,938 UTC Data: id name -- ---- e898b147-7cf0-4bd0-8b54-e32ec83d04cb PCA_POOL c2f67943-df81-47a5-9713-06768318b623 PCA_POOL_HIGH PCA-ADMIN> show ZfsPool id=e898b147-7cf0-4bd0-8b54-e32ec83d04cb Command: show ZfsPool id=e898b147-7cf0-4bd0-8b54-e32ec83d04cb Status: Success Time: 2022-10-10 08:44:22,051 UTC Data: Id = e898b147-7cf0-4bd0-8b54-e32ec83d04cb Type = ZfsPool Pool Status = Online Free Pool = 44879343128576 Total Pool = 70506183131136 Pool Usage Percent = 0.3634693989163486 Name = PCA_POOL Work State = Normal
-
Configure the fault threshold of the ZFS pool manager. It is set to 80 percent full (value = 0.8) by default.
PCA-ADMIN> show ZfsPoolManager Command: show ZfsPoolManager Status: Success Time: 2022-10-10 08:58:11,231 UTC Data: Id = a6ca861b-f83a-4032-91c5-bc506394d0de Type = ZfsPoolManager LastRunTime = 2022-10-09 12:17:52,964 UTC Poll Interval (sec) = 60 The minimum Zfs pool usage percentage to trigger a major fault = 0.8 Manager's run state = Running PCA-ADMIN> edit ZfsPoolManager usageMajorFaultPercent=0.75 Command: edit ZfsPoolManager usageMajorFaultPercent=0.75 Status: Success Time: 2022-10-10 08:58:27,657 UTC JobId: 67cfe180-f2a2-4d59-a676-01b3d73cffae
Accessing System Logs
Logs are collected from all over the system and aggregated in Loki. All the log data can be queried, filtered and displayed using the central interface of Grafana
To view the Loki logs
-
Open the Grafana home page.
-
In the menu bar on the left hand side, click Explore (the compass icon).
By default, the Explore page's data source is set to "Prometheus".
-
At the top of the page near the left hand side, select "Loki" from the data source list.
-
Use the Log Labels list to query and filter the logs.
The logs are categorized with labels, which you can query in order to display log entries of a particular type or category. The principal log label categories used within Private Cloud Appliance are the following:
-
job
The log labels in this category are divided into three categories:
-
Platform: logs from services and components running in the foundation layers of the appliance architecture.
Log labels in this category include:
"him"
/"has"
/"hms"
(hardware management),"api-server"
,"vault"
/"etcd"
(secret service),"corosync"
/"pacemaker"
/"pcsd"
(management cluster), "messages" (RabbitMQ)"pca-platform-l0"
,"pca-platform-l1api"
, and so on. -
Infrastructure services: logs from the user-level cloud services and administrative services deployed on top of the platform. These services are easier to identify by their name.
Log labels in this category include:
"brs"
(backup/restore),"ceui"
(Compute Web UI),"seui"
(Service Web UI),"compute"
,"dr-admin"
(disaster recovery),"filesystem"
,"iam"
(identity and access management),"pca-upgrader"
, and so on. -
Standard output: logs that the containerized infrastructure services send to the
stdout
stream. This output is visible to users when they execute a UI operation or CLI command.Use the log label
job="k8s-stdout-logs"
to filter for the standard output logs. The log data comes from the microservices' Kubernetes containers, and can be filtered further by specifying a pod and/or container name.
-
-
k8s_app
Log labels in this category allow you to narrow down the standard output logs (
job="k8s-stdout-logs"
). That log data comes from the microservices' Kubernetes containers, and can be filtered further by selecting the label that corresponds with the name of the specific service you are interested in.
You navigate through the logs by selecting one of the job
or
k8s_app
log labels. You pick the label that corresponds with the
service or application you are interested in, and the list of logs is displayed in reverse
chronological order. You can narrow your search by zooming in on a portion of the time line
shown above the log entries. Color coding helps to identify the items that require your
attention; for example: warnings are marked in yellow and errors are marked in red.
Audit Logs
The audit logs can be consulted as separate categories. From the Log Labels list, you can select these audit labels:
-
job="vault-audit"
Use this log label to filter for the audit logs of the Vault cluster. Vault, a key component of the secret service, keeps a detailed log of all requests and responses. You can view every authenticated interaction with Vault, including errors. Because these logs contain sensitive information, many strings within requests and responses are hashed so that secrets are not shown in plain text in the audit logs.
-
job="kubernetes-audit"
Use this log label to filter for the audit logs of the Kubernetes cluster. The Kubernetes audit policy is configured to log request metadata: requesting user, time stamp, resource, verb, etc. Request body and response body are not included in the audit logs.
-
job="audit"
Use this log label to filter for the Oracle Linux kernel audit daemon logs. The kernel audit daemon (auditd) is the userspace component of the Linux Auditing System. It captures specific events such as system logins, account modifications and sudo operations.
-
log="audit"
Use this log label to filter for the audit logs of the ZFS Storage Appliance.
In addition to using the log labels from the list, you can also build custom queries. For example, to filter for the audit logs of the admin service and API service, enter the following query into the field next to the Log Labels list:
{job=~"(admin|api-server)"} | json tag="tag" | tag=~"(api-audit.log|audit.log)"
To execute, either click the Run Query button in the top-right corner or press
Shift
+ Enter
.
Using Oracle Auto Service Request
Oracle Private Cloud Appliance is qualified for Oracle Auto Service Request (ASR). ASR is integrated with My Oracle Support. When specific hardware failures occur, ASR automatically opens a service request and sends diagnostic information. The appliance administrator receives notification that a service request is open.
Using ASR is optional: the service must be registered and enabled for your appliance.
Understanding Oracle Auto Service Request
ASR automatically opens service requests when specific Private Cloud Appliance hardware faults occur. To enable this feature, the Private Cloud Appliance must be configured to send hardware fault telemetry to Oracle directly at https://transport.oracle.com, to a proxy host, or to a different endpoint. For example, you can use a different endpoint if you have the ASR Manager software installed in your data center as an aggregation point for multiple systems.
When a hardware problem is detected, ASR submits a service request to Oracle Support Services. In many cases, Oracle Support Services can begin work on resolving the issue before the administrator is even aware the problem exists.
ASR detects faults in the most common hardware components, such as disks, fans, and power supplies, and automatically opens a service request when a fault occurs. ASR does not detect all possible hardware faults, and it is not a replacement for other monitoring mechanisms, such as SMTP alerts, within the customer data center. ASR is a complementary mechanism that expedites and simplifies the delivery of replacement hardware. ASR should not be used for downtime events in high-priority systems. For high-priority events, contact Oracle Support Services directly.
An email message is sent to both the My Oracle Support email account and the technical contact for Private Cloud Appliance to notify them of the creation of the service request. A service request might not be filed automatically in some cases, for example if a loss of connectivity to ASR occurs. Administrators should monitor their systems for faults and call Oracle Support Services if they do not receive notice that a service request has been filed automatically.
For more information about ASR, consult the following resources:
-
Oracle Auto Service Request web page: https://www.oracle.com/servers/technologies/auto-service-request.html.
-
Oracle Auto Service Request user documentation: https://docs.oracle.com/cd/E37710_01/index.htm.
Oracle Auto Service Request Prerequisites
Before you register for the ASR service, ensure the following prerequisites are satisfied:
-
You have a valid My Oracle Support account.
If necessary, create an account at https://support.oracle.com/portal/.
-
The following are set up correctly in My Oracle Support:
-
Technical contact person at the customer site who is responsible for Private Cloud Appliance
-
Valid shipping address at the customer site where the Private Cloud Appliance is located, so that parts are delivered to the site where they must be installed
-
-
The management nodes have an active outbound Internet connection using HTTPS or an HTTPS proxy.
For example, try
curl
to test whether you can access https://support.oracle.com/portal/.
Registering Private Cloud Appliance for Oracle Auto Service Request
To register a Private Cloud Appliance as an ASR client, the appliance must be configured to send hardware fault telemetry to Oracle in one of the following ways:
-
Directly at https://transport.oracle.com
-
To a proxy host
-
To a different endpoint
An example of when you would use a different endpoint is if you have the ASR Manager software installed in your data center as an aggregation point for multiple systems.
When you register your Private Cloud Appliance for ASR, the ASR service is automatically enabled.
Using the Service Web UI
-
Open the navigation menu and click ASR Phone Home.
-
Click the Register button.
-
Fill in the username and password, then complete the fields for the Phone Home configuration that you choose.
-
Username: Required. Enter your Oracle Single Sign On (SSO) credentials, which can be obtained from My Oracle Support.
-
Password: Required. Enter the password for your SSO account.
-
Proxy Username: To use a proxy host, enter a username to access that host.
-
Proxy Password: To use a proxy host, enter the password to access that host.
-
Proxy Host: To use a proxy host, enter the name of that host.
-
Proxy Port: To use a proxy host, enter the port used to access the host.
-
Endpoint: I you use an aggregation point, or other endpoint for ASR data consolidation, enter that endpoint in this format:
http://host[:port]/asr
-
Using the Service CLI
Configure ASR directly to https://transport.oracle.com
-
Using SSH, log into the management node VIP as
admin
.# ssh -l admin 100.96.2.32 -p 30006
-
Use the
asrClientRegister
custom command to register the appliance.PCA-ADMIN> asrClientRegister username=asr-pca3_ca@example.com \ password=******** confirmPassword=******** \ endpoint=https://transport.oracle.com/ \ Command: asrClientRegister username=asr-pca3_ca@example.com \ password=***** confirmPassword=***** \ endpoint=https://transport.oracle.com/ Status: Success Time: 2021-07-12 18:47:14,630 UTC
-
Confirm the configuration.
PCA-ADMIN> show asrPhonehome Command: show asrPhonehome Status: Success Time: 2021-09-30 13:08:42,210 UTC Data: Is Registered = true Overall Enable Disable = true Username = asr.user@example.com Endpoint = https\://transport.oracle.com/ PCA-ADMIN>
Configure ASR to a Proxy Host
-
Using SSH, log into the management node VIP as
admin
.# ssh -l admin 100.96.2.32 -p 30006
-
Use the
asrClientRegister
custom command to register the appliance.PCA-ADMIN> asrClientRegister username=asr-pca3_ca@oracle.com \ password=******** confirmPassword=******** \ proxyHost=zeb proxyPort=80 \ proxyUsername=support \ proxyPassword=**** proxyConfirmPassword=**** \
Configure ASR to a Different Endpoint
-
Using SSH, log into the management node VIP as
admin
.# ssh -l admin 100.96.2.32 -p 30006
-
Use the
asrClientRegister
custom command to register the appliance.PCA-ADMIN> asrClientRegister username=oracle_email@example.com \ password=******** confirmPassword=******** \ endpoint=https://transport.oracle.com/ \ Command: asrClientRegister username=oracle_email@example.com \ password=***** confirmPassword=***** \ endpoint=https://transport.oracle.com/ Status: Success Time: 2021-07-12 18:47:14,630 UTC
Testing Oracle Auto Service Request Configuration
Once configured, test your ASR configuration to ensure end-to-end communication is working properly.
Using the Service Web UI
-
Open the navigation menu and click ASR Phone Home.
-
Select Test Registration in the Controls menu.
-
Click Test Registration. A dialog confirms whether the test is successful.
-
If the test is not successful, confirm your ASR configuration information and repeat the test.
Using the Service CLI
-
Using SSH, log into the management node VIP as
admin
.# ssh -l admin 100.96.2.32 -p 30006
-
Use the
asrClientsendTestMsg
custom command to test the ASR configuration.PCA-ADMIN> asrClientsendTestMsg Command: asrClientsendTestMsg Status: Success Time: 2021-12-08 18:43:30,093 UTC PCA-ADMIN>
Unregistering Private Cloud Appliance for Oracle Auto Service Request
When you unregister your Private Cloud Appliance for ASR, the ASR service is automatically disabled; you do not need to perform a separate step.
Using the Service Web UI
-
Open the navigation menu and click ASR Phone Home.
-
Click the Unregister button. Confirm the operation when prompted.
Using the Service CLI
-
Using SSH, log into the management node VIP as
admin
.# ssh -l admin 100.96.2.32 -p 30006
-
Use the
asrClientUnregister
custom command to register the appliance.PCA-ADMIN> asrClientUnregister Command: asrClientUnregister Status: Success Time: 2021-06-23 15:25:18,127 UTC PCA-ADMIN>
Disabling Oracle Auto Service Request
You can disable ASR on an appliance to temporarily prevent fault messages from being sent and service requests created. For example, during system maintenance, components might be down but not failed or faulted. To restart the ASR service, see Enabling Oracle Auto Service Request.
Using the Service Web UI
-
Open the navigation menu and click ASR Phone Home.
-
Click the Disable button. Confirm the operation when prompted.
Using the Service CLI
-
Using SSH, log into the management node VIP as
admin
.# ssh -l admin 100.96.2.32 -p 30006
-
Use the
asrClientDisable
custom command to halt the ASR service.PCA-ADMIN> asrClientDisable Command: asrClientDisable Status: Success Time: 2021-06-23 15:26:17,753 UTC PCA-ADMIN>
Enabling Oracle Auto Service Request
This section describes how to restart the ASR service if the ASR service is disabled.
Using the Service Web UI
-
Open the navigation menu and click ASR Phone Home.
-
Click the Enable button. Confirm the operation when prompted.
Using the Service CLI
-
Using SSH, log into the management node VIP as
admin
.# ssh -l admin 100.96.2.32 -p 30006
-
Use the
asrClientEnable
custom command to start the ASR service.PCA-ADMIN> asrClientEnable Command: asrClientEnable Status: Success Time: 2021-06-23 15:26:47,632 UTC PCA-ADMIN>
Using Support Bundles
Support bundles are files of diagnostic data collected from the Private Cloud Appliance that are used to evaluate and fix problems.
Support bundles can be uploaded to Oracle Support automatically or manually. Support bundles are uploaded securely and contain the minimum required data: system identity (not IP addresses), problem symptoms, and diagnostic information such as logs and status.
Support bundles can be created and not uploaded. You might want to create a bundle for your own use. Creating a support bundle is a convenient way to collect related data.
Support bundles are created and uploaded in the following ways:
- Oracle Auto Service Request (ASR)
-
ASR automatically creates a service request and support bundle when certain hardware faults occur. The service request and support bundle are automatically sent to Oracle Support, and the Private Cloud Appliance administrator is notified. See Using Oracle Auto Service Request.
-
asrInitiateBundle
-
The
asrInitiateBundle
command is aPCA-ADMIN
command that creates a support bundle, attaches the support bundle to an existing service request, and uploads to Oracle Support. See Using the asrInitiateBundle Command. -
support-bundles
-
The
support-bundles
command is a management node command that creates a support bundle of a specified type. Oracle Support might ask you to run this command to collect more data related to a service request, or you might want to collect this data for your own use. See Using the support-bundles Command. - Manual upload to Oracle Support
-
Several methods are available for uploading support bundles or other data to Oracle Support. See Uploading Support Bundles to Oracle Support.
Using the asrInitiateBundle
Command
The asrInitiateBundle
command takes three parameters, all required:
PCA-ADMIN> asrInitiateBundle mode=triage sr=SR_number bundleType=auto
A triage
support bundle is collected and automatically attached to service
request SR_number
. For more information about the
triage
support bundle, see Triage Mode.
If the ASR service is enabled, bundleType=auto
uploads the bundle to Oracle Support using the Phone Home service. For
information about the Phone Home service, see Registering Private Cloud Appliance for Oracle Auto Service Request.
Using the support-bundles
Command
The support-bundles
command collects various types of bundles, or modes, of
diagnostic data such as health check status, command outputs, and logs. This topic describes
the available modes. The following is the recommended way to use this command:
-
Start data collection by specifying
triage
mode to understand the preliminary status of the Private Cloud Appliance. -
If NOT_HEALTHY appears in the
triage
mode results, then do one of the following:-
Use
time_slice
mode to collect data by time slots. These results can be further narrowed by specifying pod name, job, and k8s_app label. -
Use
smart
mode to query data from specific health-checkers.
-
The support-bundles
command requires a mode (-m
) option.
Some modes have additional options.
The following table lists the options that are common to all modes of the
support-bundles
command.
Option | Description | Required |
---|---|---|
|
The type of bundle. |
yes |
|
The service request number. |
no |
For most modes, the support-bundles
command produces a single archive
file. The output archive file is named
[SR_number_]pca-support-bundle.current-time.tgz
.
The SR_number
is used if you provided the -sr
option. If you are creating the support bundle for a service request, you should specify the
SR_number
.
For native
mode, the support-bundles
command produces a
directory of archive files.
The archive files are stored in /nfs/shared_storage/support_bundles/
on the
management node.
Log in to the Management Node
To use the support-bundles
command, log in as root
to the
management node that is running Pacemaker resources. Collect data first from the management
node that is running Pacemaker resources, then from other management nodes as needed.
If you do not know which management node is running Pacemaker resources, log in to any management node and check Pacemaker cluster status. The following command shows the Pacemaker cluster resources are running on pcamn01.
[root@pcamn01 ~]# pcs status Cluster name: mncluster Stack: corosync Current DC: pcamn01 ... Full list of resources: scsi_fencing (stonith:fence_scsi): Stopped (disabled) Resource Group: mgmt-rg vip-mgmt-int (ocf::heartbeat:IPaddr2): Started pcamn01 vip-mgmt-host (ocf::heartbeat:IPaddr2): Started pcamn01 vip-mgmt-ilom (ocf::heartbeat:IPaddr2): Started pcamn01 vip-mgmt-lb (ocf::heartbeat:IPaddr2): Started pcamn01 vip-mgmt-ext (ocf::heartbeat:IPaddr2): Started pcamn01 l1api (systemd:l1api): Started pcamn01 haproxy (ocf::heartbeat:haproxy): Started pcamn01 pca-node-state (systemd:pca_node_state): Started pcamn01 dhcp (ocf::heartbeat:dhcpd): Started pcamn01 hw-monitor (systemd:hw_monitor): Started pcamn01 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled
Triage Mode
In triage
mode, Prometheus platform_health_check
is
queried for both HEALTHY and NOT_HEALTHY status. If NOT_HEALTHY is found, use
time_slice
mode to get more detail.
[root@pcamn01 ~]# support-bundles -m triage
The following files are in the output archive file.
File | Description |
---|---|
|
Timestamp and command line to generate this bundle. |
|
Pods running in the compute node. |
|
Pods running in the management node. |
|
Rack installation time and build version. |
|
Chunk files in json. |
Time Slice Mode
In time slice mode, data is collected by specifying start and end timestamps.
If you do not specify either the -j
or --all
option, then
data is collected from all health checker jobs.
You can narrow the data collection by specifying any of the following:
-
Loki job label
-
Loki k8s_app label
-
Pod name
[root@pcamn01 ~]# support-bundles -m time_slice -j flannel-checker -s 2021-05-29T22:40:00.000Z \ -e 2021-06-29T22:40:00.000Z -l INFO
See more examples below.
The time slice mode of the support-bundles
command has the following
options in addition to the mode and service request number options listed at the beginning
of this topic.
-
Only one of
--job_name
,--all
, and--k8s_app
an be specified. -
If none of
--job_name
,--all
, or--k8s_app
is specified, the pod filtering will occur on the default (.+checker
). -
The
--all
option can collect a huge amount of data. You might want to limit your time slice to 48 hours.
Option | Description | Required |
---|---|---|
|
Loki job name. Default value: See Label List Query below. |
no |
--all
|
Queries all job names except for jobs known for too much logging, such as
audit , kubernetes-audit , and
vault-audit and k8s_app label
pcacoredns .
|
no |
--k8s_app
label
|
The See Label List Query below. |
no |
|
Message level |
no |
|
Start date in format The minimum argument is |
yes |
|
End date in format The minimum argument is |
yes |
--pod_name pod_name
|
The pod name (such as kube or
network-checker ) to filter output based on the pod. Only the
starting letters are necessary.
|
no |
Label List Query
Use the label list query to list the available job names and k8s_app
label values.
[root@pcamn01 ~]# support-bundles -m label_list 2021-10-14T23:19:18.265 - support_bundles - INFO - Starting Support Bundles 2021-10-14T23:19:18.317 - support_bundles - INFO - Locating filter-logs Pod 2021-10-14T23:19:18.344 - support_bundles - INFO - Executing command - ['python3', '/usr/lib/python3.6/site-packages/filter_logs/label_list.py'] 2021-10-14T23:19:18.666 - support_bundles - INFO - Label: job Values: ['admin', 'api-server', 'asr-client', 'asrclient-checker', 'audit', 'cert-checker', 'ceui', 'compute', 'corosync', 'etcd', 'etcd-checker', 'filesystem', 'filter-logs', 'flannel-checker', 'his', 'hms', 'iam', 'k8s-stdout-logs', 'kubelet', 'kubernetes-audit', 'kubernetes-checker', 'l0-cluster-services-checker', 'messages', 'mysql-cluster-checker', 'network-checker', 'ovm-agent', 'ovn-controller', 'ovs-vswitchd', 'ovsdb-server', 'pca-healthchecker', 'pca-nwctl', 'pca-platform-l0', 'pca-platform-l1api', 'pca-upgrader', 'pcsd', 'registry-checker', 'sauron-checker', 'secure', 'storagectl', 'uws', 'vault', 'vault-audit', 'vault-checker', 'zfssa-checker', 'zfssa-log-exporter'] Label: k8s_app Values: ['admin', 'api', 'asr-client', 'asrclient-checker', 'brs', 'cert-checker', 'compute', 'default-http-backend', 'dr-admin', 'etcd', 'etcd-checker', 'filesystem', 'filter-logs', 'flannel-checker', 'fluentd', 'ha-cluster-exporter', 'has', 'his', 'hms', 'iam', 'ilom', 'kube-apiserver', 'kube-controller-manager', 'kube-proxy', 'kubernetes-checker', ' l0-cluster-services-checker', 'loki', 'loki-bnr', 'mysql-cluster-checker', 'mysqld-exporter', 'network-checker', 'pcacoredns', 'pcadnsmgr', 'pcanetwork', 'pcaswitchmgr', 'prometheus', 'rabbitmq', 'registry-checker', 'sauron-api', 'sauron-checker', 'sauron-grafana', 'sauron-ingress-controller', 'sauron-mandos', 'sauron-operator', 'sauron-prometheus', 'sauron-prometheus-gw', 'sauron-sauron-exporter', 'sauron.oracledx.com', 'storagectl', 'switch-metric', 'uws', 'vault-checker', 'vmconsole', 'zfssa-analytics-exporter', 'zfssa-csi-nodeplugin', 'zfssa-csi-provisioner', 'zfssa-log-exporter']
Examples:
No job label, no k8s_app label, collect log from all health checkers.
[root@pcamn01 ~]# support-bundles -m time_slice -sr 3-xxxxxxxxxxx -s "2022-01-11T00:00:00" -e "2022-01-12T23:59:59"
One job ceui.
[root@pcamn01 ~]# support-bundles -m time_slice -sr 3-xxxxxxxxxxx -j ceui -s "2022-01-11T00:00:00" -e "2022-01-12T23:59:59"
One k8s_app network-checker.
[root@pcamn01 ~]# support-bundles -m time_slice -sr 3-xxxxxxxxxxx --k8s_app network-checker -s "2022-01-11T00:00:00" -e "2022-01-12T23:59:59"
All jobs and date.
[root@pcamn01 ~]# support-bundles -m time_slice -sr 3-xxxxxxxxxxx -s `date -d "2 days ago" -u +"%Y-%m-%dT%H:%M:%S.000Z"` -e `date -d +u +"%Y-%m-%dT%H:%M:%S.000Z"`
All jobs.
[root@pcamn01 ~]# support-bundles -m time_slice -sr 3-xxxxxxxxxxx --all -s "2022-01-11T00:00:00" -e "2022-01-12T23:59:59"
The following files are in the output archive file.
File | Description |
---|---|
|
Timestamp and command line to generate this bundle. |
|
Chunk files in json. |
Smart Mode
In smart mode, health checkers are queried for recent NOT_HEALTHY status. By default, two
days of logs are collected. If you need more than two days of logs, specify the
--force
option. Use the -hc
option to specify a health
checker.
[root@pcamn01 ~]# support-bundles -m smart
See more examples below.
The smart mode of the support-bundles
command has the following options
in addition to the mode and service request number options listed at the beginning of this
topic.
If only the start date or only the end date is given, the time is calculated and queried two days prior to the given end date or two days after the given start date. If only the start date is given and under the two day time range, the default most recent unhealthy time is used.
Option | Description | Required |
---|---|---|
|
Loki health checker name. See the health checker log files table below. |
no |
--errors_only
|
Level name filtering takes place only on Error, Critical, and Severe. | no |
--force
|
Force the start date to override the two-day time range limit. |
no |
|
Start date in format The minimum argument is Default value: End date minus 2 days |
no |
|
End date in format The minimum argument is Default value: Most recent unhealthy time |
no |
The following table lists the log files for each health checker.
Health Checker | Supporting Log Files |
---|---|
L0_hw_health-checker |
|
cert-checker |
No logs - only certificate and expiry date (from the checker) |
etcd-checker |
|
flannel-checker |
|
kubernetes-checker |
|
l0-cluster-services-checker |
|
mysql-cluster-checker |
|
network-checker |
|
registry-checker |
messages (registry itself does not produce logs) |
vault-checker |
|
zfssa-checker |
|
Examples:
No -hc
. Query unhealthy data from all health checkers.
[root@pcamn01 ~]# support-bundles -m smart -sr 3-xxxxxxxxxxx
Use -hc
to specify one health checker.
[root@pcamn01 ~]# support-bundles -m smart -sr 3-xxxxxxxxxxx -hc network-checker
Timestamps with --force
.
[root@pcamn01 ~]# support-bundles -m smart -sr 3-xxxxxxxxxxx -s "2022-01-11/00:00:00" -e "2022-01-15/23:59:59" --force
The following files are in the output archive file.
File | Description |
---|---|
|
Timestamp and command line to generate this bundle. |
|
Chunk files in json. |
Native Mode
Unlike other support bundle modes, the native bundle command returns immediately and the
bundle collection runs in the background. Native bundles might take hours to collect.
Collection progress information is provided in the native_collection.log
in
the bundle directory.
Also unlike other support bundle modes, the output of native bundles is not a single
archive file. Instead, a bundle directory is created in the
/nfs/shared_storage/support_bundles/
area on the management node. The
directory contains the native_collection.log
file and a number of
tar.gz
files.
[root@pcamn01 ~]# support-bundles -m native -t bundle_type [-c component_name] [-sr SR_number]
The native mode of the support-bundles
command has the following options
in addition to the mode and service request number options listed at the beginning of this
topic.
Option | Description | Required |
---|---|---|
|
Bundle type: |
yes |
|
Component name This option only applies to type |
no |
ZFS Bundle
When type
is zfs-bundle
, a ZFS support bundle
collection starts on both ZFS nodes and downloads the new ZFS support bundles into the
bundle directory.
[root@pcamn01 ~]# support-bundles -m native -t zfs-bundle 2021-11-16T22:49:30.982 - support_bundles - INFO - Starting Support Bundles 2021-11-16T22:49:31.037 - support_bundles - INFO - Locating filter-logs Pod 2021-11-16T22:49:31.064 - support_bundles - INFO - Executing command - ['python3', '/usr/lib/python3.6/site-packages/filter_logs/native.py', '-t', 'zfs-bundle'] 2021-11-16T22:49:31.287 - support_bundles - INFO - LAUNCHING COMMAND: ['python3', '/usr/lib/python3.6/site-packages/filter_logs/native_app.py', '-t', 'zfs-bundle', '--target_directory', '/support_bundles/zfs-bundle_20211116T224931267'] ZFS native bundle collection running to /nfs/shared_storage/support_bundles/zfs-bundle_20211116T224931267 Monitor /nfs/shared_storage/support_bundles/zfs-bundle_20211116T224931267/native_collection.log for progress. 2021-11-16T22:49:31.287 - support_bundles - INFO - Finished running Support Bundles
SOS Report Bundle
When type
is sosreport
, the
component_name
is a management node or compute node.
If component_name
is not specified, the report is collected
from all management and compute nodes.
[root@pcamn01 ~]# support-bundles -m native -t sosreport -c pcacn003 -sr SR_number
Uploading Support Bundles to Oracle Support
After you create a support bundle using the support-bundles
command as
described in Using the support-bundles Command, you can use the methods described in this topic to upload the support
bundle to Oracle Support.
To use these methods, you must satisfy the following requirements:
-
You must have a My Oracle Support user ID with Create and Update SR permissions granted by the appropriate Customer User Administrator (CUA) for each Support Identifier (SI) being used to upload files.
-
For file uploads to existing service requests, the Support Identifier associated with the service request must be in your profile.
-
To upload files larger than 2 GB, sending machines must have network access to connect to the My Oracle Support servers at
transport.oracle.com
to use FTPS and HTTPS.The Oracle FTPS service is a "passive" implementation. With an implicit configuration, the initial connection is from the client to the service on a control port of 990 and the connection is then switched to a high port to exchange data. Oracle defines a possible range of the data port of 32000-42000, and depending upon your network configuration you may need to enable outbound connections on both port 990 and 32000-42000. TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256 is the only encryption method enabled.
The Oracle HTTPS diagnostic upload service uses the standard HTTPS port of 443 and does not require any additional ports to be opened.
When using command line protocols, do not include your password in the command. Enter your password only when prompted.
-
Oracle requires the use of TLS 1.2+ for all file transfers.
-
Do not upload encrypted or password-protected files, standalone or within an archive. A Service Request update will note this as a corrupted file or reject the upload as disallowed file types were found. Files are encrypted when you use FTPS and HTTPS; additional protections are not required.
-
Do not upload files with file type extensions
exe
,bat
,asp
, orcom
, either standalone or within an archive. A Service Request update will note that a disallowed file type was found.
Uploading Files 2 GB or Smaller
Use the SR file upload utility on the My Oracle Support Portal.
-
Log in to My Oracle Support with your My Oracle Support username and password.
-
Do one of the following:
-
Create a new service request and in the next step, select the Upload button.
-
Select and open an existing service request.
-
-
Click the Add Attachment button located at the top of the page.
-
Click the Choose File button.
-
Navigate and select the file to upload.
-
Click the Attach File button.
You can also use the methods described in the next section for larger files.
Uploading Files Larger Than 2 GB
You cannot upload a file larger than 200 GB. See Splitting Files.
FTPS
Syntax:
Be sure to include the /
character after the service request number.
$ curl -T path_and_filename -u MOS_user_ID ftps://transport.oracle.com/issue/SR_number/
Example:
$ curl -T /u02/files/bigfile.tar -u MOSuserID@example.com ftps://transport.oracle.com/issue/3-1234567890/
HTTPS
Syntax:
Be sure to include the /
character after the service request number.
$ curl -T path_and_filename -u MOS_user_ID https://transport.oracle.com/upload/issue/SR_number/
Example:
$ curl -T D:\data\bigfile.tar -u MOSuserID@example.com https://transport.oracle.com/upload/issue/3-1234567890/
Renaming the file during send
$ curl -T D:\data\bigfile.tar -u MOSuserID@example.com https://transport.oracle.com/upload/issue/3-1234567890/NotSoBig.tar
Using a proxy
$ curl -k -T D:\data\bigfile.tar -x proxy.example.com:80 -u MOSuserID@example.com https://transport.oracle.com/upload/issue/3-1234567890/
Splitting Files
You can split a large file into multiple parts and upload the parts. Oracle Transport will concatenate the segments when you complete uploading all the parts.
Only HTTPS protocol can be used. Only the UNIX split utility can be used. The Microsoft Windows split utility produces an incompatible format.
To reduce upload times, compress the original file prior to splitting.
-
Split the file.
The following command splits the file
file1.tar
into 2 GB parts namedfile1.tar.partaa
andfile1.tar.partab
.Important:
Specify the
.part
extension exactly as shown below.$ split –b 2048m file1.tar file1.tar.part
-
Upload the resulting
file1.tar.partaa
andfile1.tar.partab
files.Important:
Do not rename these output part files.
$ curl -T file1.tar.partaa -u MOSuserID@example.com https://transport.oracle.com/upload/issue/SR_number/ $ curl -T file1.tar.partab -u MOSuserID@example.com https://transport.oracle.com/upload/issue/SR_number/
-
Send the command to put the parts back together.
The spit files will not be attached to the service request. Only the final concatenated file will be attached to the service request.
$ curl -X PUT -H X-multipart-total-size:original_size -u MOSuserID@example.com https://transport.oracle.com/upload/issue/SR_number/file1.tar?multiPartComplete=true
In the preceding command,
original_size
is the size of the original unsplit file as shown by a file listing. -
Verify the size of the newly-attached file.
Note:
This verification command must be executed immediately after the concatenation command in Step 3. Otherwise, the file will have begun processing and will no longer be available for this command.
$ curl -I -u MOSuserID@example.com https://transport.oracle.com/upload/issue/SR_number/file1.tar X-existing-file-size: original_size
Resuming an Interrupted HTTPS Upload
You can resume a file upload that terminated abnormally. Resuming can only be done by using HTTPS. Resuming does not work with FTPS. When an upload is interrupted by some event, the start with retrieving the file size of the interrupted file
-
Determine how much of the file has already been uploaded.
$ curl -I -u MOSuserID@example.com https://transport.oracle.com/upload/issue/SR_number/myinfo.tar HTTP/1.1 204 No Content Date: Tue, 15 Nov 2022 22:53:54 GMT Content-Type: text/plain X-existing-file-size: already_uploaded_size X-Powered-By: Servlet/3.0 JSP/2.2
-
Resume the file upload.
Note the file size returned in “X-existing-file-size” in Step 1. Use that file size after the
-C
switch and in the-H “X-resume-offset:”
switch.$ curl -Calready_uploaded_size -H "X-resume-offset: already_uploaded_size" -T myinfo.tar -u MOSuserID@example.com https://transport.oracle.com/upload/issue/SR_number/myinfo.tar
-
Verify the final file size.
$ curl -I -u MOSuserID@example.com https://transport.oracle.com/upload/issue/SR_number/myinfo.tar -H X-existing-file-size: original_size
In the preceding command,
original_size
is the size of the original file as shown by a file listing.