Viewing Admin Service Health Data

This section describes Private Cloud Appliance Admin service health metrics and the conditions that raise faults. This health information is not for hardware faults but is information about resource utilization (CPU, memory, and storage), hardware run state, and health checker notifications. The hardware faults listed at the bottom of Table 5-1 are reported through ASR.

Admin Service Faults Summary

The threshold, run state, and health checker notification fault types that are listed in the following table are described in more detail in the following sections.

Table 5-1 Admin Service Fault Detection Configuration Summary

Fault Type	Fault Detection Frequency (seconds)	Fault Detection Delay (seconds)	Data Source	Method of Detection
Compute Node CPU and Memory Utilization Faults	60	< 20	Admin calls ComputeNode service	Faults are raised by fault task based on the compute node object attributes stored in the database.
Storage Utilization Faults	120	< 20	Admin calls Prometheus service	Faults are raised by fault task based on Prometheus ZFS pool usage and status data stored in the database.
Hardware Run State Faults	150	< 20	Admin calls Hardware `list` REST API	Faults are raised by fault task based on hardware component node/ILOM run states stored in the database.
Health Checker Notification Faults	Defined by the ZFS/Network health checker notification frequency	0	Various HealthChecker services send notifications	Faults are created based on the RabbitMQ notification fault results.
Platform ILOM Faults	150	0	Admin calls Hardware `getMgmt` and `getCompute` ILOM Health REST APIs	Faults are created based on the L1 API results for ILOM object data. See PCA X9-2 Appliance: Automatic Service Request (ASR) Event Coverage (Doc ID 2833567.1) for a list of Private Cloud Appliance X9-2 events that are actionable by ASR.
Hardware Status Faults	On initialization, and when the `syncHardwareData` command runs	< 20	Admin calls Hardware `list` REST API	Faults are raised by fault task based on the PcaSystem object attribute. See PCA X9-2 Appliance: Automatic Service Request (ASR) Event Coverage (Doc ID 2833567.1) for a list of Private Cloud Appliance X9-2 events that are actionable by ASR.

Using the Service Web UI to View Admin Service Faults

Click the Active Faults link at the top of the Service Enclave Home page, or click Faults on the Navigation menu.
The Faults page is displayed.
At the top of the Faults page, you can toggle whether to list all faults or only active faults.
For more information about a fault, click the name of the fault, or click View Details on the Actions menu.

The details page shows the description, cause, and recommended action to take.

Using the Service CLI to View Admin Service Faults

To view the list of Admin service faults, use the list fault command.

Both active and cleared faults are listed.

PCA-ADMIN> list fault
Command: list fault
Status: Success
Time: 2023-03-07 15:34:52,613 UTC
Data:
  id                                     name                                               status    severity
  --                                     ----                                               ------    --------
  33c61b8a-dcc7-4b8f-bc0f-56915ecc62f5   RackUnitIlomRunStateFaultStatusFault(pcacn005)     Cleared   Critical
  f7d22180-aeae-4159-b5c8-5e55a7906a78   RackUnitIlomRunStateFaultStatusFault(pcacn004)     Cleared   Critical
  a4fef907-8e54-4750-9fac-6829fbade90d   ComputeNodeCpuFaultStatusFault(pcacn006)           Cleared   Minor
  f8d93384-da30-43cd-9396-6e6671d240e2   RackUnitIlomRunStateFaultStatusFault(pcacn010)     Cleared   Critical
  8e61bb81-7a02-4c26-8ef4-c13b198f64da   ComputeNodeCpuFaultStatusFault(pcacn007)           Cleared   Warning
  3216b6f9-326b-4992-99a3-ab23cb18243b   AK-8003-F9--PCIe 3                                 Active    Minor
  ef3fb25b-0573-4524-8d1c-fb704c814446   AK-8003-HF--vnic1                                  Active    Major
  f830cd46-21ff-4d74-ba81-c82fd6f52c67   ComputeNodeCpuFaultStatusFault(pcacn005)           Cleared   Minor
  d2e71da0-ba63-4983-97da-24033d5c6447   ZfsPoolUsageFaultStatusFault(PCA_POOL)             Cleared   Major
  eecd5ef2-4a71-4137-be96-54c028212d2f   ComputeNodeMemoryFaultStatusFault(pcacn004)        Cleared   Minor
  cf68d2ee-e483-e573-b46e-c31bcbc8e968   ISTOR-8000-1S--ORACLE SERVER E5-2L                 Cleared   Major
  0686c11d-b96b-e5aa-dfbe-a20154da4794   SPAMD-8002-FJ--ORACLE SERVER E5-2L                 Cleared   Major
  b488a45a-80df-46e3-b0b5-a35527eb9c0e   AK-8003-F9--PCIe 10                                Active    Minor
  ac48f88d-e181-4b03-b620-6bfbf4ad95ef   RackUnitIlomRunStateFaultStatusFault(pcacn007)     Cleared   Critical
  b4c66a7c-def3-42c2-8842-d4763afc5184   RackUnitIlomRunStateFaultStatusFault(pcacn006)     Cleared   Critical
  9fc2e45a-1cff-4f95-828d-58742c8ce12f   ComputeNodeMemoryFaultStatusFault(pcacn002)        Active    Minor
  c0124122-a91c-4110-89cc-deebe54de7ba   ComputeNodeMemoryFaultStatusFault(pcacn006)        Cleared   Critical
  ca26ed46-4d1c-4ade-9e74-af27d94cf8f4   AK-8003-HF--vnic2                                  Active    Major
  58e9ab5d-d4e7-4d94-9ca6-e85a1c88b3b8   RackUnitRunStateFaultStatusFault(sn022147XLF014)   Cleared   Critical
  474c269f-4018-45d7-97d5-da17c9c845f4   RackUnitIlomRunStateFaultStatusFault(pcacn001)     Cleared   Critical
  2b5ece1c-50fc-436a-81b3-da0c5b418fe3   RackUnitIlomRunStateFaultStatusFault(pcacn003)     Cleared   Critical
  1c164eb9-9a76-4592-8ab6-150edb8f7a75   ComputeNodeCpuFaultStatusFault(pcacn001)           Cleared   Warning
  55ed1494-6aac-4248-91cb-9ac8295d668c   AK-8003-HF--PCIe 6                                 Active    Major
  afbcc080-0b93-434b-8ead-fa673f302170   AK-8003-F9--PCIe 6                                 Active    Minor
  8b36c2db-a3b4-41c8-b416-8e733ace3aeb   PcaSystemReSyncHwStatusStatusFault(null)           Cleared   Warning
  28c5ba93-6b4e-42f3-ad61-90734b46bf30   SPENV-8000-RU--ORACLE SERVER E5-2L                 Cleared   Critical
  3d932188-0120-489f-a512-1a244ec01e49   RackUnitIlomRunStateFaultStatusFault(pcacn009)     Cleared   Critical
  21e6faa9-68e1-47ae-a298-e2cb14d2a406   ComputeNodeMemoryFaultStatusFault(pcacn007)        Cleared   Minor
  db023304-fb7a-613b-ad9b-e277b7ce5675   SPENV-8000-A7--ORACLE SERVER E5-2L                 Cleared   Major
  63839bf5-335b-48ff-86a0-9e981e3e9902   RackUnitRunStateFaultStatusFault(sn012147XLF014)   Cleared   Critical
  2e851c6e-aa29-4a25-846a-29b08967dd95   RackUnitValidationStateStatusFault(pcacn008)       Cleared   Major
  76805c56-fcf6-48a2-b4fd-ffa77570e83c   ComputeNodeCpuFaultStatusFault(pcacn002)           Active    Minor
  9be74faf-df4d-ea20-cfc1-92b2a6a01b06   SPENV-8000-A7--ORACLE SERVER E5-2L                 Cleared   Major
  1624064f-d380-4ffc-9000-d293c185d7ac   ComputeNodeCpuFaultStatusFault(pcacn003)           Cleared   Warning
  7ca3f7af-f0bd-45d9-bad7-15794d49e7c6   RackUnitIlomRunStateFaultStatusFault(pcacn008)     Cleared   Critical
  3e7a3503-7a71-4ef1-a3ad-fba2162571ab   ComputeNodeCpuFaultStatusFault(pcacn004)           Cleared   Warning
  0922cd8e-297e-4356-b736-b09ac382b28b   AK-8003-F9--PCIe 10                                Active    Minor
  ab44ad2c-1105-417d-aa47-e8cb477ef0ec   AK-8003-F9--PCIe 3                                 Active    Minor

To view the details of a specific fault, including description, cause, and recommended action to take, use the show fault command with the specific fault ID.

PCA-ADMIN> show fault id=ab44ad2c-1105-417d-aa47-e8cb477ef0ec
Command: show fault id=ab44ad2c-1105-417d-aa47-e8cb477ef0ec
Status: Success
Time: 2023-03-07 15:36:19,414 UTC
Data:
  Id = ab44ad2c-1105-417d-aa47-e8cb477ef0ec
  Type = Fault
  Category = Internal
  Severity = Minor
  Status = Active
  Last Update Time = 2023-03-06 20:04:11,668 UTC
  Message Id = AK-8003-F9
  Time Reported = Mon Mar 06 2023 16:50:24 GMT+0000 (UTC)
  Action = Check the networking cable, switch port, and switch configuration. Contact your vendor for support
           if the network port remains inexplicably down. Please refer to the associated reference document at
           http://support.oracle.com/msg/AK-8003-F9 for the latest service procedures and policies regarding 
           this diagnosis.
  Health Exporter = zfssa-analytics-exportersn022147XLF014
  uuid = ab44ad2c-1105-417d-aa47-e8cb477ef0ec
  Diagnosing Source = zfssa_analytics_exporter
  FaultHistoryLogIds 1 = id:fdfaa42f-de8d-4622-a9df-ea229b7bad6f  type:FaultHistoryLog  name:
  BaseManagedObjectId = id:2147XLF015/PCIe 3/465774J-2121701684  type:HardwareComponent  name:
  Description = Network connectivity via port mlxne4 has been lost.
  Name = AK-8003-F9--PCIe 3
  Work State = Normal

Additional examples of using the Service CLI to show Admin service faults are shown in Compute Node CPU and Memory Utilization Faults.

Compute Node CPU and Memory Utilization Faults

The Admin service raises faults for the percent of memory used and percent of CPU used for a ComputeNode object. More severe faults are raised as more memory and CPU are used. When the percent used drops below a certain percentage, any faults are cleared.

These are utilization faults (CPU and memory usage), not hardware faults. Problems with CPU and memory hardware are reported through ASR.

CPU Usage

The following table shows the default percent of compute node CPU usage that raises different severities of faults.

CPU Percentage	Fault Severity	Fault State
< .75	Not applicable	Cleared
>= .75	Warning	Active
>= .80	Minor	Active
>= .90	Major	Active
>= .95	Critical	Active

CPU Memory

The following table shows the default percent of compute node memory usage that raises different severities of faults.

Memory Percentage	Fault Severity	Fault State
< .75	Not applicable	Cleared
>= .75	Warning	Active
>= .80	Minor	Active
>= .90	Major	Active
>= .95	Critical	Active

Using the Service CLI to View Compute Node Faults

To view the CPU and memory compute node usage default fault trigger settings using the Service CLI, use the cnUpdateManager command:

PCA-ADMIN> show cnUpdateManager
Command: show cnUpdateManager
Status: Success
Time: 2023-03-06 23:41:37,249 UTC
Data:
  Id = caaaaaa1-a076-4e48-94b5-7bdcd4e0c42c
  Type = CnUpdateManager
  LastRunTime = 2023-03-06 23:41:33,676 UTC
  Poll Interval (sec) = 60
  The minimum CPU usage percentage to trigger a critical fault = 0.95
  The minimum CPU usage percentage to trigger a major fault = 0.9
  The minimum CPU usage percentage to trigger a minor fault = 0.8
  The minimum CPU usage percentage to trigger a warning = 0.75
  The minimum memory usage percentage to trigger a critical fault = 0.95
  The minimum memory usage percentage to trigger a major fault = 0.9
  The minimum memory usage percentage to trigger a minor fault = 0.8
  The minimum memory usage percentage to trigger a warning = 0.75

To view the list of all faults and the details of a specific fault, see Viewing Admin Service Health Data. The following example shows a specific compute node fault. Current usage is not shown except that it is at least the minor fault threshold but less than the major fault threshold. To see current usage, use the Service Web UI.

PCA-ADMIN> show fault id=76805c56-fcf6-48a2-b4fd-ffa77570e83c
Command: show fault id=76805c56-fcf6-48a2-b4fd-ffa77570e83c
Status: Success
Time: 2023-03-07 15:40:50,917 UTC
Data:
  Id = 76805c56-fcf6-48a2-b4fd-ffa77570e83c
  Type = Fault
  Category = Status
  Severity = Minor
  Status = Active
  Associated Attribute = cpuFault
  Last Update Time = 2023-03-04 01:06:25,666 UTC
  Cause = ComputeNode pcacn002 attribute cpuFault = MINOR.
  FaultHistoryLogIds 1 = id:79b44c26-cb4e-4bec-a58c-6efc7fc63fed  type:FaultHistoryLog  name:
  FaultHistoryLogIds 2 = id:fc90a99a-031b-457f-b585-5c905e61362e  type:FaultHistoryLog  name:
  FaultHistoryLogIds 3 = id:48068f78-1328-447d-9506-efb6f22d154d  type:FaultHistoryLog  name:
  FaultHistoryLogIds 4 = id:d97c5819-923c-480d-8f61-2341c8403182  type:FaultHistoryLog  name:
  FaultHistoryLogIds 5 = id:18cdd005-53c0-488c-a2df-28f2da3b1092  type:FaultHistoryLog  name:
  FaultHistoryLogIds 6 = id:bfe1ffcd-5899-4400-914c-b467d8671e0c  type:FaultHistoryLog  name:
  FaultHistoryLogIds 7 = id:459fa55b-8654-4c07-8ae7-6d0ef011e3b1  type:FaultHistoryLog  name:
  FaultHistoryLogIds 8 = id:b9c8a909-f8ea-4de6-9bfe-2516e7addf73  type:FaultHistoryLog  name:
  FaultHistoryLogIds 9 = id:6ab5d1ca-3659-49a7-8e68-946bbbeccc9f  type:FaultHistoryLog  name:
  FaultHistoryLogIds 10 = id:d04d06a1-1e2c-404c-ac67-680e0deb34c5  type:FaultHistoryLog  name:
  FaultHistoryLogIds 11 = id:22dd163e-528f-4346-b177-d62c7ceb9885  type:FaultHistoryLog  name:
  FaultHistoryLogIds 12 = id:cdb2dbf5-6999-43c2-bb5f-17192bfad3e2  type:FaultHistoryLog  name:
  FaultHistoryLogIds 13 = id:aa7b2e43-ab0b-4d78-bfe7-d4b0dd0fec4a  type:FaultHistoryLog  name:
  BaseManagedObjectId = id:0dd96e90-de00-4fa0-82e3-16937e4601f8  type:ComputeNode  name:
  Description = ComputeNode pcacn002 attribute cpuFault = MINOR.
  Name = ComputeNodeCpuFaultStatusFault(pcacn002)
  Work State = Normal

Storage Utilization Faults

The following table describes the two kinds of Oracle ZFS Storage Appliance faults raised in the Admin service.

These are utilization faults (ZFS pool usage), not hardware faults. Problems with ZFS hardware are reported through ASR.

Private Cloud Appliance uses Prometheus matrix data collected for ZFS Storage Appliance to report pool usage. Total pool size per pool (zfssa_pool_total) and free space per pool (zfssa_pool_free) are used to calculate pool usage percentage. The zfssa_pool_status metric reports the health of a pool.

Metric Name Metric Value Description Fault Condition

Metric Name	Metric Value Description	Fault Condition
`zfssa_pool_total` `zfssa_pool_free`	Pool usage percentage is calculated using the following formula for each pool: (zfssa_pool_total - zfssa_pool_free) / zfssa_pool_total	If the pool usage percentage is above a pre-configured value, a major fault is raised. The default value is 80 percent.
`zfssa_pool_status`	The `zfssa_pool_status` metric can have the following values: `0` - exported `1` - degraded `2` - online `-1` - offline `-2` - faulted `-3` - unavailable `-4` - removed	A major fault is raised for any `pool`/`zfssa_node` combination that has any pool status value other than 0 or 2.

zfssa_pool_total

zfssa_pool_free

Pool usage percentage is calculated using the following formula for each pool:

(zfssa_pool_total - zfssa_pool_free)
 / zfssa_pool_total

If the pool usage percentage is above a pre-configured value, a major fault is raised. The default value is 80 percent.

zfssa_pool_status

The zfssa_pool_status metric can have the following values:

0 - exported
1 - degraded
2 - online
-1 - offline
-2 - faulted
-3 - unavailable
-4 - removed

A major fault is raised for any pool/zfssa_node combination that has any pool status value other than 0 or 2.

Hardware Run State Faults

A critical or major fault is raised if a hardware unit on the rack such as a management node, compute node, storage node, or switch has an invalid run state.

The following table shows the severity of the fault that will be raised for the given run state. Any run state other than the listed run states results in clearing any fault.

Run State Value (case insensitive)	Fault Severity	Fault State
UNABLE TO CONNECT TO ILOM	Critical	Active
FAIL	Critical	Active
SERVICE REQUIRED	Major	Active
other	Not applicable	Cleared

Health Checker Notification Faults

Health Checker faults are raised from notifications from the ZFSSA and Network Health Checker components. For every notification it receives, the Admin service will raise a fault.

Following are example attributes of the faultedComponents object in the Network Health Checker component fault data:

"class": "cisco.fan.fail",
"severity": "Major",
"description": "Fan module has failed and needs to be replaced. This can lead to overheating and temperature alarms.",
...
"class": "cisco.power.fail",
"severity": "Major",
"description": "Power Supply has failed or has been shutdown",

Following are example attributes of the faultedComponents object in the ZFSSA Health Checker component fault data:

"severity":"Major",
"type":"Fault",
"description":"An internal power supply failure has been detected.",

Detailed information is provided about the part that has failed.

An action attribute contains a brief description of what to do to fix the problem and might include a link to the appropriate support document.

Manually Clearing Faults

This section describes how to manually clear faults using the Service CLI. You cannot manually clear faults using the Service Web UI.

Using the Service CLI

Using SSH, log into the management node VIP as admin.
```
# ssh -l admin 100.96.2.32 -p 30006
```

Use the list fault command to find the list of fault identifications.

PCA-ADMIN> list fault
Command: list fault
Status: Success
Time: 2024-01-31 21:38:05,472 UTC
Data:
id                                 Name                       Status Severity
–-                                 –-–-                       –-–-–- –-–-–-–-       
71671228-.….….-56a6a58947c6a6789   pcamn02-example            Active Critical 
524cb805-.….….-acc3458bb79t04295   RackUnit-example           Active Major
PCA-ADMIN>

Use the clearFault command with the fault identifier to clear the fault.

PCA-ADMIN> cleatFault id=[524cb805-.….….-acc3458bb79t04295]
Command: clearFault
Status: Success
Time: 2024-01-31 21:39:30,094 UTC
PCA-ADMIN>

Note:

You can verify the clear fault result by using another list fault command.

PCA-ADMIN> list fault
Command: list fault
Status: Success
Time: 2024-01-31 21:40:02,685 UTC
Data:
id                                 Name                       Status Severity
–-                                 –-–-                       –-–-–- –-–-–-–-       
71671228-.….….-56a6a58947c6a6789   pcamn02-example            Active Critical 
PCA-ADMIN>