Viewing Admin Service Health Data
This section describes Private Cloud Appliance Admin service health metrics and the conditions that raise faults. This health information is not for hardware faults but is information about resource utilization (CPU, memory, and storage), hardware run state, and health checker notifications. The hardware faults listed at the bottom of Table 5-1 are reported through ASR.
Admin Service Faults Summary
The threshold, run state, and health checker notification fault types that are listed in the following table are described in more detail in the following sections.
Table 5-1 Admin Service Fault Detection Configuration Summary
Fault Type | Fault Detection Frequency (seconds) | Fault Detection Delay (seconds) | Data Source | Method of Detection |
---|---|---|---|---|
60 |
< 20 |
Admin calls ComputeNode service |
Faults are raised by fault task based on the compute node object attributes stored in the database. |
|
120 |
< 20 |
Admin calls Prometheus service |
Faults are raised by fault task based on Prometheus ZFS pool usage and status data stored in the database. |
|
150 |
< 20 |
Admin calls Hardware |
Faults are raised by fault task based on hardware component node/ILOM run states stored in the database. |
|
Defined by the ZFS/Network health checker notification frequency |
0 |
Various HealthChecker services send notifications |
Faults are created based on the RabbitMQ notification fault results. |
|
Platform ILOM Faults |
150 |
0 |
Admin calls Hardware |
Faults are created based on the L1 API results for ILOM object data. See PCA X9-2 Appliance: Automatic Service Request (ASR) Event Coverage (Doc ID 2833567.1) for a list of Private Cloud Appliance X9-2 events that are actionable by ASR. |
Hardware Status Faults |
On initialization, and when the |
< 20 |
Admin calls Hardware |
Faults are raised by fault task based on the PcaSystem object attribute. See PCA X9-2 Appliance: Automatic Service Request (ASR) Event Coverage (Doc ID 2833567.1) for a list of Private Cloud Appliance X9-2 events that are actionable by ASR. |
Using the Service Web UI to View Admin Service Faults
-
Click the Active Faults link at the top of the Service Enclave Home page, or click Faults on the Navigation menu.
The Faults page is displayed. -
At the top of the Faults page, you can toggle whether to list all faults or only active faults.
-
For more information about a fault, click the name of the fault, or click View Details on the Actions menu.
The details page shows the description, cause, and recommended action to take.
Using the Service CLI to View Admin Service Faults
-
To view the list of Admin service faults, use the
list fault
command.Both active and cleared faults are listed.
PCA-ADMIN> list fault Command: list fault Status: Success Time: 2023-03-07 15:34:52,613 UTC Data: id name status severity -- ---- ------ -------- 33c61b8a-dcc7-4b8f-bc0f-56915ecc62f5 RackUnitIlomRunStateFaultStatusFault(pcacn005) Cleared Critical f7d22180-aeae-4159-b5c8-5e55a7906a78 RackUnitIlomRunStateFaultStatusFault(pcacn004) Cleared Critical a4fef907-8e54-4750-9fac-6829fbade90d ComputeNodeCpuFaultStatusFault(pcacn006) Cleared Minor f8d93384-da30-43cd-9396-6e6671d240e2 RackUnitIlomRunStateFaultStatusFault(pcacn010) Cleared Critical 8e61bb81-7a02-4c26-8ef4-c13b198f64da ComputeNodeCpuFaultStatusFault(pcacn007) Cleared Warning 3216b6f9-326b-4992-99a3-ab23cb18243b AK-8003-F9--PCIe 3 Active Minor ef3fb25b-0573-4524-8d1c-fb704c814446 AK-8003-HF--vnic1 Active Major f830cd46-21ff-4d74-ba81-c82fd6f52c67 ComputeNodeCpuFaultStatusFault(pcacn005) Cleared Minor d2e71da0-ba63-4983-97da-24033d5c6447 ZfsPoolUsageFaultStatusFault(PCA_POOL) Cleared Major eecd5ef2-4a71-4137-be96-54c028212d2f ComputeNodeMemoryFaultStatusFault(pcacn004) Cleared Minor cf68d2ee-e483-e573-b46e-c31bcbc8e968 ISTOR-8000-1S--ORACLE SERVER E5-2L Cleared Major 0686c11d-b96b-e5aa-dfbe-a20154da4794 SPAMD-8002-FJ--ORACLE SERVER E5-2L Cleared Major b488a45a-80df-46e3-b0b5-a35527eb9c0e AK-8003-F9--PCIe 10 Active Minor ac48f88d-e181-4b03-b620-6bfbf4ad95ef RackUnitIlomRunStateFaultStatusFault(pcacn007) Cleared Critical b4c66a7c-def3-42c2-8842-d4763afc5184 RackUnitIlomRunStateFaultStatusFault(pcacn006) Cleared Critical 9fc2e45a-1cff-4f95-828d-58742c8ce12f ComputeNodeMemoryFaultStatusFault(pcacn002) Active Minor c0124122-a91c-4110-89cc-deebe54de7ba ComputeNodeMemoryFaultStatusFault(pcacn006) Cleared Critical ca26ed46-4d1c-4ade-9e74-af27d94cf8f4 AK-8003-HF--vnic2 Active Major 58e9ab5d-d4e7-4d94-9ca6-e85a1c88b3b8 RackUnitRunStateFaultStatusFault(sn022147XLF014) Cleared Critical 474c269f-4018-45d7-97d5-da17c9c845f4 RackUnitIlomRunStateFaultStatusFault(pcacn001) Cleared Critical 2b5ece1c-50fc-436a-81b3-da0c5b418fe3 RackUnitIlomRunStateFaultStatusFault(pcacn003) Cleared Critical 1c164eb9-9a76-4592-8ab6-150edb8f7a75 ComputeNodeCpuFaultStatusFault(pcacn001) Cleared Warning 55ed1494-6aac-4248-91cb-9ac8295d668c AK-8003-HF--PCIe 6 Active Major afbcc080-0b93-434b-8ead-fa673f302170 AK-8003-F9--PCIe 6 Active Minor 8b36c2db-a3b4-41c8-b416-8e733ace3aeb PcaSystemReSyncHwStatusStatusFault(null) Cleared Warning 28c5ba93-6b4e-42f3-ad61-90734b46bf30 SPENV-8000-RU--ORACLE SERVER E5-2L Cleared Critical 3d932188-0120-489f-a512-1a244ec01e49 RackUnitIlomRunStateFaultStatusFault(pcacn009) Cleared Critical 21e6faa9-68e1-47ae-a298-e2cb14d2a406 ComputeNodeMemoryFaultStatusFault(pcacn007) Cleared Minor db023304-fb7a-613b-ad9b-e277b7ce5675 SPENV-8000-A7--ORACLE SERVER E5-2L Cleared Major 63839bf5-335b-48ff-86a0-9e981e3e9902 RackUnitRunStateFaultStatusFault(sn012147XLF014) Cleared Critical 2e851c6e-aa29-4a25-846a-29b08967dd95 RackUnitValidationStateStatusFault(pcacn008) Cleared Major 76805c56-fcf6-48a2-b4fd-ffa77570e83c ComputeNodeCpuFaultStatusFault(pcacn002) Active Minor 9be74faf-df4d-ea20-cfc1-92b2a6a01b06 SPENV-8000-A7--ORACLE SERVER E5-2L Cleared Major 1624064f-d380-4ffc-9000-d293c185d7ac ComputeNodeCpuFaultStatusFault(pcacn003) Cleared Warning 7ca3f7af-f0bd-45d9-bad7-15794d49e7c6 RackUnitIlomRunStateFaultStatusFault(pcacn008) Cleared Critical 3e7a3503-7a71-4ef1-a3ad-fba2162571ab ComputeNodeCpuFaultStatusFault(pcacn004) Cleared Warning 0922cd8e-297e-4356-b736-b09ac382b28b AK-8003-F9--PCIe 10 Active Minor ab44ad2c-1105-417d-aa47-e8cb477ef0ec AK-8003-F9--PCIe 3 Active Minor
-
To view the details of a specific fault, including description, cause, and recommended action to take, use the
show fault
command with the specific fault ID.PCA-ADMIN> show fault id=ab44ad2c-1105-417d-aa47-e8cb477ef0ec Command: show fault id=ab44ad2c-1105-417d-aa47-e8cb477ef0ec Status: Success Time: 2023-03-07 15:36:19,414 UTC Data: Id = ab44ad2c-1105-417d-aa47-e8cb477ef0ec Type = Fault Category = Internal Severity = Minor Status = Active Last Update Time = 2023-03-06 20:04:11,668 UTC Message Id = AK-8003-F9 Time Reported = Mon Mar 06 2023 16:50:24 GMT+0000 (UTC) Action = Check the networking cable, switch port, and switch configuration. Contact your vendor for support if the network port remains inexplicably down. Please refer to the associated reference document at http://support.oracle.com/msg/AK-8003-F9 for the latest service procedures and policies regarding this diagnosis. Health Exporter = zfssa-analytics-exportersn022147XLF014 uuid = ab44ad2c-1105-417d-aa47-e8cb477ef0ec Diagnosing Source = zfssa_analytics_exporter FaultHistoryLogIds 1 = id:fdfaa42f-de8d-4622-a9df-ea229b7bad6f type:FaultHistoryLog name: BaseManagedObjectId = id:2147XLF015/PCIe 3/465774J-2121701684 type:HardwareComponent name: Description = Network connectivity via port mlxne4 has been lost. Name = AK-8003-F9--PCIe 3 Work State = Normal
Additional examples of using the Service CLI to show Admin service faults are shown in Compute Node CPU and Memory Utilization Faults.
Compute Node CPU and Memory Utilization Faults
The Admin service raises faults for the percent of memory used and percent of CPU used for a
ComputeNode
object. More severe faults are raised as more memory and CPU
are used. When the percent used drops below a certain percentage, any faults are cleared.
These are utilization faults (CPU and memory usage), not hardware faults. Problems with CPU and memory hardware are reported through ASR.
CPU Usage
The following table shows the default percent of compute node CPU usage that raises different severities of faults.
CPU Percentage | Fault Severity | Fault State |
---|---|---|
< .75 |
Not applicable |
Cleared |
>= .75 |
Warning |
Active |
>= .80 |
Minor |
Active |
>= .90 |
Major |
Active |
>= .95 |
Critical |
Active |
CPU Memory
The following table shows the default percent of compute node memory usage that raises different severities of faults.
Memory Percentage | Fault Severity | Fault State |
---|---|---|
< .75 |
Not applicable |
Cleared |
>= .75 |
Warning |
Active |
>= .80 |
Minor |
Active |
>= .90 |
Major |
Active |
>= .95 |
Critical |
Active |
Using the Service CLI to View Compute Node Faults
To view the CPU and memory compute node usage default fault trigger settings using the Service CLI, use the cnUpdateManager
command:
PCA-ADMIN> show cnUpdateManager Command: show cnUpdateManager Status: Success Time: 2023-03-06 23:41:37,249 UTC Data: Id = caaaaaa1-a076-4e48-94b5-7bdcd4e0c42c Type = CnUpdateManager LastRunTime = 2023-03-06 23:41:33,676 UTC Poll Interval (sec) = 60 The minimum CPU usage percentage to trigger a critical fault = 0.95 The minimum CPU usage percentage to trigger a major fault = 0.9 The minimum CPU usage percentage to trigger a minor fault = 0.8 The minimum CPU usage percentage to trigger a warning = 0.75 The minimum memory usage percentage to trigger a critical fault = 0.95 The minimum memory usage percentage to trigger a major fault = 0.9 The minimum memory usage percentage to trigger a minor fault = 0.8 The minimum memory usage percentage to trigger a warning = 0.75
To view the list of all faults and the details of a specific fault, see Viewing Admin Service Health Data. The following example shows a specific compute node fault. Current usage is not shown except that it is at least the minor fault threshold but less than the major fault threshold. To see current usage, use the Service Web UI.
PCA-ADMIN> show fault id=76805c56-fcf6-48a2-b4fd-ffa77570e83c Command: show fault id=76805c56-fcf6-48a2-b4fd-ffa77570e83c Status: Success Time: 2023-03-07 15:40:50,917 UTC Data: Id = 76805c56-fcf6-48a2-b4fd-ffa77570e83c Type = Fault Category = Status Severity = Minor Status = Active Associated Attribute = cpuFault Last Update Time = 2023-03-04 01:06:25,666 UTC Cause = ComputeNode pcacn002 attribute cpuFault = MINOR. FaultHistoryLogIds 1 = id:79b44c26-cb4e-4bec-a58c-6efc7fc63fed type:FaultHistoryLog name: FaultHistoryLogIds 2 = id:fc90a99a-031b-457f-b585-5c905e61362e type:FaultHistoryLog name: FaultHistoryLogIds 3 = id:48068f78-1328-447d-9506-efb6f22d154d type:FaultHistoryLog name: FaultHistoryLogIds 4 = id:d97c5819-923c-480d-8f61-2341c8403182 type:FaultHistoryLog name: FaultHistoryLogIds 5 = id:18cdd005-53c0-488c-a2df-28f2da3b1092 type:FaultHistoryLog name: FaultHistoryLogIds 6 = id:bfe1ffcd-5899-4400-914c-b467d8671e0c type:FaultHistoryLog name: FaultHistoryLogIds 7 = id:459fa55b-8654-4c07-8ae7-6d0ef011e3b1 type:FaultHistoryLog name: FaultHistoryLogIds 8 = id:b9c8a909-f8ea-4de6-9bfe-2516e7addf73 type:FaultHistoryLog name: FaultHistoryLogIds 9 = id:6ab5d1ca-3659-49a7-8e68-946bbbeccc9f type:FaultHistoryLog name: FaultHistoryLogIds 10 = id:d04d06a1-1e2c-404c-ac67-680e0deb34c5 type:FaultHistoryLog name: FaultHistoryLogIds 11 = id:22dd163e-528f-4346-b177-d62c7ceb9885 type:FaultHistoryLog name: FaultHistoryLogIds 12 = id:cdb2dbf5-6999-43c2-bb5f-17192bfad3e2 type:FaultHistoryLog name: FaultHistoryLogIds 13 = id:aa7b2e43-ab0b-4d78-bfe7-d4b0dd0fec4a type:FaultHistoryLog name: BaseManagedObjectId = id:0dd96e90-de00-4fa0-82e3-16937e4601f8 type:ComputeNode name: Description = ComputeNode pcacn002 attribute cpuFault = MINOR. Name = ComputeNodeCpuFaultStatusFault(pcacn002) Work State = Normal
Storage Utilization Faults
The following table describes the two kinds of Oracle ZFS Storage Appliance faults raised in the Admin service.
These are utilization faults (ZFS pool usage), not hardware faults. Problems with ZFS hardware are reported through ASR.
Private Cloud Appliance uses Prometheus matrix data
collected for ZFS Storage Appliance to report pool usage. Total pool
size per pool (zfssa_pool_total
) and free space per pool
(zfssa_pool_free
) are used to calculate pool usage percentage. The
zfssa_pool_status
metric reports the health of a pool.
Metric Name | Metric Value Description | Fault Condition |
---|---|---|
|
Pool usage percentage is calculated using the following formula for each pool: (zfssa_pool_total - zfssa_pool_free) / zfssa_pool_total |
If the pool usage percentage is above a pre-configured value, a major fault is raised. The default value is 80 percent. |
|
The
|
A major fault is raised for any |
Hardware Run State Faults
A critical or major fault is raised if a hardware unit on the rack such as a management node, compute node, storage node, or switch has an invalid run state.
The following table shows the severity of the fault that will be raised for the given run state. Any run state other than the listed run states results in clearing any fault.
Run State Value (case insensitive) | Fault Severity | Fault State |
---|---|---|
UNABLE TO CONNECT TO ILOM |
Critical |
Active |
FAIL |
Critical |
Active |
SERVICE REQUIRED |
Major |
Active |
other |
Not applicable |
Cleared |
Health Checker Notification Faults
Health Checker faults are raised from notifications from the ZFSSA and Network Health Checker components. For every notification it receives, the Admin service will raise a fault.
Following are example attributes of the faultedComponents
object in the
Network Health Checker component fault data:
"class": "cisco.fan.fail", "severity": "Major", "description": "Fan module has failed and needs to be replaced. This can lead to overheating and temperature alarms.", ... "class": "cisco.power.fail", "severity": "Major", "description": "Power Supply has failed or has been shutdown",
Following are example attributes of the faultedComponents
object in the
ZFSSA Health Checker component fault data:
"severity":"Major", "type":"Fault", "description":"An internal power supply failure has been detected.",
Detailed information is provided about the part that has failed.
An action
attribute contains a brief description of what to do to fix the
problem and might include a link to the appropriate support document.
Manually Clearing Faults
This section describes how to manually clear faults using the Service CLI. You cannot manually clear faults using the Service Web UI.
Using the Service CLI
-
Using SSH, log into the management node VIP as
admin
.# ssh -l admin 100.96.2.32 -p 30006
-
Use the
list fault
command to find the list of fault identifications.PCA-ADMIN> list fault Command: list fault Status: Success Time: 2024-01-31 21:38:05,472 UTC Data: id Name Status Severity –- –-–- –-–-–- –-–-–-–- 71671228-.….….-56a6a58947c6a6789 pcamn02-example Active Critical 524cb805-.….….-acc3458bb79t04295 RackUnit-example Active Major PCA-ADMIN>
-
Use the
clearFault
command with the fault identifier to clear the fault.PCA-ADMIN> cleatFault id=[524cb805-.….….-acc3458bb79t04295] Command: clearFault Status: Success Time: 2024-01-31 21:39:30,094 UTC PCA-ADMIN>
Note:
You can verify the clear fault result by using anotherlist fault
command.PCA-ADMIN> list fault Command: list fault Status: Success Time: 2024-01-31 21:40:02,685 UTC Data: id Name Status Severity –- –-–- –-–-–- –-–-–-–- 71671228-.….….-56a6a58947c6a6789 pcamn02-example Active Critical PCA-ADMIN>