Viewing Admin Service Health Data

This section describes Private Cloud Appliance Admin service health metrics and the conditions that raise faults. This health information is not for hardware faults but is information about resource utilization (CPU, memory, and storage), hardware run state, and health checker notifications. The hardware faults listed at the bottom of Table 5-1 are reported through ASR.

Admin Service Faults Summary

The threshold, run state, and health checker notification fault types that are listed in the following table are described in more detail in the following sections.

Table 5-1 Admin Service Fault Detection Configuration Summary

Fault Type Fault Detection Frequency (seconds) Fault Detection Delay (seconds) Data Source Method of Detection

Compute Node CPU and Memory Utilization Faults

60

< 20

Admin calls ComputeNode service

Faults are raised by fault task based on the compute node object attributes stored in the database.

Storage Utilization Faults

120

< 20

Admin calls Prometheus service

Faults are raised by fault task based on Prometheus ZFS pool usage and status data stored in the database.

Hardware Run State Faults

150

< 20

Admin calls Hardware list REST API

Faults are raised by fault task based on hardware component node/ILOM run states stored in the database.

Health Checker Notification Faults

Defined by the ZFS/Network health checker notification frequency

0

Various HealthChecker services send notifications

Faults are created based on the RabbitMQ notification fault results.

Platform ILOM Faults

150

0

Admin calls Hardware getMgmt and getCompute ILOM Health REST APIs

Faults are created based on the L1 API results for ILOM object data. See PCA X9-2 Appliance: Automatic Service Request (ASR) Event Coverage (Doc ID 2833567.1) for a list of Private Cloud Appliance X9-2 events that are actionable by ASR.

Hardware Status Faults

On initialization, and when the syncHardwareData command runs

< 20

Admin calls Hardware list REST API

Faults are raised by fault task based on the PcaSystem object attribute. See PCA X9-2 Appliance: Automatic Service Request (ASR) Event Coverage (Doc ID 2833567.1) for a list of Private Cloud Appliance X9-2 events that are actionable by ASR.

Using the Service Web UI to View Admin Service Faults

  1. Click the Active Faults link at the top of the Service Enclave Home page, or click Faults on the Navigation menu.

    The Faults page is displayed.
  2. At the top of the Faults page, you can toggle whether to list all faults or only active faults.

  3. For more information about a fault, click the name of the fault, or click View Details on the Actions menu.

    The details page shows the description, cause, and recommended action to take.

Using the Service CLI to View Admin Service Faults

  1. To view the list of Admin service faults, use the list fault command.

    Both active and cleared faults are listed.

    PCA-ADMIN> list fault
    Command: list fault
    Status: Success
    Time: 2023-03-07 15:34:52,613 UTC
    Data:
      id                                     name                                               status    severity
      --                                     ----                                               ------    --------
      33c61b8a-dcc7-4b8f-bc0f-56915ecc62f5   RackUnitIlomRunStateFaultStatusFault(pcacn005)     Cleared   Critical
      f7d22180-aeae-4159-b5c8-5e55a7906a78   RackUnitIlomRunStateFaultStatusFault(pcacn004)     Cleared   Critical
      a4fef907-8e54-4750-9fac-6829fbade90d   ComputeNodeCpuFaultStatusFault(pcacn006)           Cleared   Minor
      f8d93384-da30-43cd-9396-6e6671d240e2   RackUnitIlomRunStateFaultStatusFault(pcacn010)     Cleared   Critical
      8e61bb81-7a02-4c26-8ef4-c13b198f64da   ComputeNodeCpuFaultStatusFault(pcacn007)           Cleared   Warning
      3216b6f9-326b-4992-99a3-ab23cb18243b   AK-8003-F9--PCIe 3                                 Active    Minor
      ef3fb25b-0573-4524-8d1c-fb704c814446   AK-8003-HF--vnic1                                  Active    Major
      f830cd46-21ff-4d74-ba81-c82fd6f52c67   ComputeNodeCpuFaultStatusFault(pcacn005)           Cleared   Minor
      d2e71da0-ba63-4983-97da-24033d5c6447   ZfsPoolUsageFaultStatusFault(PCA_POOL)             Cleared   Major
      eecd5ef2-4a71-4137-be96-54c028212d2f   ComputeNodeMemoryFaultStatusFault(pcacn004)        Cleared   Minor
      cf68d2ee-e483-e573-b46e-c31bcbc8e968   ISTOR-8000-1S--ORACLE SERVER E5-2L                 Cleared   Major
      0686c11d-b96b-e5aa-dfbe-a20154da4794   SPAMD-8002-FJ--ORACLE SERVER E5-2L                 Cleared   Major
      b488a45a-80df-46e3-b0b5-a35527eb9c0e   AK-8003-F9--PCIe 10                                Active    Minor
      ac48f88d-e181-4b03-b620-6bfbf4ad95ef   RackUnitIlomRunStateFaultStatusFault(pcacn007)     Cleared   Critical
      b4c66a7c-def3-42c2-8842-d4763afc5184   RackUnitIlomRunStateFaultStatusFault(pcacn006)     Cleared   Critical
      9fc2e45a-1cff-4f95-828d-58742c8ce12f   ComputeNodeMemoryFaultStatusFault(pcacn002)        Active    Minor
      c0124122-a91c-4110-89cc-deebe54de7ba   ComputeNodeMemoryFaultStatusFault(pcacn006)        Cleared   Critical
      ca26ed46-4d1c-4ade-9e74-af27d94cf8f4   AK-8003-HF--vnic2                                  Active    Major
      58e9ab5d-d4e7-4d94-9ca6-e85a1c88b3b8   RackUnitRunStateFaultStatusFault(sn022147XLF014)   Cleared   Critical
      474c269f-4018-45d7-97d5-da17c9c845f4   RackUnitIlomRunStateFaultStatusFault(pcacn001)     Cleared   Critical
      2b5ece1c-50fc-436a-81b3-da0c5b418fe3   RackUnitIlomRunStateFaultStatusFault(pcacn003)     Cleared   Critical
      1c164eb9-9a76-4592-8ab6-150edb8f7a75   ComputeNodeCpuFaultStatusFault(pcacn001)           Cleared   Warning
      55ed1494-6aac-4248-91cb-9ac8295d668c   AK-8003-HF--PCIe 6                                 Active    Major
      afbcc080-0b93-434b-8ead-fa673f302170   AK-8003-F9--PCIe 6                                 Active    Minor
      8b36c2db-a3b4-41c8-b416-8e733ace3aeb   PcaSystemReSyncHwStatusStatusFault(null)           Cleared   Warning
      28c5ba93-6b4e-42f3-ad61-90734b46bf30   SPENV-8000-RU--ORACLE SERVER E5-2L                 Cleared   Critical
      3d932188-0120-489f-a512-1a244ec01e49   RackUnitIlomRunStateFaultStatusFault(pcacn009)     Cleared   Critical
      21e6faa9-68e1-47ae-a298-e2cb14d2a406   ComputeNodeMemoryFaultStatusFault(pcacn007)        Cleared   Minor
      db023304-fb7a-613b-ad9b-e277b7ce5675   SPENV-8000-A7--ORACLE SERVER E5-2L                 Cleared   Major
      63839bf5-335b-48ff-86a0-9e981e3e9902   RackUnitRunStateFaultStatusFault(sn012147XLF014)   Cleared   Critical
      2e851c6e-aa29-4a25-846a-29b08967dd95   RackUnitValidationStateStatusFault(pcacn008)       Cleared   Major
      76805c56-fcf6-48a2-b4fd-ffa77570e83c   ComputeNodeCpuFaultStatusFault(pcacn002)           Active    Minor
      9be74faf-df4d-ea20-cfc1-92b2a6a01b06   SPENV-8000-A7--ORACLE SERVER E5-2L                 Cleared   Major
      1624064f-d380-4ffc-9000-d293c185d7ac   ComputeNodeCpuFaultStatusFault(pcacn003)           Cleared   Warning
      7ca3f7af-f0bd-45d9-bad7-15794d49e7c6   RackUnitIlomRunStateFaultStatusFault(pcacn008)     Cleared   Critical
      3e7a3503-7a71-4ef1-a3ad-fba2162571ab   ComputeNodeCpuFaultStatusFault(pcacn004)           Cleared   Warning
      0922cd8e-297e-4356-b736-b09ac382b28b   AK-8003-F9--PCIe 10                                Active    Minor
      ab44ad2c-1105-417d-aa47-e8cb477ef0ec   AK-8003-F9--PCIe 3                                 Active    Minor
  2. To view the details of a specific fault, including description, cause, and recommended action to take, use the show fault command with the specific fault ID.

    PCA-ADMIN> show fault id=ab44ad2c-1105-417d-aa47-e8cb477ef0ec
    Command: show fault id=ab44ad2c-1105-417d-aa47-e8cb477ef0ec
    Status: Success
    Time: 2023-03-07 15:36:19,414 UTC
    Data:
      Id = ab44ad2c-1105-417d-aa47-e8cb477ef0ec
      Type = Fault
      Category = Internal
      Severity = Minor
      Status = Active
      Last Update Time = 2023-03-06 20:04:11,668 UTC
      Message Id = AK-8003-F9
      Time Reported = Mon Mar 06 2023 16:50:24 GMT+0000 (UTC)
      Action = Check the networking cable, switch port, and switch configuration. Contact your vendor for support
               if the network port remains inexplicably down. Please refer to the associated reference document at
               http://support.oracle.com/msg/AK-8003-F9 for the latest service procedures and policies regarding 
               this diagnosis.
      Health Exporter = zfssa-analytics-exportersn022147XLF014
      uuid = ab44ad2c-1105-417d-aa47-e8cb477ef0ec
      Diagnosing Source = zfssa_analytics_exporter
      FaultHistoryLogIds 1 = id:fdfaa42f-de8d-4622-a9df-ea229b7bad6f  type:FaultHistoryLog  name:
      BaseManagedObjectId = id:2147XLF015/PCIe 3/465774J-2121701684  type:HardwareComponent  name:
      Description = Network connectivity via port mlxne4 has been lost.
      Name = AK-8003-F9--PCIe 3
      Work State = Normal

Additional examples of using the Service CLI to show Admin service faults are shown in Compute Node CPU and Memory Utilization Faults.

Compute Node CPU and Memory Utilization Faults

The Admin service raises faults for the percent of memory used and percent of CPU used for a ComputeNode object. More severe faults are raised as more memory and CPU are used. When the percent used drops below a certain percentage, any faults are cleared.

These are utilization faults (CPU and memory usage), not hardware faults. Problems with CPU and memory hardware are reported through ASR.

CPU Usage

The following table shows the default percent of compute node CPU usage that raises different severities of faults.

CPU Percentage Fault Severity Fault State

< .75

Not applicable

Cleared

>= .75

Warning

Active

>= .80

Minor

Active

>= .90

Major

Active

>= .95

Critical

Active

CPU Memory

The following table shows the default percent of compute node memory usage that raises different severities of faults.

Memory Percentage Fault Severity Fault State

< .75

Not applicable

Cleared

>= .75

Warning

Active

>= .80

Minor

Active

>= .90

Major

Active

>= .95

Critical

Active

Using the Service CLI to View Compute Node Faults

To view the CPU and memory compute node usage default fault trigger settings using the Service CLI, use the cnUpdateManager command:

PCA-ADMIN> show cnUpdateManager
Command: show cnUpdateManager
Status: Success
Time: 2023-03-06 23:41:37,249 UTC
Data:
  Id = caaaaaa1-a076-4e48-94b5-7bdcd4e0c42c
  Type = CnUpdateManager
  LastRunTime = 2023-03-06 23:41:33,676 UTC
  Poll Interval (sec) = 60
  The minimum CPU usage percentage to trigger a critical fault = 0.95
  The minimum CPU usage percentage to trigger a major fault = 0.9
  The minimum CPU usage percentage to trigger a minor fault = 0.8
  The minimum CPU usage percentage to trigger a warning = 0.75
  The minimum memory usage percentage to trigger a critical fault = 0.95
  The minimum memory usage percentage to trigger a major fault = 0.9
  The minimum memory usage percentage to trigger a minor fault = 0.8
  The minimum memory usage percentage to trigger a warning = 0.75

To view the list of all faults and the details of a specific fault, see Viewing Admin Service Health Data. The following example shows a specific compute node fault. Current usage is not shown except that it is at least the minor fault threshold but less than the major fault threshold. To see current usage, use the Service Web UI.

PCA-ADMIN> show fault id=76805c56-fcf6-48a2-b4fd-ffa77570e83c
Command: show fault id=76805c56-fcf6-48a2-b4fd-ffa77570e83c
Status: Success
Time: 2023-03-07 15:40:50,917 UTC
Data:
  Id = 76805c56-fcf6-48a2-b4fd-ffa77570e83c
  Type = Fault
  Category = Status
  Severity = Minor
  Status = Active
  Associated Attribute = cpuFault
  Last Update Time = 2023-03-04 01:06:25,666 UTC
  Cause = ComputeNode pcacn002 attribute cpuFault = MINOR.
  FaultHistoryLogIds 1 = id:79b44c26-cb4e-4bec-a58c-6efc7fc63fed  type:FaultHistoryLog  name:
  FaultHistoryLogIds 2 = id:fc90a99a-031b-457f-b585-5c905e61362e  type:FaultHistoryLog  name:
  FaultHistoryLogIds 3 = id:48068f78-1328-447d-9506-efb6f22d154d  type:FaultHistoryLog  name:
  FaultHistoryLogIds 4 = id:d97c5819-923c-480d-8f61-2341c8403182  type:FaultHistoryLog  name:
  FaultHistoryLogIds 5 = id:18cdd005-53c0-488c-a2df-28f2da3b1092  type:FaultHistoryLog  name:
  FaultHistoryLogIds 6 = id:bfe1ffcd-5899-4400-914c-b467d8671e0c  type:FaultHistoryLog  name:
  FaultHistoryLogIds 7 = id:459fa55b-8654-4c07-8ae7-6d0ef011e3b1  type:FaultHistoryLog  name:
  FaultHistoryLogIds 8 = id:b9c8a909-f8ea-4de6-9bfe-2516e7addf73  type:FaultHistoryLog  name:
  FaultHistoryLogIds 9 = id:6ab5d1ca-3659-49a7-8e68-946bbbeccc9f  type:FaultHistoryLog  name:
  FaultHistoryLogIds 10 = id:d04d06a1-1e2c-404c-ac67-680e0deb34c5  type:FaultHistoryLog  name:
  FaultHistoryLogIds 11 = id:22dd163e-528f-4346-b177-d62c7ceb9885  type:FaultHistoryLog  name:
  FaultHistoryLogIds 12 = id:cdb2dbf5-6999-43c2-bb5f-17192bfad3e2  type:FaultHistoryLog  name:
  FaultHistoryLogIds 13 = id:aa7b2e43-ab0b-4d78-bfe7-d4b0dd0fec4a  type:FaultHistoryLog  name:
  BaseManagedObjectId = id:0dd96e90-de00-4fa0-82e3-16937e4601f8  type:ComputeNode  name:
  Description = ComputeNode pcacn002 attribute cpuFault = MINOR.
  Name = ComputeNodeCpuFaultStatusFault(pcacn002)
  Work State = Normal

Storage Utilization Faults

The following table describes the two kinds of Oracle ZFS Storage Appliance faults raised in the Admin service.

These are utilization faults (ZFS pool usage), not hardware faults. Problems with ZFS hardware are reported through ASR.

Private Cloud Appliance uses Prometheus matrix data collected for ZFS Storage Appliance to report pool usage. Total pool size per pool (zfssa_pool_total) and free space per pool (zfssa_pool_free) are used to calculate pool usage percentage. The zfssa_pool_status metric reports the health of a pool.

Metric Name Metric Value Description Fault Condition

zfssa_pool_total

zfssa_pool_free

Pool usage percentage is calculated using the following formula for each pool:

(zfssa_pool_total - zfssa_pool_free)
 / zfssa_pool_total

If the pool usage percentage is above a pre-configured value, a major fault is raised. The default value is 80 percent.

zfssa_pool_status

The zfssa_pool_status metric can have the following values:

  • 0 - exported

  • 1 - degraded

  • 2 - online

  • -1 - offline

  • -2 - faulted

  • -3 - unavailable

  • -4 - removed

A major fault is raised for any pool/zfssa_node combination that has any pool status value other than 0 or 2.

Hardware Run State Faults

A critical or major fault is raised if a hardware unit on the rack such as a management node, compute node, storage node, or switch has an invalid run state.

The following table shows the severity of the fault that will be raised for the given run state. Any run state other than the listed run states results in clearing any fault.

Run State Value (case insensitive) Fault Severity Fault State

UNABLE TO CONNECT TO ILOM

Critical

Active

FAIL

Critical

Active

SERVICE REQUIRED

Major

Active

other

Not applicable

Cleared

Health Checker Notification Faults

Health Checker faults are raised from notifications from the ZFSSA and Network Health Checker components. For every notification it receives, the Admin service will raise a fault.

Following are example attributes of the faultedComponents object in the Network Health Checker component fault data:

"class": "cisco.fan.fail",
"severity": "Major",
"description": "Fan module has failed and needs to be replaced. This can lead to overheating and temperature alarms.",
...
"class": "cisco.power.fail",
"severity": "Major",
"description": "Power Supply has failed or has been shutdown",

Following are example attributes of the faultedComponents object in the ZFSSA Health Checker component fault data:

"severity":"Major",
"type":"Fault",
"description":"An internal power supply failure has been detected.",

Detailed information is provided about the part that has failed.

An action attribute contains a brief description of what to do to fix the problem and might include a link to the appropriate support document.

Manually Clearing Faults

This section describes how to manually clear faults using the Service CLI. You cannot manually clear faults using the Service Web UI.

Using the Service CLI

  1. Using SSH, log into the management node VIP as admin.

    # ssh -l admin 100.96.2.32 -p 30006
  2. Use the list fault command to find the list of fault identifications.

    PCA-ADMIN> list fault
    Command: list fault
    Status: Success
    Time: 2024-01-31 21:38:05,472 UTC
    Data:
    id                                 Name                       Status Severity
    –-                                 –-–-                       –-–-–- –-–-–-–-       
    71671228-.….….-56a6a58947c6a6789   pcamn02-example            Active Critical 
    524cb805-.….….-acc3458bb79t04295   RackUnit-example           Active Major
    PCA-ADMIN> 
  3. Use the clearFault command with the fault identifier to clear the fault.

    PCA-ADMIN> cleatFault id=[524cb805-.….….-acc3458bb79t04295]
    Command: clearFault
    Status: Success
    Time: 2024-01-31 21:39:30,094 UTC
    PCA-ADMIN>

    Note:

    You can verify the clear fault result by using another list fault command.
    PCA-ADMIN> list fault
    Command: list fault
    Status: Success
    Time: 2024-01-31 21:40:02,685 UTC
    Data:
    id                                 Name                       Status Severity
    –-                                 –-–-                       –-–-–- –-–-–-–-       
    71671228-.….….-56a6a58947c6a6789   pcamn02-example            Active Critical 
    PCA-ADMIN>