Admin Service Health Data

The Admin service in Roving Edge generates health metrics, and raises faults when certain conditions occur.

This health information is not for hardware faults but is information about resource utilization (CPU, memory, and storage), hardware run state, and health checker notifications. The hardware faults listed at the bottom of the Admin service faults summary table are reported through ASR.

Admin Service Fault Detection Configuration Summary

Fault Type

Detection Frequency

Detection Delay

Data Source

Method of Detection

CPU and memory utilization

60 sec

< 20 sec

Admin calls ComputeNode service

Faults are raised by fault task based on the compute node object attributes stored in the database.

Storage utilization

120 sec

< 20 sec

Admin calls Prometheus service

Faults are raised by fault task based on Prometheus ZFS pool usage and status data stored in the database.

Hardware run state

150 sec

< 20 sec

Admin calls Hardware list REST API

Faults are raised by fault task based on hardware component node/ILOM run states stored in the database.

Health checker notification

Defined by the ZFS/Network health checker notification frequency

0 sec

Various HealthChecker services send notifications

Faults are created based on the RabbitMQ notification fault results.

Platform ILOM faults

150 sec

0 sec

Admin calls Hardware getMgmt and getCompute ILOM Health REST APIs

Faults are created based on the L1 API results for ILOM object data.

For more information about ASR actionable events, see this support note: Doc ID 2833567.2.

Hardware status

On initialization, and when the syncHardwareData command runs

< 20 sec

Admin calls Hardware list REST API

Faults are raised by fault task based on the PcaSystem object attribute.

For more information about ASR actionable events, see this support note: Doc ID 2833567.2.

Viewing Admin Service Faults

Using the Service Web UI
  1. Click the Active Faults link at the top of the Service Web UI Home page, or click Faults on the Navigation menu.

    The Faults page is displayed.
  2. At the top of the Faults page, you can toggle whether to list all faults or only active faults.

  3. For more information about a fault, click the name of the fault, or click View Details on the Actions menu.

    The details page shows the description, cause, and recommended action to take.

Using the Service CLI
  1. To view the list of Admin service faults, use the list fault command.

    Both active and cleared faults are listed.

    PCA-ADMIN> list fault
    Data:
      id                                     name                                               status    severity
      --                                     ----                                               ------    --------
      33c61b8a-dcc7-4b8f-bc0f-56915ecc62f5   RackUnitIlomRunStateFaultStatusFault(pcacn005)     Cleared   Critical
      f7d22180-aeae-4159-b5c8-5e55a7906a78   RackUnitIlomRunStateFaultStatusFault(pcacn004)     Cleared   Critical
      a4fef907-8e54-4750-9fac-6829fbade90d   ComputeNodeCpuFaultStatusFault(pcacn006)           Cleared   Minor
      f8d93384-da30-43cd-9396-6e6671d240e2   RackUnitIlomRunStateFaultStatusFault(pcacn010)     Cleared   Critical
      8e61bb81-7a02-4c26-8ef4-c13b198f64da   ComputeNodeCpuFaultStatusFault(pcacn007)           Cleared   Warning
      3216b6f9-326b-4992-99a3-ab23cb18243b   AK-8003-F9--PCIe 3                                 Active    Minor
      ef3fb25b-0573-4524-8d1c-fb704c814446   AK-8003-HF--vnic1                                  Active    Major
      f830cd46-21ff-4d74-ba81-c82fd6f52c67   ComputeNodeCpuFaultStatusFault(pcacn005)           Cleared   Minor
      d2e71da0-ba63-4983-97da-24033d5c6447   ZfsPoolUsageFaultStatusFault(PCA_POOL)             Cleared   Major
      eecd5ef2-4a71-4137-be96-54c028212d2f   ComputeNodeMemoryFaultStatusFault(pcacn004)        Cleared   Minor
      cf68d2ee-e483-e573-b46e-c31bcbc8e968   ISTOR-8000-1S--ORACLE SERVER E5-2L                 Cleared   Major
      0686c11d-b96b-e5aa-dfbe-a20154da4794   SPAMD-8002-FJ--ORACLE SERVER E5-2L                 Cleared   Major
      b488a45a-80df-46e3-b0b5-a35527eb9c0e   AK-8003-F9--PCIe 10                                Active    Minor
      ac48f88d-e181-4b03-b620-6bfbf4ad95ef   RackUnitIlomRunStateFaultStatusFault(pcacn007)     Cleared   Critical
      b4c66a7c-def3-42c2-8842-d4763afc5184   RackUnitIlomRunStateFaultStatusFault(pcacn006)     Cleared   Critical
      9fc2e45a-1cff-4f95-828d-58742c8ce12f   ComputeNodeMemoryFaultStatusFault(pcacn002)        Active    Minor
      c0124122-a91c-4110-89cc-deebe54de7ba   ComputeNodeMemoryFaultStatusFault(pcacn006)        Cleared   Critical
      ca26ed46-4d1c-4ade-9e74-af27d94cf8f4   AK-8003-HF--vnic2                                  Active    Major
      58e9ab5d-d4e7-4d94-9ca6-e85a1c88b3b8   RackUnitRunStateFaultStatusFault(sn022147XLF014)   Cleared   Critical
      474c269f-4018-45d7-97d5-da17c9c845f4   RackUnitIlomRunStateFaultStatusFault(pcacn001)     Cleared   Critical
      2b5ece1c-50fc-436a-81b3-da0c5b418fe3   RackUnitIlomRunStateFaultStatusFault(pcacn003)     Cleared   Critical
      1c164eb9-9a76-4592-8ab6-150edb8f7a75   ComputeNodeCpuFaultStatusFault(pcacn001)           Cleared   Warning
      55ed1494-6aac-4248-91cb-9ac8295d668c   AK-8003-HF--PCIe 6                                 Active    Major
      afbcc080-0b93-434b-8ead-fa673f302170   AK-8003-F9--PCIe 6                                 Active    Minor
      8b36c2db-a3b4-41c8-b416-8e733ace3aeb   PcaSystemReSyncHwStatusStatusFault(null)           Cleared   Warning
      28c5ba93-6b4e-42f3-ad61-90734b46bf30   SPENV-8000-RU--ORACLE SERVER E5-2L                 Cleared   Critical
      3d932188-0120-489f-a512-1a244ec01e49   RackUnitIlomRunStateFaultStatusFault(pcacn009)     Cleared   Critical
      21e6faa9-68e1-47ae-a298-e2cb14d2a406   ComputeNodeMemoryFaultStatusFault(pcacn007)        Cleared   Minor
      db023304-fb7a-613b-ad9b-e277b7ce5675   SPENV-8000-A7--ORACLE SERVER E5-2L                 Cleared   Major
      63839bf5-335b-48ff-86a0-9e981e3e9902   RackUnitRunStateFaultStatusFault(sn012147XLF014)   Cleared   Critical
      2e851c6e-aa29-4a25-846a-29b08967dd95   RackUnitValidationStateStatusFault(pcacn008)       Cleared   Major
      76805c56-fcf6-48a2-b4fd-ffa77570e83c   ComputeNodeCpuFaultStatusFault(pcacn002)           Active    Minor
      9be74faf-df4d-ea20-cfc1-92b2a6a01b06   SPENV-8000-A7--ORACLE SERVER E5-2L                 Cleared   Major
      1624064f-d380-4ffc-9000-d293c185d7ac   ComputeNodeCpuFaultStatusFault(pcacn003)           Cleared   Warning
      7ca3f7af-f0bd-45d9-bad7-15794d49e7c6   RackUnitIlomRunStateFaultStatusFault(pcacn008)     Cleared   Critical
      3e7a3503-7a71-4ef1-a3ad-fba2162571ab   ComputeNodeCpuFaultStatusFault(pcacn004)           Cleared   Warning
      0922cd8e-297e-4356-b736-b09ac382b28b   AK-8003-F9--PCIe 10                                Active    Minor
      ab44ad2c-1105-417d-aa47-e8cb477ef0ec   AK-8003-F9--PCIe 3                                 Active    Minor
  2. To view the details of a specific fault, including description, cause, and recommended action to take, use the show fault command with the specific fault ID.

    PCA-ADMIN> show fault id=ab44ad2c-1105-417d-aa47-e8cb477ef0ec
    Data:
      Id = ab44ad2c-1105-417d-aa47-e8cb477ef0ec
      Type = Fault
      Category = Internal
      Severity = Minor
      Status = Active
      Last Update Time = 2023-03-06 20:04:11,668 UTC
      Message Id = AK-8003-F9
      Time Reported = Mon Mar 06 2023 16:50:24 GMT+0000 (UTC)
      Action = Check the networking cable, switch port, and switch configuration. Contact your vendor for support
               if the network port remains inexplicably down. Please refer to the associated reference document at
               http://support.oracle.com/msg/AK-8003-F9 for the latest service procedures and policies regarding 
               this diagnosis.
      Health Exporter = zfssa-analytics-exportersn022147XLF014
      uuid = ab44ad2c-1105-417d-aa47-e8cb477ef0ec
      Diagnosing Source = zfssa_analytics_exporter
      FaultHistoryLogIds 1 = id:fdfaa42f-de8d-4622-a9df-ea229b7bad6f  type:FaultHistoryLog  name:
      BaseManagedObjectId = id:2147XLF015/PCIe 3/465774J-2121701684  type:HardwareComponent  name:
      Description = Network connectivity via port mlxne4 has been lost.
      Name = AK-8003-F9--PCIe 3
      Work State = Normal

Viewing CPU and Memory Utilization Faults

The Admin service raises faults for the percentage of memory used and percentage of CPU used for a ComputeNode object. More severe faults are raised as more memory and CPU are used. When the percentage used drops below a certain percentage, any faults are cleared.

These are utilization faults (CPU and memory usage), not hardware faults. Problems with CPU and memory hardware are reported through ASR.

CPU Usage

The following table shows the default percentage of compute node CPU usage that raises different severities of faults.

CPU Percentage

Fault Severity

Fault State

< .75

Not applicable

Cleared

>= .75

Warning

Active

>= .80

Minor

Active

>= .90

Major

Active

>= .95

Critical

Active

Memory Usage

The following table shows the default percentage of compute node memory usage that raises different severities of faults.

Memory Percentage

Fault Severity

Fault State

< .75

Not applicable

Cleared

>= .75

Warning

Active

>= .80

Minor

Active

>= .90

Major

Active

>= .95

Critical

Active

Viewing Compute Node Faults in the Service CLI

To view the compute node CPU and memory usage default fault trigger settings, use the cnUpdateManager command:

PCA-ADMIN> show cnUpdateManager
Data:
  Id = caaaaaa1-a076-4e48-94b5-7bdcd4e0c42c
  Type = CnUpdateManager
  LastRunTime = 2023-03-06 23:41:33,676 UTC
  Poll Interval (sec) = 60
  The minimum CPU usage percentage to trigger a critical fault = 0.95
  The minimum CPU usage percentage to trigger a major fault = 0.9
  The minimum CPU usage percentage to trigger a minor fault = 0.8
  The minimum CPU usage percentage to trigger a warning = 0.75
  The minimum memory usage percentage to trigger a critical fault = 0.95
  The minimum memory usage percentage to trigger a major fault = 0.9
  The minimum memory usage percentage to trigger a minor fault = 0.8
  The minimum memory usage percentage to trigger a warning = 0.75

To view the list of all faults and the details of a specific fault, see Viewing Admin Service Faults. The following example shows a specific compute node fault. Current usage is not shown except that it is at least the minor fault threshold but less than the major fault threshold. To see current usage, use the Service Web UI.

PCA-ADMIN> show fault id=76805c56-fcf6-48a2-b4fd-ffa77570e83c
Data:
  Id = 76805c56-fcf6-48a2-b4fd-ffa77570e83c
  Type = Fault
  Category = Status
  Severity = Minor
  Status = Active
  Associated Attribute = cpuFault
  Last Update Time = 2023-03-04 01:06:25,666 UTC
  Cause = ComputeNode pcacn002 attribute cpuFault = MINOR.
  FaultHistoryLogIds 1 = id:79b44c26-cb4e-4bec-a58c-6efc7fc63fed  type:FaultHistoryLog  name:
  FaultHistoryLogIds 2 = id:fc90a99a-031b-457f-b585-5c905e61362e  type:FaultHistoryLog  name:
  FaultHistoryLogIds 3 = id:48068f78-1328-447d-9506-efb6f22d154d  type:FaultHistoryLog  name:
  FaultHistoryLogIds 4 = id:d97c5819-923c-480d-8f61-2341c8403182  type:FaultHistoryLog  name:
  FaultHistoryLogIds 5 = id:18cdd005-53c0-488c-a2df-28f2da3b1092  type:FaultHistoryLog  name:
  FaultHistoryLogIds 6 = id:bfe1ffcd-5899-4400-914c-b467d8671e0c  type:FaultHistoryLog  name:
  FaultHistoryLogIds 7 = id:459fa55b-8654-4c07-8ae7-6d0ef011e3b1  type:FaultHistoryLog  name:
  FaultHistoryLogIds 8 = id:b9c8a909-f8ea-4de6-9bfe-2516e7addf73  type:FaultHistoryLog  name:
  FaultHistoryLogIds 9 = id:6ab5d1ca-3659-49a7-8e68-946bbbeccc9f  type:FaultHistoryLog  name:
  FaultHistoryLogIds 10 = id:d04d06a1-1e2c-404c-ac67-680e0deb34c5  type:FaultHistoryLog  name:
  FaultHistoryLogIds 11 = id:22dd163e-528f-4346-b177-d62c7ceb9885  type:FaultHistoryLog  name:
  FaultHistoryLogIds 12 = id:cdb2dbf5-6999-43c2-bb5f-17192bfad3e2  type:FaultHistoryLog  name:
  FaultHistoryLogIds 13 = id:aa7b2e43-ab0b-4d78-bfe7-d4b0dd0fec4a  type:FaultHistoryLog  name:
  BaseManagedObjectId = id:0dd96e90-de00-4fa0-82e3-16937e4601f8  type:ComputeNode  name:
  Description = ComputeNode pcacn002 attribute cpuFault = MINOR.
  Name = ComputeNodeCpuFaultStatusFault(pcacn002)
  Work State = Normal

Storage Utilization Faults

The following table describes the two kinds of ZFS Storage Appliance faults raised in the Admin service. These are utilization faults (ZFS pool usage), not hardware faults. Problems with ZFS hardware are reported through ASR.

Prometheus matrix data collected for the ZFS Storage Appliance, is used to report pool usage. Total pool size per pool (zfssa_pool_total) and free space per pool (zfssa_pool_free) are used to calculate pool usage percentage. The zfssa_pool_status metric reports the health of a pool.

Metric Name

Metric Value Description

Fault Condition

zfssa_pool_total

zfssa_pool_free

Pool usage percentage is calculated using the following formula for each pool:

(zfssa_pool_total - zfssa_pool_free)
 / zfssa_pool_total

If the pool usage percentage is above a pre-configured value, a major fault is raised. The default value is 80 percent.

zfssa_pool_status

The zfssa_pool_status metric can have the following values:

  • 0 - exported

  • 1 - degraded

  • 2 - online

  • -1 - offline

  • -2 - faulted

  • -3 - unavailable

  • -4 - removed

A major fault is raised for any pool/zfssa_node combination that has any pool status value other than 0 or 2.

Health Checker Notification Faults

Health Checker faults are raised from notifications from the ZFS Storage Appliance and Network Health Checker components. The Admin service raises a fault for every notification it receives.

Following are example attributes of the faultedComponents object in the Network Health Checker component fault data:

"class": "cisco.fan.fail",
"severity": "Major",
"description": "Fan module has failed and needs to be replaced. This can lead to overheating and temperature alarms.",
[...]
"class": "cisco.power.fail",
"severity": "Major",
"description": "Power Supply has failed or has been shutdown",

Following are example attributes of the faultedComponents object in the ZFS Storage Appliance Health Checker component fault data:

"severity":"Major",
"type":"Fault",
"description":"An internal power supply failure has been detected.",

Detailed information is provided about the part that has failed.

An action attribute contains a brief description of what to do to fix the problem and might include a link to the appropriate support document.

Manually Clearing Faults

Faults can be manually cleared using the Service CLI. You cannot clear faults using the Service Web UI.

  1. Using SSH, log into the management node VIP as admin.

    # ssh -l admin 100.96.2.32 -p 30006
  2. Use the list fault command to find the list of fault identifications.

    PCA-ADMIN> list fault
    Data:
    id                             Name               Status   Severity
    –-                             –-–-               –-–-–-   –-–-–-–-
    71671228...56a6a58947c6a6789   pcamn02-example    Active   Critical
    524cb805...acc3458bb79t04295   RackUnit-example   Active   Major
  3. Use the clearFault command with the fault identifier to clear the fault.

    Note

    You can verify the clear fault result by running the list fault command again.
    PCA-ADMIN> clearFault id=[524cb805...acc3458bb79t04295]
    Status: Success
    
    PCA-ADMIN> list fault
    Data:
    id                             Name               Status   Severity
    –-                             –-–-               –-–-–-   –-–-–-–-
    71671228...56a6a58947c6a6789   pcamn02-example    Active   Critical