2.9 Health Monitoring

The Oracle Private Cloud Appliance Controller Software contains a monitoring service, which is started and stopped with the ovca service on the active management node. When the system runs for the first time it creates an inventory database and monitor database. Once these are set up and the monitoring service is active, health information about the hardware components is updated continuously.

The inventory database is populated with information about the various components installed in the rack, including the IP addresses to be used for monitoring. With this information, the ping manager pings all known components every 3 minutes and updates the inventory database to indicate whether a component is pingable and when it was last seen online. When errors occur they are logged in the monitor database. Error information is retrieved from the component ILOMs.

For troubleshooting purposes, historic health status details can be retrieved through the CLI support mode by an authorized Oracle Field Engineer. When the CLI is used in support mode, a number of additional commands are available; two of which are used to display the contents of the health monitoring databases.

  • Use show db inventory to display component health status information from the inventory database.

  • Use show db monitor to display errors logged in the monitoring database.

The appliance administrator can retrieve current component health status information from the Oracle Linux command line on the master management node, using the Oracle Private Cloud Appliance Health Check utility. The Health Check utility is built on the framework of the Oracle Private Cloud Appliance Upgrader, and is included in the Upgrader package. It detects the appliance network architecture and runs the sets of health checks defined for the system in question.

Checking the Current Health Status of an Oracle Private Cloud Appliance Installation

  1. Using SSH and an account with superuser privileges, log in to the active management node.

    Note

    The default root password is Welcome1. For security reasons, you must set a new password at your earliest convenience.

    # ssh root@10.100.1.101
    root@10.100.1.101's password:
    root@ovcamn05r1 ~]#
  2. Launch the Health Check utility.

    # pca_healthcheck
    PCA Rack Type: PCA X8_BASE.
    Please refer to log file
    /nfs/shared_storage/pca_upgrader/log/pca_healthcheck_2019_10_04-12.09.45.log
    for more details.

    After detecting the rack type, the utility executes the applicable health checks.

    Beginning PCA Health Checks...
    
    Check Management Nodes Are Running                                     1/24
    Check Support Packages                                                 2/24
    Check PCA DBs Exist                                                    3/24
    PCA Config File                                                        4/24
    Check Shares Mounted on Management Nodes                               5/24
    Check PCA Version                                                      6/24
    Check Installed Packages                                               7/24
    Check for OpenSSL CVE-2014-0160 - Security Update                      8/24
    Management Nodes Have IPv6 Disabled                                    9/24
    Check Oracle VM Manager Version                                       10/24
    Oracle VM Manager Default Networks                                    11/24
    Repositories Defined in Oracle VM Manager                             12/24
    PCA Services                                                          13/24
    Oracle VM Server Model                                                14/24
    Network Interfaces on Compute Nodes                                   15/24
    Oracle VM Manager Settings                                            16/24
    Check Network Leaf Switch                                             17/24
    Check Network Spine Switch                                            18/24
    All Compute Nodes Running                                             19/24
    Test for ovs-agent Service on Compute Nodes                           20/24
    Test for Shares Mounted on Compute Nodes                              21/24
    Check for bash ELSA-2014-1306 - Security Update                       22/24
    Check Compute Node's Active Network Interfaces                        23/24
    Checking for xen OVMSA-2014-0026 - Security Update                    24/24
    
    PCA Health Checks completed after 2 minutes
  3. When the health checks have been completed, check the report for failures.

    Check Management Nodes Are Running                                   Passed
    Check Support Packages                                               Passed
    Check PCA DBs Exist                                                  Passed
    PCA Config File                                                      Passed
    Check Shares Mounted on Management Nodes                             Passed
    Check PCA Version                                                    Passed
    Check Installed Packages                                             Passed
    Check for OpenSSL CVE-2014-0160 - Security Update                    Passed
    Management Nodes Have IPv6 Disabled                                  Passed
    Check Oracle VM Manager Version                                      Passed
    Oracle VM Manager Default Networks                                   Passed
    Repositories Defined in Oracle VM Manager                            Passed
    PCA Services                                                         Passed
    Oracle VM Server Model                                               Passed
    Network Interfaces on Compute Nodes                                  Passed
    Oracle VM Manager Settings                                           Passed
    Check Network Leaf Switch                                            Passed
    Check Network Spine Switch                                           Failed
    All Compute Nodes Running                                            Passed
    Test for ovs-agent Service on Compute Nodes                          Passed
    Test for Shares Mounted on Compute Nodes                             Passed
    Check for bash ELSA-2014-1306 - Security Update                      Passed
    Check Compute Node's Active Network Interfaces                       Passed
    Checking for xen OVMSA-2014-0026 - Security Update                   Passed
    
    ---------------------------------------------------------------------------
    Overall Status                                                       Failed
    ---------------------------------------------------------------------------
    
    Please refer to log file
    /nfs/shared_storage/pca_upgrader/log/pca_healthcheck_2019_10_04-12.09.45.log
    for more details.
  4. If certain checks have resulted in failures, review the log file for additional diagnostic information. Search for text strings such as "error" and "failed".

    # grep -inr "failed" /nfs/shared_storage/pca_upgrader/log/pca_healthcheck_2019_10_04-12.09.45.log
    
    726:[2019-10-04 12:10:51 264234] INFO (healthcheck:254) Check Network Spine Switch Failed -
    731:  Spine Switch ovcasw22r1 North-South Management Network Port-channel check                 [FAILED]
    733:  Spine Switch ovcasw22r1 Multicast Route Check                                             [FAILED]
    742:  Spine Switch ovcasw23r1 North-South Management Network Port-channel check                 [FAILED]
    750:[2019-10-04 12:10:51 264234] ERROR (precheck:148) [Check Network Spine Switch ()] Failed
    955:[2019-10-04 12:12:26 264234] INFO (precheck:116) [Check Network Spine Switch ()] Failed
    
    # less /nfs/shared_storage/pca_upgrader/log/pca_healthcheck_2019_10_04-12.09.45.log
    
    [...]
      Spine Switch ovcasw22r1 North-South Management Network Port-channel check                 [FAILED]
      Spine Switch ovcasw22r1 OSPF Neighbor Check                                               [OK]
      Spine Switch ovcasw22r1 Multicast Route Check                                             [FAILED]
      Spine Switch ovcasw22r1 PIM RP Check                                                      [OK]
      Spine Switch ovcasw22r1 NVE Peer Check                                                    [OK]
      Spine Switch ovcasw22r1 Spine Filesystem Check                                            [OK]
      Spine Switch ovcasw22r1 Hardware Diagnostic Check                                         [OK]
    [...]
  5. Investigate and fix any detected problems. Repeat the health check until the system passes all checks.