2.10 Health Monitoring

The Oracle PCA controller software contains a monitoring service, which is started and stopped with the ovca service on the active management node. When the system runs for the first time it creates an inventory database and monitor database. Once these are set up and the monitoring service is active, health information about the hardware components is updated continuously.

The inventory database is populated with information about the various components installed in the rack, including the IP addresses to be used for monitoring. With this information, the ping manager pings all known components every 3 minutes and updates the inventory database to indicate whether a component is pingable and when it was last seen online. When errors occur they are logged in the monitor database. Error information is retrieved from the component ILOMs.

For troubleshooting purposes, historic health status details can be retrieved through the CLI support mode by an authorized Oracle Field Engineer. When the CLI is used in support mode, a number of additional commands are available; two of which are used to display the contents of the health monitoring databases.

  • Use show db inventory to display component health status information from the inventory database.

  • Use show db monitor to display errors logged in the monitoring database.

The appliance administrator can retrieve current component health status information through the Oracle PCA CLI at any time by means of the diagnose command.

Checking the Current Health Status of an Oracle PCA Installation

  1. Using SSH and an account with superuser privileges, log into the active management node.

    Note

    The default root password is Welcome1. For security reasons, you must set a new password at your earliest convenience.

    # ssh root@10.100.1.101
    root@10.100.1.101's password:
    root@ovcamn05r1 ~]#
  2. Launch the Oracle PCA command line interface.

    # pca-admin
    Welcome to PCA! Release: 2.3.2
    PCA>
  3. Check the current status of the rack components by querying their ILOMs.

    PCA> diagnose ilom
    Checking ILOM health............please wait..
    
    IP_Address      Status          Health_Details
    ----------      ------          --------------
    192.168.4.129   Not Connected
    192.168.4.128   Not Connected
    192.168.4.127   Not Connected
    192.168.4.126   Not Connected
    192.168.4.125   Not Connected
    192.168.4.124   Not Connected
    192.168.4.123   Not Connected
    192.168.4.122   Not Connected
    192.168.4.121   Not Connected
    192.168.4.120   Not Connected
    192.168.4.101   OK
    192.168.4.102   OK
    192.168.4.105   Faulty          Mon Nov 25 14:17:37 2013  Power    PS1 (Power Supply 1) 
                                    A loss of AC input to a power supply has occurred. 
                                    (Probability: 100, UUID: 2c1ec5fc-ffa3-c768-e602-ca12b86e3ea1, 
                                    Part Number: 07047410, Serial Number: 476856F+1252CE027X, 
                                    Reference Document: http://www.sun.com/msg/SPX86-8003-73)
    192.168.4.107   OK
    192.168.4.106   OK
    192.168.4.109   OK
    192.168.4.108   OK
    192.168.4.112   OK
    192.168.4.113   Not Connected
    192.168.4.110   OK
    192.168.4.111   OK
    192.168.4.116   Not Connected
    192.168.4.117   Not Connected
    192.168.4.114   Not Connected
    192.168.4.115   Not Connected
    192.168.4.118   Not Connected
    192.168.4.119   Not Connected
    -----------------
    27 rows displayed
    
    Status: Success
  4. Verify that the Oracle PCA controller software is fully operational.

    PCA> diagnose software
    PCA Software Acceptance Test runner utility
    Test -    701 - OpenSSL CVE-2014-0160 Heartbleed bug Acceptance [PASSED]
    Test -    785 - PCA package Acceptance [PASSED]
    Test -   1083 - Mgmt node xsigo network interface Acceptance [PASSED]
    Test -    787 - Shared Storage Acceptance [PASSED]
    Test -    973 - Simple connectivity Acceptance [PASSED]
    Test -   1078 - Test for ovs-agent service on CNs Acceptance [PASSED]
    Test -   1079 - Test for shares mounted on CNs Acceptance [PASSED]
    Test -   1080 - ovs-log check Acceptance [PASSED]
    Test -    788 - PCA services Acceptance [PASSED]
    Test -    789 - PCA config file Acceptance [PASSED]
    Test -   1300 - All compute nodes running Acceptance [PASSED]
    Test -   1318 - Check support packages in PCA image Acceptance [PASSED]
    Test -    928 - Repositories defined in OVM manager Acceptance [PASSED]
    Test -   1107 - Compute node xsigo network interface Acceptance [PASSED]
    Test -   1316 - PCA version Acceptance [PASSED]
    Test -   1117 - Network interfaces check Acceptance [PASSED]
    Test -    824 - OVM manager settings Acceptance [PASSED]
    Test -    927 - OVM server model Acceptance [PASSED]
    Test -    925 - PCA log Acceptance [PASSED]
    Test -    926 - Networks defined in OVM manager for CNs Acceptance [PASSED]
    Test -    822 - Compute node network interface Acceptance [PASSED]
    Status: Success
    Note

    For additional information about these diagnostic results, look at /var/log/ovca-diagnosis.log. However, note that this health monitoring status information changes frequently as the appliance environment runs. If the system does not perform as expected, use it only as an indication of where a problem might have occurred.

  5. Close the CLI.

    PCA> exit