8 Troubleshooting the Exadata Plug-in

This chapter provides troubleshooting tips and techniques on installing, discovering, and configuring the Exadata plug-in. The following sections are provided:

8.1 Establish SSH Connectivity

For Release 12.1.0.1.0, the SSH key location is <ORACLE_HOME>/.ssh where ORACLE_HOME is the installation directory of the Enterprise Manager agent. For example:

/u01/app/oracle/product/gc12/agent/core/12.1.0.1.0

Note:

Some metric collection has a dependency on ~/.ssh/known_hosts.

For Release 12.1.0.2.0 or later, the SSH key location is $HOME/.ssh of the agent user.

To set up SSH connectivity between the computer where Agent is running and the Oracle Exadata Storage Server, as the Agent user:

  1. Log in to the computer where the Enterprise Manager Agent is running, open a terminal, and run the following commands as the Agent user to generate a pair of the SSH private/public keys if they are not present:
    • For Release 12.1.0.1.0:

      $ cd <ORACLE_HOME>/.ssh 
      $ ssh-keygen -t dsa -f id_dsa
      

      Where <ORACLE_HOME> is the installation directory of the Enterprise Manager Agent.

    • For Release 12.1.0.2.0:

      $ cd $HOME/.ssh
      $ ssh-keygen -t dsa -f id_dsa
      

      Where $HOME is the home directory of the Agent user.

  2. Copy the public key (id_dsa.pub) to the /tmp directory on the storage cell:
    $ scp id_dsa.pub root@<cell_ipaddress>:/tmp
    
  3. Add the contents of the id_dsa.pub file to the authorized_keys file in the .ssh directory within the home directory of the cellmonitor user:
    $ ssh -l root <cell_ipaddress> "cat /tmp/id_dsa.pub >> ~cellmonitor/.ssh/authorized_keys"
    

    Note:

    If the authorized_keys file does not exist, then create one by copying the id_dsa.pub file to the .ssh directory within the home directory of the user cellmonitor:

    $ ssh -l root <cell_ipaddress> "cp /tmp/id_dsa.pub ~cellmonitor/.ssh/authorized_keys; chown cellmonitor:cellmonitor ~cellmonitor/.ssh/authorized_keys"
    
  4. Make sure that the .ssh directory and authorized_keys have the right file permission:
    # chmod 700 ~cellmonitor/.ssh
    # chmod 600 ~cellmonitor/.ssh/authorized_keys
    

8.2 Discovery Troubleshooting

Very often, the error message itself will include the cause for the error. Look for error messages in the OMS and agent logs (case insensitive search for dbmdiscovery) or in the Discovery window itself.

The following sections are provided:

8.2.1 Hardware Availability

All the hardware components must be "known" and reachable; otherwise, communication failures will occur. Use the ping command for each hardware component of the Exadata rack to make sure all names are resolved.

The MAP targets in Enterprise Manager Cloud Control 13c may fail while collecting the correlation identifier. This failure can happen if the credentials are incorrect OR if a target (for example, the ILOM) is too slow in responding.

ILOM can be slow when the number of open sessions on ILOM has exceeded the limit. You can resolve this issue by temporarily closing the sessions on the ILOM.

The rack placement of targets can fail:

  • If examan did not return valid rack position for the target.

  • If there is an existing target in the same location.

8.2.2 Discovery Failure Diagnosis

Should discovery of your Oracle Exadata Database Machine fail, collect the following information for diagnosis:

  • Any examan-*.xml, examan-*.html, targets-*.xml, and examan*.log files from the AGENT_ROOT/agent_inst/sysman/emd/state directory.

  • Agent logs: emagent_perl.trc and gcagent.log

  • OMS logs: emoms.trc and emoms.log

  • Any snapshot (screen capture) of errors shown on the target summary page.

8.2.3 Cell is not Discovered

If the cell itself is not discovered, possible causes could be:

  • The installation of RDBMS Oracle Home Release 11.2 is incorrect.

  • The /etc/oracle/cell/network-config/cellip.ora file on the compute node is missing or unreadable by the agent user or cell not listed in that file.

  • The cell is not listed in the /etc/oracle/cell/network-config/cellip.ora file.

  • Management Server (MS) or cellsrv is down.

  • Cell management IP is changed improperly. Bouncing both cellsrv and MS may help.

  • To check that the cell is discovered with a valid management IP, run the following command on the compute node used for discovery:

    $ORACLE_HOME/bin/kfod op=cellconfig
    

8.2.4 Compute Node Error Message

Problems with the compute node may generate the following error:

The selected compute node is not an existing host target managed by Enterprise Manager. Please add the compute node as managed target before you continue.

Possible causes for this error include:

  • The compute node was not added as an Enterprise Manager host target before the Exadata Database Machine discovery.

  • The host target name for compute node is an IP address. This problem can be an /etc/hosts or DNS issue.

  • The host target name is not fully qualified with domain name (for example, hostname.mycompany.com)

8.2.5 Compute Node or InfiniBand Switch is not Discovered

If there are problems with discovery of the compute node or the InfiniBand switch, possible causes could be:

  • The InfiniBand switch host name or nm2user password is incorrect.

  • The connection from the compute node to the InfiniBand switch through SSH is blocked by a firewall.

  • The InfiniBand switch is down or takes too long to respond to SSH.

To resolve problems with the compute node or InfiniBand switch discovery, try:

  • If the InfiniBand switch node is not discovered, the InfiniBand switch model or switch firmware may not be supported by EM Exadata. Run the ibnetdiscover command. Output should look like:

    Switch 36 "S-002128469f47a0a0" # "Sun DCS 36 QDR switch switch1.example.com" enhanced port 0 lid 1 lmc 0
    
  • To verify discovery of the compute node, run the following command on the compute node used for discovery:

    # ssh <IB switch> -l nm2user ibnetdiscover
    
  • If the compute node is not discovered, run the ibnetdiscover command. Output should look like:

    Ca 2 "H-00212800013e8f4a" # " xdb1db02 S 192.168.229.85 HCA-1“
    

    A bug in the 11.2.2.2.2 compute node image shows “S" and the InfiniBand IP as missing. Output would look like:

    Ca 2 " H-00212800013e8f4a " # "xdb1db02 HCA-1“
    

    A workaround for this problem is to run the following command as root on the compute nodes:

    # /opt/oracle.cellos/ib_set_node_desc.sh
    

8.2.6 Compute Node not Managed by Enterprise Manager

If you encounter an instance where the compute note is not being managed by Enterprise Manager, then check the following troubleshooting steps:

  • If the Agent host name is different than the compute node host name, then run the following command as root to match up to agent host name:

    # ibnetdiscover
    
  • If the wrong Agent is used for discovery, then select the compute node Agent for discovery.

  • If the compute node name has been reset from the client to management or vice-versa, then run the following command:

    # /usr/sbin/set_nodedesc.sh
    
  • If a short host name is used for agents, then reinstall the agents using fully-qualified host name <hostname.domain>.

8.2.7 Extra or Missing Components in the Newly Discovered Exadata Database Machine

If you are showing extra components or if there are missing components, then check the following troubleshooting steps:

  • For extra components, examine them for Exadata Database Machine membership. Deselect any extra components manually from the discovered list.

  • Verify which schematic file that was used for discovery. Ensure that Enterprise Manager can read the latest xml file (for example, databasemachine.xml) on the compute node.

  • For missing components, check the schematic file content.

  • If you need to generate a new schematic file, then log a service request (SR) with Oracle Support and provide the details.

8.2.8 InfiniBand Network Performance Page Shows No Data

If the InfiniBand network performance page does not show data, double check that the files under the /opt/oracle.SupportTools/em/ directory on compute nodes should be publicly readable. Se Oracle Bug 13255511 for more information.

8.2.9 ILOM, PDU, KVM, or Cisco Switch is not Discovered

If the ILOM, PDU, KVM, or Cisco switch is not discovered, the most likely cause is that the Exadata Database Machine Schematic file cannot be read or has incorrect data. See Troubleshooting the Exadata Database Machine Schematic File.

8.2.10 Target Does not Appear in Selected Targets Page

Even though no error may appear during the Exadata Database Machine guided discovery, the target does not appear on the Select Components page. Possible causes and solutions include:

  • Check the All Targets page to make sure that the target has not been added as an Enterprise Manager target already:

    • Log in to Enterprise Manager.

    • Select Targets, then All Targets.

    • On the All Targets page, check to see if the Oracle Exadata target appears in the list.

  • A target that is added manually may not be connected to the Exadata Database Machine system target through association. To correct this problem:

    • Delete these targets before initiating the Exadata Database Machine guided discovery.

    • Alternatively, use the emcli command to add these targets to the appropriate system target as members.

8.2.11 Target is Down or Metric Collection Error After Discovery

After the Exadata Database Machine guided discovery, an error that the target is down or that there is a problem with the metric collection may display. Possible causes and recommended solutions include:

  • For the cell or InfiniBand switch, the setup of SSH may not be configured properly. To troubleshoot and resolve this problem:

    • The agent's SSH public key in the <AGENT_INST>/.ssh/id_dsa.pub file is not in the authorized_keys file of $HOME/.ssh for cellmonitor or nm2user.

    • Verify permissions. The permission settings for .ssh and authorized_keys should be:

      drwx------ 2 cellmonitor cellmonitor 4096 Oct 13 07:06 .ssh
      -rw-r--r-- 1 cellmonitor cellmonitor 441842 Nov 10 20:03 authorized_keys
      
    • Resolve a PerformOperationException error. See Troubleshooting the Exadata Database Machine Schematic File for more information.

  • If the SSH setup is confirmed to be properly configured, but the target status is still down, then check to make sure there are valid monitoring and backup agents assigned to monitor the target. To confirm, click the Database Machine menu and select Monitoring Agent. Figure 8-1 shows an example of the monitoring agents:

    Figure 8-1 Monitoring Agents Example


    Monitoring Agents Example

  • For the ILOM, PDU, KVM, or Cisco switch, possible causes include:

    • The Exadata Database Machine Schematic Diagram file has the wrong IP address.

    • Monitoring Credentials is not set or incorrect. To verify:

      • Log in to Enterprise Manager.

      • Click Setup, then Security, and finally Monitoring Credentials.

      • On the Monitoring Credentials page, click the Oracle Exadata target type. Then set the monitoring credentials.

8.2.12 ILOM Credential Validation Fails During 12c Discovery

ILOM Credential Validation Failure Errors

Note:

The following applies to Enterprise Manager Cloud Control Release 2, Plug-in Update 1 and later.

ILOM credential validation may fail while performing a 12c discovery. The following errors may occur:

  • Authentication failed

    Problem: Credentials provided are invalid.

    Resolution: Use valid credentials.

  • Unable to establish IPMI V2 / RMCP+ session.

    Problem: IPMI V2 is not enabled on the ILOM server.

    Resolution: If IPMI V2 was disabled on purpose and the ILOM server is configured to use the more secure TLS1.2 protocol, then the IPMI client needs to be upgraded to use TLS1.2 for establishing a session with ILOM server.

  • Unable to establish LAN session.

    Problem: Occurs when IPMI V1.5 is not enabled on the ILOM server.

    Resolution: Enable IPMI V1.5 on the ILOM server. IPMI V1.5 is used for communication as a third option after TLS1.2 and IPMI V2 have failed.

8.2.13 Discovery Process Hangs

If the discovery process for the Exadata Database Machine hangs, then check the following troubleshooting steps:

  • Examine your network to verify:

    • That the host name can be resolved.

    • That the Agent(s) can access the OMS.

    • That a simple job can be executed from the console.

  • If the OMS reported any errors, then check the following log file:

    MW_HOME/gc_inst/sysman/log/emoms.log
    
  • For Repository issues, check the Repository database's alert.log file.

  • For Agent issues, check the following log file:

    $AGENT/agent_inst/sysman/log/gcagent.log
    

8.3 Troubleshooting the Exadata Database Machine Schematic File

The Exadata Database Machine Schematic file version 503 is required as a prerequisite for guided discovery. As part of any discovery troubleshooting, possible causes and recommended resolution with the schematic file can include:

  • The schematic file on the compute node is missing or is not readable by the agent user.

    • For Exadata Release 11.2.3.2 and later, the schematic file is:

      /opt/oracle.SupportTools/onecommand/catalog.xml
      
    • For Exadata Release 11.2.3.1 and earlier, the schematic file is:

      /opt/oracle.SupportTools/onecommand/databasemachine.xml
      
  • If a PerformOperationException error appears, the agent NMO is not configured for setuid-root:

    • From the OMS log:

      2011-11-08 12:28:12,910 [[ACTIVE] ExecuteThread: '6' for queue: 'weblogic.kernel.Default (self-tuning)'] 
      ERROR model.DiscoveredTarget logp.251 - 
      ERROR: NMO not setuid-root (Unix only) oracle.sysman.emSDK.agent.client.exception.PerformOperationException:
      
    • As root, run:

      # <AGENT_INST>/root.sh
      
  • In the /etc/pam.d file, pam_ldap.so is used instead of pam_unix.so on compute nodes.

    • Even though the agent user and password are correct, this errors appears in the agent log:

      oracle.sysman.emSDK.agent.client.exception.PerformOperationException:
      ERROR: Invalid username and/or password
      
  • Schematic file has error because of a known Exadata Database Machine configurator bug:

    • Verify that the Exadata Database Machine configurator is version 12.0

    • Verify that the schematic file is version 503

    • Older versions may or may not have the bug depending on the Exadata Database Machine rack type and partitioning.

  • If the schematic file is blank, then:

    • Check your browser support and Enterprise Manager Cloud Control 12c.

    • Run through discovery again and watch for messages.

    • Check the emoms.log file for exceptions at the same time.

  • If components are missing, then:

    • Add manually to the schematic page (click Edit).

    • Check for component presence in Enterprise Manager. Check to see if it is monitored.

8.4 Exadata Database Machine Management Troubleshooting

If data is missing in Resource Utilization graphs, then run a "view object" SQL query to find out what data is missing. Common problems include:

  • Schematic file is not loaded correctly.

  • Cluster, Database, and ASM are not added as Enterprise Manager targets.

  • Database or cell target is down or is returning metric collection errors.

  • Metric is collected in the Enterprise Manager repository, but has an IS_CURRENT != Y setting.

8.5 Exadata Derived Association Rules

Exadata derived association rules depend on Exadata and DB/ASM ECM data. This data may take up to 30 minutes to appear depending on metric collection schedule. To check for data availability:

  • From the Enterprise Manager Cloud Control console:

    • Click Targets, then All Targets.

    • On the All Targets page, click the Oracle Exadata target from the list.

    • Click Database System, then Configuration, and finally Last Collected.

    • On the Latest Configuration page, click Actions, then Refresh.

  • From the command line:

    # emctl control agent runCollection
    # target_name:target_type <collectionName>
    

Other troubleshooting tips include:

  • Verify that ECM data are collected and present in Enterprise Manager repository.

  • Verify that all data and conditions in query are met by running the query in SQL+.

  • Verify triggers by enabling debug logging to check for timing issues.

8.6 InfiniBand Patch Details Missing

Problem: If you select both InfiniBand switches in an Exadata Database Machine, the Patch Details section is empty. Installation of a patch for both InfiniBand switches are not affected - only the details of the installed patch.

Workaround: Select each InfiniBand switch individually to review the patch details.

8.7 Oracle Auto Service Request (ASR) Issues

8.7.1 Oracle ASR Not Working on Exadata Storage Server

Problem: You may encounter a problem where Oracle ASR is not working on the Exadata Storage Server.

Resolution: Ensure that there are two subscriptions on the Exadata Storage Server:

  • The type should be ASR or V3ASR for receiving the cell ILOM traps.

  • The type should be default (no type) or v3 for receiving the MS MIB traps from the cell.

8.7.2 No Slots Available Error

The asr type entry on MS MIB adds a subscription to the cell ILOM automatically. If the ILOM SNMP slots are full, then the subscription command on the cell may fail with the following error:

CELL-02669: No slots are available for ILOM SNMP subscribers

There is a limit of 15 subscribers on the ILOM which might cause this failure. You will need to free up some slots on the ILOM and retry the ASR subscription:

  1. Login to the ILOM console (for example: https://XXXCELL-c.example.com).
  2. Click ILOM Administration and then Notification.
  3. Choose the slot and set the subscriber to 0.0.0.0 to clear it up.

8.8 Target Status Issues

If the Target status shows DOWN inaccurately, then:

  • For the Cell: Check ssh equivalence (cellmonitor user) with the followign command:

    ssh –i /home/oracle/.ssh/id_dsa –l cellmonitor <cell name> -e cellcli list cell
    

    Output should be: <cell name>

  • For the PDU: Check to make sure you can access the PDU through your browser to verify that it is connected to your LAN:

    http://<pdu name>
    
  • For the Cisco Switch: Check for proper SNMP subscriptions. See Set Up SNMP for Cisco Ethernet Switch Targets for details.

8.9 Metric Collection Issues

If the Target status shows a Metric Collection Error, then:

  • Hover over the icon or navigate to Incident Manager.

  • Read the full text of the error.

  • Visit the Monitoring Configuration page and examine the settings. From the Setup menu, select Monitoring Configuration.

  • Trigger a new collection: From the Target menu, select Configuration, then select Last Collected, then Actions, and finally select Refresh.

  • Access the monitoring Agent Metric through your browser:

    https://<agent URL>/emd/browser/main
    

    Click Target >> and then click Response to evaluate the results. You may need to log a service request (SR) with Oracle Support.

8.10 Status: Pending Issues

For those issues where components are in a Pending status, see the following troubleshooting steps:

8.10.1 Cellsys Targets

If the Cellsys target seems to be in a Pending status for too long, then:

  • Verify that there is an association for the Cluster ASM, Database, and Storage Cell.

  • Check and fix the status of the associated target database.

  • Check and fix the status of the associated target ASM cluster.

  • Ensure an UP status for all cell server targets.

  • Delete any unassociated cellsys targets.

8.10.2 Database Machine Target or Any Associated Components

If the Exadata Database Machine target or any associated components are in a Pending status for too long, then:

  • Check for duplicate or pending delete targets. From the Setup menu, select Manage Cloud Control, then select Health Overview.

  • Check the target configuration. From the target's home page menu, select Target Setup, then select Monitoring Configurations.

  • Search for the target name in the agent or OMS logs:

    $ grep <target name> gcagent.log or emoms.log
    

8.11 Enhanced MIB Incompatibility

With some shipments of the X4 series of Exadata and all later version of the hardware, the Exadata Database Machine contains an upgraded PDU that is running a newer version of firmware. This firmware ships with an enhanced Management Information Base (MIB) that is not compatible with the Exadata 12.1.x plug-in. In these cases, change the setting in the PDU to using the original MIB as illustrated in Figure 8-2.

  1. Select Original MIB if you are monitoring the ILOM as an "Oracle Engineered System PDU" in Enterprise Manager.
  2. To use the Enhanced MIB instead, you can use the "Oracle System Infrastructure PDU" and then convert Exadata Database Machine components to use Oracle Enterprise Manager Cloud Control 13c. For details on converting to this version, see Using the Conversion Wizard to Convert 12c Targets to 13c Targets.

Figure 8-2 Selecting Original MIB

Selecting Original MIB screen shot example

8.12 Monitoring Agent Not Deployed for IPv6 Environments

Problem: For IPv6 environments, the monitoring agent is not deployed.

Cause: If the IPv6 address is not included in the /etc/hosts file, then the agent will not be deployed.

Resolution: Edit the /etc/hosts of the compute node (or the VM in case of virtual Exadata) to map the OMS host name to an IPv6 address.

8.13 Configure IPv6–SNMPv3 Subscription

You can only do SNMPv3 subscriptions on Oracle Exadata Storage Server page.

Prerequisites:
  • You need to create an SNMPv3 user on Cell. See Step 1 for instructions.

  • Exadata Storage Server version must be 12.1.2.2 or later to support SNMPv3 subscription.

To configure your IPv6–SNMPv3 subscription:

  1. Creating an SNMPv3 user on Exadata Storage Server Target.
    • Suppose you need a user called "v3user". You can check whether this user is already present on the cell using this command:

      cellcli -e "list cell attributes snmpUser"
      
    • If this user is not present on the Exadata Storage Server:

      ALTER CELL snmpUser=((name='[v3user]', authProtocol='MD5', authPassword='[passwd]', privProtocol='DES', privPassword='[passwd]'))
      

    Note:

    The Enterprise Manager Agent supports DES and MD5/SHA combination.
  2. When running the discovery process on 12c and 13c, ensure that the v monitoring credentials on the cell target in EM match the v user on the cell.

    Note:

    Make sure that an SNMP v3 user has been created before the discovery process is initiated (see Step 1 for instructions).
  3. If you need to edit the SNMP v3 monitoring credentials of the Exadata Storage Server after discovery, you can log in to EM and:
    • Select Configuration, then Security, then Monitoring Credentials.

    • Select Exadata Storage Server type, and click on Manage Monitoring Credentials button.

    • Select the Exadata Storage Server and set the v3 Credentials.

    Note:

    Ensure that the v3 monitoring credentials are setup on the cell target in EM which match the v3 user on the cell.