A Troubleshooting Issues

This appendix provides solutions to common issues you might encounter when using provisioning and patching Deployment Procedures. In particular, this appendix covers the following:

Troubleshooting Provisioning and Patching Issues
Frequently Asked Questions

Troubleshooting Provisioning and Patching Issues

This section provides troubleshooting tips for some common provisioning and patching issues. In particular, this section covers the following:

Troubleshooting Provisioning Issues
Troubleshooting Virtualization Management Issues
Troubleshooting Linux Provisioning Issues
Troubleshooting Linux Patching Issues

Troubleshooting Provisioning Issues

To troubleshoot provisioning issues, see My Oracle Support note 816656.1. You can access My Oracle Support at:

https://support.oracle.com/

Troubleshooting Patching Issues

This section describes how you can troubleshoot patching issues. In particular, this section covers the following:

Using Troubleshooting Options Offered by Deployment Procedure
Collecting Information from Procedure Logs
Instance XML of the Procedure Executed
Raising a Service Request

Using Troubleshooting Options Offered by Deployment Procedure

When a patching Deployment Procedure fails at a step, you can choose to do one of the following:

Ignore: Ignore the failure and proceed with the step.
Retry: Retry the step. For example, locked processes issue.
Update and Retry: Update and then retry the step in case of any errors in the runtime values.

Figure A-1 shows the troubleshooting option available as part of the procedure steps.

Figure A-1 Troubleshooting options as part of the procedure steps

Additionally, during the Update and Retry mode, you can enable the debug option to get more information on the failures. See Figure A-2.

Figure A-2 Getting More Debug Information

Note:

This option is a part of the patching Deployment Procedures only.

Collecting Information from Procedure Logs

You can collect information about the failure from the procedure logs available in the Oracle home of the OMS and the Oracle home of the Management Agent.

The following are the OMS-related procedure logs:

Generic Enterprise Manager trace files:

$<ORACLE_HOME>/sysman/log/emoms.trc
Provision Advisory Framework logs:

$<ORACLE_HOME>/ sysman/log/pafLogs/

For specific Deployment Procedure instance, the logs are in the following file:

$<ORACLE_HOME>/sysman/log/pafLogs/<name>_<instance_guid>.log

The following are the Management Agent-related procedure logs:

$<ORACLE_HOME>/sysma/logs/emagent.nohup

$<ORACLE_HOME>/sysma/logs/emagent.trc

Optionally, to capture more details, you can make the logging to finer levels. Follow these steps to reset the log level and capture the logs mentioned above. (Note: It is advised to archive the old logs and have a fresh run after resetting the log level to capture the fresh logs.)

On the OMS side:
1. Open the following file available in the Oracle home of the OMS:
  
  $<ORACLE_HOME>/sysman/config/emomslogging.properties
2. Set the @ log4j.rootCategory= parameter to DEBUG.
3. Restart the OMS by running the following commands:
  
  $<ORACLE_HOME>/bin/emctl stop oms
  
  $<ORACLE_HOME>/bin/emctl start oms
On the Management Agent side:
1. Open the following file available in the Oracle home of the Management Agent:
  
  $<ORACLE_HOME>/sysman/config/emd.properties
2. Make the following settings:
  
  tracelevel.Dispatcherr=DEBUG
  
  tracelevel.command=DEBUG
3. Reload the Management Agent by running the following command:
  
  $<ORACLE_HOME>/bin/emctl reload agent

The above settings are to be set only when you want additional details and when the logs do not have sufficient information to debug the issue. Make sure to set the debug level back to the original levels after reproducing the issue.

Instance XML of the Procedure Executed

Instance XML provides with insight into all the values embedded with the procedures from the selected targets and user inputs. To generate an instance XML, you need to set up EMCLI either on the OMS or on the targets as SYSMAN or as an Enterprise Manager user who executed the procedure.

For more information, see Oracle Enterprise Manager Command Line Interface available at:

http://www.oracle.com/technology/documentation/oem.html

Run the following commands, where instance_guid is derived from the URL of the failed procedure (see Figure A-3):

emcli get_instance_data_xml -instance={instance_guid}

emcli get_instance_data_xml -instance=16B15CB29C3F9E6CE040578C96093F61 > data.xml

Figure A-3 Instance GUID for the Procedure Instance Run

Raising a Service Request

Associate the following with a service request (SR) to Oracle Support to enable speedy resolution:

Screenshots of the failed procedure and step
Log from the failed step UI
Tar/zip of the logs
Optionally, instance XML of the run

Troubleshooting Virtualization Management Issues

How do I view Enterprise Management agent errors?

Check the Alerts from the Virtual Server Home Page where the error occurs.

Check the following EM Agent logs for errors. This will be the EM Agent monitoring the Virtual Server on which we have observed a collection error.

<AGENT_STATE>/sysman/log/sysman/log/emagent.log       
<AGENT_STATE>/sysman/log/sysman/log/emagent.nohup
<AGENT_STATE>/sysman/log/sysman/log/emagent.trc
<AGENT_STATE>/sysman/log/sysman/log/emagentfetchlet.log 
<AGENT_STATE>/sysman/log/sysman/log/emagentfetchlet.trc

Enhance the logging level at the Agent to Debug, reproduce the issue, and take a look at the above-mentioned files for errors. Do remember to restore the logging level to what it was after getting the required data.

Update the following properties in <AGENT_STATE>/sysman/config/emagentlogging.properties file to the below-mentioned values:
```
log4j.rootCategory=DEBUG, emagentlogAppender, emagenttrcAppender
```
Also update the following properties in the <AGENT_STATE>/sysman/config/emd.properties file to the below-mentioned values:
```
tracelevel.command=DEBUG 
tracelevel.Dispatcher=DEBUG
```
After making these changes, the EM Agent will need to be bounced as follows:
```
<AGENT_HOME>/bin/emctl reload agent
```

How do I view OMS errors?

To enable enhanced logging at the OMS update the following properties in OMS_ORACLE_HOME/sysman/config/emomslogging.properties on each OMS with the following and bounce each OMS after the update:

log4j.category.em.ovm=DEBUG, emovmlogAppender
log4j.appender.emovmlogAppender=org.apache.log4j.RollingFileAppender
log4j.appender.emovmlogAppender.File=/private/rahgupta/app/oracle/product/10.2.0/EMHomes/oms10g/sysman/log/emovm.log
log4j.appender.emovmlogAppender.Append=true
log4j.appender.emovmlogAppender.MaxFileSize =5000000
log4j.appender.emovmlogAppender.MaxBackupIndex=100
log4j.appender.emovmlogAppender.layout=org.apache.log4j.PatternLayout
log4j.appender.emovmlogAppender.layout.ConversionPattern=%d [%t] %-5p %c{2} %M.%L - %m\n

From each OMS the following logs need to be looked into:

$ORACLE_HOME/sysman/log/emoms.log
$ORACLE_HOME/sysman/log/emoms.trc
$ORACLE_HOME/sysman/log/emovm.log*
$ORACLE_HOME/opmn/logs/OC4J~OC4J_EM~default_island~1

It is possible that the log files get rolled over, particularly if reproducing the issue takes a while. In that case increase the following parameter in the above-mentioned file to a larger number:

LogFileMaxRolls=50

How do I view Virtual Server errors?

View the following logs for virtual server errors:

/var/log/ovs-agent/ovs_operation.log
/var/log/ovs-agent/ovs_query.log
/var/log/ovs-agent/ovs_root.log

Troubleshooting Linux Provisioning Issues

I cannot see my stage, boot server in the UI to configure them with the provisioning application?

Either Management Agents have not been installed on the Stage/Boot Server machine or its not uploading data to the OMS. Refer to the Agent Deployment Best Practices document available at http://www.oracle.com/ocom/technology/products/oem/pdf/10gr2_agent_deploy_bp.pdf for troubleshooting information and known issues.

"Cannot create under the software library, please contact your administrator" error comes up while creating default image.

This may happen because the Software Library is not configured. Refer to Configuring Oracle Software Library. If it is configured and the error still occurs, then check if the library directory is accessible from the OMS server. The software library location should be accessible from the OMS and should have enough space.

Default Image or OS component creation fails with a '404' status while copying the header.info.

This may happen because of the RPM Repository URL being incorrect or because the RPM Repository has not been configured properly. Please refer to Setting Up RPM Repository and Configuring RPM Repository.

Default Image or OS component creation fails with an error message "Following RPMs are not found in Repository".

Reference machine has additional RPMs that are not available in the RPM repository. One of the following alternatives can be selected:

Choose a different reference machine.
Add extra RPMs in RPM repository (Make sure the RPMs are added in the repository as per Redhat specifications. Header files should be updated with the new RPM).
If the RPMs are not required on the provisioned machine, add them to the Deny RPM list in the OS Component creation user interface.

Default Image or OS component creation job is suspended.

This may happen if the Preferred Credentials are not set for Reference machine. Please set the Preferred Credentials.

Default Image or OS component creation fails "Sudo error".

This may happen if:

Sudo is not installed on the reference machine
The user mentioned for the Reference Machine is not in the sudoers list. Edit /etc/sudoers file and add the user in the list.

Default Image staging fails with "Cannot stage default image".

This happens if there is already a default image staged for the same IP address range as the current one. Remove the staged image by clicking on the Remove button in the Default Image section on the Administration tab.

Default Image or OS component staging/provision fails with "Sudo error".

This may happen if:

Sudo is not installed on the reference machine.
The user mentioned for the Reference Machine is not in the sudoers list. Edit /etc/sudoers file and add the user in the list.

Default Image or OS component staging/provision job is suspended.

This may happen if Preferred Credentials are not set for Reference machine. Please set the Preferred Credentials.

Default Image or OS component staging/provision job fails "with directory permission error".

This error happens because of insufficient user privileges on the stage server machine. STAGE_TOP_LEVEL_DIRECTORY should have the write permission for the Stage user. In case of NAS, the NAS directory should be mounted on the staging server. If the error is while writing to the '/tftpboot directory then it has to have the write permission for the Stage user.

Bare metal machine is not coming up since it cannot locate the Boot file.

Verify the dhcp settings (/etc/dhcpd.conf) & tftp settings for the target machine. Check whether the services (dhcpd, xinted, portmap) are running. Make the necessary setting changes if required or start the required services if they are down.

Even though the environment is correctly setup, bare metal box is not getting booted over network

DHCP server does not get a DHCPDISCOVER message for the MAC address of the bare metal machine.

Edit the DHCP configuration to include the IP address of the subnet where the bare metal machine is being booted up.

Provisioning Default image on the bare metal box fails with "reverse name lookup failed" error.

Verify that the DNS has the entry for the IP address and the host name.

Agent Installation fails after operating system has been provisioned on the bare metal box?

No host name is assigned to the bare metal box after provisioning the OS?

This might happen if the "get-lease-hostnames" entry in the dhcpd.conf file is set to true. Edit the dhcpd.conf file to set get-lease-hostnames entry to false.

Also, ensure that length of the host name is compatible with length of the operating system host name.

Bare metal machine hangs after initial boot up (tftp error/kernel error).

This may happen if the tftp service is not running. Enable the tftp service. Go to the '/etc/xinetd.d/tftp' file and change the 'disable' flag to 'no' (disable=no). Also verify the dhcp settings.

Kernel panic occurs when the Bare Metal machine boots up.

Verify the dhcp settings and tftp settings for the target machine and make the necessary changes as required. In a rare case, the intird and vmlinuz copied may be corrupted. Copying them from RPM repository again would fix the problem.

Bare metal machine hangs after loading the initial kernel image.

This may happen if the network is half duplex. Half duplex network is not directly supported but following steps below would fix the problem:

On the Reference Machine modify 'ethtool -s eth0 duplex half' entry to 'ethtool -s eth0 duplex full' in the kickstart file.
Once done recreate the Default Image.

Bare metal machine cannot locate the kickstart file (Redhat screen appears for manually entering the values such as 'language', 'keyboard' etc).

This happens if STAGE_TOP_LEVEL_DIRECTORY is not mountable or not accessible. Make sure the stage top level is network accessible to the target machine. Though very rare but this might also happen because of any problem in resolving the stage server hostname. Enter the IP address of the stage/NAS server instead of hostname on which they are located and try the provisioning operation again.

Bare metal machine does not go ahead with the silent installation (Redhat screen appears for manually entering the network details).

Verify that DNS is configured for the stage server host name and DHCP is configured to deliver correct DNS information to the target machine. If the problem persists, specify the IP address of the stage/NAS server instead of hostname and try the provisioning operation again.

After provisioning, the machine is not registered in Enterprise Manager.

This happens if Enterprise Manager Agent is not placed in the STAGE_TOP_LEVEL_DIRECTORY before provisioning operation. Place the EM agent in this directory and try the operation again. It might also happen if the OMS registration password provided for securing the agents is incorrect. Go to the agent oracle home on the target machine and run the emctl secure agent command supplying the correct OMS registration password.

Check the time zone of the OMS and the provisioned operating system. Modify the time zone of the provisioned operating system to match with the OMS time zone.

Component creation fails with "Cannot create under the software library, please contact your administrator".

This may happen if the Software Library is not configured with the Enterprise Manager. Refer Configuring Oracle Software Library. Create the components once the software library is configured.

"Cannot create under the Software Library, please contact your administrator" error comes up while creating default image.

Target does not appear under the hardware tab or while selecting hardware during assignment creation.

When you have multiple Oracle Management Services managed by a load balancer, you must configure a common directory that can be used by the Management Agents to upload data. To configure a common directory for this purpose, run the following command:

emctl config oms loader -shared yes -dir <directory>

With 64-bit OS provisioning, agent is not installed.

During OS provisioning, specify the full path of the agent RPM in the Advanced Properties page.

Troubleshooting Linux Patching Issues

My Staging Server Setup DP fails at "Channels Information Collection" step with the error message "Could not fetch the subscribed channels properly". How do I fix this?

This error is seen if there is any network communication error between up2date and ULN. Check if up2date is configured with correct proxy setting by following https://linux.oracle.com/uln_faq.html - 9. You can verify if the issue is resolved or not by using the command, "up2date –nox –show-channels". If the command lists all the subscribed channels, the issue is resolved.

My "up2date –nox –show-channels" command does not list the subscribed channels properly. How do I fix this?

Go to /etc/sysconfig/rhn/sources files, uncomment "up2date default" and comment out all the local RPM Repositories configured.

How can I register to channels of other architectures and releases?

Refer to https://linux.oracle.com/uln_faq.html for this and more such related FAQs.

After visiting some other page, I come back to "Setup Groups" page; I do not see the links to the jobs submitted. How can I get it back?

Click "Show" in the details column.

Package Information Job fails with "ERROR: No Package repository was found" or "Unknown Host" error. How do I fix it?

Package Repository you have selected is not good. Check if metadata files are created by running yum-arch and createrepo commands. The connectivity of the RPM Repository from OMS might be a cause as well.

Even after the deployment procedure finished its execution successfully, the Compliance report still shows my Group as non-compliant, why?

Compliance Collection is a job that runs once in every 24 hour. You should wait for the next cycle of the job for the Compliance report to update itself. Alternately, you can go to the Jobs tab and edit the job to change its schedule.

Package Information Job fails with "ERROR: No Package repository was found" or "Unknown Host" error. How do I fix it?

I see a UI error message saying "Package list is too long". How do I fix it?

Unselect some of the selected packages. The UI error message tells you from which package to unselect.

Frequently Asked Questions

This section provides answers to some frequently asked questions. In particular, this section covers the following:

Frequently Asked Questions on Setting Up the Provisioning Environment
Frequently Asked Questions on Directives and Components
Frequently Asked Questions on Network Profiles and Images

Frequently Asked Questions on Setting Up the Provisioning Environment

What is PXE (Pre-boot Execution Environment)?

The Pre-boot Execution Environment (PXE, aka Pre-Execution Environment) is an environment to bootstrap computers using a network interface card independently of available data storage devices (like hard disks) or installed operating systems. Refer to Appendix G, "PXE Booting and Kickstart Technology" for more information.

Can my Boot server and Stage server be located on different machine?

It is recommended that the Boot server and Stage server are co-located on the same physical machine. If this is not done, then one has to ensure that the network boot directory (say /tftpboot/pxelinux0.cfg) is exposed to Stage server via NFS and can be mounted by it.

Can my boot server reside on a subnet other than the one on which the bare metal boxes will be added?

Yes. But it is a recommended best practice to have boot server in the same subnet on which the bare metal boxes will be added. If the network is subdivided into multiple virtual networks, and there is a separate DHCP/PXE boot server in each network, the Assignment must specify the boot server on the same network as the designated hardware server.

If one wants to use a boot server in a remote subnet then one of the following should be done:

-- Router should be configured to forward DHCP traffic to a DHCP server on a remote subnet. This traffic is broadcast traffic and routers do not normally forward broadcast traffic unless configured to do so. A network router can be a hardware-based router, such as those manufactured by the Cisco Corporation or software-based such as Microsoft's Routing and Remote Access Services (RRAS). In either case, you need to configure the router to relay DHCP traffic to designated DHCP servers.

-- If routers cannot be used for DHCP/BOOTP relay, set up a DHCP/BOOTP relay agent on one machine in each subnet. The DHCP/BOOTP relay agent relays DHCP and BOOTP message traffic between the DHCP-enabled clients on the local network and a remote DHCP server located on another physical network by using the IP address of the remote DHCP server.

Why is Agent rpm staged on the Stage server?

Agent rpm is used for installing the agent on the target machine after booting over the network using PXE. With operating system provisioning, agent bits are also pushed on the machine from the staging location specified in the Advanced Properties.

Can I use the Agent rpm for installing Agent on Stage and Boot Server?

This is true only if the operating system of the Stage or Boot Server machine is RedHat Linux 4.0, 3.1 or 3.0. Refer to section Using agent rpm for Oracle Management Agent Installation on the following page for more information:

http://www.oracle.com/technology/software/products/oem/htdocs/provisioning_agent.html

Can the yum repository be accessed by any protocol other than HTTP?

Though the rpm repository can be exposed via file:// or ftp:// as well, the recommended method is to expose it via http://. The latter is faster and more secure.

Frequently Asked Questions on Directives and Components

Are directives associated with Components or Images?

Directives can be associated with both components and images. As explained later during the creation of components and images, directives are actually associated with one of Staging, Pre-Install, Install or Post-Installation provisioning steps of a component. In case of an Image, directives can be associated with Stage, Cleanup, Post-Install, and Diagnostics provisioning steps of an Image.

What is the significance of the Status of a directive? How can one change it?

Look at the following table to know the possible Status values and what they signify.

Table A-1 Status Values

Status	Description
Incomplete	This Status signifies that some step was not completed during the directive creation, for example uploading the actual script for the directive, or a user saved the directive while creating it and still some steps need to be performed to make complete the directive creation.
Ready	his signifies that the directive creation was successful and the directive is now ready to be used along with any component/image.
Active	A user can manually change the status of a Ready directive to Active to signify that it is ready for provisioning. Clicking Activate changes the Status to Active.

What is a Maturity Level of a directive? How can one change it?

Look at the following table to know the possible Status values and what they signify:

Table A-2 Maturity Levels

Maturity Level	Maturity Level Description
Untested	This signifies that the directive has not been tested and is the default maturity level that is assigned to the directive when it is created.
Beta	A directive can be manually promoted to Beta using the Promote button after testing the directive.
Production	A directive can be manually promoted to Production using the Promote button after a user is satisfied that the directive can be used for actual provisioning on production systems.

Can a same component be used in multiple images?

Yes. Components are reusable and a given component can be a part of multiple images at the same time.

Do I need to modify the images associated with a component, if the component is edited?

No. If the component is changed for some reason then the change is propagated to all the images with contain that component.

For creating the Linux OS component does the Reference Machine need to have a management agent running on it?

Yes. Reference Machine has to be one of the managed targets of the Enterprise Manager.

What if the working directory does not have enough space while creating a clone component?

The component creation job that is kicked off in the end will indicate that the component creations failed because of insufficient space. Create the component again by specifying a temporary directory with sufficient space and the creation will succeed.

What is the significance of the Status of a component? How can one change it?

Status of a component is similar to that of a directive. Refer to What is the significance of the Status of a directive? How can one change it?.

What is a Maturity Level of a component? How can one change it?

Maturity Level of a component is similar to that of a directive. Refer to What is a Maturity Level of a component? How can one change it?.

What is the purpose of associating Properties with Generic Components?

Properties are used to provide flexibility for customizing the components depending on need. The property values are fed to the directives, which then customize the installation depending upon the values.

Frequently Asked Questions on Network Profiles and Images

Can I use 10.1.x IP addresses for Private Network Configuration?

No. Because of a known limitation, this type of addressing cannot be done for private IPs of the cluster nodes.

What is Reset Timeout?

After the OS installation on a target machine, management agent is installed on it, which reports to OMS and makes the machine a managed target of the Enterprise Manager. The time duration between the start of OS provisioning to agent reporting back to OMS server is known is the Reset Timeout. This is used by the provisioning application for deciding it provisioning process has timed out. If the machine or network is slow then it is advisable to set a high reset value.