Sun N1 System Manager 1.2 Administration Guide

Previous: Chapter 5 Monitoring Your Servers

Chapter 6 Troubleshooting

This chapter provides troubleshooting information on the following topics:

Discovery Problems

If discovery fails, the target server has reached its maximum number of SNMP connections if the following is contained in the job output:

Error. The limit on the number of SNMP destinations has been exceeded.

The service processor of the Sun Fire V20z and V40z server has a limit of three SNMP destinations. To see the current SNMP destinations, perform the following steps:

Log into the service processor using SSH.
Run the following command:
sp get snmp-destinations

The SNMP destinations appear in the output.

If there are three destinations for a V20z or a V40z, discovery will fail. The failure occurs because the N1 System Manager adds another snmp-destination to the service processor during discovery.

The SNMP destinations can be configured in a service processor by N1 System Manager or some other management software. You can delete entries from the SNMP destinations if you know that the SNMP destination entry is no longer needed. This would be the case if you discovered the target server using N1 System Manager on one management server and then decided to not use that management server without deleting the server. You can use the sp delete snmp-destination command on the service processor if you need to delete an entry. Use the delete command with caution because some other management software may need the entry for monitoring. A provisionable server's SNMP destination is deleted, however, when the server is deleted from the N1 System Manager using the delete server command. It is best practice always to use the delete server command when removing a provisionable server.

Security Problems

This section provides security-based troubleshooting information.

The N1 System Manager uses strong encryption techniques to ensure secure communication between the management server and each managed server.

The keys used by the N1 System Manager are stored under the /etc/opt/sun/cacao/security directory on each server where the servers are running Linux. For servers running the Solaris OS, these keys are stored under the /etc/opt/SUNWcacao/security directory.

Why Regenerate Security Keys?

The security keys used by the N1 System Manager must be identical across all servers. Under normal operation, the security keys used by the keys can be left in their default configuration. You might have to regenerate security keys from time to time:

If there is a risk that the root password of the management server has been exposed or compromised, regenerate the security keys.
If the system date on the management server has been changed using the date command, regenerate the security keys. If the system date on the management server has been changed using the date command, there is a risk that the next time the N1 System Manager management daemon, n1sminit, is restarted, no services are subsequently provided by the management server. In this case, keys must be regenerated, and the N1 System Manager management daemon restarted, as explained in How to Regenerate Common Agent Container Security Keys.

How to Regenerate Common Agent Container Security Keys

Steps

On the management server as root, stop the N1 System Manager management daemon.
# /etc/init.d/n1sminit stop

Regenerate security keys using the create-keys subcommand.

If the management server is running Linux:
# /opt/sun/cacao/bin/cacaoadm create-keys --force
If the management server is running the Solaris OS:
# /opt/SUNWcacao/bin/cacaoadm create-keys --force

As root on the management server, restart the N1 System Manager management daemon.
# /etc/init.d/n1sminit start

General Security Considerations

The following list provides general security considerations that you should be aware of when you are using the N1 System Manager:

The Java^TM Web Console that is used to launch the N1 System Manager's browser interface uses self-signed certificates. These certificates should be treated with the appropriate level of trust by clients and users.
The terminal emulator applet that is used by the browser interface for the serial console feature does not provide a certificate-based authentication of the applet. The applet also requires that you enable SSHv1 for the management server. For certificate-based authentication or to avoid enabling SSHv1, use the serial console feature by running the connect command from the n1sh shell.
SSH fingerprints that are used to connect from the management server to the provisioning network interfaces on the provisionable servers are automatically acknowledged by the N1 System Manager software. This automation might make the provisionable servers vulnerable to “man-in-the middle” attacks.
The Web Console (Sun ILOM Web GUI) autologin feature for Sun Fire X4100 and Sun Fire X4200 servers exposes the server's service processor credentials to users who can view the web page source for the Login page. To avoid this security issue, disable the autologin feature by running the n1smconfig utility. See Configuring the N1 System Manager System in Sun N1 System Manager 1.2 Installation and Configuration Guide for details.

Troubleshooting OS Distributions

This section describes scenarios that cause OS deployment to fail and explains how to correct failures.

Distribution Copy Failures

If the creation of an OS distribution fails with a copying files error, check the size of the ISO image and ensure that it is not corrupted. You might see output similar to the following in the job details:

bash-3.00# /opt/sun/n1gc/bin/n1sh show job 25
Job ID:   25
Date:     2005-07-20T14:28:43-0600
Type:     Create OS Distribution
Status:   Error (2005-07-20T14:29:08-0600)
Command:	 create os RedHat file /images/rhel-3-U4-i386-es-disc1.iso
Owner:    root
Errors:   1
Warnings: 0

Steps
ID     Type             Start
Completion                 Result
1      Acquire Host     2005-07-20T14:28:43-0600
2005-07-20T14:28:43-0600   Completed
2      Run Command      2005-07-20T14:28:43-0600
2005-07-20T14:28:43-0600   Completed
3      Acquire Host     2005-07-20T14:28:46-0600
2005-07-20T14:28:46-0600   Completed
4      Run Command      2005-07-20T14:28:46-0600
2005-07-20T14:29:06-0600   Error 1

Errors
Error 1:
Description: INFO   : Mounting /images/rhel-3-U4-i386-es-disc1.iso at
/mnt/loop23308
INFO   : Version is 3ES, disc is 1
INFO   : Version is 3ES, disc is 1
INFO   : type redhat ver: 3ES
cp: /var/opt/SUNWscs/data/allstart/image/3ES-bootdisk.img: Bad address
INFO   : Could not copy PXE file bootdisk.img
INFO   : umount_exit: mnt is: /mnt/loop23308
INFO   : ERROR: Could not add floppy to the Distro

Results
Result 1:
Server:   -
Status:   -1
Message:  Creating OS rh30u4-es failed.

In the above case, try copying a different set of distribution files to the management server. See To Copy an OS Distribution From CDs or a DVD or To Copy an OS Distribution From ISO Files.

Mount Point Issues

Distribution copy failures might also occur if there are file systems on the /mnt mount point. Move all file systems off of the /mnt mount point before attempting create os command operations.

Patching Solaris 9 Distributions

The inability to deploy Solaris 9 OS distributions to servers from a Linux management server is usually due to a problem with NFS mounts. To solve this problem, you need to apply a patch to the mini-root of the Solaris 9 OS distribution. This section provides instructions for applying the required patches. The instructions differ according to the management and patch server configuration scenarios in the following table.

Table 6–1 Task Map for Patching a Solaris 9 Distribution


Management Server	Patch Server	Task
Red Hat 3.0 u2	Solaris 9 OS on x86 platform	To Patch a Solaris 9 OS Distribution by Using a Solaris 9 OS on x86 Patch Server
Red Hat 3.0 u2	Solaris 9 OS on SPARC platform	To Patch a Solaris 9 OS Distribution by Using a Solaris 9 OS on SPARC Patch Server

Using a Provisionable Server to Patch OS Distributions

When you are using a patch server to perform the following tasks, you need to have root access to both the management server and the provisionable server at once. For some tasks, you need to first patch the provisionable server, then mount the management server and patch the distribution.

To Patch a Solaris 9 OS Distribution by Using a Solaris 9 OS on x86 Patch Server

This procedure describes how to patch a Solaris 9 OS distribution in the N1 System Manager. The steps in this procedure need to be performed on both the patch server and the management server. The patches described are necessary for the N1 System Manager to be able to provision Solaris OS 9 update 7 and below. This procedure is not required for Solaris OS 9 update 8 and above.

Consider opening two terminal windows to complete the steps. The following steps first guide you through patching the patch server and then provide steps for patching the distribution.

Before You Begin

Create a Solaris 9 OS distribution on the management server. See To Copy an OS Distribution From CDs or a DVD or To Copy an OS Distribution From ISO Files. Type show os os-name at the command line to view the ID of the OS distribution. This number is used in place of DISTRO_ID in the instructions.
Install the Solaris 9 OS on x86 platform software on a machine that is not the management server.
Create a /patch directory on the Solaris 9 x86 patch server.
For a Solaris OS on x86 distribution, download and unzip the following patches into the /patch directory on the Solaris 9 OS on x86 patch server: 117172-17 and 117468-02. You can access these patches from http://sunsolve.sun.com.
For a Solaris OS on SPARC distribution, download and unzip the following patches into the /patch directory on the Solaris 9 OS on x86 patch server: 117171-17, 117175-02, and 113318-20. You can also access these patches from http://sunsolve.sun.com.

Steps

Patch the Solaris 9 OS on x86 patch server.
1. Log in as root.
  % su password:password
  The root prompt appears.
2. Reboot the Solaris 9 patch server to single-user mode.
  # reboot -- -s
3. In single-user mode, change to the patch directory.
  # cd /patch
4. Install the patches.
  # patchadd -M . 117172-17 # patchadd -M . 117468-02
  Tip –
  Pressing Control+D returns you to multiuser mode.

Prepare to patch the distribution on the management server.
1. Log in to the management server as root.
  % su password:password
  The root prompt appears.
2. Edit the /etc/exports file.
  # vi /etc/exports
3. Change /js *(ro,no_root_squash) to /js *(rw,no_root_squash).
4. Save and close the /etc/exports file.
5. Restart NFS.
  # /etc/init.d/nfs restart

Patch the distribution that you copied to the management server.
1. Log in to the Solaris 9 patch server as root.
  % su password:password
  The root prompt appears.
2. Mount the management server.
  # mount -o rw management-server-IP:/js/DISTRO_ID /mnt
3. Install the patches by performing one of the following actions:
  - If you are patching an x86 distribution, type the following commands:
    # patchadd -C /mnt/Solaris_9/Tools/Boot/ -M /patch 117172-17 # patchadd -C /mnt/Solaris_9/Tools/Boot/ -M /patch 117468-02
  - If you are patching a SPARC distribution, type the following commands:
    # patchadd -C /mnt/Solaris_9/Tools/Boot/ -M /patch 117171-17 # patchadd -C /mnt/Solaris_9/Tools/Boot/ -M /patch 117175-02 # patchadd -C /mnt/Solaris_9/Tools/Boot/ -M /patch 113318-20
    Note –
    You will receive a partial error for the first patch installation. Ignore this error.
4. Unmount the management server.
  # unmount /mnt

Restart NFS on the management server.
1. Edit the /etc/exports file.
  # vi /etc/exports
2. Change /js *(rw,no_root_squash) to /js *(ro,no_root_squash).
3. Restart NFS.
  # /etc/init.d/nfs restart
  NFS is restarted.
  
  The Solaris 9 OS on SPARC distribution is ready for deployment to target servers.

Fix the Solaris 9 OS on x86 distribution.
1. Change to /js/<distro_id>/Solaris_9/Tools/Boot/boot/solaris.
  # cd /js/<distro_id>/Solaris_9/Tools/Boot/boot/solaris
2. Re-create the bootenv.rc link.
  # ln -s ../../tmp/root/boot/solaris/bootenv.rc .
  The Solaris 9 OS on x86 distribution is ready for deployment to target servers.

Troubleshooting

If you want to patch another distribution, you might have to delete the /patch/117172-17 directory and re-create it using the unzip 117172-17.zip command. When the first distribution is patched, the patchadd command makes a change to the directory that causes problems with the next patchadd command execution.

This patch is not needed for the Solaris 9 update 8 build 5 OS and beyond. Versions of the Solaris OS from Solaris 9 9/05 s9x_u8wos_05, therefore, do not require this patch.

To Patch a Solaris 9 OS Distribution by Using a Solaris 9 OS on SPARC Patch Server

This procedure describes how to patch a Solaris 9 OS distribution in the N1 System Manager. The steps in this procedure need to be performed on the provisionable server and the management server. Consider opening two terminal windows to complete the steps. The following steps first guide you through patching the provisionable server and then provide steps for patching the distribution.

Before You Begin

Create a Solaris 9 OS distribution on the management server. See To Copy an OS Distribution From CDs or a DVD or To Copy an OS Distribution From ISO Files. Type show os os-name at the command line to view the ID of the OS distribution. This number is used in place of DISTRO_ID in the instructions.
Install the Solaris 9 OS on SPARC software on a machine that is not the management server. See To Load an OS Profile on a Server or a Server Group.
Create a /patch directory on the Solaris 9 SPARC patch server.
For a Solaris OS on x86 distribution, download and unzip the following patches into the /patch directory on the Solaris 9 OS on x86 patch server: 117172-17 and 117468-02. You can access these patches from http://sunsolve.sun.com.
For a Solaris OS on SPARC distribution, download and unzip the following patches into the /patch directory on the Solaris 9 OS on x86 patch server: 117171-17, 117175-02, and 113318-20. You can access these patches from http://sunsolve.sun.com.

Steps

Set up and patch the Solaris 9 OS on SPARC machine.
1. Log in to the Solaris 9 machine as root.
  % su password:password
2. Reboot the Solaris 9 machine to single-user mode.
  # reboot -- -s
3. In single-user mode, change to the patch directory.
  # cd /patch
4. Install the patches.
  # patchadd -M . 117171-17 # patchadd -M . 117175-02 # patchadd -M . 113318–20
  Tip –
  Pressing Control+D returns you to multiuser mode.

Patch the distribution that you copied to the management server.
1. Log in to the Solaris 9 machine as root.
  % su password:password
2. Mount the management server.
  # mount -o rw management-server-IP:/js/DISTRO_ID /mnt
3. Install the patches by performing one of the following actions:
  - If you are patching a Solaris OS on x86 software distribution, type the following commands:
    # patchadd -C /mnt/Solaris_9/Tools/Boot/ -M /patch 117172-17 # patchadd -C /mnt/Solaris_9/Tools/Boot/ -M /patch 117468-02
  - If you are patching a Solaris OS on SPARC software distribution, type the following commands:
    # patchadd -C /mnt/Solaris_9/Tools/Boot/ -M /patch 117171-17 # patchadd -C /mnt/Solaris_9/Tools/Boot/ -M /patch 117175-02 # patchadd -C /mnt/Solaris_9/Tools/Boot/ -M /patch 113318-20
    Note –
    You will receive a partial error for the first patch installation. Ignore this error.
4. Unmount the management server.
  # unmount /mnt

Restart NFS on the management server.
1. Edit the /etc/exports file.
  # vi /etc/exports
2. Change /js *(rw,no_root_squash) to /js *(ro,no_root_squash).
3. Restart NFS.
  # /etc/init.d/nfs restart
  NFS is restarted.
  
  The Solaris 9 OS on SPARC distribution is ready for deployment to target servers.

Fix the Solaris 9 OS on x86 distribution.
1. Change to /js/<distro_id>/Solaris_9/Tools/Boot/boot/solaris.
  # cd /js/<distro_id>/Solaris_9/Tools/Boot/boot/solaris
2. Re-create the bootenv.rc link.
  # ln -s ../../tmp/root/boot/solaris/bootenv.rc .
  The Solaris 9 OS on x86 distribution is ready for deployment to target servers.

Troubleshooting

If you want to patch another distribution you might have to delete the /patch/117172-17 directory and re-create it using the unzip 117172-17.zip command. When the first distribution is patched, the patchadd command makes a change to the directory that causes problems with the next patchadd command execution.

OS Profile Deployment Failures

Chapter 2, Sun N1 System Manager System and Network Preparation, in Sun N1 System Manager 1.2 Site Preparation Guide recommends that the OS provisioning network be isolated. This is mainly due to the use of DHCP on the network and due to the high bandwidth consumed by provisioning operations.

Since DHCP is a broadcast protocol it can cause conflicts on a network between DHCP servers. OS monitoring is also performed on the provisioning network. OS monitoring can consume significant network bandwidth in larger configurations.

It is also recommended that the management network hosts the hardware monitoring and management capabilities. However, if your business needs require that the networks be unified and you can configure your network to deal with the DHCP and bandwidth considerations outlined above, your site might not need to isolate the networks.

OS profile deployments might fail or fail to complete if any of the following conditions occur:

Partitions are not modified to suit a Sun Fire V40z or SPARC V440 server. See To Modify the Default Solaris OS Profile for a Sun Fire V40z or a SPARC v440 Server.
Scripts are not modified to install the driver needed to recognize the Ethernet interface on a Sun Fire V20z server. See To Modify a Solaris 9 OS Profile for a Sun Fire V20z Server With a K2.0 Motherboard.
DHCP is not correctly configured. See Solaris Deployment Job Times Out or Stops.
OS profile installs only the Solaris Core System Support distribution group. See Solaris OS Profile Installation Fails.
The target server cannot access DHCP information or mount distribution directories. See Invalid Management Server Netmask.
The management server cannot access files during a Load OS operation. See Restarting NFS to Resolve Boot Failed Errors.
The Linux deployment stops. See Linux Deployment Stops.
The Red Hat deployment fails. See Red Hat OS Profile Deployment Failures.

Use the following graphic as a guide to troubleshooting best practices. The graphic describes steps to take when you initiate provisioning operations. Taking these steps will help you troubleshoot deployments with greater efficiency.

This graphic illustrates troubleshooting steps to take when initiating
deployments.

To Modify the Default Solaris OS Profile for a Sun Fire V40z or a SPARC v440 Server

This procedure describes how to modify the Solaris OS profile that is created by default. The following modification is required for successful installation of the default Solaris OS profile on a Sun Fire V40z or a SPARC v440 server.

Steps

Clone the default profile.

N1-ok> create osprofile sol10v40z clone sol10

Remove the root partition.

N1-ok> remove osprofile sol10v40z partition /

Remove the swap partition.

N1-ok> remove osprofile sol10v40z partition swap

Add new root parameters.

N1-ok> add osprofile sol10v40z partition / device c1t0d0s0 sizeoption free
 type ufs

Add new swap parameters.

N1-ok> add osprofile sol10v40z partition swap device c1t0d0s1 size 2000
 type swap sizeoption fixed

To Modify a Solaris 9 OS Profile for a Sun Fire V20z Server With a K2.0 Motherboard

This procedure describes how to create and add a script to your Solaris OS profile. This script installs the Broadcom 5704 NIC driver needed for Solaris 9 x86 to recognize the NIC Ethernet interface on a Sun Fire V20z server with a K2.0 motherboard. Earlier versions of the Sun Fire V20z server use the K1.0 motherboard. Newer versions use the K2.0 motherboard.

Note –

This patch is needed for K2.0 motherboards but can also be used on K1.0 motherboards without negative consequences.

Steps

Type the following command:
% /opt/sun/n1gc/bin/n1sh show os
The list of available OS distributions appears.

Note down the ID for the Solaris 9 distribution.

You use this ID, which is in fact the DISTRO_ID of the OS, in the next step.

Type the following command:
# mkdir /js/DISTRO_ID/patch
Here, distro_id is the ID you noted previously. A patch directory is created for the Solaris 9 distribution.

Download the 116666-04 patch from http://sunsolve.sun.com to the /js/DISTRO_ID/patch directory.

Change to the /js/DISTRO_ID/patch directory.
# cd /js/DISTRO_ID/patch

Unzip the patch file.
# unzip 116666-04.zip

Type the following command:
# mkdir /js/scripts

In the /js/scripts directory, create a script called patch_sol9_k2.sh that includes the following three lines:
#!/bin/sh echo "Adding patch for bge devices." patchadd -R /a -M /cdrom/patch 116666-04
Note –
Ensure the script is executable. You can use the chmod 775 patch_sol9_k2.sh command.

Add the script to the Solaris 9 OS profile.

N1-ok> add osprofile osprofile script /js/scripts/patch_sol9_k2.sh type post

Example 6–1 Adding a Script to a Solaris OS Profile

This example shows how to add a script to an OS profile. The type attribute specifies that the script is to be run after the installation.

N1-ok> add osprofile sol9K2 script /js/scripts/patch_sol9_k2.sh 
type post

Next Steps

To load the modified Solaris OS profile, see To Load an OS Profile on a Server or a Server Group.

Solaris Deployment Job Times Out or Stops

If you attempt to load a Solaris OS profile and the OS Deploy job times out or stops, check the output in the job details to ensure that the target server completed a PXE boot. For example:

PXE-M0F: Exiting Broadcom PXE ROM.
      Broadcom UNDI PXE-2.1 v7.5.14
     Copyright (C) 2000-2004 Broadcom Corporation
     Copyright (C) 1997-2000 Intel Corporation
     All rights reserved.      
CLIENT MAC ADDR: 00 09 3D 00 A5 FC  GUID: 68D3BE2E 6D5D 11D8 BA9A 0060B0B36963
     DHCP.

If the PXE boot fails, the /etc/dhcpd.conf file on the management server might have not been set up correctly by the N1 System Manager.

Note –

The best diagnostic tool is to open a console window on the target machine and then run the deployment. See To Open a Server's Serial Console.

If you suspect that the /etc/dhcpd.conf file was configured incorrectly, complete the following procedure to modify the configuration.

To Modify the Network Interface Configuration

Steps

Inspect the dhcpd.conf file for errors.
# vi /etc/dhcpd.conf

If errors exist that need to be corrected, run the following command:
# /usr/bin/n1smconfig
The n1smconfig utility appears.

Modify the provisioning network interface configuration.

See Configuring the N1 System Manager System in Sun N1 System Manager 1.2 Installation and Configuration Guide for detailed instructions.

Load the OS profile on the target server.

Solaris OS Profile Installation Fails

OS profiles that install only the Core System Support distribution group do not load successfully. Specify “Entire Distribution plus OEM Support” as the value for the distributiongroup parameter. Doing so configures a profile that will install the needed version of SSH and other tools that are required for servers to be managed by the N1 System Manager.

Invalid Management Server Netmask

If the target server cannot access DHCP information or mount the distribution directories on the management server during a Solaris 10 deployment, you might have network problems caused by an invalid netmask. The console output might be similar to the following:

Booting kernel/unix...
  krtld: Unused kernel arguments: `install'.
  SunOS? Release 5.10 Version Generic 32-bit
  Copyright 1983-2005 Sun Microsystems, Inc.  All rights reserved.
  Use is subject to license terms.
  Unsupported Tavor FW version: expected: 0003.0001.0000, actual: 0002.0000.0000
  NOTICE: tavor0: driver attached (for maintenance mode only)
  Configuring devices.
  Using DHCP for network configuration information.
  Beginning system identification...
  Searching for configuration file(s)...
  Using sysid configuration file /sysidcfg
  Search complete.
  Discovering additional network configuration...
  Completing system identification...
  Starting remote procedure call (RPC) services: done.
  System identification complete.
  Starting Solaris installation program...
  Searching for JumpStart directory...
  /sbin/dhcpinfo: primary interface requested but no primary interface is set
  not found
  Warning: Could not find matching rule in rules.ok
  Press the return key for an interactive Solaris install program...

To fix the problem, set the management server netmask value to 255.255.255.0. See To Configure the N1 System Manager System in Sun N1 System Manager 1.2 Installation and Configuration Guide.

Linux Deployment Stops

If you are deploying a Linux OS and the deployment stops, check the console of the target server to see if the installer is in interactive mode. If the installer is in interactive mode, the deployment timed out because of a delay in the transmission of data from the management server to the target server. This delay usually occurs because the switch or switches connecting the two machines has spanning tree enabled. Either turn off spanning tree on the switch or disable spanning tree for the ports that are connected to the management server and the target server.

If spanning tree is already disabled and OS deployment stops, there may be a problem with your network.

Note –

For Red Hat installations to work with some networking configurations, you must enable spanning tree.

Red Hat OS Profile Deployment Failures

Building Red Hat OS profiles on the N1 System Manager might require additional analysis to avoid failures. If you have a problem with a custom OS profile, perform the following steps while the problem deployment is still active.

Log into the management server as root.

Run the following script:

# cat /var/opt/sun/scs/share/allstart/config/ks*cfg > failed_ks_cfg

The failed_ks_cfg file will contain all of the KickStart parameters, including those that you customized. Verify that the parameters stated in the configuration file are appropriate for the current hardware configuration. Correct any errors and try the deployment again.

OS Deployment Fails on V20z or V40z With `internal error` Message

If OS deployment fails on a V20z or a V40z with the internal error occurred message provided in the job results, direct the platform console output to the service processor. If the platform console output cannot simply be directed to the service processor, reboot the service processor. To reboot the service processor, log on to the service processor and run the sp reboot command.

To check the console output, log on to the service processor, and run the platform console command. Examine the output during OS deployment to resolve the problem.

Restarting NFS to Resolve `Boot Failed` Errors

Error: boot: lookup /js/4/Solaris_10/Tools/Boot failed boot: cannot open kernel/sparcv9/unix

Solution:

The message differs depending on the OS that is being deployed. If the management server cannot access files during a Load OS operation, it might be caused by a network problem. To possibly correct this problem, try restarting NFS.

On a Solaris system, type the following:

# /etc/init.d/nfs.server stop
# /etc/init.d/nfs.server start

On a Linux system, type the following:

# /etc/init.d/nfs restart

Resolving Command Failures Related to OS Monitoring

Adding the feature might fail due to stale SSH entries on the management server. If the add server server-name feature osmonitor agentip command fails and no true security breach has occurred, remove the /root/.ssh/known_hosts file or the specific entry in the file that corresponds to the provisionable server. Then, retry the add command.

Additionally, adding the OS monitoring feature to a server that has the base management feature might fail. The following job output shows the error:

Repeat attempts for this operation are not allowed.

This error indicates that SSH credentials have previously been supplied and cannot be altered. To avoid this error, issue the add server feature osmonitor command without agentssh credentials. See To Add the OS Monitoring Feature for instructions.

N1-ok> show job 61
Job ID: 61
Date: 2005-08-16T16:14:27-0400
Type: Modify OS Monitoring Support
Status: Error (2005-08-16T16:14:38-0400)
Command: add server 192.168.2.10 feature osmonitor agentssh root/rootpasswd
Owner: root
Errors: 1
Warnings: 0

Steps
ID Type Start Completion Result
1 Acquire Host 2005-08-16T16:14:27-0400 2005-08-16T16:14:28-0400 Completed
2 Run Command 2005-08-16T16:14:28-0400 2005-08-16T16:14:28-0400 Completed
3 Acquire Host 2005-08-16T16:14:29-0400 2005-08-16T16:14:30-0400 Completed
4 Run Command 2005-08-16T16:14:30-0400 2005-08-16T16:14:36-0400 Error

Results
Result 1:
Server: 192.168.2.10
Status: -3
Message: Repeate attempts for this operation are not allowed.

Checking for OS Monitoring Agents

If you tried to install the OS monitoring agents as described in To Add the OS Monitoring Feature and OS monitoring data did not appear, verify that the OS monitoring feature was installed, as follows:

It can take 5-7 minutes before all OS monitoring data is fully initialized. You may see that CPU idle is at 0.0%, which causes a Failed Critical status with OS usage. This should clear up within 5-7 minutes after adding or upgrading the OS monitoring feature. At that point, OS monitoring data should be available for the provisionable server by using the show server server command.

Use the grep command and try to see if indeed the agents themselves were successfully installed.

To verify the Solaris feature, type the following commands:

# pkginfo |grep n1sm 

sparc:   SUNWn1smsparcag-1-2
solx86:  SUNWn1smx86ag-1-2
# ps -ef |grep -i esd
root 23817     1  0 19:57:59 ?       0:01 esd - init agent -dir
 /var/opt/SUNWsymon -q

To verify the Linux feature, type the following commands:

# rpm -qa | grep n1sm-linux-agent

 # ps -ef | grep -i esd
 root 1940 1 0 Jan28 ? 00:00:14 esd - init agent -dir
 /var/opt/SUNWsymon -q

OS Update Problems

This section describes possible solutions for the following troubleshooting scenarios:

OS Update Creation Failures

The name that is specified when you create a new OS update must be unique. The OS update to be created also needs to be unique. That is, in addition to the uniqueness of the file name for each OS update, the combination of the internal package name, version, release, and file name also needs to be unique.

For example, if test1.rpm is the source for an RPM named test1, another OS update called test2 cannot have the same file name as test1.rpm. To avoid additional naming issues, do not name an OS update with the same name as the internal package name for any other existing packages on the provisionable server.

You can specify an adminfile value when you create an OS update. For the Solaris OS update packages, a default admin file is located at /opt/sun/n1gc/etc/admin.

mail=
   instance=unique
   partial=nocheck
   runlevel=nocheck
   idepend=nocheck
   rdepend=nocheck
   space=quit
   setuid=nocheck
   conflict=nocheck
   action=nocheck
   basedir=default
   authentication=nocheck

If you use an adminfile to install an OS update, ensure that the package file name matches the name of the package. If the file name does not match that of the package, and an adminfile is used to install the OS update, uninstallation will fail. See OS Update Uninstallation Failures.

The default admin file setting used for Solaris package deployments in the N1 System Manager is instance=unique. If you want to report errors for duplicated packages, change the admin file setting to instance=quit. This change causes an error to appear in the Load Update job results if a duplicate package is detected.

See the admin(4) man page for detailed information about admin file parameter settings. Type man -s4 admin as root user on a Solaris system to view the man page.

For Solaris packages, a response file might also be needed. For instructions on how to specify an admin file and a response file when you create an OS update, see To Copy an OS Update.

Solaris OS Update Deployment Failures

This section describes troubleshooting scenarios and possible solutions for the following categories of failures during Solaris OS update deployment:

Failures that occur before the job is submitted
Load Update job failures
Unload Update job failures
Stop Job failures for Load Update

In the following unload command, the update could be either the update name in the list that appears when you type show update all list, or the update could be the actual package name on the target server.

N1-ok> load server server update update

Always check the package is targeted to the correct architecture.

Note –

The N1 System Manager does not distinguish 32-bit from 64-bit for the Solaris (x86 or SPARC) OS, so the package or patch might not install successfully if it is installed on an incompatible OS.

If the package or patch does install successfully, but performance decreases, check that the architecture of the patch matches the architecture of the OS.

The following are common failures that can occur before the job is submitted:

Target server is not initialized

Solution:

Check that the add server feature osmonitor command was issued and that it succeeded.

Another running job on the target server

Solution:

Only one job is allowed at a time on a server. Try again after the job completes.

Update is incompatible with operating system on target server

Solution:

Check that the OS type of the target server matches one of the update OS types. Type show update update-name at the N1–ok> prompt to view the OS type for the update.

Target server is not in a good state or is powered off

Solution:

Check that the target server is up and running. Type show server server-name at the N1–ok> prompt to view the server status. Type reset server server-name force to force a reboot.

The following are possible causes for Load Update job failures:

Sometimes, Load Update jobs fail because either the same package already exists or because a higher version of the package exists. Ensure that the package does not already exist on the target server if the job fails.

error: Failed dependencies:

A prerequisite package and should be installed.

Solution:

For a Solaris system, configure the idepend= parameter in the admin file.

Preinstall or postinstall scripts failure: Non-zero status

pkgadd: ERROR: ... script did not complete successfully

Solution:

Check the pre-installation or post installation scripts for possible errors to resolve this error.

Interactive request script supplied by package

Solution:

This message indicates that the response file is missing or that the setting in the admin file is incorrect. Add a response file to correct this error.

patch-name was installed without backing up the original files

Solution:

This message indicates that the Solaris OS update was installed without backing up the original file. No action needs to be taken.

Insufficient diskspace

Solution:

Load Update jobs might fail due to insufficient disk space. Check the available disk space by typing df -k. Also check the package size. If the package size is too large, create more available disk space on the target server.

The following are stop job failures for loading or unloading update operations:

If you stop a Load Update or Unload Update job and the job does not stop, manually ensure that the following process is killed on the management server:

# ps -ef |grep swi_pkg_pusher
ps -ef |grep pkgadd, pkgrm, scp, ...

Then, check any processes that are running on the provisionable server:

# ps -ef |grep pkgadd, pkgrm, ...

The following are common failures for Unload Server and Unload Group jobs:

The rest of this section provides errors and possible solutions for failures related to the following commands: unload server server-name update update-name and unload group group-name update update-name.

Removal of <SUNWssmu> was suspended (interaction required)

Solution:

This message indicates a failed dependency for uninstalling a Solaris package. Check the admin file setting and provide an appropriate response file.

Job step failure without error details

Solution:

This message might indicate that the job was not successfully started internally. Contact a Sun Service Representative for more information.

Job step failure with vague error details: Connection to 10.0.0.xx

Solution:

This message might indicate that the uninstallation failed because some packages were not fully installed. In this case, manually install the package in question on the target server. For example:

To manually install a .pkg file, type the following command:

# pkgadd -d pkg-name -a admin-file

To manually install a patch, type the following command:

# patchadd -d patch-name -a admin-file

Then, run the unload command again.

Job hangs

Solution:

If the job appears to hang, stop the job and manually kill the remaining processes. For example:

To manually kill the job, type the following command:

# n1sh stop job job-ID

Then, find the PID of the PKG and kill the process, by typing the following commands:

# ps -ef |grep pkgadd
# pkill pkgadd-PID

Then run the unload command again.

Linux OS Update Deployment Failures

This section describes troubleshooting scenarios and possible solutions for the following categories of failures during Linux OS update deployment:

Failures that occur before the job is submitted
Load Update job failures
Unload Update job failures
Stop Job failures for Load Update

N1-ok> load server server update update

The following are common failures that can occur before the job is submitted:

Target server is not initialized

Solution:

Check that the add server feature osmonitor command was issued and that it succeeded.

Another running job on the target server

Solution:

Only one job is allowed at a time on a server. Try again after the job completes.

Update is incompatible with operating system on target server

Solution:

Check that the OS type of the target server matches one of the update OS types. Type show update update-name at the N1–ok> prompt to view the OS type for the update.

Target server is not in a good state or is powered off

Solution:

Check that the target server is up and running. Type show server server-name at the N1–ok> prompt to view the server status. Type reset server server-name force to force a reboot.

The following are possible causes for Load Update job failures:

error: Failed dependencies:

A prerequisite package should be installed

Solution:

Use an RPM tool to address and resolve Linux RPM dependencies.

Preinstall or postinstall scripts failure: Non-zero status

ERROR: ... script did not complete successfully

Solution:

Check the pre-installation or post installation scripts for possible errors to resolve this error.

Insufficient diskspace

Solution:

The following are stop job failures for loading or unloading update operations:

If you stop a Load Update or Unload Update job and the job does not stop, manually ensure that the following process is killed on the management server:

# ps -ef |grep swi_pkg_pusher
ps -ef |grep rpm

Then, check any processes that are running on the provisionable server:

# ps -ef |grep rpm, ...

The following are common failures for Unload Server and Unload Group jobs:

Job step failure without error details

Solution:

This message might indicate that the job was not successfully started internally. Contact a Sun Service Representative for more information.

Job step failure with vague error details: Connection to 10.0.0.xx

Solution:

This message might indicate that the uninstallation failed because some RPMs were not fully installed. In this case, manually install the package in question on the target server. For example:

To manually install an RPM, type the following command:

# rpm -Uvh rpm-name

Then, run the unload command again.

Job hangs

Solution:

If the job appears to hang, stop the job and manually kill the remaining processes. For example:

To manually kill the job, type the following command:

# n1sh stop job job-ID

Then, find the PID of the RPM and kill the process, by typing the following commands:

# ps -ef |grep rpm-name
# pkill rpm-PID

Then run the unload command again.

OS Update Uninstallation Failures

If you cannot uninstall an OS update that was installed with an adminfile, check that the package file name matches the name of the package. To check the package name:

bash-2.05# ls FOOi386pkg 
   FOOi386pkg
   bash-2.05# pkginfo -d ./FOOi386pkg 
   application FOOi386pkg     FOO Package for Testing
   bash-2.05# pkginfo -d ./FOOi386pkg | /usr/bin/awk '{print $2}'
   FOOi386pkg
---
   bash-2.05# cp FOOi386pkg Foopackage
   bash-2.05# pkginfo -d ./Foopackage 
   application FOOi386pkg     FOO Package for Testing
   bash-2.05# pkginfo -d ./Foopackage | /usr/bin/awk '{print $2}'
   FOOi386pkg
   bash-2.05#

If the name is not the same, rename the adminfile in the provisionable server's /tmp directory to match the name of the package and try the unload command again. If the package still exists, remove it from the provisionable server by using pkgrm.

Downloading V20z and V40z Server Firmware Updates

This section provides detailed information to help you download and prepare the firmware versions that are required to discover Sun Fire V20z and V40z servers.

To Download and Prepare Sun Fire V20z and V40z Server Firmware

Steps

Create directories into which the V20z and V40z firmware update zip files are to be saved.

Create separate directories for each server type firmware download. For example:
# mkdir V20z-firmware V40z-firmware

In a web browser, go to http://www.sun.com/servers/entry/v20z/downloads.html.

The Sun Fire V20z/V40z Server downloads page appears.

Click Current Release.

The Sun Fire V20z/V40z NSV Bundles 2.3.0.11 page appears.

Click Download.

The download Welcome page appears. Type your username and password, and then click Login.

The Terms of Use page appears. Read the license agreement carefully. You must accept the terms of the license to continue and download the firmware. Click Accept and then click Continue.

The Download page appears. Several downloadable files are displayed.

To download the V20z firmware zip file, click V20z BIOS and SP Firmware, English (nsv-v20z-bios-fw_V2_3_0_11.zip).

Save the 10.21–Mbyte file to the directory that you created for the V20z firmware in Step 2.

To download the V40z firmware zip file, click V40z BIOS and SP Firmware, English (nsv-v40z-bios-fw_V2_3_0_11.zip).

Save the 10.22–Mbyte file to the directory you created for the V40z firmware in Step 2.

Change to the directory where you downloaded the V20z firmware file.
1. Type unzip to unpack the file.
  
  Type y to continue.
  
  The sw_images directory is extracted.
  
  The following files in the sw_images directory are used by the N1 System Manager to update V20z provisionable server firmware:
  - Service Processor:
    
    sw_images/sp/spbase/V2.3.0.11/install.image
  - BIOS
    
    sw_images/platform/firmware/bios/ V2.33.5.2/bios.sp

Change to the directory where you downloaded the V40z firmware zip file.
1. Type unzip nsv-v40z-bios-fw_V2_3_0_11.zip to unpack the zip file.
  
  The sw_images directory is extracted.
  
  The following files in the sw_images directory are used by the N1 System Manager to update V40z provisionable server firmware:
  - Service Processor:
    
    sw_images/sp/spbase/V2.3.0.11/install.image
  - BIOS:
    
    sw_images/platform/firmware/bios/V2.33.5.2/bios.sp

Next Steps

Copy the firmware updates to the N1 System Manager as described in To Copy a Firmware Update.
Update the firmware on a single server or server group provisionable server as described in To Load a Firmware Update on a Server or a Server Group.

Downloading ALOM 1.5 Firmware Updates

This section provides detailed information to help you download and prepare the firmware versions that are required to discover Sun servers that use ALOM 1.5.

To Download and Prepare ALOM 1.5 Firmware

Steps

Create directories into which the ALOM firmware update zip files are to be saved.

Create separate directories for each server type firmware download. For example:
# mkdir ALOM-firmware

In a web browser, go to http://jsecom16.sun.com/ECom/EComActionServlet?StoreId=8.

The downloads page appears.

To download the ALOM 1.5 firmware zip file, log in and navigate to the ALOM 1.5, All Platforms/SPARC, English, Download.

Download the file to the directory you created for the ALOM firmware in Step 2.

Change to the directory where you downloaded the ALOM firmware file and untar the file.

bash-3.00# tar xvf ALOM_1.5.3_fw.tar
x README, 9186 bytes, 18 tape blocks
x copyright, 93 bytes, 1 tape blocks
x alombootfw, 161807 bytes, 317 tape blocks
x alommainfw, 5015567 bytes, 9797 tape blocks

The files are extracted.

Next Steps

Copy the firmware updates to the N1 System Manager as described in To Copy a Firmware Update.
Update the firmware on a single server or server group provisionable server as described in To Load a Firmware Update on a Server or a Server Group.

Handling Threshold Breaches

If a threshold value is breached for a monitored attribute, an event is generated. You can create notification rules to warn you about this type of event. Notification of threshold breaches or warnings is done through the event log. This log is most easily viewed through the browser interface.

Notifications can be created using the create notification command and the resulting notification sent by email or to a pager. See create notification in Sun N1 System Manager 1.2 Command Line Reference Manual for syntax details.

Identifying Hardware and OS Threshold Breaches

If the value of a monitored hardware health attribute, or OS resource utilization attribute breaches a threshold value, an event log is immediately created, which indicates that the threshold has been breached. The event log is available from the browser interface. A symbol appears among the monitored data table in the browser interface to indicate that a threshold has been breached, as shows in the graphic at To Retrieve Threshold Values for a Server

Alternatively, use the show log command to verify that the event log has been generated:

N1-ok> show log
Id            Date                       Severity    Subject     Message
.
. 
10            2005-11-22T01:45:02-0800   WARNING     Sun_V20z_XG041105786
A critical high threshold was violated for server Sun_V20z_XG041105786: Attribute cpu0.vtt-s3 Value 1.32

13            2005-11-22T01:50:08-0800   WARNING     Sun_V20z_XG041105786
A normal low  threshold was violated for server Sun_V20z_XG041105786: Attribute cpu0.vtt-s3 Value 1.2

If monitoring traps are lost, a particular threshold status may not be refreshed for up to 30 hours, although the overall status can still be refreshed every 10 minutes.

Identifying Monitoring Failure

If monitoring is enabled, as described in Enabling and Disabling Monitoring, and the status in the output of the show server or show group commands is unknown or unreachable, then the server or server group is not being reached successfully for monitoring. If the status remains unknown or unreachable for less than 30 minutes, it is possible that a transient network problem is occurring. However if the status remains unknown or unreachable for more than 10 minutes, it is possible that monitoring has failed. This could be the result of a failure in the monitoring feature. For more information, see Resolving Command Failures Related to OS Monitoring.

If monitoring traps are lost, a particular threshold status may not be refreshed for up to 30 hours, although the overall status should still be refreshed every 10 minutes.

A time stamp is provided in the monitoring data output. The relationship between this time stamp and the current time can also be used to judge if there is an error with the monitoring agent.

Problems After Rebooting or Restarting Services

If you reboot the management server, and the N1 System Manager services do not restart, you must regenerate security keys as explained in Why Regenerate Security Keys?.

If you stop the N1 System Manager services using the n1sminit stop command, and the services do not restart after using the n1sminit start command, you must regenerate security keys as explained in Why Regenerate Security Keys?.

Management Features Unavailable on Provisionable Servers After Rebooting

When the load server or load group command is used to install software on the provisionable server, the provisionable server's networktype attribute could be set to dhcp. This setting means that the server uses DHCP to get its provisioning network IP address. If the system reboots and obtains a different IP address than the one that was used for the agentip parameter during the load command or add server commands, then the following features may not work:

The OS Monitoring content of the show server command. (No OS monitoring)
The load server server update and load group group update commands
The start server server command command
The set server server threshold command
The set server server refresh command

In this case, use the set server server agentip command to correct the server's agent IP address as shown in this procedure. See To Modify the Agent IP for a Server for details.

Fixing Notifications From ALOM-based Servers

The ports of some models of provisionable servers use the Advanced Lights Out Manager (ALOM) standard. These servers, detailed in Provisionable Server Requirements in Sun N1 System Manager 1.2 Site Preparation Guide, use email instead of SNMP traps to send notifications about hardware events to the management server. For information about other events, see Managing Event Log Entries and Setting Up Event Notifications.

To ensure that the management server receives event notifications from these servers, configure the management server, or another designated server that can be accessed by the N1 System Manager, as a mail server to receive notifications about hardware events from provisionable servers that use ALOM. This is explained in Configuring the Management Server Mail Service and Account in Sun N1 System Manager 1.2 Site Preparation Guide and Configuring the N1 System Manager System in Sun N1 System Manager 1.2 Installation and Configuration Guide.

If there are no notifications about hardware events from provisionable servers that use ALOM, it could mean that all managed servers are all healthy. However, it is also possible that the management server, or other designated server that can be accessed by the N1 System Manager, has not been configured correctly as an email server, or that email configuration has been invalidated due to other issues such as network error or domain name change.

If an email server has been configured, the problem might be that mail accounts have to be reset. The email accounts for provisionable servers may have been deleted or corrupted, or changes could have been made to the email server that impact its configuration, such as a change in the domain name or in a management network IP address.

To reset or change email addresses used by the management server, use the following procedure.

To Reset Email Accounts for ALOM-based Provisionable Servers

The following procedure describes how to reset email accounts for provisionable servers. Following this procedure enables you to replace previous email addresses used by the management server with new addresses.

The email addresses you reset should be reserved for use only by the N1 System Manager.

Before You Begin

Confirm that the problem is related to the fact that email alerts are not being received for the server. It is possible that the management server, or some other chosen server that can be accessed by the N1 System Manager, has not been configured correctly as an email server, or that email configuration has been invalidated due to other issues such as network error or domain name change.

Before trying the following procedure, verify that email sent from the ALOM server can be received by the designated email server, by configuring an independent mail client, such as Mozilla, with the same mail server IP, username and password. Then use the telnet command to access an ALOM server, and execute the resetsc -y command to generate a warning message. Check if the mail client is able to receive the ALOM warning message. If it is, you do not need to follow this procedure.

See Discovering Servers for information about default telnet login and passwords for servers.

Before trying the following procedure, verify also that the N1 System Manager has access to the designated email server by using the telnet command to access an ALOM server, and executing the showsc command. Make sure the following parameters/values are set as shown:

The if_emailalerts value is set to true
The mgt_mailhost variable is set to the designated mail server's IP address.
The mgt_mailalert(1) variable is set to the email address to which alerts must be sent.

If you do not see these settings, or if you see incorrect values for the mgt_mailalert email address, follow this procedure.

Steps

Switch off monitoring for ALOM-based provisionable servers.

Set the monitored attribute to false by using the set server command.
N1-ok> set server server monitored false
In this example, server is the name of the ALOM-based provisionable server for which you want to reset the email account. Executing this command disables monitoring of the server.
- If the ALOM-based servers are in the same group, use the set group command to switch off monitoring for the server group.
  N1-ok> set group group monitored false
  In this example, group is the name of the group of ALOM-based provisionable servers for which you want to reset email accounts. Executing this command disables monitoring of the server group.

Change the email address for the server using the n1smconfig command with the —a option.

ALOM-based servers support email addresses of up to 33 characters in length.

Note –
If you manually configured ALOM-based servers to send event notifications by email to other addresses, using the telnet command and the setsc mgt_mailalert command, those addresses will not be changed by running the n1smconfig command.

Switch on monitoring for the ALOM-based provisionable server.

Set the monitored attribute to true by using the set server command.
N1-ok> set server server monitored true
- If the ALOM-based servers are in the same group, use the set group command to switch on monitoring for the server group.
  N1-ok> set group group monitored true
  In this example, group is the name of the group of ALOM-based provisionable servers for which you want to reset email accounts. Executing this command enables monitoring of the server group.

Previous: Chapter 5 Monitoring Your Servers

Chapter 6 Troubleshooting

Discovery Problems

Security Problems

Why Regenerate Security Keys?

How to Regenerate Common Agent Container Security Keys

Steps

General Security Considerations

Troubleshooting OS Distributions

Distribution Copy Failures

Mount Point Issues

Patching Solaris 9 Distributions

Using a Provisionable Server to Patch OS Distributions

To Patch a Solaris 9 OS Distribution by Using a Solaris 9 OS on x86 Patch Server

Before You Begin

Steps

Troubleshooting

To Patch a Solaris 9 OS Distribution by Using a Solaris 9 OS on SPARC Patch Server

Before You Begin

Steps

Troubleshooting

OS Profile Deployment Failures

To Modify the Default Solaris OS Profile for a Sun Fire V40z or a SPARC v440 Server

Steps

See Also

To Modify a Solaris 9 OS Profile for a Sun Fire V20z Server With a K2.0 Motherboard

Steps

Example 6–1 Adding a Script to a Solaris OS Profile

Next Steps

Solaris Deployment Job Times Out or Stops

To Modify the Network Interface Configuration

Steps

Solaris OS Profile Installation Fails

Invalid Management Server Netmask

Linux Deployment Stops

Red Hat OS Profile Deployment Failures

OS Deployment Fails on V20z or V40z With internal error Message

Restarting NFS to Resolve Boot Failed Errors

Resolving Command Failures Related to OS Monitoring

Checking for OS Monitoring Agents

OS Update Problems

OS Update Creation Failures

Solaris OS Update Deployment Failures

Linux OS Update Deployment Failures

OS Update Uninstallation Failures

Downloading V20z and V40z Server Firmware Updates

To Download and Prepare Sun Fire V20z and V40z Server Firmware

Steps

Next Steps

Downloading ALOM 1.5 Firmware Updates

To Download and Prepare ALOM 1.5 Firmware

Steps

Next Steps

Handling Threshold Breaches

Identifying Hardware and OS Threshold Breaches

Identifying Monitoring Failure

Problems After Rebooting or Restarting Services

Management Features Unavailable on Provisionable Servers After Rebooting

Fixing Notifications From ALOM-based Servers

To Reset Email Accounts for ALOM-based Provisionable Servers

Before You Begin

Steps

OS Deployment Fails on V20z or V40z With `internal error` Message

Restarting NFS to Resolve `Boot Failed` Errors