System Administration Guide: Advanced Administration

Chapter 18 Troubleshooting Miscellaneous Software Problems (Tasks)

This chapter describes miscellaneous software problems that might occur occasionally and are relatively easy to fix. Troubleshooting miscellaneous software problems includes solving problems that aren't related to a specific software application or topic, such as unsuccessful reboots and full file systems. Resolving these problems are described in the following sections.

This is a list of the information in this chapter.

What to Do If Rebooting Fails

Note –

Some of the information in this section pertains to systems that are running the Oracle Solaris 10 release only.

If the system does not reboot completely, or if it reboots and then crashes again, there might be a software or hardware problem that is preventing the system from booting successfully.

Cause of System Not Booting	How to Fix the Problem
The system can't find /platform/`uname -m`/kernel/unix.	You may need to change the `boot-device` setting in the PROM on a SPARC based system. For information on changing the default boot device, see How to Change the Default Boot Device by Using the Boot PROM in System Administration Guide: Basic Administration.
Oracle Solaris 10: There is no default boot device on an x86 based system. The message displayed is: `Not a UFS filesystem.`	Oracle Solaris 10: Boot the system by using the Configuration Assistant/boot diskette and select the disk from which to boot.
Solaris 10 1/06: The GRUB boot archive has become corrupted. Or, the SMF boot archive service has failed. An error message is displayed if you run the `svcs` `-x` command.	Solaris 10 1/06: Boot the failsafe archive.
There's an invalid entry in the `/etc/passwd` file.	For information on recovering from an invalid `passwd` file, see Chapter 12, Booting an Oracle Solaris System (Tasks), in System Administration Guide: Basic Administration.
There's a hardware problem with a disk or another device.	Check the hardware connections: Make sure the equipment is plugged in. Make sure all the switches are set properly. Look at all the connectors and cables, including the Ethernet cables. If all this fails, turn off the power to the system, wait 10 to 20 seconds, and then turn on the power again.

If none of the above suggestions solve the problem, contact your local service provider.

What to Do If You Forgot the Root Password

If you forget the root password and you cannot log into the system, you will have to do the following:

Stop the system by using the keyboard stop sequence.
Oracle Solaris 10: Boot the system from a boot server or an install server, or from a local CD-ROM.
Mount the root (/) file system.
Remove the root password from the /etc/shadow file.
Reboot the system.
Log in and set root's password.

If you forget the root password and you cannot log into the system, you will have to do the following:

Stop the system by using the keyboard stop sequence.
Starting with Solaris 10 1/06 release: On x86 based systems, boot the system in the Solaris failsafe archive.
Oracle Solaris 10: Boot the system from a boot server or an install server, or from a local CD-ROM.
Mount the root (/) file system.
Remove the root password from the /etc/shadow file.
Reboot the system.
Log in and set root's password.

These procedures are fully described in Chapter 12, Booting an Oracle Solaris System (Tasks), in System Administration Guide: Basic Administration.

Note –

GRUB based booting is not available on SPARC based systems in this release.

The following examples describe how to recover from a forgotten root password on both SPARC and x86 based systems.

Example 18–1 SPARC: What to Do If You Forgot the Root Password

The following example shows how to recover when you forget the root password by booting from the network. This example assumes that the boot server is already available. Be sure to apply a new root password after the system has rebooted.

(Use keyboard abort sequence--Press Stop A keys to stop the system)
ok boot net -s
# mount /dev/dsk/c0t3d0s0 /a
# cd /a/etc
# TERM=vt100
# export TERM
# vi shadow
(Remove root's encrypted password string)
# cd /
# umount /a
# init 6

Example 18–2 x86: Performing a GRUB Based Boot When You Have Forgotten the Root Password

This example assumes that the boot server is already available. Be sure to apply a new root password after the system has rebooted.

GNU GRUB  version 0.95  (637K lower / 3144640K upper memory)
 +-------------------------------------------------------------------+
| be1
| be1 failsafe
| be3
| be3 failsafe
| be2
| be2 failfafe
  +------------------------------------------------------------------+
      Use the ^ and v keys to select which entry is highlighted.
      Press enter to boot the selected OS, 'e' to edit the
      commands before booting, or 'c' for a command-line.

Searching for installed OS instances...
	
	An out of sync boot archive was detected on /dev/dsk/c0t0d0s0.
	The boot archive is a cache of files used during boot and
	should be kept in sync to ensure proper system operation.
	
	Do you wish to automatically update this boot archive? [y,n,?] n
Searching for installed OS instances...

Multiple OS instances were found. To check and mount one of them
read-write under /a, select it from the following list. To not mount
any, select 'q'.

  1  pool10:13292304648356142148     ROOT/be10
  2  rpool:14465159259155950256      ROOT/be01

Please select a device to be mounted (q for none) [?,??,q]: 1
mounting /dev/dsk/c0t0d0s0 on /a
starting shell.
      .
      .
      .
# cd /a/etc
# vi shadow
(Remove root's encrypted password string)
# cd /
# umount /a
# reboot

Example 18–3 x86: Booting a System When You Have Forgotten the Root Password

Oracle Solaris 10: The following example shows how to recover when you forget root's password by booting from the network. This example assumes that the boot server is already available. Be sure to apply a new root password after the system has rebooted.

Press any key to reboot.
Resetting...
.
.
.
Initializing system                                                             
Please wait...                                                                  
                                                                                
                                                                                
                     <<< Current Boot Parameters >>>                            
Boot path: /pci@0,0/pci-ide@7,1/ide@0/cmdk@0,0:a                                
Boot args:                                                                      
                                                                                
Type    b [file-name] [boot-flags] <ENTER>     to boot with options            
or      i <ENTER>                              to enter boot interpreter       
or      <ENTER>                                to boot with defaults           
                                                                               
                  <<< timeout in 5 seconds >>>

Select (b)oot or (i)nterpreter: b -s
SunOS Release 5.10 Version amd64-gate-2004-09-30 32-bit
Copyright 1983-2004 Sun Microsystems, Inc.  All rights reserved.
Use is subject to license terms.
DEBUG enabled
Booting to milestone "milestone/single-user:default".
Hostname: venus
NIS domain name is example.com
Requesting System Maintenance Mode
SINGLE USER MODE

Root password for system maintenance (control-d to bypass): xxxxxx
Entering System Maintenance Mode
.
.
.
# mount /dev/dsk/c0t0d0s0 /a
      .
      .
      .
# cd /a/etc
# vi shadow
(Remove root's encrypted password string)
# cd /
# umount /a
# init 6

x86: What to Do If the SMF Boot Archive Service Fails During a System Reboot

Note –

This procedure applies to systems that are running the Oracle Solaris 10 release only.

Solaris 10 1/06: If the system crashes, the boot archive SMF service, svc:/system/boot-archive:default, might fail when the system is rebooted. If the boot archive service has failed, a message similar to the following is displayed when you run the svcs -x command:

svc:/system/boot-archive:default (check boot archive content)
 State: maintenance since Fri Jun 03 10:24:52 2005
Reason: Start method exited with $SMF_EXIT_ERR_FATAL.
   See: http://sun.com/msg/SMF-8000-KS
   See: /etc/svc/volatile/system-boot-archive:default.log
Impact: 48 dependent services are not running.  (Use -v for list.)

svc:/network/rpc/gss:default (Generic Security Service)
 State: uninitialized since Fri Jun 03 10:24:51 2005
Reason: Restarter svc:/network/inetd:default is not running.
   See: http://sun.com/msg/SMF-8000-5H
   See: gssd(1M)
Impact: 10 dependent services are not running.  (Use -v for list.)

svc:/application/print/server:default (LP print server)
 State: disabled since Fri Jun 03 10:24:51 2005
Reason: Disabled by an administrator.
   See: http://sun.com/msg/SMF-8000-05
   See: lpsched(1M)
Impact: 1 dependent service is not running.  (Use -v for list.)

To correct the problem, take the following action:

Reboot the system and select the failsafe archive option from the GRUB boot menu.
Answer y when prompted by the system to rebuild the boot archive.

After the boot archive is rebuilt, the system is ready to boot.
To continue booting, clear the SMF boot archive service by using the following command.

# svcadm clear boot-archive

Note that you must become superuser or the equivalent to run this command.

For more information on rebuilding the GRUB boot archive, see How to Boot an x86 Based System in Failsafe Mode in System Administration Guide: Basic Administration and the bootadm(1M) man page.

What to Do If a System Hangs

A system can freeze or hang rather than crash completely if some software process is stuck. Follow these steps to recover from a hung system.

Determine whether the system is running a window environment and follow these suggestions. If these suggestions don't solve the problem, go to step 2.
- Make sure the pointer is in the window where you are typing the commands.
- Press Control-q in case the user accidentally pressed Control-s, which freezes the screen. Control-s freezes only the window, not the entire screen. If a window is frozen, try using another window.
- If possible, log in remotely from another system on the network. Use the pgrep command to look for the hung process. If it looks like the window system is hung, identify the process and kill it.
Press Control-\ to force a “quit” in the running program and (probably) write out a core file.
Press Control-c to interrupt the program that might be running.
Log in remotely and attempt to identify and kill the process that is hanging the system.
Log in remotely, become superuser or assume an equivalent role and reboot the system.
If the system still does not respond, force a crash dump and reboot. For information on forcing a crash dump and booting, see Forcing a Crash Dump and Reboot of the System in System Administration Guide: Basic Administration.
If the system still does not respond, turn the power off, wait a minute or so, then turn the power back on.
If you cannot get the system to respond at all, contact your local service provider for help.

What to Do If a File System Fills Up

When the root (/) file system or any other file system fills up, you will see the following message in the console window:

.... file system full

There are several reasons why a file system fills up. The following sections describe several scenarios for recovering from a full file system. For information on routinely cleaning out old and unused files to prevent full file systems, see Chapter 6, Managing Disk Use (Tasks).

File System Fills Up Because a Large File or Directory Was Created

Reason Error Occurred	How to Fix the Problem
Someone accidentally copied a file or directory to the wrong location. This also happens when an application crashes and writes a large `core` file into the file system.	Log in as superuser or assume an equivalent role and use the `ls -tl` command in the specific file system to identify which large file is newly created and remove it. For information on removing `core` files, see How to Find and Delete `core` Files.

A `TMPFS` File System is Full Because the System Ran Out of Memory

Reason Error Occurred	How to Fix the Problem
This can occur if `TMPFS` is trying to write more than it is allowed or some current processes are using a lot of memory.	For information on recovering from `tmpfs`-related error messages, see the tmpfs(7FS) man page.

What to Do If File ACLs Are Lost After Copy or Restore

Reason Error Occurred	How to Fix the Problem
If files or directories with ACLs are copied or restored into the `/tmp` directory, the ACL attributes are lost. The `/tmp` directory is usually mounted as a temporary file system, which doesn't support UFS file system attributes such as ACLs.	Copy or restore files into the `/var/tmp` directory instead.

Troubleshooting Backup Problems

This section describes some basic troubleshooting techniques to use when backing up and restoring data.

The root (`/`) File System Fills Up After You Back Up a File System

You back up a file system, and the root (/) file system fills up. Nothing is written to the media, and the ufsdump command prompts you to insert the second volume of media.

Reason Error Occurred	How to Fix the Problem
If you used an invalid destination device name with the `-f` option, the `ufsdump` command wrote to a file in the `/dev` directory of the root (`/`) file system, filling it up. For example, if you typed `/dev/rmt/st0` instead of `/dev/rmt/0`, the backup file `/dev/rmt/st0` was created on the disk rather than being sent to the tape drive.	Use the `ls -tl` command in the `/dev` directory to identify which file is newly created and abnormally large, and remove it.

Make Sure the Backup and Restore Commands Match

You can only use the ufsrestore command to restore files backed up with the ufsdump command. If you back up with the tar command, restore with the tar command. If you use the ufsrestore command to restore a tape that was written with another command, an error message tells you that the tape is not in ufsdump format.

Check to Make Sure You Have the Right Current Directory

It is easy to restore files to the wrong location. Because the ufsdump command always copies files with full path names relative to the root of the file system, you should usually change to the root directory of the file system before running the ufsrestore command. If you change to a lower-level directory, after you restore the files you will see a complete file tree created under that directory.

Interactive Commands

When you use the interactive command, a ufsrestore> prompt is displayed, as shown in this example:

# ufsrestore ivf /dev/rmt/0
Verify volume and initialize maps
Media block size is 126
Dump   date: Fri Jan 30 10:13:46 2004
Dumped from: the epoch
Level 0 dump of /export/home on starbug:/dev/dsk/c0t0d0s7
Label: none
Extract directories from tape
Initialize symbol table.
ufsrestore >

Troubleshooting Common Agent Container Problems in the Oracle Solaris OS

This section addresses problems that you might encounter with the common agent container shared component. In this Oracle Solaris release, the common agent container Java program is included in the Oracle Solaris OS. The program implements a container for Java management applications. Typically, the container is not visible to the user.

The following are potential problems:

Port number conflicts
Compromised security for the superuser password

Port Number Conflicts

The common agent container occupies the following port numbers by default:

JMX port (TCP) = 11162
SNMPAdaptor port (UDP) = 11161
SNMPAdaptor port for traps (UDP) = 11162
Commandstream Adaptor port (TCP) = 11163
RMI connector port (TCP) = 11164

Note –

If you are troubleshooting an installation of Oracle Solaris Cluster, the port assignments are different.

If your installation already reserves any of these port numbers, change the port numbers that are occupied by the common agent container, as described in the following procedure.

How to Check Port Numbers

This procedure shows you how to verify the port.

Become superuser or assume an equivalent role.

Roles contain authorizations and privileged commands. For more information about roles, see Configuring RBAC (Task Map) in System Administration Guide: Security Services.

Stop the common agent container management daemon.
# /usr/sbin/cacaoadm stop

Change the port numbs by using the following syntax:
# /usr/sbin/cacaoadm set-param param=value
For example, to change the port occupied by the SNMPAdaptor from the default of 11161 to 11165, type:
# /usr/sbin/cacaoadm set-param snmp-adaptor-port=11165

Restart the common agent container management daemon.
# /usr/sbin/cacaoadm start

Compromised Security for Superuser Password

It might be necessary to regenerate security keys on a host that is running the Java ES. For example, if there is a risk that a superuser password has been exposed or compromised, you should regenerate the security keys. The keys that are used by the common agent container services are stored in /etc/cacao/instances/instance-name/security directory. The following task shows you how to generate security keys for the Oracle Solaris OS.

How to Generate Security Keys for the Oracle Solaris OS

Become superuser or assume an equivalent role.

Roles contain authorizations and privileged commands. For more information about roles, see Configuring RBAC (Task Map) in System Administration Guide: Security Services.

Stop the common agent container management daemon.
# /usr/sbin/cacaoadm stop

Regenerate the security keys.

# /usr/sbin/cacaoadm create-keys --force

Restart the common agent container management daemon.
# /usr/sbin/cacaoadm start
Note –
For the Sun Cluster software, you must propagate this change across all nodes in the cluster.

Chapter 18 Troubleshooting Miscellaneous Software Problems (Tasks)

What to Do If Rebooting Fails

What to Do If You Forgot the Root Password

Example 18–1 SPARC: What to Do If You Forgot the Root Password

Example 18–2 x86: Performing a GRUB Based Boot When You Have Forgotten the Root Password

Example 18–3 x86: Booting a System When You Have Forgotten the Root Password

x86: What to Do If the SMF Boot Archive Service Fails During a System Reboot

What to Do If a System Hangs

What to Do If a File System Fills Up

File System Fills Up Because a Large File or Directory Was Created

A TMPFS File System is Full Because the System Ran Out of Memory

What to Do If File ACLs Are Lost After Copy or Restore

Troubleshooting Backup Problems

The root (/) File System Fills Up After You Back Up a File System

Make Sure the Backup and Restore Commands Match

Check to Make Sure You Have the Right Current Directory

Interactive Commands

Troubleshooting Common Agent Container Problems in the Oracle Solaris OS

Port Number Conflicts

How to Check Port Numbers

Compromised Security for Superuser Password

How to Generate Security Keys for the Oracle Solaris OS

A `TMPFS` File System is Full Because the System Ran Out of Memory

The root (`/`) File System Fills Up After You Back Up a File System