Solaris Common Messages and Troubleshooting Guide

"S"

save: SYSTEM error, Arg list too long

Cause

save fails with the error. The cause of this error is that the data base (index) file for the client is greater than 2 Gbytes. With Solaris 2.6 and SBU 5.0.1 this is no longer a problem.

Action

However, with earlier versions of Solaris you need to bring up nwadmin -> indexes -> select appropriate client -> select appropriate fs -> remove oldes cycle -> reclaim space.

This may have to be repeated a few times to reclaim enough space. The indexes can be recreated later if necessary by using scanner.

SCSI bus DATA IN phase parity error

Cause

The most common cause of this problem is unapproved hardware. Some SCSI devices for the PC market do not meet the high I/O speed requirements for the UNIX market. Other possible causes of this problem are improper cabling or termination, and power fluctuations. Data corruption is possible but unlikely to occur, because this parity error prevents data transfer.

Action

Check that all SCSI devices on the bus are Sun-approved hardware. Then verify that all cables measure no longer than six meters, total, and that all SCSI connections are properly terminated. If power fluctuations are occurring, invest in an uninterruptible power supply.

SCSI transport failed: reason 'reset'

Cause

This message indicates that the system sent data over the SCSI bus, but the data never reached its destination because of a SCSI bus reset. The most common cause of this condition is conflicting SCSI targets.¤ Data corruption is possible but unlikely to occur, because this failure prevents data transfer.

Action

Verify that all cables measure no longer than six meters, total, and that all SCSI connections are properly terminated. If power surges are a problem, acquire a surge suppressor or uninterruptible power supply.

A machine's internal disk drive is usually SCSI target 3. Make sure that external and secondary disk drives are targeted to 1, 2, or 0, and do not conflict with each other. Also make sure that tape drives are targeted to 4 or 5, and CD drives to 6, avoiding any conflict with each other or with disk drives. If the targeting of the internal disk drive is in question, power off the machine, remove all external drives, turn the power on, and from the PROM monitor run the probe-scsi-all or probe-scsi command.

If SCSI device targeting is acceptable, memory configuration could be the problem, especially for machines with the sun4c architecture. Ensure that high-capacity memory chips (such as 4MB SIMMs) are in lower banks, while lower-capacity memory chips (such as 1MB SIMMs) are in the upper banks.

SPARC systems do not always support third party CD-ROM drives, and may generate a similar "unknown vendor" error message. Check with the CD-ROM vendor for specific configuration requirements.

Some third-party disk drives have a read-ahead cache that interferes with Solaris device drivers. Make sure that any existing read-ahead cache facility is turned off.

See Also

¤ For more information on SCSI targets, see the section on device naming conventions in the Solaris 1.x to Solaris 2.x Transition Guide. If you are using the AnswerBook, "scsi targets" is a good search string.

Security exception on host string. USER ACCESS DENIED.

Cause

When trying to create a user via Adminsuite, placing the home directory on a system remote from the nisplus server, customer gets error message:

Security exception on host hostname. USER ACCESS DENIED.
The user identity (555)username was received, but that user
   is not authorized to execute the requested functionality
      on this system. Is this user a member of an appropriate
        security group on this system ?
            (Function: class directory method create_dir)
User can rsh(1) to the remote machine and create a home directory on the system.

Action

The user was not in the sysadmin group NIS+ tables.


 # niscat group.org_dir | grep sysadmin 
 sysadmin::14:
Add the username to the sysadmin group.

Segmentation Fault

Cause

Segmentation faults usually result from programming error. This message is usually accompanied by a core dump, except on read-only filesystems.

Action

To see which program produced a core(4) file, run either the file(1) command or the adb(1) command. The following examples show the output of the file(1) and adb(1) commands on a core file from the dtmail program.


$ file core
core: ELF 32-bit MSB core file SPARC Version 1, from `dtmail'

$ adb core
core file = core -- program `dtmail'
SIGSEGV  11: segmentation violation
^D      (use Control-d to quit the adb rogram)
Ask the vendor or author of this program for a debugged version.

Technical Notes

A process has received a signal indicating that it attempted to access an area of memory that is protected or that does not exist. The two most common causes of segmentation faults are attempting to dereference a null pointer or indexing past the bounds of an array.

sendmail[]: can't lookup data via name server "dns" or sendmail[]: can't lookup data via name server "nis"

Cause

The following entry in the /etc/nsswitch.conf file: sendmailvars: dns nis files causes the messages to appear in the console window.

Action

The sendmailvars database can only be used with local files and/or NIS+. So, if you do not have this database setup, the default sendmailvars entry should look as follows in the /etc/nsswitch.conf file:


sendmailvars: files

sendmail[int]: NOQUEUE: SYSERR: net hang reading from string

Cause

This is a sendmail(1M) message that appears on the console and in the log file /var/adm/messages. If this message occurs once for a particular user, it is possible that a mail message from this user ends with a partial line (having no terminating newline character). If this message appears frequently or at busy times, especially along with other networking errors, it could indicate network problems.

Action

Check the user's mail spool file to see if a message ends without a newline character. If so, talk with the user and determine how to prevent the problem from occurring again. If these messages are the result of network problems, you could try moving the mail spool directory to another machine with a faster network interface.

Technical Notes

During the SMTP receipt of DATA phase, a message-terminating period on a line of its own never arrived, so sendmail(1M) timed out and produced this error.

Service wouldn't let us acquire selection

Cause

This message indicates that the OpenWindows selection service failed to seize the requested selection from /tmp/winselection. Some diagnostics follow: the requested selection could be 0 for unknown, 1 for caret, 2 for primary, 3 for secondary, or 4 for clipboard. The result could be 0 for failure, 2 for nonexistent, 3 for didn't have, 4 for wrong rank, 5 for continued, 6 for cancelled, or 7 for unrecognized.

setmnt: Cannot open /etc/mnttab for writing

Cause

The system is having problems writing to /etc/mnttab. The filesystem containing /etc may be mounted read-only, or is not mounted at all.

Action

Check that this file exists and is writable by root. If so, ensure that the /etc filesystem has been mounted, and is mounted read-write rather than read-only.

share_nfs: /home: Operation not applicable

Cause

This message usually indicates that the system has a local filesystem mounted on /home, which is where the automounter usually mounts users' home directories.

Action

When a system is running the automounter, do not mount local filesystems on the /home directory. Mount them on another directory, such as /disk2, which on most systems you will have to create. You could also change the automounter auto_home entry, but that is a more difficult solution.

Slice c0t1d0s0 is too small to contain 1 replicas

Cause

When trying to add a state replica using metatool to cylinder 0 of a disk, the following error message appears:


	Your attempt to attach metastate database
	replicas on slice "c?t?d?s?" failed for the
	following reason: Slice c?t?d?s? is too small
	to contain 1 replicas.

This is because metatool masks out the very first cylinder to protect the disk label. On disksuite v4.1, metatool does allow adding the databases to cylinder 0 on 2.1gb disks or larger.

Action

The workaround is not to start at cylinder 0 but at cylinder 1, or use the command line (metadb -a).

Socket type not supported

Cause

The support for the socket type has not been configured into the system or no implementation for it exists.

Technical Notes

The symbolic name for this error is ESOCKTNOSUPPORT, errno=121.

Soft error rate (int%) during writing was too high

Cause

This message from the SCSI tape drive appears when Exabyte or DAT tapes generate too many soft (recoverable) errors. It is followed by the advisory "Please, replace tape cartridge" message. Soft errors are an indication that hard errors could soon occur, causing data corruption.

Action

First clean the tape head with a cleaning tape as recommended by the manufacturer. If that doesn't work, replace the tape cartridge. You might need to replace the tape drive if the problem still occurs with new tape cartridges.

Soft error rate (retries = int) during writing was too high

Cause

This message from the SCSI tape drive appears when Archive tapes generate too many soft (recoverable) errors. It is followed by the advisory "Periodic head cleaning required and/or replace tape cartridge" message. Soft errors are an indication that hard errors could soon occur, causing data corruption.

Action

First clean the tape head with a cleaning tape as recommended by the manufacturer. If that doesn't work, replace the tape cartridge. You might need to replace the tape drive if the problem still occurs with new tape cartridges.

Software caused connection abort

Cause

A connection abort was caused internal to your host machine.

Technical Notes

The symbolic name for this error is ECONNABORTED, errno=130.

Srmount error

Cause

This error is RFS specific. It occurs when an attempt is made to stop RFS while resources are still mounted by remote machines, or when a resource is readvertised with a client list that does not include a remote machine with the resource currently mounted.

Technical Notes

The symbolic name for this error is ESRMNT, errno=69.

Stale NFS file handle

Cause

A file or directory that was opened by an NFS client was either removed or replaced on the server.

Action

If you were editing this file, write it to a local filesystem instead. Try remounting the filesystem on top of itself or shutting down any client processes that refer to stale file handles. If neither of these solutions works, reboot the system.

Technical Notes

The original vnode is no longer valid. The only way to get rid of this error is to force the NFS server and client to renegotiate file handles.

The symbolic name for this error is ESTALE, errno=151.

statd: cannot talk to statd at string

Cause

This message comes from the NFS status monitor daemon statd(1M), which provides crash recovery services for the NFS lock daemon lockd(1M). The message indicates that statd(1M) has left old references in the /var/statmon/sm and /var/statmon/sm.bak directories. After a user has removed or modified a host in the hosts database, statd(1M) might not properly purge files in these directories, which results in its trying to communicate with a nonexistent host.

Action

Remove the file named variable (where variable is the hostname) from both the /var/statmon/sm and /var/statmon/sm.bak directories. Then kill the statd daemon and restart it. If that doesn't get rid of the message, kill and restart lockd(1M) as well. If that doesn't work, reboot the machine at your convenience.

stty: TCGETS: Operation not supported on socket

Cause

This message results when a user tries to remote copy with rcp(1) or remote shell with rsh(1) from one machine to another, but has an stty(1) command in the remote .cshrc file. This error results in failure of the rcp(1) or rsh(1) command.

Action

The solution is to move the invokation of the stty(1) command to the user's .login (or equivalent) file. Alternatively, execute the stty(1) command in .cshrc only when the shell is interactive. Here is a test to do that:


if ($?prompt) stty ...

Technical Notes

The rcp(1) and rsh(1) commands make a connection using sockets, which do not support stty(1)'s TCGETS ioctl.

su: No shell

Cause

This message indicates that someone changed the default login shell for root to a program that is missing from the system. For example, the final colon-separated field in /etc/passwd could have been changed from /sbin/sh to /usr/bin/bash, which does not exist in that location. Possibly an extra space was appended at the end of the line. The outcome is that you cannot login as root or switch user to root, and so cannot directly fix this problem.

Action

The only solution is to reboot the system from another source, then edit the password file to correct this problem. Invoke sync(1M) several times, then halt the machine by typing Stop-A or by pressing the reset button. Reboot single-user from CD-ROM, the net, or diskette, such as by typing boot cdrom -s at the ok prompt.

After the system comes up and gives you a # prompt, mount the device corresponding to the original / partition somewhere, such as with a mount(1M) command similar to the one below. Then run an editor on the newly-mounted system password file (use ed(1) if terminal support is lacking):


# mount /dev/dsk/c0t3d0s0 /mnt
# ed /mnt/etc/passwd
Use the editor to change the password file's root entry to call an existing shell, such as /usr/bin/csh or /usr/bin/ksh.

Technical Notes

To keep the "No shell" problem from happening, habitually use admintool or /usr/ucb/vipw to edit the password file. These tools make it difficult to change password entries in ways that make the system unusable.

SunPC may NOT run correctly as root

Cause

With SunPC 4.1 and the 102924 jumbo patch installed: when a user attempts to run SunPC, the following error message is displayed:


SunPC may NOT run correctly as root.
Please run in user mode.
SunPC script is exiting
Yet, the user is not root.

The user's primary group id is probably root. For example:


$ /usr/bin/id
uid=33650(gruff) gid=0(root)

Action

Change the user's primary group to another group, such as 10, and, the user still needs to be in teh root group, add the root group to the user's secondary group list.

su: 'su root' failed for string on /dev/pts/int

Cause

The user specified after "for" tried to become superuser, but typed the wrong password.

Action

If the user is supposed to know the root password, wait to see if the correct password is supplied. If the user is not supposed to know the root password, ask why he or she is attempting to become superuser.

su: 'su root' succeeded for string on /dev/pts/int

Cause

The user specified after "for" just became superuser by typing the root password.

Action

If the user is supposed to know the root password, this message is only informational. If the user is not supposed to know the root password, change this password immediately and ask how the user learned it.

syncing file systems...

Cause

This indicates that the kernel is updating the super-blocks before taking the system down, to ensure filesystem integrity. This message appears after a halt(1M) or reboot(1M) command. It can also appear after a system panic, in which case the system might contain corrupted data.

Action

If you just halted or rebooted the machine, don't worry-- this message is normal. In case of a system panic, look up the panic messages that appear above this one. Your system vendor might be able to help diagnose the problem. So that you can describe the panic to the vendor, either leave your system in its panicked state or be sure that you can reproduce the problem.

Technical Notes

Numbers that sometimes display after the three dots in the message show the count of dirty pages that are being written out. Numbers in brackets show an estimate of the number of busy buffers in the system.

SYSLOGD CAUSES SYSTEM HANGS

Cause

(Over and Over again = installpatch problems)

syslog service starting.

Cause

During system reboot, this message might appear and the boot seems to hang. After starting syslogd(1M) service, the system runs /etc/rc2.d/S75cron, which in turn calls ps(1). Sometimes after an abrupt system crash /dev/bd.off becomes a link to nowhere, causing the ps(1) command to hang indefinitely.

Action

Reboot single user (for example with boot -s) and run ls -l /dev/bd* to see if this is the problem. If so, remove /dev/bd.off, then run bdconfig off or reboot with the -r (reconfigure) option.

This is the most commonly reported situation that causes ps(1) to hang.

System booting after fatal error FATAL

Cause

The system reboots automatically. Afterward, the messages file contains System booting after fatal error FATAL.

The message is issued during a reboot after the system detects a hardware error. Things which can cause this are: UPA address parity error, Master queue overflows, DTAG parity errors, E-Cache tag parity errors, and Coherence errors.

Action

Use prtdiag(1M) to help identify failed hardware components. The errors indicate either have a bad CPU module or a bad system board.

system hang

Cause

4.1.3C SBUS cards suffer system freeze

SYSTEM HANGS DURING BOOT

Cause

When the user boots a system, it hangs after the boot up messages "root on," "swap on," and "dump on." After the system displays these messages, the LEDs will flash and the system hangs.

This is the result of an earlier fsck that deleted devices under the /dev directory. Check for the /dev/console device and if it is missing, make one.