Sun Gathering Debug Data for Sun Java System Messaging Server

ProcedureTo Collect Debug Data on a Hung or Unresponsive Messaging Server Process

A process hang is defined as one of the Messaging Server processes not responding to requests anymore while the process is still running locally. Messaging Server's seven specific processes are:

Additionally, Messaging Server uses these three processes:

The system can be running more than one of each of these processes (especially tcp_smtp_server).

Other processes from the job_controller.cnf configuration file might be running (autoreply, ims_master, l_master, conversion, reprocess, and so on), as well as reconstruct and imexpire processes.

Before You Begin

Make sure that you collect all the data over the same time frame in which the problem occurs. See 1.6 Configuring Solaris OS to Generate Core Files if a core file is not generated.

Collect the following information for process hang problems. Run the commands in order when the problem occurs. Be sure to specify the time when the process hang happened and affected processes, if possible.

  1. Collect the general system information as explained in To Collect Required Debug Data for Any Messaging Server Problem.

  2. Run the netstat command and save the output.

    UNIX and Linux

    netstat -an | grep messaging-service-port

    Windows

    netstat -an

  3. Run the following commands and save the output.

    Solaris OS

    ps -ef | grep server-rootvmstat 5 5iostat -xtopuptime

    HP-UX

    ps -aux | grep server-rootvmstat 5 5iostat -xtopsar

    Linux

    ps -aux | grep server-rootvmstat 5 5topuptimesar

    Windows

    Obtain the MESSAGING process PID: C:\windbg-root>tlist.exe

    Obtain process details of the MESSAGING running process PID: C:\windbg-root>tlist.exe messaging-pid

  4. Get the swap information.

    Solaris OS

    swap -l

    HP-UX

    swapinfo

    Linux

    free

    Windows

    Already provided in C:\report.txt as described in To Collect Required Debug Data for Any Messaging Server Problem.

  5. Get the system logs.

    Solaris OS and Linux

    /var/adm/messages/var/log/syslog

    HP-UX

    /var/adm/syslog/syslog.log

    Windows

    Event log files:Start-> Settings-> Control Panel —> Event Viewer-> Select LogThen click Action-> Save log file as


    Note –

    For UNIX systems, depending on your site's configuration of the SNDOPR_PRIORITY option.dat option and your syslog configuration (syslog.conf), the MTA might be sending automatically generated syslog notices to a non-default location. Also, the LOG_SNDOPR option.dat option controls whether additional potential syslog notices are generated by the MTA message logging facility.


  6. For UNIX and Linux systems, get the contents of the /etc/nsswitch.conf and /etc/resolv.conf files.

  7. Get the most current log files for the hanging process, if known. Otherwise, get the log files for all processes.

    You should also set LOG_SNDOPR in the option.dat file and check for syslog notices if some expected MTA log file doesn't seem to have been generated. When you set the LOG_SNDOPR option, then the MTA generates a syslog notice if it cannot write to its regular log files.

    • Sun Java System Messaging Server (Messaging Server 6):

      UNIX and Linux

      server-root/log/*

      Windows

      server-root\data\log\default\defaultserver-root\data\log\http\httpserver-root\data\log\imap\imap server-root\data\log\pop\popserver-root\data\log\imta\*

    • iPlanet Messaging Server (Messaging Server 5):

      UNIX and Linux

      server-root/msg-identifier/log/default/defaultserver-root/msg-identifier/log/http/httpserver-root/msg-identifier/log/imap/imapserver-root/msg-identifier/log/pop/popserver-root/msg-identifier/log/imta/*

      Windows

      server-root\msg-identifier\log\default\defaultserver-root\msg-identifier\log\http\httpserver-root\msg-identifier\log\imap\imap server-root\msg-identifier\log\pop\popserver-root\msg-identifier\log\imta\*

  8. (Solaris OS only) If you are able to isolate the hanging process, get the following debug data for that process. Otherwise, get the following data for each of the Messaging Server processes.

    Using the PID obtained in Step 3, get a series of five of the following commands (one every 10 seconds):

    pstack messaging-pid
    pmap -x messaging-pid
    
  9. Look for any core file that could have been dumped by one of the Messaging Server processes. If you find one, see To Collect Debug Data on a Messaging Server Crashed Process.

  10. If any of the ims_master, imapd, mshttpd, popd, mboxutil, or reconstruct processes is hung, as the mailsrv user, get the outputs of the db_stat command as follows.

    • UNIX and Linux:

      cd msg-instance/data/store/mboxlist
      server-root/lib/db_stat -Co -h msg-instance/data/store/mboxlist/ -N
      server-root/lib/db_stat -t

      If you have set the dbtmpdir configutil variable, use that location instead of msg-instance/store/mboxlist/.


      Note –

      For Solaris OS systems only, the script dbhang captures all the following debug data for you. Edit the top of the script to match your system's configuration. Specifically, edit the SRVROOT, INST, MAILUSER, and DBTMPDIR parameters. Then run the script and collect the data. See 1.7 Running the Messaging Server Debugging Scripts for more information.


    • Windows:

      cd msg-instance\data\store\mboxlist
      server-root\lib\db_stat -Co -h msg-instance\data\store\mboxlist\ -N
      server-root\lib\db_stat -t
  11. Get the output of the following command.

    Solaris OS

    truss -ealf -rall -wall -vall -o /tmp/truss.out -p messaging-pid

    HP-UX

    tusc -v -fealT -rall -wall -o /tmp/tusc.out -p messaging-pid

    Linux

    strace -fv -o /tmp/strace.out -p messaging-pid

    Windows

    Use DebugView: http://www.sysinternals.com/Utilities/DebugView.html


    Note –

    Wait one minute after launching the appropriate command (truss, strace, tusc, or DebugView) then stop it by pressing Control-C in the terminal where you launched the command.


  12. Get core files and the output of the following commands.

    In a process hang situation, it is helpful to compare several core files to review the state of the threads over time. To not overwrite a core file, copy that core file to a new name, wait approximately one minute then rerun the following commands. Do this three times to obtain three core files.


    Note –

    For HP-UX, you need the following two patches to use the gcore command: PHKL_31876 and PHCO_32173. If you cannot install these patch, use the HP-UX /opt/langtools/bin/gdb command from version 3.2 and later, or the dumpcore command.


    Solaris OS

    cd server-root/bin/slapd/servergcore -o /tmp/messaging_process-core messaging-pidpstack /tmp/messaging_process-core

    HP-UX

    # gcore -p messaging-pid
    (gdb) attach messaging-pid
    Attaching to process messaging-pid
    No executable file name was specified
    (gdb) dumpcore
    Dumping core to the core file core.messaging-pid
    (gdb) quit
    The program is running. Quit anyway (and detach it)? (y or n) y
    Detaching from program: , process messaging-pid
    
    Linux

    # gdb
    (gdb) attach messaging-pid
    Attaching to process messaging-pid
    No executable file name was specified
    (gdb) gcore
    Saved corefile core.messaging-pid
    
    (gdb)backtrace
    (gdb)quit
    
    Windows

    Get the MESSAGING process PID:

    C:\windbg-root>tlist.exe

    Generate a crash dump on the MESSAGING running process PID:

    C:\windbg-root>adplus.vbs -hang -p messaging-pid -o C:\crashdump_dir


    Note –

    For Windows, provide the complete generated folder under C:\crashdump_dir.


  13. (Solaris OS only) Archive the result of the script pkg_app (one core file is sufficient).

    ./pkg_app.ksh pid-of-application corefile
    

    The Sun Support Center must have the output from the pkg_app script to properly analyze the core file(s).


    Note –

    Make sure the appropriate limitations are set by using the ulimit command, and that the user is not nobody. Also check the coreadm command for additional control. See 1.6 Configuring Solaris OS to Generate Core Files if a core file is not generated.


  14. When you have collected all debug data, perform the following steps to restore the service.

    Messaging Server processes usually hang because of an orphan lock left in one of the databases. Stopping the server (especially the stored process), and cleaning the temporary shared db files helps to resolve the problem.

    1. Stop Messaging Server.

      cd server-root/sbin./stop-msg

    2. Make sure that all Messaging Server processes stopped.

      Wait one minute, then kill any remaining processes, except tcp_smtp_server processes (which do not use databases).

    3. Restart Messaging Server.

      ./start-msg

    4. After restarting the services, check the logs for any unexpected behavior.