Troubleshooting Applications DBA Operations

Managing Worker Processes

The adadmin and adop utilities can process jobs in parallel to reduce the time needed to complete them. This section describes the procedures for reviewing these processes, and handling situations where processing has been interrupted.

Note: For more information, see Using Parallel Processing.

Reviewing Worker Status

Requirement

How can I monitor the progress of parallel processing jobs?

Discussion

When adadmin and adop process jobs in parallel, they assign jobs to workers for completion. There may be situations that cause a worker to stop processing. AD Controller is a utility you can use to determine the status of workers and manage worker tasks. You use it to monitor the actions or workers and the status of the processing jobs they have been assigned.

Actions

To review worker status, perform these steps:

  1. Set the environment by executing (sourcing) the patch file system environment file:

    $ source <patch APPL_TOP path>/APPS<CONTEXT_NAME>.env

    Note: For more information, see Setting the Environment in Running AD Utilities.

  2. Start AD Controller by entering adctrl on the command line.

  3. Review worker status.

    Select "Show worker status" from the AD Controller main menu. AD Controller displays a summary of current worker activity. The summary columns are:

    • Control Worker is the worker number

    • Code is the last instruction from the manager to this worker

    • Context is the general action the manager is executing

    • Filename is the file the worker is running (if any)

    The following table describes the types of status that may be assigned to a worker and reported in the Status column.

    Worker Status Values
    Status Meaning
    Assigned The manager assigned a job to the worker, and the worker has not started.
    Completed The worker completed the job, and the manager has not yet assigned it a new job.
    Failed The worker encountered a problem.
    Fixed, Restart The worker should retry the failed operation now that the problem has been fixed.
    Restarted The worker is retrying a job or has successfully restarted a job (note that the status does not change to Running).
    Running The worker is running a job
    Wait The worker is idle.

    If the worker status shows as Failed, the problem may need to be fixed before the AD utility can complete its processing. This is described next.

Determining Why a Worker Failed

Requirement

One of the workers has failed. How do I determine the cause of the failure?

Discussion

When a worker fails its job, you do not have to wait until the other workers and the manager stop. Use the worker log files (adworknnn.log) to determine what caused the failure. These log files are written to APPL_TOP/admin/<SID>/log. You can find the worker log file and copy it to a temporary area so that you can review it. If the job was deferred after the worker failed, there may be no action required on your part.

The first time a job fails, the manager defers the job and assigns a new worker. If the deferred job fails a second time, the manager defers it a second time only if the runtime of the job is less than ten minutes. If the deferred job fails a third time, or if the job's runtime is greater than ten minutes, the job stays at a failed status and the worker waits for intervention.

Actions

  1. Set the environment by executing (sourcing) the patch file system environment file:

    $ source <patch APPL_TOP path>/APPS<CONTEXT_NAME>.env

    Note: For more information, see Setting the Environment in Running AD Utilities.

  2. Start AD Controller by entering adctrl on the command line.

  3. Identify the worker that encountered a problem.

    Workers that have encountered problems stop processing jobs and show a status of Failed. Follow the steps in the Reviewing Worker Status section in this chapter to determine which workers have a status of Failed.

  4. Review the log file to find out why the worker failed.

    The following is an example of a worker failure message:

    AD Worker error:
    The following ORACLE error:
    
    ORA-01630: max # extents (50) reached in temp segment in tablespace TSTEMP
    occurred while executing the SQL statement:
    
    CREATE INDEX AP.AP_INVOICES_N11 ON AP.AP_INVOICES_ALL (PROJECT_ID, TASK_ID)
    NOLOGGING STORAGE (INITIAL 4K NEXT 512K MINEXTENTS 1 MAXEXTENTS 50
    PCTINCREASE 0 FREELISTS 4) PCTFREE 10 MAXTRANS 255 TABLESPACE  APX
    
    AD Worker error:
    Unable to compare or correct tables or indexes or keys because of the error
    above

    In this example, the worker could not create the index AP_INVOICES_N11 because the maximum number of extents in the temporary tablespace was reached.

  5. Determine how best to resolve the problem that caused the failure. For example, search My Oracle Support for potential causes. If you cannot identify a fix, you may wish to open a service request with Oracle Support.

Handling a Failed Job

Requirement

I have reviewed the log file for the failed worker and determined the problem. What do I do next?

Discussion

A worker usually runs continuously in the background and when it fails to complete the job it was assigned, it reports a status of Failed. When the manager displays an error message, confirm the failed status of a worker by using AD Controller to review worker status. If the job was deferred after the worker failed, no action may be required.

Note: For more information, see Using Parallel Processing.

Actions

Perform the following steps:

  1. Set the environment by executing (sourcing) the patch file system environment file:

    $ source <patch APPL_TOP path>/APPS<CONTEXT_NAME>.env
  2. Start AD Controller by entering adctrl on the command line.

    Note: For more information, see Setting the Environment in Running AD Utilities.

  3. Identify the failed file.

    The Worker and Filename columns in the AD Controller worker status screen show the numbers of the workers that failed and list the name of the files that failed to run.

  4. Review the worker log file.

    Each worker logs the status of tasks assigned to it in a log file called adworkxxx.log, where nnn is the worker number. For example, adwork001.log for worker 1 and adwork007.log for worker 7. These files are in the $APPL_TOP/admin/<SID>/log directory on the patch file system. Review adworkxxx.log for the failed worker to determine the source of the error.

  5. Resolve the error.

    Resolve the error using the information provided in the log files. Contact Oracle Support Services if you do not understand how to resolve the issue.

  6. Restart the failed job.

    Choose Option 2 from the AD Controller main menu to tell the worker to restart a failed job.

  7. Verify worker status.

    Choose Option 1 again. The Status column for the worker that failed should now say Restarted or Fixed, Restart.

    Note: When all workers are in either Failed or Wait status, the manager becomes idle. At this point, you must take action to get the failed workers running again.

Terminating a Hanging Worker Process

Requirement

A worker process has been running for a long time. What should I do?

Discussion

When running AD utilities, there may be situations when a worker process appears to hang, or stop processing. If this occurs, it may be necessary to terminate the process manually. Once you do, you must also restart that process manually.

Caution: A process that appears to be hanging could actually just be a long-running job.

To terminate a process, start AD Controller, obtain the ID of the worker, and then stop any hanging processes. Once you make the necessary changes, you can restart the job or worker.

Note: For more information, see Restarting a Failed Worker.

Actions

  1. Set the environment by executing (sourcing) the patch file system environment file:

    $ source <patch APPL_TOP path>/APPS<CONTEXT_NAME>.env

    Note: For more information, see Setting the Environment in Running AD Utilities

  2. Start AD Controller by entering adctrl on the command line.

  3. Determine what the worker process is doing.

    Use the AD Controller worker status screen to determine the file being processed and check the worker log file to see what it is doing:

    • Check whether the process is consuming CPU.

    • Review the file to see what actions are being taken.

    • Check for correct indexes on the tables (if the problem appears to be performance-related).

    • Check for an entry for this process in the V$SESSION table. This may provide clues to what the process is doing in the database.

  4. Get the worker's process ID.

    If the job is identified as "hanging," determine the worker's process ID.

    UNIX:

    $ ps -a | grep adworker

    Windows:

    Invoke the Windows Task Manager (with Ctrl-Alt-Delete or Ctrl-Shift-Esc) to view processes.

  5. Determine what processes the worker has started, if any.

    If there are child processes, get their process IDs. Examples of child processes include SQL*Plus and FNDLOAD.

  6. Stop the hanging process, using the command that is appropriate for your operating system.

  7. Fix the issue that caused the worker to hang. Contact Oracle Support Services if you require assistance doing this.

  8. Restart the job or the worker.

    See Restarting a Failed Worker in this chapter for more information.

Restarting Processes

This section describes some situations where you may need to choose the restart option in AD Controller.

Restarting a Failed Worker

Requirement

I need to restart a failed worker.

Discussion

If a worker has failed, or if you have terminated a hanging worker process, you need to restart the worker manually.

Some worker processes spawn other processes called child processes. If you terminate a child process (that is hanging), the worker that spawned the process shows Failed as the status. After you fix the problem, choose to restart the failed job. Once the worker is restarted, the associated child processes are started as well.

Actions

Perform these steps:

  1. Set the environment by executing (sourcing) the patch file system environment file:

    $ source <patch APPL_TOP path>/APPS<CONTEXT_NAME>.env

    Note: For more information, see Setting the Environment in Running AD Utilities.

  2. Start AD Controller by entering adctrl on the command line.

  3. Choose Option 1 to review worker status.

  4. Take the appropriate action for each worker status.

    If the worker shows Failed, choose Option 2 to restart the failed job. When prompted, enter the number of the worker that failed.

    If the worker shows Running or Restarted status, but the process is not really running, select the following menu options:

    • Option 4: Tell manager that a worker has failed its job. When prompted, enter the number of the hanging worker.

    • Option 6: Tell manager to start a worker that has shut down on the current machine. When prompted, enter the number of the worker that failed.

    Caution: Do not choose Option 6 if the worker process is running. Doing so will create duplicate worker processes with the same worker ID.

    The worker will restart its assigned jobs and spawn the necessary child processes.

Restarting an AD Utility After Machine Failure

Requirement

While I was running an AD utility, the machine crashed. What is the best way to the restart the utility?

Discussion

Because the manager cannot automatically detect a machine crash, you must manually notify it that all jobs have failed and manually restart the workers. If you restart the utility without doing this, the utility status and the system status will not be synchronized.

Actions

Perform these steps:

  1. Set the environment by executing (sourcing) the patch file system environment file:

    $ source <patch APPL_TOP path>/APPS<CONTEXT_NAME>.env

    Note: For more information, see Setting the Environment in Running AD Utilities.

  2. Start AD Controller by entering adctrl on the command line.

  3. Select the following options:

    • Option 4: Tell manager that a worker has failed its job (specify 'all' for workers)

    • Option 2: Tell worker to restart a failed job (specify 'all' for workers)

  4. Restart the AD utility that was running when the machine crashed.

Shutting Down and Restarting Managers

This section discusses some reasons for shutting down and reactivating managers.

Shutting Down a Manager

Requirement

How do I stop an AD utility while it is running?

Discussion

There may be situations when you need to shut down an AD utility while it is running. For example, you may need to shut down the database during an adop or adop session.

You should perform this shutdown in an orderly fashion so that it does not affect your data. The best way to do this is to shut down the workers manually so that the AD utility quits in an orderly fashion.

Actions

Perform these steps:

  1. Start AD Controller

    Set the environment by executing (sourcing) the patch file system environment file:

    $ source <patch APPL_TOP path>/APPS<CONTEXT_NAME>.env

    Note: For more information, see Setting the Environment in Running AD Utilities.

  2. Start AD Controller by entering adctrl on the command line.

  3. In adctrl, select Option 3 and enter 'all' for the worker number. Each worker stops when it either completes or fails its current job.

  4. Verify that no worker processes are running. Use the appropriate command for your platform.

    UNIX:

    $ ps -a | grep adworker

    Windows:

    Invoke Windows Task Manager (with Ctrl-Alt-Delete or Ctrl-Shift-Esc) to view the relevant processes.

  5. When all workers have shut down, the manager and the AD utility quit.

Restarting a Manager

Requirement

No workers are running jobs, when they should be doing so. What is the problem?

Discussion

A restarted worker resumes the failed job immediately as long as the worker process is running. The other workers change to a Waiting status if they cannot run any jobs because of dependencies on the failed job, or because there are no jobs left in the phase. When no workers are able to run, the manager becomes idle and messages like the following will appear on the screen:

ATTENTION: All workers either have failed or are waiting:

FAILED: file cedropcb.sql on worker 1.
FAILED: file adgrnctx.sql on worker 2.
FAILED: file aftwf01.sql on worker 3.

ATTENTION: Please fix the above failed worker(s) so the manager can continue.

Actions

Complete the following steps for each failed worker:

  1. Start AD Controller.

    Note: For more information, see Setting the Environment in Running AD Utilities.

  2. Determine the cause of the error.

    Choose Option 1 to view the status. Review the worker log file for the failed worker to determine the source of the error.

  3. Resolve the error.

    Use the information provided in the log files. Contact Oracle Support Services if you do not understand how to resolve the issue.

  4. Restart the failed job.

    Choose Option 2 on the AD Controller menu to tell the worker to restart a failed job. The worker process restarts, causing the AD utility to become active again.