33 Diagnosing Job System Issues

Job Diagnostics gives you an in-depth view into the job system. Using intuitive dashboards, Job Diagnostics provides an administrator view of the job system to diagnose problems and resolve job system performance issues.

Topics:

Typical Job System Issues

Below are some of the top issues that can affect Job System performance:

  • Agent is Down, Unknown, or Suspended in Blackout
  • Agent is overloaded resulting in excessive job retries (Metric Extensions can often cause this)
  • Priority jobs are getting starved due to failing System Retry Jobs
  • DB session hang due to repository background process deadlocks
  • OMS UI console to PBS communication failure
  • Corrective Actions trigger too frequently due to incorrect metric threshold settings
  • User-suspended jobs are locking resources
  • Long running jobs are blocking common Job System resources, thus preventing new jobs from running
  • Jobs backlog due to stuck head of the queue

The job diagnostics dashboard enables administrators to easily identify the above issues, diagnose the root cause and take appropriate action..

Job System Components

The Enterprise Manager Job System is an OMS subsystem and includes a Job Scheduler and Job Workers. In turn, the Job Scheduler consists of two components: the Job Step Scheduler and the Job Dispatcher. In addition to user-submitted jobs, the majority of the background tasks in Enterprise Manager are run via a series of jobs. Typical tasks carried out by these jobs are loading metric data, calculating the availability of composite targets, rollup and purge of metric data and notifications.

Performance of the Job System relies on numerous components to perform optimally. Job Diagnostics consolidates performance information pertaining to these components into intuitive dashboards for easy comparison and analysis. The primary components of the Job system are shown in the following illustration.


Graphic shows the Job System architecture.

Job System Components Used by Job Diagnostics

  • Job Step Scheduler – The Job Step Scheduler is a global component so there is only one per Enterprise Manager environment. It is scheduled to run by the DBMS Scheduler. The primary purpose of this component is to mark steps ready for the dispatcher to execute.

  • Job Dispatcher - The Enterprise Manager Job system also has a notion of a short jobs (user jobs that complete quickly) and long jobs (user jobs that run a long time) and has separate worker pools in the OMS (not in the database as with the job workers) to handle those requests. The Job Dispatcher runs locally on each OMS and its purpose is to dispatch the jobs found by the Job Step Scheduler to the Job Workers. If the dispatcher cannot keep up with the work in the queue, the backlog increases. This is not a problem as long as the backlog is temporary. If it is not, then either the dispatcher is not able to keep up with the amount of work which could mean adding another OMS server or there is a problem with the Job Workers and they are not able to accept the work from the dispatcher.

  • Job Workers – Job Workers take work for a given job step from the Job Dispatcher and process it. This can happen while holding a thread for steps that do processing in java, by contacting the repository for those that use SQL, or by contacting the agent for those that run remotely. If Job Workers are always busy and never free, then capacity needs to be added either via another OMS server or by increasing the number of Job Workers and potentially increasing the number of DB connections (each Job Worker takes a connection to the database).

Accessing Job Diagnostics

  1. Log in to the Enterprise Manager console as a user with Super Administrator privileges. Note: You must have Super Administrator privileges in order to access the Job Diagnostics UI.
  2. From the Setup menu, select Manage Cloud Control and then Job Diagnostics. This Job Diagnostics home page displays.

Home (Overview) Dashboard

From the Job Diagnostics home page, you can select the following Job System areas to analyze.


Graphic shows the left job component selector region.

  • Home/Dispatchers: Toggle between the Job Diagnostics Home page and the Dispatchers page. See Dispatchers.
  • Retried Jobs: Jobs getting retried several times. This impacts Job System performance by consuming excessive resource. For example, jobs are retried because the agent is down or unreachable.
  • Retried Steps: Steps getting retried.
  • Longest Queues: The job queue ensures that a particular order for the job execution is followed on a particular target. For example, save target, delete target, update properties, etc. Various subsystems of Enterprise Manager use job queues. Queues are generally used by the system jobs. The following table lists common system jobs.

    Table 33-1 Typical System Jobs that use Queues

    Job Name Scheduler Job Name Task
    Agent Ping EM_PING_MARK_NODE_STATUS Keeps track of the health of the host targets in Enterprise Manager.
    Daily Maintenance EM_DAILY_MAINTENANCE This job does the daily repository maintenance tasks such as partition maintenance, stats updates, etc.
    Repository Metrics MGMT_COLLECTION.Collection Subsystem This job shows the amount of work done for the repository metrics.
    Rollup EM_ROLLUP_SCHED_JOB This job indicates the amount of data involved in the rollup job.
  • Jobs Executing: View a list of jobs that have been executing for the selected Time Frame.

Job System Overview

The Overview section displays at-a-glance information about the three main elements of the Job System in addition to a list of all steps processed in the selected time frame:


Graphic shows the Job Diagnostics home page.

Dispatchers

This region shows the status of all dispatchers within your Enterprise Manager environment. Currently there is at most 1 dispatcher per OMS.

Steps Scheduler

The Steps Scheduler marks the steps as ready for execution so that the dispatcher can pick them up for execution.

Book-keeping Steps

Internal Job System steps which help to maintain continuity of the job execution when various subsystems of Enterprise Manager perform specific actions. For example, mark jobs, executions and steps as failed, scheduled or suspended based on various system events such as agent bounce, blackouts, or group changes.

Steps Processed

List of steps that have been processed by the job system in a given time frame. This time frame can be fine-grained (5 minutes to a maximum of 1 day). The graph show steps marked as ready, steps that were executed, and the yellow line displays the backlog of steps. If you see a high level of backlog, this indicates that there may be an issue such as running out of threads.

The table shows the details of all steps that were executed with the selected time frame. Clicking on a step takes you to that step’s Job Activity page where you can view more detailed information.

Retried Jobs

Click Retried Jobs to view the top list of jobs that were retried for the specified time frame and how many times.

For example, if you see the total number of retried job is 51, and you see each job had been retried 100 times, then 5100 job cycles had been used retrying jobs, which can represent a significant amount of system resource.

In the following graphic, you can see the top job SI_NMR, that was retried 100 times before it failed.


Graphic shows the Retried Jobs page.

Clicking on a job takes you to that job’s Job Activity page where you can see which target the job was executed on as well as the output log for that job.


Graphic shows the Job Activity page for the selected Retried Job.

In the above graphic, you can see that NMO is not set up in the Output log. When an agent is installed, you need to execute the root.sh file. This helps the agent to execute an action on the agent for several types of jobs. After reaching maximum limit of 100 retries, the job is moved to Suspended by User status so that the user can perform a correction before moving this job forward.

Longest Queues

Job Queues ensure that a list of jobs is executed in sequential order. Click Longest Queues to view how many job queues there are and the maximum number of scheduled job executions.


Graphic shows the Longest Queues page.

For example, adding a target or deleting one creates several jobs that have to be executed in a particular order. That can be accomplished by adding it to a job queue. In the graphic above, you see this queue has 92 scheduled executions and the status of the job at the top of the queue (Head Job Status) is Agent is not Ready.

Click on a Queue Name to view explicit details for that queue. The Queue Details dialog appears.


Graphic shows the Queue Details dialog.

In the Top Job Types table, you see the job types currently stuck in the queue along with the number of executions. To find out why the jobs in the queue are not getting processed, click on the Head of the Queue job name to go to the Job Activity page for the head job.

On the Job Activity page, you will find specific details about why the job at the head of the queue is causing the backlog. In this case, the current status is Agent is not Ready.

Graphic shows the Job Activity page for the job at the head of the queue.

With this knowledge, you can go to the Target Status page to determine what the problem is with the agent, as shown below.

Graphic shows the Target Status page for the problem agent.

In the above webpage, we see that the target’s status is unknown and agent is blocked with a Plug-in Mismatch, If the agent is blocked and unable to upload or take any requests, all job requests on it will be delayed until the problem is fixed. The solution is to resolve the plug-in mismatch. So, in situations where the status of the agent is Agent is not Ready, you now know that the underlying issue can be a plug-in mismatch (as in this case), agent down, agent blackout or any other issue preventing agent communication. Navigate to the agent home page to determine the root cause. Once resolved, jobs should automatically start running again. There are also cases where a target is logically obsolete but not yet deleted from Enterprise Manager. There is often build up of jobs on such targets. Work with your operations team to finish deleting those targets if possible.

Jobs Executing

Click Jobs Executing to view a summary for all jobs that have successfully executed during the selected time frame.


Graphic shows the Jobs Executing page.

Click on a Job Name to view the Job Activity page for that job.

Dispatchers

As mentioned previously, Job Dispatchers are services that handle dispatching the Job Steps for execution. To view the current status of all Dispatchers in your Enterprise Manager environment (one dispatcher per OMS), select Dispatcher from the drop-down menu.

This is the start of your topic.


Graphic shows the selection of the Dispatchers menu option.

The Dispatcher dashboard displays.


Graphic shows the Dispatchers dashboard.

This dashboard displays the details for all dispatchers for your managed Enterprise environment (1 per OMS). You can click on a specific Dispatcher name to display details about that dispatcher. In addition to Status and Up Since, details for the dispatcher's Thread Pool and the Connection Pool are also shown.

Thread Pool

Thread pools provide a way to scope the resources used by the Job System. For example, the user short pool defaults to 25 threads. This allows each OMS to run up to 25 different user steps marked short running concurrently.

Job steps can be categorized into 5 broad categories:

  • User Short -- End user (short running)
  • User Long -- End user (long running)
  • System Normal—Steps run by system jobs.
  • System Critical—Steps run by system jobs.
  • Internal -- Steps created by the Job System for performing low-level actions like step time outs, grace period timeouts and bookkeeping steps.

Connection Pool (maximum number of connections allowed for the Dispatcher)

There are three categories of connections:

  • Job Worker--for the worker threads of the job system that execute particular steps.
  • Job Receiver—pool of threads to accept asynchronous status and updates from the agent.
  • Job Dispatcher—takes care of the dispatching the steps to various workers for execution.

If the Job Worker percent usage is high, then it means the dispatcher cannot dispatch to all the workers in a timely fashion. In this situation, there could be a resource problem, and the environment could probably benefit from more worker threads. However, do not go beyond doubling the size of the threads. If doubling the number of threads does not seem high enough, contact Oracle as it might be better to add an additional OMS.

In the graph below the Dispatcher Thread Pool and Connection Pool status, you can select for each pool how many steps were executed.


Graphic shows the lower graph area highlighted.

This graph is interactive and allows you to choose the pool for which you want to see information, thus allowing you to see which pools are being used more at a specific point in time.