Agent Troubleshooting

The following topics describe troubleshooting tips related to the Unified Monitoring Agent, for both Linux and Windows.

Hardware Requirements

Depending on your logging requirements and configuration (number of logs, type of buffering, and so on), the hardware requirements and performance of the Unified Monitoring Agent can vary widely. When no operational pressure is present (less than 1.000 log events per minute), the agent should not consume more than 200 MB of RAM, and 20% of a CPU core. The Unified Monitoring Agent service hard-coded limits are 5 GB RAM, and 40% of a core. 1 GB of RAM is also recommended.

Enabling Monitoring

Monitoring can aid with troubleshooting. See Enabling Monitoring for Compute Instances for more information on how you can enable monitoring (metrics and logging) in your Oracle Cloud Infrastructure Compute instances.

Linux

systemd Units

The Unified Monitoring Agent is based on systemd units, and is composed of the following components:

  1. unified-monitoring-agent.service: The main Unified Monitoring Agent service.
  2. unified-monitoring-agent_config_downloader.service: The configuration automatic updater service.
  3. unified-monitoring-agent_config_downloader.timer: The timer unit, which triggers the automatic downloader service on specified, randomized, intervals.
  4. unified-monitoring-agent_restarter.path: The path unit, which triggers the reload of the configuration by the Unified Monitoring Agent, if a change is detected (because of a new configuration being downloaded by the automatic updater service).
Note

Remember that most of the systemctl or journalctl commands must be run with super user privileges (either as root, or through sudo).

To verify the correct operation of these systemd units, you can use the systemctl command like the following:

systemctl status <unit_name>

Where <unit_name> must be replaced with one of the following values:

  1. unified-monitoring-agent.service
  2. unified-monitoring-agent_config_downloader.service
  3. unified-monitoring-agent_config_downloader.timer
  4. unified-monitoring-agent_restarter.path

Typically these systemctl commands show output similar to the following:

systemctl status unified-monitoring-agent.service
● unified-monitoring-agent.service - unified-monitoring-agent: Fluentd based data collector for Oracle Cloud Infrastructure
   Loaded: loaded (/usr/lib/systemd/system/unified-monitoring-agent.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2020-09-29 13:54:03 UTC; 1min 37s ago
     Docs: https://docs.cloud.oracle.com/
  Process: 2337 ExecReload=/bin/kill -USR2 ${MAINPID} (code=exited, status=0/SUCCESS)
  Process: 2321 ExecStart=/opt/unified-monitoring-agent/embedded/bin/fluentd --log /var/log/unified-monitoring-agent/unified-monitoring-agent.log --daemon /var/run/unified-monitoring-agent/unified-monitoring-agent.pid --log-rotate-size 1048576 --log-rotate-age 10 $EXTRA_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 2327 (fluentd)
   Memory: 66.3M (limit: 5.0G)
   CGroup: /system.slice/unified-monitoring-agent.service
           ├─2327 /opt/unified-monitoring-agent/embedded/bin/ruby /opt/unified-monitoring-agent/embedded/bin/fluentd --log /var/log/unified-monitoring-agent/unified-monitoring-agent.log --daemon /var/run/unif...
           └─2330 /opt/unified-monitoring-agent/embedded/bin/ruby -Eascii-8bit:ascii-8bit /opt/unified-monitoring-agent/embedded/bin/fluentd --log /var/log/unified-monitoring-agent/unified-monitoring-agent.lo...
systemctl status unified-monitoring-agent_config_downloader.service
● unified-monitoring-agent_config_downloader.service - unified-monitoring-agent Fluentd configuration downloader.
  Loaded: loaded (/usr/lib/systemd/system/unified-monitoring-agent_config_downloader.service; enabled; vendor preset: disabled)
  Active: inactive (dead) since Tue 2020-09-29 13:54:38 UTC; 1min 30s ago
 Process: 2333 ExecStart=/opt/unified-monitoring-agent/embedded/bin/ruby /opt/unified-monitoring-agent/embedded/bin/fluent_config_updater.rb -c /etc/unified-monitoring-agent/conf.d/ -b 10 (code=exited, status=0/SUCCESS)
Main PID: 2333 (code=exited, status=0/SUCCESS) 
systemctl status  unified-monitoring-agent_config_downloader.timer
● unified-monitoring-agent_config_downloader.timer - Run unified-monitoring-agent configuration automatic updater.
   Loaded: loaded (/usr/lib/systemd/system/unified-monitoring-agent_config_downloader.timer; enabled; vendor preset: disabled)
   Active: active (waiting) since Tue 2020-09-29 13:54:03 UTC; 3min 57s ago 
systemctl status  unified-monitoring-agent_restarter.path
● unified-monitoring-agent_restarter.path - "Monitor the /etc/unified-monitoring-agent/conf.d/ directory for changes"
   Loaded: loaded (/usr/lib/systemd/system/unified-monitoring-agent_restarter.path; enabled; vendor preset: disabled)
   Active: active (waiting) since Tue 2020-09-29 13:54:03 UTC; 4min 9s ago 

The most important parts of the systemctl command output are the Loaded and Active fields. The Loaded field has the value loaded for all system units. The Active field has the following values:

  • active (running) for the unified-monitoring-agent.service unit.
  • active (waiting) or active (running) for the unified-monitoring-agent_restarter.path and the unified-monitoring-agent_config_downloader.timer units.
  • active (running) or inactive (dead) for the unified-monitoring-agent_config_downloader.service unit. For the latter value, the field Main PID includes the value code=exited, status=0/SUCCESS).

Processes

Another way to further verify the correct operation of the Unified Monitoring Agent, is to check the system’s running processes. When operating correctly, the Unified Monitoring Agent runs two processes: one supervisor process, and one worker process. You can verify their existence by running the following command in a terminal (sample output included):

ps aux | grep unified-monitoring-agen[t]
root      2327  0.0  2.3 307704 40864 ?        Sl   13:54   0:00 /opt/unified-monitoring-agent/embedded/bin/ruby /opt/unified-monitoring-agent/embedded/bin/fluentd --log /var/log/unified-monitoring-agent/unified-monitoring-agent.log --daemon /var/run/unified-monitoring-agent/unified-monitoring-agent.pid --log-rotate-size 1048576 --log-rotate-age 10
root      2330  0.2  2.1 297456 38192 ?        S    13:54   0:03 /opt/unified-monitoring-agent/embedded/bin/ruby -Eascii-8bit:ascii-8bit /opt/unified-monitoring-agent/embedded/bin/fluentd --log /var/log/unified-monitoring-agent/unified-monitoring-agent.log --daemon /var/run/unified-monitoring-agent/unified-monitoring-agent.pid --log-rotate-size 1048576 --log-rotate-age 10 --under-supervisor

As shown in the preceding sample, there are two processes running, with the same arguments, except for the extra –under-supervisor added to the second one. This denotes the worker process, thus making the process without this parameter the supervisor.

Logs

Note

Remember that most of the systemctl or journalctl commands must be run with super user privileges (either as root, or through sudo).

The Unified Monitoring Agent logs are available at /var/log/unified-monitoring-agent/unified-monitoring-agent.log. This file includes logs from the Unified Monitoring Agent itself.

Besides the agent's logs, which do not contain system-related events (for example, service start, service stop, and so on), you can also view the logs from journald, systemd's system logging service. To view the system logs specific to a unit, you can use the journalctl command like the following:

journalctl -u <unit_name>

Where <unit_name> must be replaced with one of the following values:

  1. unified-monitoring-agent.service
  2. unified-monitoring-agent_config_downloader.service
  3. unified-monitoring-agent_config_downloader.timer
  4. unified-monitoring-agent_restarter.path
When querying journald logs through journalctl, you can also define specific time ranges:
journalctl --since "2020-12-30 00:00:01" --until "2020-12-31 23:59:59"
The date format used is YYYY-MM-DD HH:MM:SS.
You can also tail the journal logs, by adding the -f parameter:
journalctl -f

Troubleshooting Scenarios

Problem: The Unified Monitoring Agent is not installed.

Solution: For newly created instances, it can take up to 25 minutes for the automatic installation of the agent. If it is not installed after this time period, check the following:

  1. The network connectivity of the instance.
  2. Whether monitoring is enabled in the Console.

You can also check the log file /var/log/oracle-cloud-agent/plugins/unifiedmonitoring/unifiedmonitoring.log for information regarding the installation of the Unified Monitoring Agent by the Oracle Cloud Agent.

Problem: The Unified Monitoring Agent is not running. Its status is not loaded/active, nor are both supervisor and worker processes running.

Solution: Restart the Unified Monitoring Agent and check the logs for any problems:

systemctl restart unified-monitoring-agent

Problem: Configuration is not automatically downloaded.

Solution: Ensure you have followed the steps in Installing the Agent and Verify Agent Installation. Consult the journal of the automatic configuration updater service by running:

journalctl -u unified-monitoring-agent_config_downloader.service

Problem: Configuration is not automatically reloaded.

Solution: Ensure you have followed the steps in Installing the Agent and Verify Agent Installation. Consult the journal of all the units:

  1. The timer unit must have run at least one time.
  2. The automatic configuration download service must have run after the relevant time unit has triggered it. You can verify from its logs that the configuration has been downloaded and extracted to the Unified Monitoring Agent's configuration directory. You can also verify this by listing the files in that directory: ls -lhatR /etc/unified-monitoring-agent.
  3. Verify that the path unit is active by checking its status: systemctl status unified-monitoring-agent_restarter.path.
  4. Verify that a reload signal has been received by the Unified Monitoring Agent, by inspecting its journal: journalctl -u unified-monitoring-agent_config_downloader.service. "Reloading unified-monitoring-agent" appears in the output of this command.

Problem: You are testing your parsing pattern and need to force the agent to download the configuration right away.

Solution: Run the following command:

systemctl restart unified-monitoring-agent_config_downloader
Note

Automatic update of the configuration on the agent side can take up to 30 minutes.

Data Collection

If you want to open a ticket so an engineer can help you with your problem regarding the Unified Monitoring Agent, include the output of the following commands. Super user privileges might be required for some of them.

yum info unified-monitoring-agent
rpm -ql unified-monitoring-agent |  xargs sha512sum
systemctl status --full unified-monitoring-agent.service
systemctl status --full unified-monitoring-agent_config_downloader.service
systemctl status --full unified-monitoring-agent_config_downloader.timer
systemctl status --full unified-monitoring-agent_restarter.path
journalctl -a --no-pager -u unified-monitoring-agent.service
journalctl -a --no-pager -u unified-monitoring-agent_config_downloader.service
journalctl -a --no-pager -u unified-monitoring-agent_config_downloader.timer
journalctl -a --no-pager -u unified-monitoring-agent_restarter.path

Also include an archive of the files under /var/log/unified-monitoring-agent/ and /var/log/oracle-cloud-agent/. You can create a gzipped tar archive of these directories with the command:

tar cvzf agent_logs_$(date +%s).tar.gz /var/log/unified-monitoring-agent/ /var/log/oracle-cloud-agent/

If the Unified Monitoring Agent is running but has erratic behavior, you can also include backtrace and memory profile information, by running the following command and including the files /tmp/sigdump-<integer>.log in your report (where <integer> is an integer with 1–6 digits, even though in rare cases it might have more than that).

ps aux | grep unified-monitoring-agen[t] | grep ruby | awk '{print $2}' | xargs kill -SIGCONT

What this command does is to find the Unified Monitoring Agent process PIDs, and send them the SIGCONT signal, which causes a dump to be generated in /tmp/sigdump-<integer>.log.

Uninstall and Reinstall

You can remove the Unified Monitoring Agent, without removing the agent's configuration, by running the following command:

yum -y remove unified-monitoring-agent

The agent's configuration remains under the /etc/unified-monitoring-agent/ directory. If you do not want to keep the configuration for a future (re)installation of the Unified Monitoring Agent package, you need to remove it manually:

# use the following command to print the contents of the agent's configuration directory
find /etc/unified-monitoring-agent/
# use the following command to remove the directory and all of its contents (this step cannot be undone)
rm -rf /etc/unified-monitoring-agent/

The agent is automatically reinstalled by the Oracle Cloud Agent, at most 25 minutes. You need to have monitoring enabled for your instance in the Console for this to occur. See Managing Plugins with Oracle Cloud Agent for more information.

Windows

Unified Monitoring Agent Troubleshooting Steps

Check the service status:

  1. The agent runs as part of a Windows service, to see its status, open the start menu and type Services.msc and open it. Go to the service Oracle Cloud Unified Monitoring Service to see the status.
  2. Right-click the service and select Properties for more information. Start/stop/restart are available here.
  3. From the Start menu type cmd, right-click on Command Prompt and select Run as Administrator. Run the following commands:
  • To view Unified Monitoring Agent service status:
    sc query unified-monitoring-agent
  • Restart the Unified Monitoring Agent service:
    sc stop unified-monitoring-agent
    sc start unified-monitoring-agent
Note

The preceding commands do not work in PowerShell, so you must instead use the Windows Command Prompt.
Find Windows Service errors:
  1. From the Start menu, type Event Viewer and select it.
  2. Open Windows Logs, then System. Every time a service starts or stops, fails to do either, or crashes suddenly, it is recorded here.
    Note

    On most Windows machines, there is a cap on how many events can be in the event viewer. As a result, if an event happened a long time ago, the logs might not be available.
Fluentd logs:
  1. Open explorer.exe (file icon on the task bar)
  2. Go to C:\oracle_unified_agent.
  3. If there is only one file, it means that there isn’t a valid configuration file on the machine.
  4. If there are two files, then there is a supervisor log that will have all the setup/start-up logs, and a worker log with all the parsing/output logs. unified-monitoring-agent.conf is the name of the configuration file if it has been downloaded properly.
  5. Run Fluentd manually. Try the preceding steps to identify the issue, but if needed, you can debug an issue by manually running Fluentd.
    Note

    Running Fluentd manually runs it in the Windows service, which stops the service from running as normally, which is different behavior than on Linux.
  6. Use the following command to run Fluentd manually. This can be run in PowerShell or Command Prompt, but it needs to be run as Administrator:
    C:\oracle_unified_agent\unified-monitoring-agent\embedded\bin\fluentd -c C:\oracle_unified_agent\unified-monitoring-agent.conf -vv

Automatic Configuration Updater Troubleshooting Steps

  1. Verify Task Scheduler is running as expected.
  2. From the Start menu, and type Task Scheduler.
  3. Go to Task Scheduler (Local), then Task Scheduler Library. Find the task named UnifiedAgentConfigUpdater.
  4. Verify the Last Run Time. If it was at an invalid date, or it says not run, then the Next Run time will be when it should run for the first time. For debugging, select the task and select Run if you need it to run immediately.
  5. Last Run Result specifies the outcome of downloading the configuration from the control plane. If there is an error result, you need to run it manually to determine what happened. Task Scheduler does not keep output logs.
  6. Run the configuration updater manually.
    Note

    Run the updater in PowerShell as an Administrator for the best experience.
    C:\oracle_unified_agent\unified-monitoring-agent\embedded\bin\ruby.exe C:\oracle_unified_agent\unified-monitoring-agent\embedded\lib\ruby\gems\2.6.0\gems\fluent-public-config-updater*\lib\fluent_config_updater.rb -c C:\oracle_unified_agent -b 10

Oracle Cloud Agent Troubleshooting Steps

Check the Oracle Cloud Agent logs. For Windows Server 2012r2 or 2016, the log file locations are:

  • C:\Users\OCA\AppData\Local\Local\OracleCloudAgent\agent.log
  • C:\Users\OCAUM\AppData\Local\OracleCloudAgent\plugins\unifiedmonitoring\unifiedmonitoring.log (runtime logs)
  • C:\Users\OCAUM\AppData\Local\OracleCloudAgent\plugins\unifiedmonitoring\unifiedmonitoring_msi.log (install logs)
  • C:\oracle_unified_agent\unified-monitoring-agent-0.log (agent worker log, which might not exist depending on state)
  • C:\oracle_unified_agent\unified-monitoring-agent-supervisor-0.log (agent supervisor log, which might not exist depending on state)

Windows Server 2019 log file locations:

  • C:\Windows\ServiceProfiles\OCA\AppData\Local\OracleCloudAgent\agent.log
  • C:\Windows\ServiceProfiles\OCAUM\AppData\Local\OracleCloudAgent\plugins\unifiedmonitoring\unifiedmonitoring.log (runtime logs)
  • C:\Windows\ServiceProfiles\OCAUM\AppData\Local\OracleCloudAgent\plugins\unifiedmonitoring\unifiedmonitoring_msi.log (install logs)
  • C:\oracle_unified_agent\unified-monitoring-agent-0.log (agent worker log, which might not exist depending on state)
  • C:\oracle_unified_agent\unified-monitoring-agent-supervisor-0.log (agent supervisor log, which may not exist depending on state)

Intermittent Failed MSI Install

An intermittent failed MSI install can occur for one of two reasons:

  1. An MSI installation was interrupted (system reboot, process stop, and so on), and on the second run, the msiexec.exe process is still holding a file handle to a folder that it created.
  2. During an upgrade where the MSI fails to get access to the main agent folder, because Ruby.exe doesn’t end like it should (a Fluentd issue). This causes the MSI to fail and to clean up the system, removing much of the agent (not the position or buffer files though).

In both instances, a second install or letting Oracle Cloud Agent run through the install a second time resolves this issue. If it still is stuck in this state do the following:

  1. Stop all msiexec and ruby processes in Task Manager, Details.
  2. Rename C:\oracle_unified_agent to C:\oracle_unified_agent_old.
  3. Install the agent again, or wait for Oracle Cloud Agent to install it.