Troubleshoot Management Agents Service

This section covers some typical issues and resolutions related to the Management Agents service, such as installing, and deinstalling with Management Agents and Management Gateways.

Troubleshoot Management Agents Installation and Configuration Issues

Users may encounter various errors during Oracle Management Agent installation and configuration process. Causes and recommended actions for some common errors are listed below.

Please uninstall the agent and remove the service file before installing the new agent!

Cause: There's an agent already installed on your host. A previous deinstall process did not remove the agent service file successfully.

Action:
  • Run rpm -e oracle.mgmt_agent to uninstall the agent. If command succeeds, try installing the new agent. If command doesn't work, try the next recommended action.
  • Execute ls /opt/oracle/mgmt_agent to check if you have residuals of the previous agent installation. If you find it, delete it by running: rm -rf /opt/oracle/mgmt_agent.
  • Check if you already have agent service file at the following location depending on your Linux version:
    • For OL7 (if you are using systemd): /etc/systemd/system/mgmt_agent.service
    • For OL6 (if you are using init): /etc/init/mgmt_agent.conf.

      If you find that you have this service file, remove it by running: rm -rf /etc/init/mgmt_agent.conf and then retry installing the new agent.

Java is not a 64-bit JVM! Please set path of a 64-bit JVM in the environment variable JAVA_HOME or Java not found please set your preferred path in JAVA_HOME.

Cause: The JAVA_HOME environment variable is not set or it's not pointing to a 64 bit JDK location.

Action: Set JAVA_HOME environment variable to the right JDK version and retry installing the agent. Currently, only 64 bit JDK is supported.

Agent Installation failed with message: useradd: Can't get unique GID (no more available GIDs)

Cause: The installation script cannot add a user and group during the management agent installation process because the available group ids on your Linux system are already in use.

Executing install
    Unpacking software zip
    Copying files to destination dir (/opt/oracle/mgmt_agent) 
useradd: Can't get unique GID (no more available GIDs) 
useradd: can't create group 
Agent installation failed, please check log file

Action: Consult with the system administrator before proceeding with the following:

  1. Edit the /etc/login.defs file. You require sudo privileges to edit the file.

    Look for the following entries:
    SYS_GID_MIN               nnnn
    SYS_GID_MAX               mmmm
    SYS_UID_MIN               pppp
    SYS_UID_MAX               qqqq
    Where nnnn and pppp are the minimum value and mmmm and qqqq are the maximum value.

    If the above entries don't exist in the file, add them.

  2. Update the value of SYS_GID_MAX entry based on the system administrator's recommendation, and save the file.

  3. Remove the failed agent installation by running: sudo rpm -e oracle.mgmt_agent.

  4. Logout of the shell followed by login.

  5. Retry the agent installation.

useradd: cannot create directory /usr/share/mgmt_agent

During the Management Agent installation, the mgmt_agent user is created with the default home directory location under /usr/share/mgmt_agent.

Cause: There's not enough file permissions under /usr/share or the file system is read-only.

Possible Actions:

  • Set file permissions to give mgmt_agent user access to the default user home directory location: /usr/share.

  • Set a different home directory location using the USER_HOME_DIR_ROOT environment variable if you want to use a different location.

    Set the USER_HOME_DIR_ROOT environment variable with the path that you prefer to use as a home directory for mgmt_agent user, and ensure the management agent user has the right file permissions on that preferred directory.

Windows: The system cannot find the path specified. Agent install failed.

ERRORLEVEL=9009

Possible Cause: Environment variables have not been set properly due to spaces in the directory/folder name.

Windows environments allow to use spaces within a directory/folder name which causes an issue with the Management Agent installation since quotes are added to the name automatically by Windows. For example, there's a directory/folder named: Program Files. In this case Windows auto-inserts quotes since there's a space within the folder name, and it will now say: "Program Files".

Extra quotes can cause an issue since Management Agent installer does not allow quotes for environment variables like JAVA_HOME and AGENT_INSTALL_BASEDIR.

Note

The Management Agent installer does not accept the following special characters in the path: [, ^^, ", ', &, or ].

Action:

The recommended way to set up environment variables in Windows is by using the Advanced System Settings.
  • On the Windows taskbar, right-click the Windows icon and select System.
  • In the Settings window, under Related Settings, click Advanced system settings.

    Windows Advanced Settings

  • On the Advanced tab, click Environment Variables.

    Windows Environment Variables

  • Click New to create a new environment variable. Click Edit to modify an existing environment variable.
  • After creating or modifying the environment variable, click Apply and then OK to have the change take effect.
    Note

    The graphical user interface for creating environment variables may vary slightly, depending on your version of Windows.

Management Agent status is "Not Available" in Console after the initial installation

Possible Cause: Incorrect system timestamp

Action: Verify the system time of the agent's host, and then you can correct it if needed.

After configuration, the Management Agent is not visible in console or through the API

Possible Cause: If after you configure the management agent or the management gateway agent the agent does not display in the Oracle Cloud console or through the API, the correct policies may not be set up for the user or the user group.

Action: Verify the user or the user group has the required policies configured for the management agent or gateway agent. To setup polices, see Create policies for user group.

Prometheus or Kubernetes metrics monitored using Management Agent are not available

Possible Causes: Management Agent does not require dynamic group or policies for it's own metrics but does for Prometheus and Kubernetes metrics. The user must define a Dynamic Group and Policy that allows the agents in that dynamic group to post metrics to OCI Monitoring. If the metrics do not show up in the compartment or the OCI Monitoring namespace then you can check the policies and the dynamic group.
  • (a) Missing policies

    Action: Verify that the policies are added to Management Agent as described in the set up instructions. For details, see Set Up Oracle Cloud Infrastructure for Management Agent Service.

    If the policies are missing, add them as described in Set Up Oracle Cloud Infrastructure for Management Agent Service.

  • (b) Typos in policies

    Action: Review the policies syntax for any errors by comparing them against the policies samples. For details, see Set Up Oracle Cloud Infrastructure for Management Agent Service.

    For example, ensure that the dynamic group definition is defined correctly as per the following syntax with right single quote characters around the compartment id and managementagent resource type:

    ALL {resource.type='managementagent', resource.compartment.id='ocid1.compartment.oc1.examplecompartmentid'}
  • (c) Incorrect compartment id in Dynamic Group definition

    Action: Verify that the install key compartment id is the same as the compartment id specified in the agent's dynamic group definition. By default the agent is created in the install key's compartment.

Agent runs into OutOfMemoryException

Possible Cause: The agent might run out of heap memory if it is not tuned properly to support the load that has been assigned to it.

Action: Update the heap memory settings for the Management Agent.

The out-of-box configuration for the maximum heap for the agent is:
  • 128 MB for Management Agent as an OCA Plugin.
  • 512 MB for standalone Management Agent. (The one downloaded from Management Agent console).
The user can update and assign more heap to the agent by doing the following:
  • Open file: agent_inst/config/java.options.
  • Edit the above file. Update the heap setting by modifying the following line: -Xmx512m

    For example: The above line sets the maximum heap for the agent to be 512 MB.

    To change the heap to 800 MB update the above line to be: -Xmx800m

  • Save the file and restart the agent for the changes to take effect.

IP address being displayed in host column when Management Agent installed on Windows host

Problem: Management Agent is installed on a Windows host and the Management Agent console displays the IP address of the Windows host in the UI page instead of displaying fully qualified domain name (hostname) of the Windows host.

Action:
  • Login to Windows host and open the Control Panel.

  • Click System and Security and then click System.

  • Look for the Computer name, domain, and workgroup settings section and then click Change settings located on the right-hand side of this section.

    The System Properties pop-up window is displayed.

  • By default, the Computer Name tab is selected. If it's not selected, then click Computer Name.

  • Look for the following message: To rename this computer or its domain or workgroup click Change.

  • Click Change.

    A pop-up window named Computer Name/Domain Changes is displayed.

  • For example, if the FQDN of the Windows host is: FOOBAR004.subnet1ab2regsu.dummytenantreg1.abcvcn.com , provide the provide the short hostname of the Windows host: FOOBAR004 in the text box named Computer Name.

  • Click More and another pop-up window named DNS suffix and NetBIOS Computer Name is displayed.

    In the text box named Primary DNS suffix of this computer, provide the DNS name of the Windows host.

    For example: subnet1ab2regsu.dummytenantreg1.abcvcn.com

  • Click OK or Apply and close all the pop-up windows.

  • Restart the Windows host.

  • Uninstall the existing Management Agent by executing uninstaller.bat script from the Windows terminal.

  • Now install again install Management Agent on the Windows machine.

Management Agent installation should be successful and on the Agent UI page FQDN of the Windows host would be displayed in the host column

Management Agent installation fails on SELinux when using external volume

The agent service fails to start after executing the installation. This situation results in a non-working agent and displays the following messages:
systemctl start mgmt_agent
Job for mgmt_agent.service failed because the control process exited with error code.
See "systemctl status mgmt_agent.service"and "journalctl -xeu mgmt_agent.service" for details.
To confirm, check the service manager logs for error details.
journalctl -xeu mgmt_agent.service
...
Dec 08 15:48:19 ol9-arm systemd[1261408]: mgmt_agent.service: Failed to execute /dir1/oracle/managementagent/agent_inst/bin/agentcore: Permission denied
Dec 08 15:48:19ol9-arm systemd[1261408]: mgmt_agent.service: Failed at step EXEC spawning /dir1/oracle/managementagent/agent_inst/bin/agentcore: Permission denied
Also, check the audit logs.
$ ausearch -ts recent -m avc -i
...
type=AVC msg=audit(12/08/202315:49:26.991:51338) : avc:  denied  { read open } for  pid=1261576comm=(gentcore) path=/dir1/oracle/managementagent/agent_inst/bin/agentcore dev="dm-0"ino=915154scontext=system_u:system_r:init_t:s0           tcontext=unconfined_u:object_r:default_t:s0 tclass=file permissive=0

All the above error messages indicate that your SELinux does not allow you to execute commands in the chosen folder.

Action: Contact your system administrator and create the required policies that allow installing and running the Management Agent.

In a RHEL 9x environment, the Management Agent or the Management Gateway Agent install fails with the following message: "mgmt_gateway service creation failed. Reason: Detected Linux:" or "mgmt_agent service creation failed. Reason: Detected Linux:".

Cause: Red Hat removed the chkconfig package in the Red Hat Enterprise Linux (RHEL) 9 distribution, for more details see the Red Hat Knowledge base. As a result, the install failure log messages may confirm the error and indicate the setup attempts use an incorrect service manager to install the agent.

See the following log error message example below, which may appear similar depending on your environment. The problem occurs when the OsFamily is not identified using the rules in the agentcore script and then the install attempts to set up the agent service using init.d and not systemctl on RHEL 9x.
$ rpm -ivh oracle.mgmt_gateway.231118.1208.1702955171.Linux-x86_64.rpm
Verifying... ################################# [100%]
Preparing... ################################# [100%]
Checking pre-requisites
Checking if any previous gateway service exists
Checking if OS has systemd or initd
Checking available disk space for gateway install
Checking if /opt/oracle/mgmt_agent directory exists
Checking if 'mgmt_agent' user exists
'mgmt_agent' user already exists, the gateway will proceed installation without creating a new one.
Checking Java version
Trying /omc/java/jdk1.8.0_391
Java version: 1.8.0_391 found at /omc/java/jdk1.8.0_391/bin/java
Checking agent version
Updating / installing...
1:oracle.mgmt_gateway-231118.1208.1################################# [100%]

Executing install
Unpacking software zip
Copying files to destination dir (/opt/oracle/mgmt_agent)
Initializing software from template
Checking if JavaScript engine is available to use
Creating 'mgmt_gateway' daemon
mgmt_gateway service creation failed. Reason: Detected Linux:
Installing the mgmt_gateway daemon...
ln: failed to create symbolic link '/etc/init.d/mgmt_gateway': No such file or directory
ln: failed to create symbolic link '/etc/rc3.d/K20mgmt_gateway': No such file or directory
ln: failed to create symbolic link '/etc/rc3.d/S20mgmt_gateway': No such file or directory
ln: failed to create symbolic link '/etc/rc5.d/S20mgmt_gateway': No such file or directory
ln: failed to create symbolic link '/etc/rc5.d/K20mgmt_gateway': No such file or directory
Service not installed.
warning: %post(oracle.mgmt_gateway-231118.1208.1702955171-1.x86_64) scriptlet failed, exit status 1
Action:
  1. Confirm the environment uses Red Hat Enterprise Linux 9.x by running the following command:
    $ cat /etc/redhat-release
    Red Hat Enterprise Linux release 9.3 (Plow)
  2. Verify the chkconfig package is missing as described in the following article on the Red Hat Knowledge base.

Solution

  1. Install the missing package using the following command:
    $ dnf install chkconfig
  2. Validate the package exists now in the environment.
    $ rpm -qa | grep chkconfig
  3. Try installing the Management Agent or Management Gateway Agent again.

Troubleshoot Management Agents Deinstallation Issues

This topic covers the typical issues and their resolutions related to deinstalling Oracle Management Agents.

Error:… specifies multiple packages

Cause: The rpm registry has multiple packages with that name.

Action: Use the --allmatches flag when running the rpm -e command:
rpm -e oracle.mgmt_agent --allmatches

Error:… scriptlet failed with exit code

Cause: The rpm could not stop the running agent or failed to remove the agent service file from the system.

Action: To resolve it, try the remove the agent manually.
  • Check if your agent is running:

    For OL7: systemctl status mgmt_agent

    For OL6: /sbin/initctl status mgmt_agent

    If you see the agent is running, stop it:

    For OL7: systemctl stop mgmt_agent

    For OL6: /sbin/initctl stop mgmt_agent

  • Remove rpm by executing rpm -e oracle.mgmt_agent --noscripts. This command will skip all rpm scripts and try to remove the package from its registry.
  • Remove all the agent files by executing rm -rf /opt/oracle/mgmt_agent. Also run the following:

    For OL7: rm -rf /etc/systemd/system/mgmt_agent.service

    For OL6: rm -rf /etc/init/mgmt_agent.conf

Troubleshoot Management Agents on Compute Instances

Users may encounter various errors during the deployment of Oracle Management Agent on compute instances. Causes and recommended actions for some common errors are listed below.

Agent is in Not Available state and agent log file reports "Invalid tags"

The Management Agents page shows the Agent in 'Not available' state and the mgmt_agent.log file (located under <Agent_Inst>/logs directory) reports the following message:

ErrorBody:{"code" : "InvalidParameter","message" : "Invalid tags: Resource creation failed because the resource requires tag value(s). Add a value to the each of the following tag definition(s): \nGLOBAL.ComponentType, GLOBAL.ApplicationName,

Cause:

This issue can happen when the compartment requires mandatory tags for every resource and the resource creation request does not include the tags, then the activation request would fail with the message:"Invalid tags: Resource creation failed because the resource requires tag value(s)" and the agent status is shown as 'Not Available'.

Action:

  • Management Agents

    If you have a standalone Management Agent, it must be uninstalled.

    If the Management Agent was installed using an RPM or a ZIP file, it must be uninstalled and reinstalled by providing a response file using the DefinedTags parameter as described in the Review Agent Parameters section.

  • Management Agents on Compute Instances
    If the Management Agent is enabled through the OCI Console using the OCA plugin, then there is no response file since it's not used for compute instances. In this case, do the following:
    1. Login to the instance where the Management Agent is deployed and sudo as oracle-cloud-agent user using command:
      sudo -u oracle-cloud-agent sh
    2. Create agent.definedtags file under/var/lib/oracle-cloud-agent/plugins/oci-managementagent/polaris/agent_inst/config/security/resource/ directory.

    3. Add defined tags needed for the resource to be created in agent.definedtags file.

      For example, you can add the following:
      [{"GLOBAL":{"ComponentType":"<value>"}}, {"GLOBAL":{"ApplicationName":"<value>"}}]
    4. Restart oracle-cloud-agent using the command:
      sudo systemctl restart oracle-cloud-agent

Management Agent setup failed with fork/exec oracle.polaris.oca.main: permission denied

Users may encounter this error resulting in failure to install or start the Management Agent.

The error message shown in the Plugin view of compute instance for the Management Agent Plugin looks similar to the following:

workflow.go:23: [ERROR] step [*core.SetupImageStep] execution failed with [setup image failed with [fork/exec 230821.1905/bin/oracle.polaris.oca.main: permission denied]]
mgmtagent_image.go:139: [ERROR] bootstrap workflow failed with error setup image failed with [fork/exec 230821.1905/bin/oracle.polaris.oca.main: permission denied]
agent.go:74: [ERROR] failed to start agent during bootstrap with [setup image failed with [fork/exec 230821.1905/bin/oracle.polaris.oca.main: permission denied]]

Possible Cause:

This issue may happen when a compute instance disallows fork/execute operations from the /tmp directory by mounting the tmpfs with the noexec flag.

To confirm this possible cause, run the following:
$ mount | grep tmpfs
tmpfs on /tmp type tmpfs (rw,nosuid,nodev,noexec,inode64)

The output should say does have the noexec flag.

Action:

  1. Stop Oracle Cloud Agent.
     sudo systemctl stop oracle-cloud-agent
  2. Add the following setting to the file: /etc/oracle-cloud-agent/plugins/oci-managementagent/config.yml
    overrideTmpDir: true
  3. Start Oracle Cloud Agent.
    $ sudo systemctl start oracle-cloud-agent

Troubleshoot Management Gateways

This topic covers typical issues and resolutions related to Management Gateways.

Remove Management Gateway

Cause: In some cases, it may be necessary to remove an existing Management Gateway installation, in order to reinstall it.

Action:
  • Check if the gateway is running:

    For OL7: systemctl status mgmt_gateway

    For OL6: /sbin/initctl status mgmt_gateway

    If the gateway is running, stop it:

    For OL7: systemctl stop mgmt_gateway

    For OL6: /sbin/initctl stop mgmt_gateway

  • Remove the installed Gateway RPM using the following command: rpm -e oracle.mgmt_gateway --noscripts

  • Remove any remaining Gateway files using the following command:

    rm -rf /opt/oracle/mgmt_agent

  • Run the following:

    For OL7: rm -rf /etc/systemd/system/mgmt_gateway.service

    For OL6: rm -rf /etc/init/mgmt_agent.conf

Configure Management Gateway

Cause: In some cases, the hostname might not be resolved in the installation environment which might cause the installation to fail with the following error message:

"Could not resolve hostname <hostname value> in the installation environment. Resolve the hostname or provide the GatewayCertCommonName in the response file and rerun the gateway setup script."

Action:

  • Check and resolve the hostname of the environment to get the fully qualified doamin name (FQDN) value after running the command: hostname -f
  • Optionally a user can provide a custom fully qualified domain name for the gateway configuration via seeding the GatewayCertCommonName property in input response file. See Response File Parmaters
  • Re-run the configure gateway script again.
    sudo /opt/oracle/mgmt_agent/agent_inst/bin/setupGateway.sh opts=<user_home_directory>/gateway.rsp

Cause: In some cases, the Management Gateway installation might fail with the following error message due to the absence of policies in OCI or because of resource limit issues in the tenancy. If you see the following error, follow the steps below.

"Failed to start Management Gateway as certificates could not be created, initialized or retrieved in OCI. Please check the logs for more details."

Action:

  • Open the log file in the Management Gateway installation directory, for example: /opt/oracle/mgmt_agent/plugins/GatewayProxy/statedir/log/mgmt_gateway.log
  • If the log file contains any of the following 404 error codes, then choose one of the following options to resolve the issue:
    2023-07-25 15:38:06.694/CEST [pool-3-thread-1] INFO  com.oracle.mgmtagent.proxy.oci.certificate.util.CertificateUtility - Response String {  "code" : "NotAuthorizedOrNotFound",  "message" : "Authorization failed or requested resource not found."}
    2023-07-25 15:38:06.696/CEST [pool-3-thread-1] ERROR com.oracle.mgmtagent.proxy.ProxyServer - Error while initializing and loading certificate bundlescom.oracle.mgmtagent.proxy.exception.CertificateFailureException: The response status is 404 after multiple retries at com.oracle.mgmtagent.proxy.oci.certificate.util.CertificateUtility.executeRequest(CertificateUtility.java:293) ~
  • If the log file contains any of the following 400 error codes, then review the following options to resolve the issue:
    2023-09-20 18:51:32.772/GMT [pool-3-thread-1] INFO  com.oracle.mgmtagent.proxy.oci.certificate.util.CertificateCreationUtil - Create Vault Service Url invoked https://kms.us-ashburn-1.oraclecloud.com/20180608/vaults
    2023-09-20 18:51:33.400/GMT [pool-3-thread-1] INFO  com.oracle.mgmtagent.proxy.oci.certificate.util.CertificateUtility - Received response code 400
    2023-09-20 18:51:33.400/GMT [pool-3-thread-1] INFO  com.oracle.mgmtagent.proxy.oci.certificate.util.CertificateUtility - Header name opc-request-id , value /5704D03441842D3818B824B2D6B2712E/1D1FED893474FDA900188E24F3DEE59B
    2023-09-20 18:51:33.401/GMT [pool-3-thread-1] INFO  com.oracle.mgmtagent.proxy.oci.certificate.util.CertificateUtility - Response String {  "code" : "LimitExceeded",  "message" : "The limit for this tenancy has been exceeded."}
    • Check the limit for the Default Vault Count resource for the Key Management Service in OCI console. You can raise a request to increase the resource limits. For more information, see Managing Keys and Managing Vaults.
    • You can set up certificates manually, for details see Perform Prerequisites for Deploying Management Gateway and go to the Manual Certificate Management section.
      Note

      When you create the Issued by internal CA certificates, the Certificate Profile must be either TLS Server or TLS Client and only the RSA signing algorithms are supported.
  • If there are any other failures related to the Vault or the Key service API's in the logs, then you can raise a request and reach out to the oci_kms team by providing the response body and opc-request-id.
  • If there are any other failures related to Certificate Authorities or Certificate service API's in the logs, then raise a request and reach out to oci_certificates team by providing the response body and opc-request-id.