Troubleshooting Telegraf Metric Collection

If Telegraf metric data does not appear in Oracle Management Cloud as expected, follow the basic debugging procedure show below.

Troubleshooting Procedure

  1. Ensure that the generic metric collector entity was added successfully to the cloud agent by the omcli add_entity command. If it is not showing up in the metric brower, run the status_entity omcli command as shown below:

    omcli status_entity agent <entityDefinitionJsonFilePath>

    Validation errors, if any, will be shown in the command output.

  2. Enable trace level logging in emd.properties. Set the following two properties:

    Logger._enableTrace=true
    Logger.sdklog.level=DEBUG

    and bounce the cloud agent. Run the tail command on the gcagent_sdk.trc file in the agent's log directory.

  3. From the log file you should see the complete payload received by agent from Telegraf, which metrics are in turn being sent by receiver to Oracle Management Cloud, and which metrics are unmapped.
    Search for "gmcReceiver received payload" in the log file to see the full payload received. If this line is not seen in the log file, the agent may not be receiving data from Telegraf. If this is the case:
    • Check if Telegraf is running.
    • Check that the intended input plugins are enabled and Telegraf is able to collect their metrics by running the telegraf --test command as shown in the following example.
      $ telegraf --test
      2019/03/04 21:00:09 I! Using config file: /etc/telegraf/telegraf.conf
      > cpu,collector=telegraf,cpu=cpu0,host=myhost.myco.com usage_guest=0,usage_guest_nice=0,usage_idle=100,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=0,usage_user=0 1551762010000000000
      > cpu,collector=telegraf,cpu=cpu1,host=myhost.myco.com usage_guest=0,usage_guest_nice=0,usage_idle=100,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=0,usage_user=0 1551762010000000000
      > cpu,collector=telegraf,cpu=cpu2,host=myhost.myco.com usage_guest=0,usage_guest_nice=0,usage_idle=98.00000004470348,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=0,usage_user=1.999999998952262 1551762010000000000
      > cpu,collector=telegraf,cpu=cpu3,host=myhost.myco.com usage_guest=0,usage_guest_nice=0,usage_idle=100,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=0,usage_user=0 1551762010000000000
      > cpu,collector=telegraf,cpu=cpu-total,host=myhost.myco.com usage_guest=0,usage_guest_nice=0,usage_idle=100,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=0,usage_user=0 1551762010000000000
      > mem,collector=telegraf,host=myhost.myco.com active=6735482880i,available=11130187776i,available_percent=73.67584678266645,buffered=3569352704i,cached=7279378432i,commit_limit=22233530368i,committed_as=4000460800i,dirty=839680i,free=281456640i,high_free=0i,high_total=0i,huge_page_size=2097152i,huge_pages_free=0i,huge_pages_total=0i,inactive=5336559616i,low_free=0i,low_total=0i,mapped=1415385088i,page_tables=116322304i,shared=1340026880i,slab=2446262272i,swap_cached=14417920i,swap_free=14367285248i,swap_total=14680047616i,total=15106969600i,used=3976781824i,used_percent=26.324153217333542,vmalloc_chunk=35184301154304i,vmalloc_total=35184372087808i,vmalloc_used=50819072i,wired=0i,write_back=0i,write_back_tmp=0i 1551762010000000000
      

    If the test is successful, but metrics are still not being received by cloud agent, check that the HTTP output plugin has been configured correctly. Check the host and port in the URL. Check Telegraf's log file or syslog for errors if any reported from outputs.http. Check if other software applications such as SELinux, anti-virus, or a firewall may be blocking Telegraf's ability to write metrics to the cloud agent's port.

  4. Search for payload level summary lines in the log file which starts with the "Source Metrics" line. These lines should give a summary count of statistics such as how many metrics are being received in each payload, how many have been sent to Oracle Management Cloud, or how many are unmapped.

    Payload Level Summary Logging Example - gcagent_sdk.trc

    Log of Payload Level Summary
    
    2019-03-04 21:45:04,613 [401336:9A108C02] DEBUG - Source Metrics: 18
    2019-03-04 21:45:04,613 [401336:9A108C02] DEBUG - SEND_METRIC_GROUP_CALLED: 18

    If the summary shows SEND_METRIC_GROUP_CALLED: <count>, that's normal.

    If the summary shows NO_ASSOC_GMC_ENTITY_WITH_MONITORING_CAPABILITY: <count>, then check that omc_filter_expression of the generic metric collector (gmc) entity allows the payload to filter through. Ensure that the name of the host field (if any) specified in the omc_filter_expression property exactly matches the host field's value in the payload. Also ensure that the gmc entity has either standard or enterprise license. License can be checked from Oracle Management Cloud’s Administration UI.

    If the summary shows METRIC_UPLOAD_RATE_LIMIT_EXCEEDED: <count>, then <count> metrics in the payload were down-sampled. They were not sent to up Oracle Management Cloud. This is expected if the sending interval is anything lower than once a minute (interval "60s" in the telegraf.conf file).

    If the summary shows WAITING_FOR_MAPPING_METADATA: <count>, then <count> metrics in the payload are waiting for auto-map processing to complete. This is a transient state only expected in the automatic mapping case. Auto-map processing can take a few minutes to a tens of minutes to complete.

    If the summary shows SKIPPED_DUPLICATE_METRIC_POST: <count>, then <count> metrics in the payload were skipped because multiple metric records were detected for any given entity, metric group and timestamp. In some cases this may be OK, such as when the payload contains redundant records which were skipped. In other cases, this may require user to tweak which tag(s) are used for entity identification by manually specifying entity_identifier in telegraf configuration file. In other cases, this may require tweaking the input plugin configuration or may even be a mapping limitation. For eg., ensure that the process name or pattern specified for procstat plugin captures a single process (PID). Ingesting procstat metrics for multiple processes in Oracle Managed Cloud is currently not supported and will result in skipped posts as seen in the log file.

    If the summary shows SKIPPED_AGGREGATE_METRIC_POST: <count>, then <count> metrics were skipped because they are aggregates. Ingestion of aggregate metrics such as sum, min, max, mean, count, histograms, etc. from Telegraf is currently not supported.

  5. If SEND_METRIC_GROUP_CALLED: <count> is seen, you should eventually start seeing entities on the monitoring service UI with type same as the Telegraf plugin name and entity name containing the Telegraf host's name (as obtained from the host field within the payload sent by Telegraf to Cloud agent). If you do not see such an entity, it’s possible that the entity has been created, but has not been granted Standard or Enterprise license. This can be fixed by adding a license from the License Administration UI. From the Oracle Management Cloud console, select the Administration > Entities Configuration > Licensing link. From this page, look at the Unlicensed Entities link. If it shows the auto-created entity, assign License Edition = Standard or Enterprise and click Save. To ensure this happens automatically in future, set the License Auto-Assignment to Standard or Enterprise.

  6. Once the auto-created entity shows up on the list of entities in the monitoring service UI, drill down into the entity to see the auto-mapped metrics. Only the availability metric will be shown by default. On the Performance Charts tab, Click Options > Choose Metrics to select the auto-created metrics for viewing their charts. Metric alert rules based on availability and threshold can also be defined on these performance metrics and are expected to work similar to alerts on metrics natively collected by Oracle Cloud agent. Anomaly alerts is disabled out of the box for Telegraf metrics auto-mapped in Oracle Manged Cloud

  7. When debugging is no longer required, turn off trace level logging and set the SDK log level to INFO. Set the following in emd.properties.

    Reset Log Level

    
    Logger._enableTrace=false
    Logger.sdklog.level=INFO