Monitoring for Replication Nodes (RN)

Each Storage Node hosts one or more Replication Nodes which stores the data in key-value pairs. For more information, see Replication Nodes and Shards in the Concepts Guide.

See the following section:

Metrics for Replication Node

  • repNodeServiceStatus The current status of the Replication Node. They are as follows:

    • starting (1) The storage node agent is booting up.

    • waitingForDeploy (2) The Replication Node is waiting to be registered with the Storage Node Agent.

    • running(3) The replication node is running.

    • stopping(4) The replication node is in the process of shutting down.

    • stopped(5) An intentional clean shutdown.

    • errorRestarting(6) The Replication Node is restarting after encountering an error.

    • errorNoRestart(7) Service is in an error state, will not restart automatically, and the service requires Administrative intervention. The user can search for SEVERE entries in both the log file for the Replication Node and the log file of the SNA controlling the failed service. The service's log in Monitoring for RN section is RN log:

      <kvroot>/<storename>/log/rg*-rn*_*.log

      where, <kvroot> and <storename> are user inputs and * represents the number of the log. For example: rg3-rn2_0.log is the latest log, rg3-rn2_1.log is previous log.

      Note that the kvroot and storename will be different for every installation. Similarly, to find the log file for SNA, use:

      <kvroot>/<storename>/log/sn*_*.log
      Examples of SN logs can be: sn1_0.log, sn1_1.log.
      You can search SEVERE keyword in these log files, and then read the searched messages to fix the errors, or you may require help from Oracle NoSQL Database support. The action to take depends on the nature of the failure and can vary from stopping and restarting the service explicitly (easy) to the need to replace the service instance entirely (not easy and slow). The issues can be any of the following:
      • Resource issue – Some type of necessary resource for example, disk space, memory, or network is not available.

      • Configuration problem – Some configuration-related issues which needs a fix.

      • Software bug – Bugs in the code which needs Oracle NoSQL Database support.

      • On disk corruption – Something in persistent storage has been corrupted.

      Note that the corruption situations are difficult to handle, but this is rare and require help from Oracle NoSQL Database support.

    • unreachable(8) The Replication Node is unreachable by the admin service.

      Note:

      If a Storage Node is UNREACHABLE, or a Replication Node is having problems and its Storage Node is UNREACHABLE, the first thing to check is the network connectivity between the Admin and the Storage Node. However, if the managing Storage Node Agent is reachable and the managed Replication Node is not, we can guess that the network is OK and the problem lies elsewhere.

    • expectedRestarting(9) The Replication Node is executing an expected restart as some plan CLI commands causes a component to restart. This is an expected restart, that is different from errorRestarting(6) (which is a restart after encountering an error).

The following metrics can be monitored to get a sense for the performance of each Replication Node in the cluster. There are two flavors of metric granularity:

  • Interval By default, each node in the cluster will sample performance data every 20 seconds and aggregate the metrics to this interval. This interval may be changed using the admin plan change-parameters - global and supplying the collectorInterval parameter with a new value (see Changing Parameters).

  • Cumulative Metrics that have been collected and aggregated since the node has started.

The metrics are further broken down into measurements for operations over single keys versus operations over multiple keys.

Note:

All timestamp metrics are in UTC, therefore appropriate conversion to a time zone relevant to where the store is deployed is necessary.

  • repNodeIntervalStart The start timestamp of when this sample of single key operation measurements were collected.

  • repNodeIntervalEnd The start timestamp of when this sample of single key operation measurements were collected.

  • repNodeIntervalTotalOps Total number of single key operations (get, put, delete) processed by the Replication Node in the interval being measured.

  • repNodeIntervalThroughput Number of single key operations (get, put, delete) per second completed during the interval being measured.

  • repNodeIntervalLatMin The minimum latency sample of single key operations (get, put, delete) during the interval being measured.

  • repNodeIntervalLatMax The maximum latency sample of single key operations (get, put, delete) during the interval being measured.

  • repNodeIntervalLatAvg The average latency sample of single key operations (get, put, delete) during the interval being measured (returned as a float).

  • repNodeIntervalPct95 The 95th percentile of the latency sample of single key operations (get, put, delete) during the interval being measured.

  • repNodeIntervalPct99 The 95th percentile of the latency sample of single key operations (get, put, delete) during the interval being measured.

  • repNodeCumulativeStart The start timestamp of when the replication started collecting cumulative performance metrics (all the below metrics that are cumulative).

  • repNodeCumulativeEnd The end timestamp of when the replication ended collecting cumulative performance metrics (all the below metrics that are cumulative).

  • repNodeCumulativeTotalOps The total number of single key operations that have been processed by the Replication Node.

  • repNodeCumulativeThroughput The sustained operations per second of single key operations measured by this node since it has started.

  • repNodeCumulativeLatMin The minimum latency of single key operations measured by this node since it has started.

  • repNodeCumulativeLatMax The maximum latency of single key operations measured by this node since it has started.

  • repNodeCumulativeLatAvg The average latency of single key operations measured by this node since it has started (returned as a float).

  • repNodeCumulativePct95 The 95th percentile of the latency of single key operations (get, put, delete) since it has started.

  • repNodeCumulativePct99 The 99th percentile of the latency of single key operations (get, put, delete) since it has started.

  • repNodeMultiIntervalStart The start timestamp of when this sample of multiple key operation measurements were collected.

  • repNodeMultiIntervalEnd The end timestamp of when this sample of multiple key operation measurements were collected.

  • repNodeMultiIntervalTotalOps Total number of multiple key operations (execute) processed by the replication node in the interval being measured.

  • repNodeMultiIntervalThroughput Number of multiple key operations (execute) per second completed during the interval being measured.

  • repNodeMultiIntervalLatMin The minimum latency sample of multiple key operations (execute) during the interval being measured.

  • repNodeMultiIntervalLatMax The maximum latency sample of multiple key operations (execute) during the interval being measured.

  • repNodeMultiIntervalLatAvg The average latency sample of multiple key operations (execute) during the interval being measured (returned as a float).

  • repNodeMultiIntervalPct95 The 95th percentile of the latency sample of multiple key operations (execute) during the interval being measured.

  • repNodeMultiIntervalPct99 The 95th percentile of the latency sample of multiple key operations (execute) during the interval being measured.

  • repNodeMultiIntervalTotalRequests The total number of multiple key operations (execute) during the interval being measured.

  • repNodeMultiCumulativeStart The start timestamp of when the Replication Node started collecting cumulative multiple key performance metrics (all the below metrics that are cumulative).

  • repNodeMultiCumulativeEnd The end timestamp of when the Replication Node started collecting cumulative multiple key performance metrics (all the below metrics that are cumulative).

  • repNodeMultiCumulativeTotalOps The total number of single multiple operations that have been processed by the Replication Node since it has started.

  • repNodeMultiCumulativeThroughput The sustained operations per second of multiple key operations measured by this node since it has started.

  • repNodeMultiCumulativeLatMin The minimum latency of multiple key operations (execute) measured by this node since it has started.

  • repNodeMultiCumulativeLatMax The maximum latency of multiple key operations (execute) measured by this node since it has started.

  • repNodeMultiCumulativeLatAvg The average latency of multiple key operations (execute) measured by this node since it has started (returned as a float).

  • repNodeMultiCumulativePct95 The 95th percentile of the latency of multiple key operations (execute) since it has started.

  • repNodeMultiCumulativePct99 The 99th percentile of the latency of multiple key operations (execute) since it has started.

  • repNodeMultiCumulativeTotalRequests The total number of multiple key operations measured by this node since it has started.

  • repNodeCommitLag The average commit lag (in milliseconds) for a given Replication Node's update operations during a given time interval.

  • repNodeCacheSize The size in bytes of the replication node's cache of B-tree nodes, which is calculated using the DBCacheSize utility.

  • repNodeConfigProperties The set of configuration name/value pairs that the Replication Node is currently running with. Each parameter is a constant which has a string value. The string value is used to set the parameter in KVSTORE. For example, the parameter CHECKPOINTER_BYTES_INTERVAL has je.checkpointer.bytesInterval string value in the javadoc (see, here). The document also details on the data type, minimum, maximum time, etc.

  • repNodeCollectEnvStats True or false depending on whether the Replication Node is currently collecting performance statistics.

  • repNodeStatsInterval The interval (in seconds) that the Replication Node is utilizing for aggregate statistics.

  • repNodeMaxTrackedLatency The maximum number of milliseconds for which latency statistics will be tracked. For example, if this parameter is set to 1000, then any operation at the repnode that exhibits a latency of 1000 or greater milliseconds is not put into the array of metric samples for subsequent reporting.

  • repNodeJavaMiscParams The value of the -Xms, -Xmx, and -XX:ParallelGCThreads= as encountered when the Java VM running this Replication Node was booted.

  • repNodeLoggingConfigProps The value of the loggingConfigProps parameter as encountered when the Java VM running this Replication Node was booted.

  • repNodeHeapMB The size of the Java heap for this Replication Node, in MB.

  • repNodeMountPoint The path to the file system mount point where this Replication Node's files are stored.

  • repNodeMountPointSize The size of the file system mount point where this Replication Node's files are stored.

  • repNodeHeapSize The current value of –Xmx for this Replication Node.

  • repNodeLatencyCeiling The upper bound (in milliseconds) at which latency samples may be gathered at this Replication Node before an alert is generated. For example, if this is set to 3, then any latency sample above 3 generates an alert.

  • repNodeCommitLagThreshold If the average commit lag (in milliseconds) for a given Replication Node during a given time interval exceeds the value returned by this method, an alert is generated.

  • repNodeReplicationState The replication state of the node.

  • repNodeThroughputFloor The lower bound (in operations per second) at which throughput samples may be gathered at this Replication Node before an alert is generated. For example, if this is set to 300,000, then any throughput calculation at this Replication Node that is lower than 300,000 operations per seconds generates an alert.