Monitoring for Replication Nodes (RN)

Each Storage Node hosts one or more Replication Nodes which stores the data in key-value pairs. For more information, see Replication Nodes and Shards in the Concepts Guide.

See the following section:

Metrics for Replication Node

Metrics for Replication Node

repNodeServiceStatus – The current status of the Replication Node. They are as follows:
- starting (1) – The storage node agent is booting up.
- waitingForDeploy (2) – The Replication Node is waiting to be registered with the Storage Node Agent.
- running(3) – The replication node is running.
- stopping(4) – The replication node is in the process of shutting down.
- stopped(5) – An intentional clean shutdown.
- errorRestarting(6) – The Replication Node is restarting after encountering an error.
- errorNoRestart(7) – Service is in an error state, will not restart automatically, and the service requires Administrative intervention. The user can search for SEVERE entries in both the log file for the Replication Node and the log file of the SNA controlling the failed service. The service's log in Monitoring for RN section is RN log:
```
<kvroot>/<storename>/log/rg*-rn*_*.log
```
  where, <kvroot> and <storename> are user inputs and * represents the number of the log. For example: rg3-rn2_0.log is the latest log, rg3-rn2_1.log is previous log.
  
  Note that the kvroot and storename will be different for every installation. Similarly, to find the log file for SNA, use:
```
<kvroot>/<storename>/log/sn*_*.log
```
  Examples of SN logs can be: sn1_0.log, sn1_1.log.
  You can search SEVERE keyword in these log files, and then read the searched messages to fix the errors, or you may require help from Oracle NoSQL Database support. The action to take depends on the nature of the failure and can vary from stopping and restarting the service explicitly (easy) to the need to replace the service instance entirely (not easy and slow). The issues can be any of the following:
  - Resource issue – Some type of necessary resource for example, disk space, memory, or network is not available.
  - Configuration problem – Some configuration-related issues which needs a fix.
  - Software bug – Bugs in the code which needs Oracle NoSQL Database support.
  - On disk corruption – Something in persistent storage has been corrupted.
  Note that the corruption situations are difficult to handle, but this is rare and require help from Oracle NoSQL Database support.
- unreachable(8) – The Replication Node is unreachable by the admin service.
  
  Note:
  
  If a Storage Node is UNREACHABLE, or a Replication Node is having problems and its Storage Node is UNREACHABLE, the first thing to check is the network connectivity between the Admin and the Storage Node. However, if the managing Storage Node Agent is reachable and the managed Replication Node is not, we can guess that the network is OK and the problem lies elsewhere.
- expectedRestarting(9) – The Replication Node is executing an expected restart as some plan CLI commands causes a component to restart. This is an expected restart, that is different from errorRestarting(6) (which is a restart after encountering an error).

The following metrics can be monitored to get a sense for the performance of each Replication Node in the cluster. There are two flavors of metric granularity:

Interval – By default, each node in the cluster will sample performance data every 20 seconds and aggregate the metrics to this interval. This interval may be changed using the admin plan change-parameters - global and supplying the collectorInterval parameter with a new value (see Changing Parameters).
Cumulative – Metrics that have been collected and aggregated since the node has started.

The metrics are further broken down into measurements for operations over single keys versus operations over multiple keys.

Note:

All timestamp metrics are in UTC, therefore appropriate conversion to a time zone relevant to where the store is deployed is necessary.

repNodeIntervalStart – The start timestamp of when this sample of single key operation measurements were collected.
repNodeIntervalEnd –The start timestamp of when this sample of single key operation measurements were collected.
repNodeIntervalTotalOps – Total number of single key operations (get, put, delete) processed by the Replication Node in the interval being measured.
repNodeIntervalThroughput – Number of single key operations (get, put, delete) per second completed during the interval being measured.
repNodeIntervalLatMin – The minimum latency sample of single key operations (get, put, delete) during the interval being measured.
repNodeIntervalLatMax – The maximum latency sample of single key operations (get, put, delete) during the interval being measured.
repNodeIntervalLatAvg – The average latency sample of single key operations (get, put, delete) during the interval being measured (returned as a float).
repNodeIntervalPct95 – The 95th percentile of the latency sample of single key operations (get, put, delete) during the interval being measured.
repNodeIntervalPct99 – The 95th percentile of the latency sample of single key operations (get, put, delete) during the interval being measured.
repNodeCumulativeStart – The start timestamp of when the replication started collecting cumulative performance metrics (all the below metrics that are cumulative).
repNodeCumulativeEnd – The end timestamp of when the replication ended collecting cumulative performance metrics (all the below metrics that are cumulative).
repNodeCumulativeTotalOps – The total number of single key operations that have been processed by the Replication Node.
repNodeCumulativeThroughput – The sustained operations per second of single key operations measured by this node since it has started.
repNodeCumulativeLatMin – The minimum latency of single key operations measured by this node since it has started.
repNodeCumulativeLatMax – The maximum latency of single key operations measured by this node since it has started.
repNodeCumulativeLatAvg – The average latency of single key operations measured by this node since it has started (returned as a float).
repNodeCumulativePct95 – The 95th percentile of the latency of single key operations (get, put, delete) since it has started.
repNodeCumulativePct99 – The 99th percentile of the latency of single key operations (get, put, delete) since it has started.
repNodeMultiIntervalStart – The start timestamp of when this sample of multiple key operation measurements were collected.
repNodeMultiIntervalEnd – The end timestamp of when this sample of multiple key operation measurements were collected.
repNodeMultiIntervalTotalOps – Total number of multiple key operations (execute) processed by the replication node in the interval being measured.
repNodeMultiIntervalThroughput – Number of multiple key operations (execute) per second completed during the interval being measured.
repNodeMultiIntervalLatMin – The minimum latency sample of multiple key operations (execute) during the interval being measured.
repNodeMultiIntervalLatMax – The maximum latency sample of multiple key operations (execute) during the interval being measured.
repNodeMultiIntervalLatAvg – The average latency sample of multiple key operations (execute) during the interval being measured (returned as a float).
repNodeMultiIntervalPct95 – The 95th percentile of the latency sample of multiple key operations (execute) during the interval being measured.
repNodeMultiIntervalPct99 – The 95th percentile of the latency sample of multiple key operations (execute) during the interval being measured.
repNodeMultiIntervalTotalRequests – The total number of multiple key operations (execute) during the interval being measured.
repNodeMultiCumulativeStart – The start timestamp of when the Replication Node started collecting cumulative multiple key performance metrics (all the below metrics that are cumulative).
repNodeMultiCumulativeEnd – The end timestamp of when the Replication Node started collecting cumulative multiple key performance metrics (all the below metrics that are cumulative).
repNodeMultiCumulativeTotalOps – The total number of single multiple operations that have been processed by the Replication Node since it has started.
repNodeMultiCumulativeThroughput – The sustained operations per second of multiple key operations measured by this node since it has started.
repNodeMultiCumulativeLatMin – The minimum latency of multiple key operations (execute) measured by this node since it has started.
repNodeMultiCumulativeLatMax – The maximum latency of multiple key operations (execute) measured by this node since it has started.
repNodeMultiCumulativeLatAvg – The average latency of multiple key operations (execute) measured by this node since it has started (returned as a float).
repNodeMultiCumulativePct95 – The 95th percentile of the latency of multiple key operations (execute) since it has started.
repNodeMultiCumulativePct99 – The 99th percentile of the latency of multiple key operations (execute) since it has started.
repNodeMultiCumulativeTotalRequests – The total number of multiple key operations measured by this node since it has started.
repNodeCommitLag – The average commit lag (in milliseconds) for a given Replication Node's update operations during a given time interval.
repNodeCacheSize – The size in bytes of the replication node's cache of B-tree nodes, which is calculated using the DBCacheSize utility.
repNodeConfigProperties – The set of configuration name/value pairs that the Replication Node is currently running with. Each parameter is a constant which has a string value. The string value is used to set the parameter in KVSTORE. For example, the parameter CHECKPOINTER_BYTES_INTERVAL has je.checkpointer.bytesInterval string value in the javadoc (see, here). The document also details on the data type, minimum, maximum time, etc.
repNodeCollectEnvStats – True or false depending on whether the Replication Node is currently collecting performance statistics.
repNodeStatsInterval – The interval (in seconds) that the Replication Node is utilizing for aggregate statistics.
repNodeMaxTrackedLatency – The maximum number of milliseconds for which latency statistics will be tracked. For example, if this parameter is set to 1000, then any operation at the repnode that exhibits a latency of 1000 or greater milliseconds is not put into the array of metric samples for subsequent reporting.
repNodeJavaMiscParams – The value of the -Xms, -Xmx, and -XX:ParallelGCThreads= as encountered when the Java VM running this Replication Node was booted.
repNodeLoggingConfigProps – The value of the loggingConfigProps parameter as encountered when the Java VM running this Replication Node was booted.
repNodeHeapMB – The size of the Java heap for this Replication Node, in MB.
repNodeMountPoint – The path to the file system mount point where this Replication Node's files are stored.
repNodeMountPointSize – The size of the file system mount point where this Replication Node's files are stored.
repNodeHeapSize – The current value of –Xmx for this Replication Node.
repNodeLatencyCeiling – The upper bound (in milliseconds) at which latency samples may be gathered at this Replication Node before an alert is generated. For example, if this is set to 3, then any latency sample above 3 generates an alert.
repNodeCommitLagThreshold – If the average commit lag (in milliseconds) for a given Replication Node during a given time interval exceeds the value returned by this method, an alert is generated.
repNodeReplicationState – The replication state of the node.
repNodeThroughputFloor – The lower bound (in operations per second) at which throughput samples may be gathered at this Replication Node before an alert is generated. For example, if this is set to 300,000, then any throughput calculation at this Replication Node that is lower than 300,000 operations per seconds generates an alert.