Monitoring for Arbiter Nodes

An Arbiter Node is a lightweight process that participates in electing a new master when the old master becomes unavailable. For more information, see Arbiter Nodes in the Concepts Guide.

See the following section:

Metrics for Arbiter Nodes

  • arbNodeServiceStatus The current status of the Arbiter Node. They are as follows:

    • starting (1) The Storage Node Agent is booting up.

    • waitingForDeploy (2) The Arbiter Node is waiting to be registered with the Storage Node Agent.

    • running(3) The Arbiter Node is running.

    • stopping(4) The Arbiter Node is in the process of shutting down.

    • stopped(5) An intentional clean shutdown.

    • errorRestarting(6) The Arbiter Node is restarting after encountering an error.

    • errorNoRestart(7) Service is in an error state and will not be automatically restarted. Administrative intervention is required. The user can search for SEVERE entries in both the service's log file and the log file of the SNA controlling the failed service. The service's log in Monitoring for Arbiter section is Arbiter log:

      <kvroot>/<storename>/log/rg*_an1_*.log

      where, <kvroot> and <storename> are user inputs and * represents the number of the log.

      Note that the kvroot and storename will be different for every installation. Similarly, to find the log file for SNA, use:
      <kvroot>/<storename>/log/sn*_*.log
      Examples of SN logs can be: sn1_0.log, sn1_1.log.
      You can search SEVERE keyword in these log files, and then read the searched messages to fix the errors, or you may require help from Oracle NoSQL Database support. The action to take depends on the nature of the failure and can vary from stopping and restarting the service explicitly (easy) to the need to replace the service instance entirely (not easy and slow). The issues can be any of the following:
      • Resource issue – Some type of necessary resource for example, disk space, memory, or network is not available.

      • Configuration problem – Some configuration-related issues which needs a fix.

      • Software bug – Bugs in the code which needs Oracle NoSQL Database support.

      • On disk corruption – Something in persistent storage has been corrupted.

      Note that the corruption situations are difficult to handle, but this is rare and require help from Oracle NoSQL Database support.

    • unreachable(8) The Arbiter Node is unreachable by the admin service.

      Note:

      If a Storage Node is UNREACHABLE, or an Admin Node is having problems and its Storage Node is UNREACHABLE, the first thing to check is the network connectivity between the Admin and the Storage Node. However, if the managing Storage Node Agent is reachable and the managed Arbiter Node is not, we can guess that the network is OK and the problem lies elsewhere.

    • expectedRestarting(9) The Arbiter Node is executing an expected restart as some plan CLI commands causes a component to restart. This is an expected restart, that is different from errorRestarting(6) (which is a restart after encountering an error).

    Note:

    All timestamp metrics are in UTC, therefore appropriate conversion to a time zone relevant to where the store is deployed is necessary.

  • arbNodeConfigProperties The set of configuration name/value pairs that the Arbiter Node is currently running with. This is analogous to the Replication Node.

  • arbNodeJavaMiscParams The value of the -Xms, -Xmx, and -XX:ParallelGCThreads= as encountered when the Java VM running this Arbiter Node was booted.

  • arbNodeLoggingConfigProps The value of the loggingConfigProps parameter as encountered when the Java VM running this Arbiter Node was booted.

  • arbNodeCollectEnvStats True or false depending on whether the Arbiter Node is currently collecting performance statistics.

  • arbNodeStatsInterval The interval (in seconds) that the Arbiter Node is utilizing for aggregate statistics.

  • arbNodeHeapMB The size of the Java heap for this Arbiter Node, in MB.

  • arbNodeAcks The number of transactions acked.

  • arbNodeMaster The current master.

  • arbNodeState The replication state of the node. An Arbiter has an associated replication state (analogous to the replication node state). The state diagram is UNKNOWN <-> REPLICA -> DETACHED.

  • arbNodeVLSN The current acked VLSN. Arbiters track the VLSN/DTVLSN of the transaction commit that the Arbiter acknowledges. This is the highest VLSN value that the Arbiter acknowledged.

  • arbNodeReplayQueueOverflow The current replayQueueOverflow value. The arbNodeReplayQueueOverflow statistic is incremented when the Arbiter is not able to process acknowledgement requests fast enough to prevent the thread reading from the network to wait for free space in the queue. The RepParms.REPLICA_MESSAGE_QUEUE_SIZE is used to specify the maximum number of entries that the queue can hold. The default is 1000 entries. A high arbNodeReplayQueueOverflow value may indicate that the queue size is too small or that the Arbiter is not able to process requests as fast as the system load requires.