If quorum is lost as a result of the failure of one or more of the admin service instance(s), the store cannot be administered; although it can still be used to read and write key/value pairs. Because the admin service employs quorum based replication for high availability of the administrative data, the number of admins deployed should be large enough to make quorum loss unlikely; typically an odd number such as 3.
Usually, the Storage Node Agent (SNA) will automatically restart its failed admin service; preserving quorum and the ability to administer the store. But if the SN (specifically, the SNA) fails, then the admin service will remain unavailable.
When zone or node failures result in a loss of admin quorum, the admin CLI can still be used in read-only mode to help with diagnosing failures. For more information, see Troubleshooting.
In that case, the following procedure allows you to reconfigure the store so that the remaining admin service(s) can be used to perform administrative tasks:
For the following example, assume all of the following:
A store has 3 Storage Nodes (sn1, sn2, sn3) deployed on hosts (host-sn1, host-sn2, host-sn3) respectively.
An admin service is deployed to two of the three machines; admin 1 is deployed to host-sn1 and admin2 is deployed to host-sn2.
Admin2 service fails so quorum is lost.
To address the lost admin service quorum:
Verify that admin service quorum has been lost. If a command that requires quorum is executed in the CLI, a timeout should occur if quorum has been lost. For example:
kv-> pool create -name new-pool Transaction retry limit exceeded (12.1.3.0.1)
Use the show admins
command to determine the names
of the admin services that were deployed:
kv-> show admins admin1: Storage Node sn1, HTTP port 13231 (master) admin2: Storage Node sn2, HTTP port 13231
Identify the remaining healthy admin service(s) so quorum can be
reconfigured for those service(s). Login to sn1
and run the following command:
> jps -m | grep Admin 12276 ManagedService -root /opt/ondb/var/kvroot -class Admin -service BootstrapAdmin.13230 -config config1.xml
which means that only the first admin service (the bootstrap admin) that was deployed to the store is still healthy and running on host-sn1.
If a given admin service is down, either the Storage Node itself is down
and cannot be accessed, or the command above will produce no output
for the associated admin service. In this case, admin2
is down and thus omitted in the output.
Once the healthy admin service(s) have been identified, the configuration
file for each of the corresponding SN(s) should be modified so that the
je.rep.electableGroupSizeOverride
configuration
property of the admin component equals the number of remaining
healthy admin services. In this example, the configuration file (config.xml)
corresponding to sn1
must be modified. The file has the following structure:
<config version="1"> <component name="storageNodeParams" type="storageNodeParams" validate="true"> <property name="storageNodeId" value="1" type="INT"/> <property name="rootDirPath" value="/opt/ondb/var/kvroot" type="STRING"/> <property name="haHostname" value="host-sn1" type="STRING"/> <property name="registryPort" value="13230" type="INT"/> .......... </component> <component name="mountPoints" type="bootstrapParams" validate="false"> <property name="/disk1/ondb/data" value="" type="STRING"/> </component> <component name="globalParams" type="globalParams" validate="true"> <property name="storeName" value="example-store" type="STRING"/> <property name="isLoopback" value="false" type="BOOLEAN"/> </component> <component name="admin1" type="adminParams" validate="true"> <property name="adminHttpPort" value="13231" type="INT"/> <property name="storageNodeId" value="1" type="INT"/> .......... </component> <component name="rg1-rn1" type="repNodeParams" validate="true"> <property name="repNodeId" value="rg1-rn1" type="STRING"/> <property name="repNodeType" value="ELECTABLE" type="STRING"/> <property name="storageNodeId" value="1" type="INT"/> .......... </component>
In the component named admin1
, add (or modify)
the property named configProperties
and set the
value to 1. For example:
<component name="admin1" type="adminParams" validate="true"> <property name="configProperties" value="je.rep.electableGroupSizeOverride 1;" type="STRING"/> .......... </component>
Kill each of the healthy admin service process(es) for which a configuration file was modified. To kill the healthy admin service (admin1) on host-sn1 run:
> kill 11762
where 11762 is the process id of the single healthy admin service, identified in step 2.
The SNA will automatically restart the killed admin service; which, for this example, will then be configured to operate with a quorum size of 1 rather than 2.
Use the single healthy admin service (admin1) to perform the necessary failure recovery administration tasks on the store.
Once the failed admin service(s) are recovered (admin2), repeat steps 4
and 5 to reconfigure the admin service(s) to their original notion
of quorum. For this example, je.rep.electableGroupSizeOverride
should be set to 2.