Two sets of configurable properties set the behavior of the Deployment Template fault tolerance mechanism and the frequency of status checks for components.
You can now configure fault tolerance (i.e., retries) for any component (such as Forge, Dgidx, and Dgraph) when invoked through the EAC. This functionality also extends to the CAS server when running a crawl with the CAS component. The name of the fault-tolerance property is maxMissedStatusQueriesAllowed
.
When components are run, the Deployment Template instructs the EAC to start a component, then polls on a regular interval to check if the component is running, stopped, or failed. If one of these status checks fails, the Deployment Template assumes the component has failed and the script ends. The maxMissedStatusQueriesAllowed
property allows a configurable number of consecutive failures to be tolerated before the script will end.
The following is an example of a Forge component configured to tolerate a maximum of ten consecutive failures:
<forge id="Forge" host-id="ITLHost"> <properties> <property name="numStateBackups" value="10"/> <property name="numLogBackups" value="10"/> <property name="maxMissedStatusQueriesAllowed" value="10"/> </properties> ... </forge>
The default number of allowed consecutive failures is 5. Note that these status checks are consecutive, so that every time a status query returns successfully, the counter is reset to zero.
Keep in mind that you can use different fault-tolerance settings for your components. For example, you could set a value of 10 for the Forge component, a value of 8 for Dgidx, and a value of 6 for the Dgraph.
As described in the previous section, the Deployment Template polls on a regular interval to check if a started component is running, stopped, or failed. A set of four properties is available to configure each component for how frequently the Deployment Template polls for status while the component is running. Because each property has a default value, you can use only those properties that are important to you.
The polling properties are as follows:
minWaitSeconds
specifies the threshold (in seconds) when slow polling switches to standard (regular) polling. The default is -1 (i.e., no threshold, so the standard polling interval is used from the start).slowPollingIntervalMs
specifies the interval (in milliseconds) that status queries are sent as long as theminWaitSeconds
time has not elapsed. The default slow polling interval is 60 seconds.standardPollingIntervalMs
(specified in milliseconds) is used after theminWaitSeconds
time has passed. If nominWaitSeconds
setting is specified, thestandardPollingIntervalMs
setting is always used. The default standard polling interval is 1 second.maxWaitSeconds
specifies the threshold (in seconds) when the Deployment Template gives up asking for status and assumes that it has failed. The default is -1 (i.e., no threshold, so the Deployment Template will keep trying indefinitely).
Here is an example configuration for a long-running Forge component that typically takes 8 hours to complete:
<forge id="Forge" host-id="ITLHost"> <properties> <property name="numStateBackups" value="10"/> <property name="numLogBackups" value="10"/> <property name="standardPollingIntervalMs" value="60000"/> <property name="slowPollingIntervalMs" value="600000"/> <property name="minWaitSeconds" value="28800"/> <property name="maxMissedStatusQueriesAllowed" value="10"/> </properties> ... </forge>
The result of this configuration would be that for the first 8 hours (minWaitSeconds
=28800), Forge’s status would be checked every 10 minutes (slowPollingIntervalMs
=600000), after which time the status would be checked every minute (standardPollingIntervalMs
=60000). If a status check fails, a maximum of 10 consecutive retries will be attempted, based on the standardPollingIntervalMs
setting.
Keep in mind that these values can be set independently for each component.
Fault tolerance and polling interval values can also be set for these utilities:
You set the new values by adjusting the BeanShell script code that is used to construct and
invoke the utility. You adjust the code by using these setter methods from the EAC Toolkit's Utility
class:
If you do not use any of these methods, then the utility will use the default values listed in the two previous sections.
For example, here is a default utility invocation in the CAS crawl scripts:
// create the target dir, if it doesn't already exist mkDirUtil = new CreateDirUtility(CAS.getAppName(), CAS.getEacHost(), CAS.getEacPort(), CAS.isSslEnabled()); mkDirUtil.init(Forge.getHostId(), destDir, CAS.getWorkingDir()); mkDirUtil.run();
You would then add these methods before calling the run()
method, so that the code would now look like this:
// create the target dir, if it doesn't already exist mkDirUtil = new CreateDirUtility(CAS.getAppName(), CAS.getEacHost(), CAS.getEacPort(), CAS.isSslEnabled()); mkDirUtil.init(Forge.getHostId(), destDir, CAS.getWorkingDir()); mkDirUtil.setMinWaitSeconds(30); mkDirUtil.setMaxWaitSeconds(120); mkDirUtil.setMaxMissedStatusQueriesAllowed(10); mkDirUtil.setPollingIntervalMs(5000); mkDirUtil.setSlowPollingIntervalMs(30000); mkDirUtil.run();
Alternatively, if your utility was defined in your AppConfig.xml
like this:
<copy id=”MyCopy” src-host-id=”ITLHost” dest-host-id=”MDEXHost” recursive=”true”> <src>./path/to/files</src> <dest>./path/to/target</dest> </copy>
You would add the same type of lines as above, before calling the run()
method; for example:
MyCopy.setMaxMissedStatusQueriesAllowed(10); MyCopy.run();
For more information on the Utility
methods, see the Javadocs for the EAC Toolkit package.