Fetch incremental crawl data script

This fetch script is used to copy the incremental crawl output files to the appropriate directory for a partial update. The script is included in this section, with steps indicating the actions performed at each point in the script.

The script does not actually perform the partial update itself; that update operation is managed by scripts in the AppConfig.xml document.

<script id="fetchIncrementalCasCrawlData">
   <![CDATA[
 log.info("Fetching incremental CAS crawl data for processing.");
  1. Obtain a lock on the complete crawl data directory. The script will attempt to set a flag to serve as a lock on the data/complete_cas_crawl_output directory. The flag will be removed in the case of an error or when the script completes successfully.
      // try to acquire a lock on the complete crawl data directory
      // for up to 10 minutes
      if (LockManager.acquireLockBlocking("complete_cas_crawl_data_lock",
          600))
    
  2. Get the path of the source data directory. The path of the source data directory is obtained. The directory name is set by the casCrawlIncrementalOutputDestDir property of the custom-component section.
      incrSrcDir = PathUtils.getAbsolutePath(CAS.getWorkingDir(),
        CAS.getCaCrawlIncrementalOutputDestDir()) + "/\\*";
    
  3. Get the path of the destination directory. The path of the destination directory (to which incremental output files will be copied) is obtained. The directory name is specified by the IncomingDataDir property ( ./data/partials/incoming is the default) in the PartialForge section of the AppConfig.xml file.
      incrDestDir = PathUtils.getAbsolutePath(PartialForge.getWorkingDir(),
        PartialForge.getIncomingDataDir());
    
  4. Copy the incremental source data to the destination directory. Instantiate a CopyUtility object (named crawlDataCopy) and use it to copy the incremental source data to the data/partials/incoming directories.
      // copy incremental crawl data  
      crawlDataCopy = new CopyUtility(PartialForge.getAppName(),
        PartialForge.getEacHost(), PartialForge.getEacPort(),
        PartialForge.isSslEnabled()); 
      crawlDataCopy.init("copy_complete_cas_incremental_crawl_data",
        CAS.getFsCrawlOutputDestHost(),PartialForge.getHostId(), incrSrcDir,
        incrDestDir, true);
      crawlDataCopy.run();
    
  5. Set flags to indicate that files are ready for partial update processing. Default partial update functionality in the AppConfig.xml script expects flags to be set to indicate which files are ready for partial update processing. For each file delivered in the previous step, a flag is set with the name of the file prefixed by the string "partial_extract::".
      // (re)set flags indicating which partial update files are ready
      // for processing -- convention is "partial_extract::[filename]"
      fileUtil = new FileUtility(PartialForge.getAppName(),
        PartialForge.getEacHost(), PartialForge.getEacPort(), 
        PartialForge.isSslEnabled());
      dirContents = fileUtil.getDirContents(incrDestDir, 
        PartialForge.getHostId());
    
      for (file : dirContents.keySet()) {
        fileName = PathUtils.getFileNameFromPath(file);
        LockManager.setFlag("partial_extract::" + fileName);
      }
    
  6. Release the lock. The "complete_cas_crawl_data_lock" flag is removed from the EAC, indicating that the fetch operation was successful. A "finished" message is also logged .
      // release lock on the crawl data directory
      LockManager.releaseLock("complete_cas_crawl_data_lock");
      ...
      log.info("Crawl data fetch script finished.");