The SourceConfig class allows a client to specify information about the data source that is being crawled. The SourceConfig class uses two methods to set data source properties: setModuleId() and setModuleProperties().

The SourceConfig object for a file system crawl requires a ModuleId that specifies "File System", a ModuleProperty to specify the seeds, and additional ModuleProperty objects for any optional source properties.


Here is an example of the source properties for a file system crawl.

// Connect to the CAS Server.
CasCrawlerLocator locator = CasCrawlerLocator.create("localhost", 8500);		
CasCrawler crawler = locator.getService();

// Create a new crawl Id with the name set to Demo.
CrawlId crawlId = new CrawlId("Demo");

// Create the crawl configuration.
CrawlConfig crawlConfig = new CrawlConfig(crawlId);

// Create the source configuration.
SourceConfig sourceConfig = new SourceConfig();

// Create a file system module ID.
ModuleId moduleId = new ModuleId("File System");

// Set the module ID in the source config.
sourceConfig.setModuleId(moduleId);

// Create a module property object for the seeds.
ModuleProperty seeds = new ModuleProperty();
// Set the key for seeds.
seeds.setKey("seeds");
// Set multiple values for seeds.
seeds.setValues("C:\\tmp\itldocset","C:\\tmp\iapdocset");

// Set the seeds module property on the source config.
sourceConfig.addModuleProperty(seeds);

// Create a module property for gathering native file props.
ModuleProperty nativeFileProps = new ModuleProperty();
// Set the key for gathering native file properties.
nativeFileProps.setKey("gatherNativeFileProperties");
// Set the value to enable gathering native file properties.
nativeFileProps.setValues("true");

// Set the nativeFileProps module property on the source config.
sourceConfig.addModuleProperty(nativeFileProps);

// Create a module property object for expanding archives.
ModuleProperty extractArchives = new ModuleProperty();
// Set the key for extracting archive files.
extractArchives.setKey("expandArchives");
// Set the value to enable expanding archives.
extractArchives.setValues("true");

// Set the nativeFileProps module property on the source config.
sourceConfig.addModuleProperty(extractArchives);

// Set the source configuration in the crawl configuration.
crawlConfig.setSourceConfig(SourceConfig);

// Create the crawl.
crawler.createCrawl(crawlConfig);

Note that if you retrieve a SourceConfig object from a configured crawl, you can call the getModuleId() method to get the module ID and the getModuleProperties() method to retrieve the list of module properties.

The SourceConfig for a CMS crawl contains a mandatory ModuleId and additional ModuleProperty objects that define the CMS to crawl.

The source configuration (SourceConfig object) for a CMS crawl requires the module ID (which is a string that identifies the CMS connector). Use the CasCrawler.listModules() method to find out which module IDs are available to your CAS Server.


Here is an example of the source properties for a CMS crawl.

// Connect to the CAS Server.
CasCrawlerLocator locator = CasCrawlerLocator.create("localhost", 8500);		
CasCrawler crawler = locator.getService();

// Create a new crawl Id with the name set to Demo.
CrawlId crawlId = new CrawlId("Demo");

// Create the crawl configuration.
CrawlConfig crawlConfig = new CrawlConfig(crawlId);

// Create the source configuration.
SourceConfig sourceConfig = new SourceConfig();

// Create a CMS module ID for a SampleLink repository.
ModuleId moduleId = new ModuleId("SampleLink");

// Set the module ID in the source configuration.
sourceConfig.setModuleId(moduleId);

// Create a list for the module property objects.
List<ModuleProperty> cmsPropsList = new ArrayList<ModuleProperty>();

// Configure the source properties that are specific to this 
// CMS. This example sets properties 
// for a SampleLink repository (a non-existent repository used to 
// illustrate the process).

// Create a module property for the DNS name.
ModuleProperty sampleLinkServer = new ModuleProperty();
// Set the key/value pair for the url source property.
sampleLinkServer.setKey("url");
sampleLinkServer.setValues("http://samplelink45.mysite.com");
// Set the module property in the module property list.
cmsPropsList.add(sampleLinkServer);

// Create a module property object to enable archive expansion.
ModuleProperty extractArchives = new ModuleProperty();
// Set the key for archive expansion.
extractArchives.setKey("expandArchives");
// Set the value to enable archive expansion.
extractArchives.setValues("true");
// Set the module property in the module property list.
cmsPropsList.add(extractArchives);

// Create a module property for username.
ModuleProperty uname = new ModuleProperty();
// Set the key for username. 
uname.setKey("username");
// Set the value and prepend the domain for Windows systems.
uname.setValues("SALES\\username");
// Set the module property in the module property list.
cmsPropsList.add(uname);

// Create a module property for password.
ModuleProperty upass = new ModuleProperty();
// Set the password key.
upass.setKey("password");
// Set the password value.
upass.setValues("endeca");
// Set the module property in the module property list.
cmsPropsList.add(upass);

// Set the module property list in the source configuration.
sourceConfig.setModuleProperties(cmsPropsList);

// Set the source configuration in the crawl configuration.
crawlConfig.setSourceConfig(SourceConfig);

// Create the crawl.
crawler.createCrawl(crawlConfig);

Note that if you retrieve a SourceConfig object from a configured crawl, you can use its getModuleId() method to get the module ID and the getModuleProperties() method to retrieve the list of module properties..

The SourceConfig object for a record store merger crawl requires a ModuleId that specifies com.endeca.cas.source.RecordStoreMerger, one or more ModuleProperty to specify the record store instances to merge, and additional ModuleProperty objects for optional source properties.


Here is an example of the source properties for a Record Store Merger crawl.

// Connect to the CAS Server.
CasCrawlerLocator locator = CasCrawlerLocator.create("localhost", 8500);		
CasCrawler crawler = locator.getService();

// Create a new crawl Id with the name set to Demo.
CrawlId crawlId = new CrawlId("Demo");

// Create the crawl configuration.
CrawlConfig crawlConfig = new CrawlConfig(crawlId);

// Create the source configuration.
SourceConfig sourceConfig = new SourceConfig();

// Create a record store merger module ID.
ModuleId moduleId = new ModuleId("com.endeca.cas.source.RecordStoreMerger");

// Set the module ID in the source config.
sourceConfig.setModuleId(moduleId);

// Create a module property object for the data record stores.
ModuleProperty dataRecStores = new ModuleProperty();
// Set the key for data record stores.
dataRecStores.setKey("dataRecordStores");
// Set multiple values for each data record store name.
dataRecStores.setValues("DataStore1","DataStore2","DataStore3");

// Set the data record store module property on the source config.
sourceConfig.addModuleProperty(dataRecStores);

// Create a module property object for the dimension value record stores.
ModuleProperty dvalRecStores = new ModuleProperty();
// Set the key for dimension value record stores.
dvalRecStores.setKey("dimensionValueRecordStores");
// Set multiple values for each taxonomy record store name.
dvalRecStores.setValues("DvalStoreCrawl1","DvalStoreCrawl2","DvalStoreCrawl3");

// Set the dimension value record store module property on the source config.
sourceConfig.addModuleProperty(dvalRecStores);

// Set the source configuration in the crawl configuration.
crawlConfig.setSourceConfig(SourceConfig);

// Create the crawl.
crawler.createCrawl(crawlConfig);

Note that if you retrieve a SourceConfig object from a configured crawl, you can call the getModuleId() method to get the module ID and the getModuleProperties() method to retrieve the list of module properties.

The SourceConfig for a custom data source crawl contains a mandatory ModuleId and ModuleProperty objects that define the custom data source to crawl and any other optional properties that are necessary for a custom data source.

Custom data sources can use any number of module properties. A plug-in developer determines what module properties are necessary for a custom data source and whether the module properties are required or optional.

A CAS application developer can check the available module properties for a custom data source by running the getModuleSpec task in the CAS Server Command-line Utility:

Here is an example of the source properties for a custom data source crawl.

// Connect to the CAS Server.
CasCrawlerLocator locator = CasCrawlerLocator.create("localhost", 8500);		
CasCrawler crawler = locator.getService();

// Create a new crawl Id with the name set to Demo.
CrawlId crawlId = new CrawlId("Demo"); 

// Create the crawl configuration.
CrawlConfig crawlConfig = new CrawlConfig(crawlId);

// Create the source configuration.
SourceConfig sourceConfig = new SourceConfig();

// Create a module ID for a Sample Data Source repository.
// Set the module ID in the constructor. 
ModuleId moduleId = new ModuleId("Sample Data Source");

// Create a list for the module property objects.
List<ModuleProperty> cmsPropsList = new ArrayList<ModuleProperty>();

// Create a module property for username.
// Set key/values of the module property as strings in the constructor.
ModuleProperty uname = new ModuleProperty("username", "SALES\\username");

// Set the module property in the module property list.
cmsPropsList.add(uname);

// Create a module property for password.
// Set key/values of the module property as strings in the constructor.
ModuleProperty upass = new ModuleProperty("password", "endeca");

// Set the module property in the module property list.
cmsPropsList.add(upass);

// Set the module property list in the source configuration.
sourceConfig.setModuleProperties(cmsPropsList);

// Set the source configuration in the crawl configuration.
crawlConfig.setSourceConfig(SourceConfig);

// Create the crawl.
crawler.createCrawl(crawlConfig);

The ManipulatorConfig for a manipulator contains a mandatory ModuleId and ModuleProperty objects that define the manipulator to run and any other optional properties that are necessary for a manipulator.

Manipulators can use any number of module properties. A plug-in developer determines what module properties are necessary for a manipulator and whether the module properties are required or optional.

A CAS application developer can check the available module properties for a manipulator by running the getModuleSpec task in the CAS Server Command-line Utility:

  1. Start a command prompt and navigate to <install path>\CAS\version\bin.

  2. Type cas-cmd.bat (for Windows), or cas-cmd.sh (for UNIX) and specify the getModuleSpec task with the Id of the module whose source properties you want to see. For example:

    C:\Endeca\CAS\<version>\bin>cas-cmd getModuleSpec -id com.endeca.cas.extension.sample.manipulator.substring.SubstringManipulator
    Substring Manipulator
    =====================
    [Module Information]
     *Id: com.endeca.cas.extension.sample.manipulator.substring.SubstringManipulator
    
     *Type: MANIPULATOR
     *Description: Generates a new property that is a substring of another property
    value
    
    [Substring Manipulator Configuration Properties]
    Group:
    -------
    Source Property:
     *Name: sourceProperty
     *Type: {http://www.w3.org/2001/XMLSchema}string
     *Required: true
     *Default Value:
     *Max Length: 255
     *Description:
     *Multiple Values: false
     *Multiple Lines: false
     *Password: false
     *Always Show: false
    
    Target Property:
     *Name: targetProperty
     *Type: {http://www.w3.org/2001/XMLSchema}string
     *Required: true
     *Default Value:
     *Max Length: 255
     *Description:
     *Multiple Values: false
     *Multiple Lines: false
     *Password: false
     *Always Show: false
    
    Substring Length:
     *Name: length
     *Type: {http://www.w3.org/2001/XMLSchema}integer
     *Required: true
     *Default Value: 2147483647
     *Min Value: -2147483648
     *Max Value: 2147483647
     *Description: Substring length
     *Multiple Values: false
     *Multiple Lines: false
     *Password: false
     *Always Show: false
    
    Substring Start Index:
     *Name: startIndex
     *Type: {http://www.w3.org/2001/XMLSchema}integer
     *Required: false
     *Default Value: 0
     *Min Value: -2147483648
     *Max Value: 2147483647
     *Description: Substring start index (zero based)
     *Multiple Values: false
     *Multiple Lines: false
     *Password: false
     *Always Show: false

Here is an example of the source properties for a crawl that includes the manipulator in the above example.

// Connect to the CAS Server.
CasCrawlerLocator locator = CasCrawlerLocator.create("localhost", 8500);		
CasCrawler crawler = locator.getService();

// Create a new crawl Id with the name set to Demo.
CrawlId crawlId = new CrawlId("Demo"); 

// Create the crawl configuration.
CrawlConfig crawlConfig = new CrawlConfig(crawlId);

// Create a list for manipulator configurations, even if
// there is only one.
List<ManipulatorConfig> manipulatorList = new ArrayList<ManipulatorConfig>();

// Create a manipulator configuration.
ManipulatorConfig manipulator = new ManipulatorConfig(moduleId);

// Create a module ID for a Substring Manipulator.
// Set the module ID in the constructor. 
ModuleId moduleId = new ModuleId("com.endeca.cas.extension.sample.manipulator.substring.SubstringManipulator");

// Create a list for the module property objects.
List<ModuleProperty> manipulatorPropsList = new ArrayList<ModuleProperty>();

// Create a module property for sourceProperty.
// Set key/values of the module property as strings in the constructor.
ModuleProperty sp = new ModuleProperty("sourceProperty", "Endeca.Document.Text");

// Set the module property in the module property list.
manipulatorPropsList.add(sp);

// Create a module property for targetProperty.
// Set key/values of the module property as strings in the constructor.
ModuleProperty tp = new ModuleProperty("targetProperty", "Truncated.Text");

// Set the module property in the module property list.
manipulatorPropsList.add(tp);

// Create a module property for length.
// Set key/values of the module property as strings in the constructor.
ModuleProperty length = new ModuleProperty("length", "20");

// Set the module property in the module property list.
manipulatorPropsList.add(length);

// Set the module property list in the manipulator configuration.
manipulator.setModuleProperties(manipulatorPropsList);
manipulatorList.add(manipulator);

// Set the list of manipulator configurations in the crawl configuration.
crawlConfig.setManipulatorConfigs(manipulatorList)

// Create the crawl.
crawler.createCrawl(crawlConfig);

The TextExtractionConfig class allows a client to specify document conversion parameters to override default values.

The TextExtractionConfig class has methods to set these document conversion options:

To set the text-extraction options:

Note that if you retrieve a TextExtractionConfig object from a configured crawl, each of the set methods has a corresponding get method, such as the getTimeout() method.


Copyright © Legal Notices