The
SourceConfig
class allows a client to specify
information about the data source that is being crawled. The
SourceConfig
class uses two methods to set data source
properties:
setModuleId()
and
setModuleProperties()
.
The
setModuleId()
method sets the module ID of the data
source for this crawl. A module ID is a
ModuleId
object.
The string
File System
is the module ID for a file system
crawl (whose content source is a file system). You must specify this module ID
when you create a file system crawl.
Each CMS connector has its own unique module ID. Use the
CasCrawler.listModules()
method to find out the
module IDs that are available to your CAS Server.
The string
com.endeca.cas.source.RecordStoreMerger
is the
module ID for a record store merger crawl (whose content source is a one or
more record store instances). You must specify this module ID when you create a
record store merger crawl.
A plug-in developer specifies the
ModuleId
for a custom data source. A CAS application
developer can determine the
ModuleId
for a custom data source by running the
listModules
and task in the CAS Server
Command-line Utility.
Each
ModuleProperty
is a key/value pair or a
key/multi-value pair that provides configuration information about this data
source. You specify a
ModuleProperty
by calling
setKey()
to specify a string representing the key
and by calling
setValues()
to set one or more corresponding values.
You then set eachModuleProperty
on the
SourceConfig
object by calling
addModuleProperty()
.
The
SourceConfig
object for a file system crawl requires a
ModuleId
that specifies
"File System"
, a
ModuleProperty
to specify the seeds, and additional
ModuleProperty
objects for any optional source
properties.
Here is an example of the source properties for a file system crawl.
// Connect to the CAS Server. CasCrawlerLocator locator = CasCrawlerLocator.create("localhost", 8500); CasCrawler crawler = locator.getService(); // Create a new crawl Id with the name set to Demo. CrawlId crawlId = new CrawlId("Demo"); // Create the crawl configuration. CrawlConfig crawlConfig = new CrawlConfig(crawlId); // Create the source configuration. SourceConfig sourceConfig = new SourceConfig(); // Create a file system module ID. ModuleId moduleId = new ModuleId("File System"); // Set the module ID in the source config. sourceConfig.setModuleId(moduleId); // Create a module property object for the seeds. ModuleProperty seeds = new ModuleProperty(); // Set the key for seeds. seeds.setKey("seeds"); // Set multiple values for seeds. seeds.setValues("C:\\tmp\itldocset","C:\\tmp\iapdocset"); // Set the seeds module property on the source config. sourceConfig.addModuleProperty(seeds); // Create a module property for gathering native file props. ModuleProperty nativeFileProps = new ModuleProperty(); // Set the key for gathering native file properties. nativeFileProps.setKey("gatherNativeFileProperties"); // Set the value to enable gathering native file properties. nativeFileProps.setValues("true"); // Set the nativeFileProps module property on the source config. sourceConfig.addModuleProperty(nativeFileProps); // Create a module property object for expanding archives. ModuleProperty extractArchives = new ModuleProperty(); // Set the key for extracting archive files. extractArchives.setKey("expandArchives"); // Set the value to enable expanding archives. extractArchives.setValues("true"); // Set the nativeFileProps module property on the source config. sourceConfig.addModuleProperty(extractArchives); // Set the source configuration in the crawl configuration. crawlConfig.setSourceConfig(SourceConfig); // Create the crawl. crawler.createCrawl(crawlConfig);
Note that if you retrieve a
SourceConfig
object from a configured crawl, you can
call the
getModuleId()
method to get the module ID and the
getModuleProperties()
method to retrieve the list of
module properties.
The
SourceConfig
for a CMS crawl contains a mandatory
ModuleId
and additional
ModuleProperty
objects that define the CMS to crawl.
The source configuration (SourceConfig
object) for a
CMS crawl requires the module ID (which is a string that identifies the CMS
connector). Use the
CasCrawler.listModules()
method to find out which
module IDs are available to your CAS Server.
Here is an example of the source properties for a CMS crawl.
// Connect to the CAS Server. CasCrawlerLocator locator = CasCrawlerLocator.create("localhost", 8500); CasCrawler crawler = locator.getService(); // Create a new crawl Id with the name set to Demo. CrawlId crawlId = new CrawlId("Demo"); // Create the crawl configuration. CrawlConfig crawlConfig = new CrawlConfig(crawlId); // Create the source configuration. SourceConfig sourceConfig = new SourceConfig(); // Create a CMS module ID for a SampleLink repository. ModuleId moduleId = new ModuleId("SampleLink"); // Set the module ID in the source configuration. sourceConfig.setModuleId(moduleId); // Create a list for the module property objects. List<ModuleProperty> cmsPropsList = new ArrayList<ModuleProperty>(); // Configure the source properties that are specific to this // CMS. This example sets properties // for a SampleLink repository (a non-existent repository used to // illustrate the process). // Create a module property for the DNS name. ModuleProperty sampleLinkServer = new ModuleProperty(); // Set the key/value pair for the url source property. sampleLinkServer.setKey("url"); sampleLinkServer.setValues("http://samplelink45.mysite.com"); // Set the module property in the module property list. cmsPropsList.add(sampleLinkServer); // Create a module property object to enable archive expansion. ModuleProperty extractArchives = new ModuleProperty(); // Set the key for archive expansion. extractArchives.setKey("expandArchives"); // Set the value to enable archive expansion. extractArchives.setValues("true"); // Set the module property in the module property list. cmsPropsList.add(extractArchives); // Create a module property for username. ModuleProperty uname = new ModuleProperty(); // Set the key for username. uname.setKey("username"); // Set the value and prepend the domain for Windows systems. uname.setValues("SALES\\username"); // Set the module property in the module property list. cmsPropsList.add(uname); // Create a module property for password. ModuleProperty upass = new ModuleProperty(); // Set the password key. upass.setKey("password"); // Set the password value. upass.setValues("endeca"); // Set the module property in the module property list. cmsPropsList.add(upass); // Set the module property list in the source configuration. sourceConfig.setModuleProperties(cmsPropsList); // Set the source configuration in the crawl configuration. crawlConfig.setSourceConfig(SourceConfig); // Create the crawl. crawler.createCrawl(crawlConfig);
Note that if you retrieve a
SourceConfig
object from a configured crawl, you can
use its
getModuleId()
method to get the module ID and the
getModuleProperties()
method to retrieve the list of
module properties..
The
SourceConfig
object for a record store merger crawl
requires a
ModuleId
that specifies
com.endeca.cas.source.RecordStoreMerger
, one or more
ModuleProperty
to specify the record store instances to
merge, and additional
ModuleProperty
objects for optional source properties.
Module Property Key for a Record Store Merger |
Key Value |
---|---|
|
The
|
|
The
|
|
The
|
Here is an example of the source properties for a Record Store Merger crawl.
// Connect to the CAS Server. CasCrawlerLocator locator = CasCrawlerLocator.create("localhost", 8500); CasCrawler crawler = locator.getService(); // Create a new crawl Id with the name set to Demo. CrawlId crawlId = new CrawlId("Demo"); // Create the crawl configuration. CrawlConfig crawlConfig = new CrawlConfig(crawlId); // Create the source configuration. SourceConfig sourceConfig = new SourceConfig(); // Create a record store merger module ID. ModuleId moduleId = new ModuleId("com.endeca.cas.source.RecordStoreMerger"); // Set the module ID in the source config. sourceConfig.setModuleId(moduleId); // Create a module property object for the data record stores. ModuleProperty dataRecStores = new ModuleProperty(); // Set the key for data record stores. dataRecStores.setKey("dataRecordStores"); // Set multiple values for each data record store name. dataRecStores.setValues("DataStore1","DataStore2","DataStore3"); // Set the data record store module property on the source config. sourceConfig.addModuleProperty(dataRecStores); // Create a module property object for the dimension value record stores. ModuleProperty dvalRecStores = new ModuleProperty(); // Set the key for dimension value record stores. dvalRecStores.setKey("dimensionValueRecordStores"); // Set multiple values for each taxonomy record store name. dvalRecStores.setValues("DvalStoreCrawl1","DvalStoreCrawl2","DvalStoreCrawl3"); // Set the dimension value record store module property on the source config. sourceConfig.addModuleProperty(dvalRecStores); // Set the source configuration in the crawl configuration. crawlConfig.setSourceConfig(SourceConfig); // Create the crawl. crawler.createCrawl(crawlConfig);
Note that if you retrieve a
SourceConfig
object from a configured crawl, you can
call the
getModuleId()
method to get the module ID and the
getModuleProperties()
method to retrieve the list of
module properties.
The
SourceConfig
for a custom data source crawl contains a
mandatory
ModuleId
and
ModuleProperty
objects that define the custom data
source to crawl and any other optional properties that are necessary for a
custom data source.
A
plug-in developer specifies the
ModuleId
for a custom data source. A CAS application
developer can determine the
ModuleId
for a custom data source by running the
listModules
and task in the CAS Server
Command-line Utility:
Start a command prompt and navigate to
<install path>\CAS\
.version
\binType
cas-cmd.bat
(for Windows), orcas-cmd.sh
(for UNIX) and specify thelistModules
task with the module type (-t
) option and specify and argument ofSOURCE
. For example:C:\Endeca\CAS\<version>\bin>cas-cmd.bat listModules -t SOURCE Sample Data Source *Id: Sample Data Source *Type: SOURCE *Description: Sample Data Source for Testing ...
In the list of data sources returned by
listModules
, locate the custom data source and Id value.
Custom data sources can use any number of module properties. A plug-in developer determines what module properties are necessary for a custom data source and whether the module properties are required or optional.
A CAS application developer can check the available module properties
for a custom data source by running the
getModuleSpec
task in the CAS Server
Command-line Utility:
Start a command prompt and navigate to
<install path>\CAS\
.version
\binType
cas-cmd.bat
(for Windows), orcas-cmd.sh
(for UNIX) and specify thegetModuleSpec
task with the ID of the module whose source properties you want to see. For example:C:\Endeca\CAS\<version>\bin>cas-cmd.bat getModuleSpec -id "Sample Data Source" Sample Data Source ================= [Module Information] *Id: Sample Data Source *Type: SOURCE *Description: Sample Data Source for Testing [Sample Data Source Configuration Properties] Group: Basic Settings --------------------- User name: *Name: username *Type: {http://www.w3.org/2001/XMLSchema}string *Required: true *Max Length: 256 *Description: The name of the user used to log on to the repository *Multiple Values: false *Multiple Lines: false *Password: false *Always Show: true Password: *Name: password *Type: {http://www.w3.org/2001/XMLSchema}string *Required: true *Max Length: 256 *Description: The password used to log on to the repository *Multiple Values: false *Multiple Lines: false *Password: true *Always Show: true ...
Here is an example of the source properties for a custom data source crawl.
// Connect to the CAS Server. CasCrawlerLocator locator = CasCrawlerLocator.create("localhost", 8500); CasCrawler crawler = locator.getService(); // Create a new crawl Id with the name set to Demo. CrawlId crawlId = new CrawlId("Demo"); // Create the crawl configuration. CrawlConfig crawlConfig = new CrawlConfig(crawlId); // Create the source configuration. SourceConfig sourceConfig = new SourceConfig(); // Create a module ID for a Sample Data Source repository. // Set the module ID in the constructor. ModuleId moduleId = new ModuleId("Sample Data Source"); // Create a list for the module property objects. List<ModuleProperty> cmsPropsList = new ArrayList<ModuleProperty>(); // Create a module property for username. // Set key/values of the module property as strings in the constructor. ModuleProperty uname = new ModuleProperty("username", "SALES\\username"); // Set the module property in the module property list. cmsPropsList.add(uname); // Create a module property for password. // Set key/values of the module property as strings in the constructor. ModuleProperty upass = new ModuleProperty("password", "endeca"); // Set the module property in the module property list. cmsPropsList.add(upass); // Set the module property list in the source configuration. sourceConfig.setModuleProperties(cmsPropsList); // Set the source configuration in the crawl configuration. crawlConfig.setSourceConfig(SourceConfig); // Create the crawl. crawler.createCrawl(crawlConfig);
The
ManipulatorConfig
for a manipulator contains a mandatory
ModuleId
and
ModuleProperty
objects that define the manipulator to
run and any other optional properties that are necessary for a manipulator.
A plug-in
developer specifies the
ModuleId
for a manipulator. A CAS application
developer can determine the
ModuleId
for a manipulator by running the
listModules
and task in the CAS Server
Command-line Utility:
Start a command prompt and navigate to
<install path>\CAS\
.version
\binType
cas-cmd.bat
(for Windows), orcas-cmd.sh
(for UNIX) and specify thelistModules
task with the module type (-t
) option and specify and argument ofMANIPULATOR
. For example:C:\Endeca\CAS\<version>\bin>cas-cmd listModules -t MANIPULATOR Substring Manipulator *Id: com.endeca.cas.extension.sample.manipulator.substring.SubstringManipulator *Type: MANIPULATOR *Description: Generates a new property that is a substring of another property value
In the list of manipulators returned by
listModules
, locate the manipulator and its Id value. That becomes theModuleId
.
Manipulators can use any number of module properties. A plug-in developer determines what module properties are necessary for a manipulator and whether the module properties are required or optional.
A CAS application developer can check the available module properties
for a manipulator by running the
getModuleSpec
task in the CAS Server
Command-line Utility:
Start a command prompt and navigate to
<install path>\CAS\
.version
\binType
cas-cmd.bat
(for Windows), orcas-cmd.sh
(for UNIX) and specify thegetModuleSpec
task with the Id of the module whose source properties you want to see. For example:C:\Endeca\CAS\<version>\bin>cas-cmd getModuleSpec -id com.endeca.cas.extension.sample.manipulator.substring.SubstringManipulator Substring Manipulator ===================== [Module Information] *Id: com.endeca.cas.extension.sample.manipulator.substring.SubstringManipulator *Type: MANIPULATOR *Description: Generates a new property that is a substring of another property value [Substring Manipulator Configuration Properties] Group: ------- Source Property: *Name: sourceProperty *Type: {http://www.w3.org/2001/XMLSchema}string *Required: true *Default Value: *Max Length: 255 *Description: *Multiple Values: false *Multiple Lines: false *Password: false *Always Show: false Target Property: *Name: targetProperty *Type: {http://www.w3.org/2001/XMLSchema}string *Required: true *Default Value: *Max Length: 255 *Description: *Multiple Values: false *Multiple Lines: false *Password: false *Always Show: false Substring Length: *Name: length *Type: {http://www.w3.org/2001/XMLSchema}integer *Required: true *Default Value: 2147483647 *Min Value: -2147483648 *Max Value: 2147483647 *Description: Substring length *Multiple Values: false *Multiple Lines: false *Password: false *Always Show: false Substring Start Index: *Name: startIndex *Type: {http://www.w3.org/2001/XMLSchema}integer *Required: false *Default Value: 0 *Min Value: -2147483648 *Max Value: 2147483647 *Description: Substring start index (zero based) *Multiple Values: false *Multiple Lines: false *Password: false *Always Show: false
Here is an example of the source properties for a crawl that includes the manipulator in the above example.
// Connect to the CAS Server. CasCrawlerLocator locator = CasCrawlerLocator.create("localhost", 8500); CasCrawler crawler = locator.getService(); // Create a new crawl Id with the name set to Demo. CrawlId crawlId = new CrawlId("Demo"); // Create the crawl configuration. CrawlConfig crawlConfig = new CrawlConfig(crawlId); // Create a list for manipulator configurations, even if // there is only one. List<ManipulatorConfig> manipulatorList = new ArrayList<ManipulatorConfig>(); // Create a manipulator configuration. ManipulatorConfig manipulator = new ManipulatorConfig(moduleId); // Create a module ID for a Substring Manipulator. // Set the module ID in the constructor. ModuleId moduleId = new ModuleId("com.endeca.cas.extension.sample.manipulator.substring.SubstringManipulator"); // Create a list for the module property objects. List<ModuleProperty> manipulatorPropsList = new ArrayList<ModuleProperty>(); // Create a module property for sourceProperty. // Set key/values of the module property as strings in the constructor. ModuleProperty sp = new ModuleProperty("sourceProperty", "Endeca.Document.Text"); // Set the module property in the module property list. manipulatorPropsList.add(sp); // Create a module property for targetProperty. // Set key/values of the module property as strings in the constructor. ModuleProperty tp = new ModuleProperty("targetProperty", "Truncated.Text"); // Set the module property in the module property list. manipulatorPropsList.add(tp); // Create a module property for length. // Set key/values of the module property as strings in the constructor. ModuleProperty length = new ModuleProperty("length", "20"); // Set the module property in the module property list. manipulatorPropsList.add(length); // Set the module property list in the manipulator configuration. manipulator.setModuleProperties(manipulatorPropsList); manipulatorList.add(manipulator); // Set the list of manipulator configurations in the crawl configuration. crawlConfig.setManipulatorConfigs(manipulatorList) // Create the crawl. crawler.createCrawl(crawlConfig);
The
TextExtractionConfig
class allows a client to specify
document conversion parameters to override default values.
Note
The phrases text extraction and document conversion mean the same thing.
The
TextExtractionConfig
class has methods to set these
document conversion options:
Whether document conversion should be performed. The default for file system crawls and CMS connector crawls is
true
. The default for custom data source extensions defaults tofalse
unless the extension developer implements an interface that supports binary content. If set totrue
, the next options can be used.Whether to use local file copies to perform the text extraction (file system crawls only).
The time that CAS Server waits for text extraction results from the Document Conversion Module before retrying.
To set the text-extraction options:
Make sure that you have already created a
SourceConfig
, aCrawlConfig
, and set the name and the seeds (if required for the source type) for the crawl.Instantiate an empty
TextExtractionConfig
objectFor example:
TextExtractionConfig textOptions = new TextExtractionConfig();
Call the
setEnabled()
method to set a Boolean indicating that extraction should be performed:// Enable text extraction for this crawl. textOptions.setEnabled(true);
For file system crawls, you can use the
setMakeLocalCopy()
method to set a Boolean indicating whether files should be copied to a local temporary directory before text is extracted from them. The default forsetMakeLocalCopy()
isfalse
. Custom data source extensions may also make local copies if the extension developer implemented theBinaryContentFileProvider
interface of the CAS Extension API.// Enable use of local file copying. textOptions.setMakeLocalCopy(true);
If desired, call the
setTimeout()
method and specify an integer to set amount of time (in seconds) CAS waits for text extraction on a document to finish before attempting again. The default is 90 seconds.// Set timeout to 120 seconds. textOptions.setTimeout(120);
Call the
CrawlConfig.setTextExtractionConfig()
method to set the populatedTextExtractionConfig
object in theCrawlConfig
object:// Set the text extraction options in the configuration crawlConfig.setTextExtractionConfig(textOptions);
crawler.createCrawl(crawlConfig);
Note that if you retrieve a
TextExtractionConfig
object from a configured crawl,
each of the set methods has a corresponding get method, such as the
getTimeout()
method.