CrawlingThreadService (Oracle Secure Enterprise Search Java API Reference)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

Oracle Secure Enterprise Search Java API Reference
10g Release 1 (10.1.8.2)
E10465-01

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

oracle.search.sdk.crawler
Interface CrawlingThreadService

public interface CrawlingThreadService

CrawlingThreadService is an interface used by a crawler plugin to perform crawl related tasks. It has execution context specific to the crawling thread that invokes the plugin crawl() method

Field Summary
`static int`	`DOC_EXCLUDED_BY_MIMETYPE` document excluded by mimetype
`static int`	`DOC_EXCLUDED_BY_SIZE` document excluced by document size
`static int`	`DOC_EXCLUDED_BY_URL_BOUNDARY` document excluded by url boundary
`static int`	`DOC_INCLUDED` document should be included

Method Summary
`int`	`checkDocumentExcluded(DocumentMetadata meta)` Checks if the document should be crawled or not.
`String`	`inferMimeType(String url)` Checks the mime type based on the URL suffix.
`void`	`markStatusNotChanged(DocumentMetadata meta)` Marks a URL entry as not requiring any changes or updates.
`void`	`submitForProcessing(DocumentContainer target)` Submits the document for processing.

Field Detail

DOC_INCLUDED

public static final int DOC_INCLUDED

document should be included

See Also:: Constant Field Values

DOC_EXCLUDED_BY_URL_BOUNDARY

public static final int DOC_EXCLUDED_BY_URL_BOUNDARY

document excluded by url boundary

See Also:: Constant Field Values

DOC_EXCLUDED_BY_MIMETYPE

public static final int DOC_EXCLUDED_BY_MIMETYPE

document excluded by mimetype

See Also:: Constant Field Values

DOC_EXCLUDED_BY_SIZE

public static final int DOC_EXCLUDED_BY_SIZE

document excluced by document size

See Also:: Constant Field Values

Method Detail

submitForProcessing

public void submitForProcessing(DocumentContainer target)
                         throws ProcessingException

Submits the document for processing. It will be indexed if its status code is DocumentContainer.STATUS_OK_FOR_INDEX. After the processing is done, this document will be automatically removed from the queue. Note that the DocumentMetadata in the submitted target will be cleared automatically if the operation is a success.

Parameters:: target - the document container containing the content and metadata.
Throws:: ProcessingException

markStatusNotChanged

public void markStatusNotChanged(DocumentMetadata meta)
                          throws ProcessingException

Marks a URL entry as not requiring any changes or updates. This will simply remove the entry from the URL Queue and will not re-index or perform any additional operations on this URL entry. This should be used when re-crawling content and when there is no change to a particular URL.

Parameters:: meta - the metadata object corresponding to the URL entry
Throws:: ProcessingException

checkDocumentExcluded

public int checkDocumentExcluded(DocumentMetadata meta)

Checks if the document should be crawled or not. The check stops if one rule excludes the document and only status code for this rule is returned.
To avoid the overhead on processing the excluded documents, this method should be called before enqueuing or submitting (if not using queue) the document. If document size or mimetype information is not available, rules based on size or mimetype are not applicable. The check order is: boundary, mimetype and size.

The internal exclusion checking always occurs when submitting the documents.

Parameters:: meta - the document metadata
Returns:: CrawlingThreadService.DOC_INCLUDED, CrawlingThreadService.DOC_EXCLUDED_BY_URL_BOUNDARY, CrawlingThreadService.DOC_EXCLUDED_BY_MIMETYPE, or CrawlingThreadService.DOC_EXCLUDED_BY_SIZE

inferMimeType

public String inferMimeType(String url)

Checks the mime type based on the URL suffix.

Parameters:: url - the document URL
Returns:: mimetype such as "text/html" etc. If no application is associated with the suffix or there is no suffix, return null.