Setting Up File Sources

A file source enables users to search files on the local computer. The following procedures identify the basic steps for setting up a file source using the Oracle SES Administration GUI. For more information on each page, click Help.

To create a file source: 

  1. On the Home page, select the Sources secondary tab to display the Sources page.

  2. For Source Type, select File.

  3. Click Create to display the Create File Source page.

  4. Complete the following fields. Click Help for additional information.

    • Source Name: Name that you assign to this table source.

    • Starting URL: The URL of the top directory where the crawler begins. See "Tips for Using File Sources".

  5. Click Create or Create & Customize.

  6. Follow the steps for crawling and indexing a source in "Getting Started Basics for the Oracle SES Administration GUI".

To customize a file source: 

  1. When creating a file source, click Create & Customize on the Create File Source page to display the Customize File Source page.


    After creating a source, click the Edit icon on the Home - Sources page.

  2. Click the following subtabs and make the desired changes.

    • Basic Settings: Source name, language, and starting URL.

    • URL Boundary Rules: Contents of a URL that include or exclude a page from crawling.

    • Document Types: Common document and image types that you can include or exclude from crawling. By default, Oracle SES crawls HTML, Excel, PowerPoint, Word, PDF, and plain text.

    • Display URL: URL that users see for security reasons instead of the actual URL.

    • Authorization: Configuration of an Access Control List or an authorization manager plug-in.

    • Attribute Mapping: Maps document attributes to Oracle SES search attributes. See "File Document Attributes".

    • Crawling Parameters: Crawling conditions, such as depth, language, HTTP cookies.

  3. Click Apply.

File Document Attributes

Oracle SES crawls and searches various attributes. By default, Oracle SES maps these search attributes to common document attributes, such as AUTHOR, CREATOR, KEYWORD, and SUBJECT. You can enter and map additional document attributes.

Oracle SES crawls and indexes these document attributes:

  • Title

  • Author

  • Description

  • Host

  • Keywords

  • Language

  • LastModifiedDate

  • Mimetype

  • Subject

Tips for Using File Sources

This section contains the following topics:

Crawling File Sources with Non-ASCII Character Sets

For file sources to successfully crawl and display multibyte environments, the locale of the computer that starts the Oracle SES server must be the same as the target file system. This way, the Oracle SES crawler can "see" the multibyte files and paths.

If the locale is different in the installation environment, then Oracle SES must be reinstalled from the environment with the correct locale. For example, for a Korean environment, either set LC_ALL to ko_KR or set both LC_LANG and LANG to ko_KR.KSC5601. Then restart Oracle SES with searchctl restartall from either a command prompt on Windows or an xterm on UNIX.

Crawling File Sources with Symbolic Links

When crawling file sources on UNIX, the crawler resolves any symbolic link to its true directory path and enforce the boundary rule on it. For example, suppose directory /tmp/A has two children, B and C, where C is a link to /tmp2/beta. The crawl has the following URLs:

  • /tmp/A

  • /tmp/A/B

  • /tmp2/beta

  • /tmp/A/C

If the inclusion rule is /tmp/A, then /tmp2/beta is excluded. The seed URL is treated as is.

Crawling File URLs

For a plug-in to return file URLs to the crawler, the file URLs must be fully qualified. For example, file://localhost/.

If a file URL is to be used "as is", without going through Oracle SES to retrieve the file, then "file" in the Display URL Prefix should be upper case "FILE". For example, FILE://localhost/.... The starting URL is not case sensitive.

"As is" means that when a user clicks the search link of the document, the browser tries to use the specified file URL on the client computer to retrieve the file. Without that, Oracle SES uses this file URL on the server computer and sends the document through HTTP to the client computer.

Crawling File Sources from a Network Drive

If the files are crawled from a network drive, then the Oracle process should be started as a user who has access to the drive.

See Also:

"Required Tasks" for instructions on how to change the user running the Oracle process.