Specifying File Source General Settings

Access General Settings page by selecting PeopleTools, Search Framework, Search Designer Activity Guide, Search Definition and selecting a Source Type of File Source.

Use the General Settings page to specify the location of the files to be indexed as well as the crawler settings.

Image: File Source: General Settings page

This example illustrates the fields and controls on the File Source: General Settings page. You can find definitions for the fields and controls later on this page.

File Source: General Settings page

Field or Control

Definition

Description

Add a brief description to help identify the purpose of the search definition.

Source Type

Displays the type of search definition, such as Query, Web, File, and so on.

Starting URLs

In the Starting URLs grid enter the location(s) in your file system where the files reside that you want to expose to the Search Framework crawling process. For each different location, add a new row to the grid.

Note: The starting URL is not case sensitive.

Note: For the search engine crawler to access the URL, the file starting URL must be fully qualified, as in file://localhost/.

On UNIX the starting URL format is:

  • To crawl local files use file://localhost/<directory structure>. For example:

    file://localhost/recruitment/resume/

  • To crawl from mounted file systems use file://localhost//<mounted_dir_path>. For example:

    file://localhost//dfs/recruitment/resume/

On Windows the starting URL format is:

  • To crawl local files use file://localhost/<Directory_Path>. For example:

    file://localhost/D:/recruitment/resume/

  • To crawl from a mapped drive use file://localhost//<machinename>/<shared_folder_path>/. For example:

    file://localhost//RTDC78067TLSBLD/recruitment/resume/

A search engine can crawl files on directories located on the server where the search engine is installed or network file paths accessible by the search engine.

When the search engine crawls files from a network drive, then the Oracle process/service should be started as a user who has access to the network drive, which you can accomplish by modifying the logon account of OracleServiceSID and OracleSIDTNSListener services to match the domain administrator and restart both services.

In the Elasticsearch implementation, when the search engine crawls files from a network drive, then the Oracle process/service should be started as a user who has access to the network drive; the user must have at least read access to the network drive.

Crawler Timeout

Indicates the maximum allowed time to retrieve a file for crawling.

Max Document Size

The maximum document size in megabytes that the system will crawl. Larger documents are not crawled.

Enable Language Detection

By Enabling Language Detection, the search engine automatically identifies the language of the document content and assigns the language code automatically.

If the search engine crawler cannot determine a perfect match, it finds a best match from the trained set of languages and assigns. Otherwise the default language in the crawler configurations will be assigned.

Note: In the Elasticsearch implementation, automatic language detection is supported for all languages that are supported by PeopleSoft.

File URL Prefix

The part of the access URL the system will not display in the search results due to security reasons.

This is an optional feature where there is a need to hide the actual URL used for indexing. If a File URL Prefix is specified it is mandatory to have the Display URL Prefix set.

Display URL Prefix

The URL the system displays instead of the actual URL. For example, if the file URL is:

file://localhost/home/operation/doc/file.doc

and you want the display URL to appear as:

https://webhost/client/doc/file.doc

then specify the File URL Prefix as:

file://localhost/home/operation

and the Display URL Prefix as:

https://webhost/client

If you select Display URL Prefix, make sure that the files are reachable using the specified URL. The search engine crawler replaces the URL string specified for the File URL Prefix with the Display URL Prefix.

Note: When you select Display URL Prefix, you must consider the following:

A File URL Prefix must be a fully resolved path; it should not be a symbolic link.