Setting Up NTFS Sources for UNIX

This section contains information for NTFS sources on UNIX, which have additional setup steps not required on Windows. For NTFS sources on Windows, see "Setting Up NTFS Sources for Windows".

An NTFS source collects the content, metadata attributes, and ACLs of files in NTFS. An NTFS source supports incremental crawl. After the initial crawl is performed, subsequent crawls only collect those documents that have changed since the last crawl. A document is re-crawled if the content, metadata or the ACL information of the document has changed. A file is also re-crawled if it is moved between folders. Files deleted from NTFS are removed from the index during incremental crawls.

Important Notes for NTFS Sources

  • On the Windows server, the super user must have permission to read the NTFS file share.

  • The super user must be the impersonate user in the IIS Server.

  • The default behavior for NTFS for UNIX is to use local file display URL, so the client computer must have access to the file share.

  • An ACL error may appear when crawling an NTFS source as a built-in user or group, such as an Administrator user. As a workaround, set explicit access to the administrator user: Security - Administrator (user), All Permissions.

  • Everyone is a special group that represents all current network users, including guests and users from other domains. When a user logs on to the network, the user is automatically added to the Everyone group. The NTFS connector supports the Everyone group. All documents for which the Everyone group has permission is crawled and accessed like public documents. There is no need to log in to the search application to access these public documents. However, if a user is denied access to a document while the Everyone group has access, then all users except for the denied user can see the document, and these users must log in to the search application to see the document.

  • When using Internet Explorer with files on a different domain, you must explicitly log on to Internet Explorer to open result links to those files.

Required Software

  • Microsoft Internet Information Server (IIS)

  • NET 2.0 Framework

Setting Up Identity Management with NTFS Sources

When an NTFS source is used, Oracle recommends that Active Directory be used as identity management system for the Oracle SES instance. The Active Directory instance must be the same one that NTFS is using to authenticate users on the file system.

For the Oracle SES instance to read the files during crawling, add permission to each folder and file to make them accessible by the operating system user that runs the Oracle SES instance. Adding permissions to a folder automatically adds the same permissions to all the files and sub-folders in the folder.

NTFS sources rely on Active Directory for security permissions. Because permissions at the server local group level are not defined in Active Directory, these permissions are not supported when crawling NTFS sources. Permissions for server local groups (not domain local groups) are ignored during crawling. Permissions for domain groups and users inherited from server local groups also are ignored.

Creating an NTFS Source

To create an NTFS source on UNIX:

  1. On the Home page, select the Sources secondary tab.

  2. On the Sources page, select the NTFS source type and click Create.

  3. Complete the Create User-Defined Source page. Table 8-2 describes the parameters.

  4. Click Create or Create & Customize.

Table 8-2 NTFS Source Parameters for UNIX

Parameter Description

UNC Path

UNC path for the NTFS system to crawl; for example, \\MYSERVER\mysharedfolder

WebService Endpoint

Target end point (HTTP or HTTPS); for example

https://mail.example.com/NTFSWebService/NTFSWebService.asmx

WebService User Name

User name to authenticate the NTFS WebService for the Endpoint.

WebServicePassword

Password for User Name.

Simple Include

To limit crawling, specify up to 50 colon-delimited (:) path inclusion boundary rules using simplified regular expressions. Specify an inclusion rule that a URL contain, start with, or end with a term. Only *, ^, and $ operators are permitted. An asterisk (*) is a wildcard. A caret (^) denotes the beginning of a URL, and a dollar sign ($) denotes the end of a URL. For example: ^https://*.oracle.com/.jpg$

Simple Exclude

To limit crawling, specify up to 50 colon-delimited (:)path exclusion boundary rules using simplified regular expressions. Only *, ^, and $ operators are permitted.

Regular Expression Include

To limit crawling, specify up to 50 colon-delimited (:) path inclusion boundary rules using restricted (full java.util.regexp) regular expression rules. For example: ^https://.*\.oracle(?:corp){0,1}\.com

Regular Expression Exclude

To limit crawling, specify up to 50 colon-delimited (:) path exclusion boundary rules using restricted (full java.util.regexp) regular expression rules.

ACL Validation Attribute

ACL attribute used to validate the user. Enter USER_NAME for Active Directory or nickname for Oracle Internet Directory.

Domain Name

Domain name of the URL (UNC Path).

Incremental Crawl With File Change Detector

Enter true to use the File Change Detector, or false to use scan-based incremental crawl. See "Installing Oracle Search File Change Detector"


After crawling an NTFS source, you may get a "No User Found Matching the Criteria" error message on the Home - Schedules - Data Synchronization page. If this error accompanies a crawl failure, then check that the principal is a valid user or group

Installing and Configuring Windows Services

NTFS sources on UNIX requires an NTFS agent to be installed and configured on the Windows domain where the NTFS files are to be crawled. The NTFS agent collects and sends content and metadata to the crawler plug-in on the Oracle SES computer in a crawl session. The communication protocol between Oracle SES and the NTFS agent is HTTP or HTTPS.

The NTFS agent must be installed on a Windows computer where IIS is present, and the computer must be in the same Windows domain where the NTFS file share to be crawled resides.

Typically, a remote file share is crawled with the permission of a domain administrator or a domain user with read privileges on the file share. The easiest way to configure this is to add the domain admin group to the 'administrators' group of the target computer.

The Oracle SES instance must connect to the same Active Directory instance that the Microsoft NTFS domain connects to.

Required Software

Windows .NET Framework 2.0

Internet Information Services (IIS) Manager

Required Tasks

Verify that Windows .NET 2.0 Framework is installed. If it is not, then download and install it from this site:

http://www.microsoft.com/download/en/default.aspx

Installing Oracle Search File Change Detector

By installing and configuring the Oracle Search File Change Detector service, you can realize significantly improved performance in incremental crawls. This service provides the crawler with a list of documents that are modified or deleted. This method is more efficient than scanning all files for changes.

The older, scan-based incremental crawl is still available and can be used when File Change Detector cannot be deployed on your NTFS system or under the conditions listed in "Configuring the NTFS Connector".

The following procedure installs File Change Detector in the Microsoft .NET Framework.

To install Oracle Search File Change Detector: 

  1. Copy OracleSearchFileChangeDetector.zip from ORACLE_HOME/search/lib/plugins/ntfsLinWin to the Windows server where Internet Information Services (IIS) is running.

  2. Unzip the contents of OracleSearchFileChangeDetector.zip to a folder. It contains two files:

    • OracleSearchFileChangeDetector.exe

    • OracleSearchFileChangeDetector.exe.config

  3. Open OracleSearchFileChangeDetector.exe.config in a text editor and modify the configuration settings as necessary. The settings are described in "Modifying the File Change Detector Configuration File".

  4. Open a command prompt window and navigate to the folder for .NET Framework Version 2.0. It has a name such as

    C:\WINDOWS\Microsoft.NET\Framework\v2.0.50727

  5. Install the OracleSearchFileChangeDetector service by issuing a command like the following, where path is the folder containing the configuration file:

    InstallUtil path\OracleSearchFileChangeDetector.exe
    

    For example:

    installutil d:\OracleSearchFileChangeDetector\
    OracleSearchFileChangeDetector.exe
    

    The Set Service Login dialog box is displayed.

  6. Enter the user credentials for the domain user identified in the ASP.NET Configuration Settings dialog box. For Username, use the format domain\username.

  7. Open the Windows Services utility and start the OracleSearchFileChangeDetector service.

  8. Install the NTFS Web service as described in "Installing the NTFS Web Service".

Modifying the File Change Detector Configuration File

The OracleSearchFileChangeDetector.exe.config file is the XML configuration file for the File Change Detector. When you add new sources, this file is automatically updated with the UNC path of the sources. However, if you make changes to the path of an already existing source, then you must restart File Change Detector for the new path to be watched.

Example 8-1 shows a sample configuration file.

Example 8-1 Oracle Search File Change Detector Configuration File

<?xml version="1.0" encoding="utf-8" ?>
<configuration>
   <configSections>
      <section name="StartupFolders"
         type="FileChangeDetector.StartupFoldersConfigSection,
         OracleSearchFileChangeDetector"/>
   </configSections>
   <StartupFolders>
      <DefaultInternalBufferSizeValue>
         <add internalBufferSize="32768" />
      </DefaultInternalBufferSizeValue>
      <Folders>
         <add sourceName="NTFS1" path="10.255.255.255\writeHere"/>
         <add sourceName="NTFS2" path="10.255.255.255\Work"
            internalBufferSize="40960"/>
      </Folders>
      <Results>
         <add directory="C:\NTFS\Data" />
      </Results>
      <SESBufferSizeValue>
         <add sesBufferSize="1" />
      </SESBufferSizeValue>
   </StartupFolders>
</configuration>

The XML elements are described in the following topics.

DefaultInternalBufferSizeValue

Oracle Search File Change Detector uses a Windows API to capture file update events. The API uses an internal buffer to cache events. The buffer size is specified in the internalBufferSize parameter of the nested add element:

<DefaultInternalBufferSizeValue>
   <add internalBufferSize="n" />
</DefaultInternalBufferSizeValue>

The internalBufferSize parameter specifies the default buffer size for all folders that the File Change Detector monitors, as specified in the Folders element.

The internal buffer is allocated from non-paged memory, which cannot be swapped to disk. Therefore, keep the value of internalBufferSize as small as possible. Increase the value for frequent, highly concurrent updates: More than 100 changes per second.

Folders

This element specifies the list of directories to be watched. Create one nested add element for each NTFS source:

<Folders>
   <add sourceName="name" path="path"/>
   <add sourceName="name" path="path" internalBufferSize="n"/>
</Folders>

The nested add element has these attributes:

  • SourceName: A unique name within the configuration file to identify the NTFS source. (Required)

  • Path: The UNC path specified in the NTFS source configuration. (Required)

    To specify multiple UNC paths, use colon as the delimiter. For example:

    <add sourceName="ntfstest" path="\\server1\share1\Folder:\\server2\share1"/>
    
  • InternalBufferSize: A value that overrides DefaultInternalBufferSizeValue for a source where extensive changes are expected. (Optional)

Results

Specifies the folder where the Oracle Search File Change Detector logs the changes. The value must be the same as the IncrementalCrawlData property in the Web service configuration.

<Results>
   <add directory="path" />
</Results>

SESBufferSizeValue

This element specifies the number of events cached in an internal buffer by the OracleSearchFileChangeDetector service before writing them to the log file. For example, a value of 1 indicates that every event is written immediately to the log file, while a value of 10 means that 10 events are cached before writing them to the log file.

Increase the value of the sesBufferSize parameter when capturing changes in folders where you expect extensive changes. However, the larger the buffer size is, the less up-to-date the changes in the log file are, because updates are less frequent. A reasonable value is the average number of concurrent updates to the crawled folders.

<SESBufferSizeValue>
   <add sesBufferSize="n" />
</SESBufferSizeValue>

Installing the NTFS Web Service

Install this service after you install Oracle Search File Change Detector, as described in "Installing Oracle Search File Change Detector".

To install the NTFS Web service: 

  1. Copy NTFSWebService.zip from ORACLE_HOME/search/lib/plugins/ntfsLinWin to the Windows server where Internet Information Services (IIS) is running.

  2. Unzip the files in NTFSWebService.zip into a permanent folder.

  3. Create a virtual directory on the Internet Information Server with the path pointing to the folder created in the previous step.

    1. Select Administrative Tools from the Windows Start menu, then select Internet Information Services (IIS) Manager.

    2. Expand the navigator in IIS Manager and right-click a Web site.

    3. Select New, then Virtual Directory.

    4. Follow the steps of the Virtual Directory Creation wizard.

  4. On the Virtual Directory Access Permissions page of the wizard, select Read and Run Scripts (such as ASP).

  5. Open NTFSWebService Properties.

  6. On the ASP.NET tab, verify that ASP.NET is version 2.0.

  7. On the General tab, enter the settings described in Table 8-3.

  8. On the Application tab, select Local Impersonation and enter the user credentials in the form domain\username.

    The application user must have these permissions:

    • Read on the NTFS Web Service physical directory

    • Read on the file share to be crawled.

    • Write on the C:\WINDOWS\Microsoft.NET\Framework\version\Temporary ASP.NET Files folder.

      If the application user does not have access to this directory, then the Web service cannot load the required DLLs and signals the following error when it tries to access the Web service:

      Server Error in '/NTFSWS683343' Application
      Could not load file or assembly 'WEBSESNTFS' or one of its dependencies. Access is denied.
      

Table 8-3 ASP.NET Configuration Settings

Parameter Description

ServiceUsername

User name that authenticates Oracle SES to the NTFS Web service. You also enter this user name when creating the NTFS source. Oracle SES cannot access the Web service without the service username and password.

ServicePassword

Password for ServiceUsername. Ensure that this password is kept secure.

Batchsize

Determines the number of file URLs fetched for a Web service response. The NTFS connector processes a folder by fetching all the files in the folder.

FileChunkSize

Positive integer that specifies the chunk size. Large documents are sent in chunks to the NTFS connector. Enter a positive integer. For example, 1024000 divides the file into 1 MB chunks for sending over the Web.

File chunk size should be the optimal data size that can transfer over the network.

IncrementalCrawlData

Path of the Results directory as specified in the Oracle Search File Change Detector configuration file. See "Modifying the File Change Detector Configuration File".

Choose the Application tab and impersonate as user that has read permission on the shared folder. In the example below, "OSES" is the domain and "NTFSCrawler" is a domain user that has read permissions on the shared folder.


To verify that the NTFS Web service is installed correctly:  

  1. Open Internet Information Services (IIS) Manager.

  2. In the navigation tree, select NTFSWebService to display its contents in the right pane.

  3. Right-click NTFSWebService.asmx and choose Browse.

  4. Ensure that the Web service methods described in Table 8-4 are listed.

Table 8-4 NTFS Web Service Methods

Method Description

ClearFCDLog

Clears the current Oracle Search File Change Detector log.

ClearPreviousFCDLog

Clears the previous Oracle Search File Change Detector log.

GetDFList

Gets all the files and subfolders in a specified folder.

GetDocContainer

Gets the file and the access URL, display URL, and actual content after encoding. It also gets the ACL for the files and attributes of the file.

GetFileInParts

Gets the file after breaking it into chunk. The FileChunkSize parameter controls the chunk size.

GetMinimalMetadata

Fetches the ACL for the document and the last modified date of the file to determine whether the file has changed.

GetModifiedURLs

Gets a list of modified files and folders from the Oracle Search File Change Detector.


Configuring the NTFS Connector

The NTFS connector must be configured to perform incremental crawls with the Oracle Search File Change Detector. The connector has an additional parameter.

To configure the NTFS connector: 

  1. Open the Oracle SES Administration GUI, and select the Sources secondary tab.

  2. Create or edit the NTFS connector.

  3. Set the Incremental crawl with the File Change Detector parameter to true.

When the Incremental crawl with File Change Detector parameter is set to true, the NTFS connector performs the incremental crawl using the detector change logs. It reverts automatically to a scan-based incremental crawl under these conditions:

  • The Oracle Search File Change Detector service is stopped.

  • The Oracle Search File Change Detector service is started after the previous crawl start time. Scan-based incremental crawl is performed because some changes in the NTFS system might not be captured by the File Change Detector.

  • The internal buffer of the File Change Detector overflowed. When the buffer overflows, the file change detector might not capture some changes.

To revert manually to a scan-based incremental crawl, set the Incremental crawl with the File Change Detector parameter to false.

Known Issues

  • The Oracle Search File Change Detector does not capture changes to top-level directories used in the crawler configuration (UNC Path). Note that other directories within the folder are detected correctly.

  • Changes to the source configuration, such as boundary rules and maximum file size, do not affect incremental crawls. For these changes to take effect, run a scan-based incremental crawl by setting the Incremental crawl with the File Change Detector parameter to false.

  • File Change Detector hangs after the Windows Server Active Directory is restarted. You must manually restart the File Change Detector service whenever Active Directory is restarted.