The following is a detailed view of how the CAS Server handles archive files:
An Endeca record is created for the archive file itself. This record has the
Endeca.File.IsArchive
property set totrue
.In addition to the top-level documents (files or directories), nested archive files are also processed.
Document conversion (if enabled) is performed on all files within the archive, in accordance with document conversion filtering.
A separate Endeca record is created for each document (including nested archives) found in the archive. The record is processed as follows:
The record has the
Endeca.File.IsInArchive
property set totrue
. In addition, theEndeca.File.SourceArchive
andEndeca.File.PathWithinSourceArchive
properties are added with a reference to the parent archive.The filtering behavior works the same for archived files and directories (that is, files and directories in an archive) as it does for non-archived files and directories.
For records from either file system or CMS crawls, the record Id is a concatenation of the
Endeca.File.SourceArchiveId
property and theEndeca.File.PathWithinSourceArchive
property:For file system records, the
Endeca.FileSystem.Path
property is the record Id. This property is a canonical string pointing to the file within the archive, and follows this format:/path/to/archive//path/to/archivedfile
For CMS records, the
Endeca.Id
property is the record Id. This property is a canonical string pointing to the file within the archive, and follows this format:reposId:itemId[:optionalContentPieceId]//path/to/archivedfile
Note
Path delimiters for the value of the
PathWithinSourceArchive
property appear as forward slashes (they are platform-independent).Path delimiters for the value of the
Endeca.FileSystem.Path
property are platform-dependent, so in the case of Windows files, path delimeters on this property appear as backslashes. For example:C:\path\to\archive//path/to/archivedfile
In the case of nested archives, the
Endeca.File.PathWithinSourceArchive
property takes the following format://path/to/nested/archive//path/within/nested/archive
While the properties of archived entries are obtained in an Endeca record, the entries themselves are not physically extracted from the archive (that is, no new files are permanently saved to disk).
If an archive has entries with identical names, the first entry that is processed is kept (that is, an Endeca record is created for it) and the duplicate entry is ignored.
Seeds are restricted to actual files or directories or entries. That is, seeds cannot point to archived files or directories.
The above behavior is the default for all archives crawled. To avoid processing archives, disable the Expand archives option for the data source.