The following is a detailed view of how the CAS Server handles archive files:
An Endeca record is created for the archive file itself. This record has the
Endeca.File.IsArchiveproperty set totrue.In addition to the top-level documents (files or directories), nested archive files are also processed.
Document conversion (if enabled) is performed on all files within the archive, in accordance with document conversion filtering.
A separate Endeca record is created for each document (including nested archives) found in the archive. The record is processed as follows:
The record has the
Endeca.File.IsInArchiveproperty set totrue. In addition, theEndeca.File.SourceArchiveandEndeca.File.PathWithinSourceArchiveproperties are added with a reference to the parent archive.The filtering behavior works the same for archived files and directories (that is, files and directories in an archive) as it does for non-archived files and directories.
For records from either file system or CMS crawls, the record Id is a concatenation of the
Endeca.File.SourceArchiveIdproperty and theEndeca.File.PathWithinSourceArchiveproperty:For file system records, the
Endeca.FileSystem.Pathproperty is the record Id. This property is a canonical string pointing to the file within the archive, and follows this format:/path/to/archive//path/to/archivedfile
For CMS records, the
Endeca.Idproperty is the record Id. This property is a canonical string pointing to the file within the archive, and follows this format:reposId:itemId[:optionalContentPieceId]//path/to/archivedfile
Note
Path delimiters for the value of the
PathWithinSourceArchiveproperty appear as forward slashes (they are platform-independent).Path delimiters for the value of the
Endeca.FileSystem.Pathproperty are platform-dependent, so in the case of Windows files, path delimeters on this property appear as backslashes. For example:C:\path\to\archive//path/to/archivedfile
In the case of nested archives, the
Endeca.File.PathWithinSourceArchiveproperty takes the following format://path/to/nested/archive//path/within/nested/archive
While the properties of archived entries are obtained in an Endeca record, the entries themselves are not physically extracted from the archive (that is, no new files are permanently saved to disk).
If an archive has entries with identical names, the first entry that is processed is kept (that is, an Endeca record is created for it) and the duplicate entry is ignored.
Seeds are restricted to actual files or directories or entries. That is, seeds cannot point to archived files or directories.
The above behavior is the default for all archives crawled. To avoid processing archives, disable the Expand archives option for the data source.

