You can globally exclude file formats by adding their file
extensions to an exclusion line in the
crawl-urlfilter.txt file.
The default
crawl-urlfilter.txt configuration excludes these
file types:
- BMP (bitmap image), via
the .bmp and .BMP extensions
- CSS (Cascading Style
Sheet), via the .css extension
- EPS (Encapsulated
PostScript), via the .eps extension
- EXE (Windows executable),
via the .exe extension
- GIF (Graphics Interchange
Format), via the .gif and .GIF extension
- GZIP (GNU Zip), via the
.gz extension
- ICO (icon image), via the
.ico and .ICO extension
- JPG and JPEG (Joint
Photographic Experts Group), via the .jpeg, .JPEG, .jpg, and .JPG extensions
- MOV (Apple QuickTime
Movie), via the .mov and .MOV extensions
- MPG (Moving Picture
Experts Group), via the .mpg extension
- PNG (Portable Network
Graphics), via the .png and .PNG extension
- RPM (Red Hat Package
Manager), via the .rpm extension
- SIT ( Stuffit archive),
via the .sit extension
- TGZ (Gzipped Tar), via the
.tgz extension
- WMF (Windows Metafile),
via the .wmf extension
- ZIP (compressed archive),
via the .zip extension
Except for HTML, text-based, and JavaScript files, text conversion on
all other file types is performed by the CAS Document Conversion Module (if you
have installed and enabled the module). As a rule of thumb, therefore, you
should exclude any file format that is not supported by the module. For a list
of the supported file formats, see the
CAS Developer's Guide.