You can globally exclude file formats by adding their file
extensions to an exclusion line in the
crawl-urlfilter.txt
file.
The default
crawl-urlfilter.txt
configuration excludes these
file types:
BMP (bitmap image), via
the .bmp and .BMP extensions
CSS (Cascading Style
Sheet), via the .css extension
EPS (Encapsulated
PostScript), via the .eps extension
EXE (Windows executable),
via the .exe extension
GIF (Graphics Interchange
Format), via the .gif and .GIF extension
GZIP (GNU Zip), via the
.gz extension
ICO (icon image), via the
.ico and .ICO extension
JPG and JPEG (Joint
Photographic Experts Group), via the .jpeg, .JPEG, .jpg, and .JPG extensions
MOV (Apple QuickTime
Movie), via the .mov and .MOV extensions
MPG (Moving Picture
Experts Group), via the .mpg extension
PNG (Portable Network
Graphics), via the .png and .PNG extension
RPM (Red Hat Package
Manager), via the .rpm extension
SIT ( Stuffit archive),
via the .sit extension
TGZ (Gzipped Tar), via the
.tgz extension
WMF (Windows Metafile),
via the .wmf extension
ZIP (compressed archive),
via the .zip extension
Except for HTML, text-based, and JavaScript files, text conversion on
all other file types is performed by the CAS Document Conversion Module (if you
have installed and enabled the module). As a rule of thumb, therefore, you
should exclude any file format that is not supported by the module. For a list
of the supported file formats, see the
CAS Developer's Guide.
To exclude file formats:
In a text editor, open
crawl-urlfilter.txt
.
Locate the following lines:
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|...|bmp|BMP)$
(the example is truncated for ease of reading)
Modify the second line to reflect file extensions that you wish to
exclude.
Save and close the file.