Excluding file formats

You can globally exclude file formats by adding their file extensions to an exclusion line in the crawl-urlfilter.txt file.

The default crawl-urlfilter.txt configuration excludes these file types:
  • BMP (bitmap image), via the .bmp and .BMP extensions
  • CSS (Cascading Style Sheet), via the .css extension
  • EPS (Encapsulated PostScript), via the .eps extension
  • EXE (Windows executable), via the .exe extension
  • GIF (Graphics Interchange Format), via the .gif and .GIF extension
  • GZIP (GNU Zip), via the .gz extension
  • ICO (icon image), via the .ico and .ICO extension
  • JPG and JPEG (Joint Photographic Experts Group), via the .jpeg, .JPEG, .jpg, and .JPG extensions
  • MOV (Apple QuickTime Movie), via the .mov and .MOV extensions
  • MPG (Moving Picture Experts Group), via the .mpg extension
  • PNG (Portable Network Graphics), via the .png and .PNG extension
  • RPM (Red Hat Package Manager), via the .rpm extension
  • SIT ( Stuffit archive), via the .sit extension
  • TGZ (Gzipped Tar), via the .tgz extension
  • WMF (Windows Metafile), via the .wmf extension
  • ZIP (compressed archive), via the .zip extension

Except for HTML, text-based, and JavaScript files, text conversion on all other file types is performed by the CAS Document Conversion Module (if you have installed and enabled the module). As a rule of thumb, therefore, you should exclude any file format that is not supported by the module. For a list of the supported file formats, see the CAS Developer's Guide.

  1. To exclude file formats:
  2. In a text editor, open crawl-urlfilter.txt.
  3. Locate the following lines:
    # skip image and other suffixes we can't yet parse
    -\.(gif|GIF|jpg|JPG|...|bmp|BMP)$
    (the example is truncated for ease of reading)
  4. Modify the second line to reflect file extensions that you wish to exclude.
  5. Save and close the file.