Overriding the server text/html MIME type

If there is confusion as to the MIME type of a given URL, the Web Crawler by default trusts the URL extension over the server MIME type. The mime.types.trust-server.text-html property is intended for crawls that may experience "text/html" MIME type resolution problems.

Assume, for example, that one of the URLs to be crawled is similar to the following:
http://www.xyz.com/scripts/InfoPDF.asp?FileName=4368.pdf

In this case, the actual page is an ASP page, and therefore the server returns "text/html" as the MIME type for the page. However, the crawler sees that the URL has a ".pdf" extension, and therefore resolves it as a PDF file (i.e., it overrides the MIME type returned by the server). The crawler then invokes the Document Conversion module on the page, when in fact it should not.

In the above example, if the mime.types.trust-server.text-html property is set to true, the crawler trusts the server's "text/html" MIME type instead of the URL extension when resolving this contention. The Document Conversion module is therefore not invoked.

To override the server text/html MIME type:

  1. In a text editor, open the default.xml file.
  2. Set the mime.types.trust-server.text-html property to true.
  3. Save the file.