Crawling errors

Processing source documents, including retrieving and extracting text can introduce problems. This section lists several common errors and any workarounds, if applicable.

PDF content in Endeca.Document.Text displays as binary data — If the Endeca Crawler processes PDF files and the content from those files appears in your application as binary data, the PDF files may contain custom-encoded embedded fonts. It cannot always correctly display content that contains custom-encoded embedded fonts. To solve the issue, a system font is substituted for the custom-encoded font. The substitution succeeds if the encoding in the substituted system font is the same as the custom encoding in the embedded font. When the substitution is not successful, you see binary data in Endeca.Document.Text.

Here are several issues related to retrieving documents from HTTP hosts, and an explanation of how the spider handles them:

Connection timeout — The spider retries the request five times. Each timeout is logged in an informational message. After a fifth timeout, an error message is logged, and the record for the offending URL is created with its Endeca.Document.Status property set to "Fetch Aborted."
URL not found — The spider logs a warning message that the URL could not be located and creates a record with its Endeca.Document.Status property set to “Fetch Failed.”
Malformed URL — The spider logs a warning message that the URL is malformed and creates a record with its Endeca.Document.Status property set to “Fetch Failed.”
Authentication failure — The spider logs a warning message that the URL could not be retrieved and creates a record with its Endeca.Document.Status property set to “Fetch Failed.”