The crawler uses a set of messages to log the crawling activities.
The following table lists the most common crawler error messages.
Message ID | Message | Comment | Action |
---|---|---|---|
30025 | {0}: Connection refused | The Web site refuses the URL access request. | Check the network setup environment of the computer running the crawler. |
30027 | Not allowed URL: {0} | A URL link violates boundary rules and is discarded. | Confirm that the URL indeed can be ignored. |
30030 | Malformed URL: {0} | The URL is not properly formed. | Verify the URL. |
30031 | Excluded by ROBOTS.TXT: {0} | The robots.txt rule from the Web site of the URL does not allow the URL to be crawled. | Configure the crawler to ignore robots rule only when you are managing the target Web site. Use the Home - Sources - Crawling Parameters page. |
30040 | Ignore URL: {0} | Redirection to this URL is not allowed by boundary rule. | Confirm that the URL indeed should be ignored. |
30041 | {0}: excluded by MIME type inclusion rule, URL is {1} | The content type of the URL is not in MIME type inclusion list. | Check if the specified content type should be included. |
30054 | Excessively long URL: {0} | The URL string is too long, and the URL is ignored. | N/A |
30057 | {0}: timeout reading document | The target Web site is too slow sending page content. | Increase the crawler timeout threshold from the crawler configuration page. The default is 30 seconds. |
30083 | {0}: Duplicate document ignored | An identical document has been seen before in the same crawl session. This could be an indication of URL looping; that is, a generation of different URLs pointing back to the same page. | Check if the URL is generated correctly. If necessary, disable indexing dynamic URLs. Use the Home - Sources - Crawling Parameters page. |
30126 | Binary document reported as text document: "{0}" | A binary file has been sent by the Web site as a text document. In most cases, the URL in question is not a binary format text document, like pdf. | Correct the Web site content type setting for the URL, if possible. |
30188 | Login form not specified for "{0}" | Unable to perform HTML form login, because the name of the form is not set. In general, the name of the form should be automatically set by the crawler. | Identify the URL of the login page, and check whether this is a regular HTML form login page or an OracleAS Single Sign-on login page. Report the problem to Oracle support. |
30199 | Encountered an error while responding to the following HTTP authentication request: [{0}] | Unable to authenticate through the target URL. | Verify if the authentication request is basic authentication or digest authentication. Also confirm the provided authentication credentials. |
30201 | Missing authentication credentials | Authentication data is not available to access the URL. | Check the type of authentication needed and provide it through the source customization page |
30206 | Ignoring "{0}" due to host (or redirected host) connection problem | The crawler cannot contact the server of the URL. | Verify that the Web site in question is up and try to re-crawl. |
30209 | Document size ({0}) too big, ignored: {1} | Document size exceeds the default limit of 10 megabytes. | Increase the document size limit on the Global Settings - Crawler Configuration page. |
30215 | Excluded by crawling depth limit({0}): {1} | Previously crawled URL is excluded due to newly reduced crawling depth limit. | Confirm that the depth limit is correct. |
30782 | Invalid document attribute {0} - ignored | Some attribute picked up from the document is not defined for the source. It is ignored. | Most likely this is safe to ignore, unless you know that this particular attribute should be defined for this source. In that case, contact Oracle Support. |