B URL Crawler Status Codes

The crawler uses a set of codes to indicate the result of the crawled URL. Besides the standard HTTP status code, it uses its own code for non-HTTP related situations.

Only URLs with status 200 are indexed. If the record exists in EQ$URL but the status is something other than 200, then the crawler encountered an error trying to fetch the document. A status of less than 600 maps directly to the HTTP status code.

See Also:

"Status Code Definitions" in Hypertext Transfer Protocol -- HTTP/1.1 at

http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html

The following table lists the U RL status codes, document container codes used by the crawler plug-in, and EQG codes.

Code	Description	Document Container Code	EQG Codes
0	A URL that has been enqueued but not yet processed		N/A
200	URL OK	`STATUS_OK_FOR_INDEX`	N/A
400	Bad request	`STATUS_BAD_REQUEST`	30009
401	Authorization required	`STATUS_AUTH_REQUIRED`	30007
402	Payment required		30011
403	Access forbidden	`STATUS_ACCESS_FORBIDDEN`	30010
404	Not found	`STATUS_NOTFOUND`	30008
405	Method not allowed		30012
406	Not acceptable		30013
407	Proxy authentication required	`STATUS_PROXY_REQUIRED`	30014
408	Request timeout	`STATUS_REQUEST_TIMEOUT`	30015
409	Conflict		30016
410	Gone		30017
414	Request URI too large		30066
500	Internal server error	`STATUS_SERVER_ERROR`	10018
501	Not implemented		10019
502	Bad gateway	`STATUS_BAD_GATEWAY`	10020
503	Service unavailable	`STATUS_FETCH_ERROR`	10021
504	Gateway timeout		10022
505	HTTP version not supported		10023
902	Timeout reading document	`STATUS_READ_TIMEOUT`	30057
903	Filtering failed	`STATUS_FILTER_ERROR`	30065
904	Out of memory error	`STATUS_OUT_OF_MEMORY`	30003
905	IOEXCEPTION in processing URL	`STATUS_IO_EXCEPTION`	30002
906	Connection refused	`STATUS_CONNECTION_REFUSED`	30025
907	Socket bind exception		30079
908	Filter not available		30081
909	Duplicate document detected		30082
910	Duplicate document ignored	`STATUS_DUPLICATE_DOC`	30083
911	Empty document	`STATUS_EMPTY_DOC`	30106
951	URL not indexed (this can happen if `robots.txt` specifies that a certain document should not be indexed)	`STATUS_OK_BUT_NO_INDEX`	N/A
952	URL crawled	`STATUS_OK_CRAWLED`	N/A
953	Metatag redirection		N/A
954	HTTP redirection		30000
955	Black list URL		N/A
956	URL is not unique		31017
957	Sentry URL (URL as a place holder)		N/A
958	Document read error	`STATUS_CANNOT_READ`	30173
959	Form login failed	`STATUS_LOGIN_FAILED`	30183
960	Document size too big, ignored	`STATUS_DOC_SIZE_TOO_BIG`	30209
962	Document was excluded based on mime type	`STATUS_DOC_MIME_TYPE_EXCLUDED`	30041
964	Document was excluded based on boundary rules	`STATUS_DOC_BOUNDARY_RULE_EXCLUDED`	30258
1001	Datatype is not TEXT/HTML		30001
1002	Broken network data stream		30004
1003	HTTP redirect location does not exist		30005
1004	Bad relative URL		30006
1005	HTTP error		30024
1006	Error parsing HTTP header		30058
1007	Invalid URL table column name		30067
1009	Binary document reported as text document		30126
1010	Invalid display URL		30112
1011	Invalid XML from OracleAS Portal	`PORTAL_XMLURL_FAIL`	31011
1020-1024	URL is not reachable. The status starts at 1020, and it increases by one with each try. After five tries (if it reaches 1025), the URL is deleted.		N/A
1026-1029	URL cannot be found. The status turns from 404 to 1026 when a URL cannot be found on re-crawl, and it increases by one with each try. After five tries (if it reaches 1030), the URL is deleted.		N/A
1111	URL remained in the queue even after a successful crawl. This indicates that the crawler had a problem processing this document. You could investigate the URL by crawling it in a separate source to check for errors in the crawler log.		N/A