For the two encoding properties, the OriginalCharEncoding is retrieved from the content-type set in the HTTP header; if that fails, the Web Crawler tries to retrieve it from the downloaded content bytes.

The Web Crawler also keeps an alias map that maps character encodings which are often used in mislabelled documents to their correct encodings. The map is:

If the encoding is mapped to a value, then CharEncodingForConversion is set to the mapped value; otherwise, it is set to the same value as the OriginalCharEncoding value.


Copyright © Legal Notices