For the two encoding properties, the
OriginalCharEncoding
is retrieved from the
content-type set in the HTTP header; if that fails, the Web Crawler tries to
retrieve it from the downloaded content bytes.
The Web Crawler also keeps an alias map that maps character encodings which are often used in mislabelled documents to their correct encodings. The map is:
If the encoding is mapped to a value, then
CharEncodingForConversion
is set to the mapped value;
otherwise, it is set to the same value as the
OriginalCharEncoding
value.