Performance considerations for language identification

Language identification requires a balance between accuracy and performance.

The exact balance between accuracy and performance depends both on the requirements and the data:

If the Web server being crawled provides incorrect encoding information, you can remove the encoding property (which typically is the Endeca.Document.Encoding property) before the parse phase. In this case, the PARSE_DOC expression will attempt to detect the encoding automatically. If the encoding for all documents being crawled is known in advance, an expression could add the correct encoding to each record before the parse expression.