Language identification requires a balance between accuracy and performance.
The exact balance between accuracy and performance depends both on the requirements and the data:
- To increase accuracy, raise the number of bytes in the LANG_ID_BYTES attribute in the ID_LANGUAGE expression.
- To increase performance, either reduce the number of bytes, or, if possible, use different criteria to determine the language. For example, if the languages are already segmented by folder, then a conditional ADD_PROP expression can be used to create the language property on each record, avoiding the LANGUAGE_ID expression altogether.
If the Web server being crawled provides incorrect encoding information, you can remove the encoding property (which typically is the Endeca.Document.Encoding property) before the parse phase. In this case, the PARSE_DOC expression will attempt to detect the encoding automatically. If the encoding for all documents being crawled is known in advance, an expression could add the correct encoding to each record before the parse expression.