public class LCSDetectionHTMLReader extends LCSDetectionReader
LCSDetectionHTMLReader class extends the LCSDetectionReader class to support the language/encoding detection for input in HTML format.
You can choose the character set of the HTML meta value or the detected character set value as the input character set with the flag METAVAL or DETECTVAL. The default flag value is DETECTVAL.
The detection sampling length indicates how many bytes of plain text on which the detection feature will perform. The default sampling length is 1K. Generally, LCSD handles the language/encoding detection, and you do not need to set this value. By allowing changes to this value, you can control the detection sampling length.
You can get the detection results from the LCSDResultSet class if needed.
Any read method returns UTFDataFormatException if the source is UTF-8 data and an invalid UTF-8 sequence is found.
| Modifier and Type | Field and Description |
|---|---|
static int |
DETECTVAL
Constant value to represent
DETECTVAL flag. |
static int |
METAVAL
Constant value to represent
METAVAL flag. |
DEFAULT_SAMPLING_SIZE| Constructor and Description |
|---|
LCSDetectionHTMLReader(InputStream in)
Creates an
LCSDetectionHTMLReader object. |
LCSDetectionHTMLReader(InputStream in, int len)
Creates an
LCSDetectionHTMLReader object. |
LCSDetectionHTMLReader(InputStream in, int len, int flag)
Creates an
LCSDetectionHTMLReader object. |
close, getResult, mark, markSupported, read, read, read, ready, resetpublic static final int METAVAL
METAVAL flag.public static final int DETECTVAL
DETECTVAL flag.public LCSDetectionHTMLReader(InputStream in) throws IOException, UTFDataFormatException
LCSDetectionHTMLReader object. Use the default sampling length and default profile for detection. The detected character set is used for conversion.in - input stream that you want to detectIOException - if any I/O error occursUTFDataFormatException - if any invalid UTF-8 data sequence is detected. Note this occurs only if the source is UTF-8 datapublic LCSDetectionHTMLReader(InputStream in, int len) throws IOException, UTFDataFormatException
LCSDetectionHTMLReader object. Use the specified sampling length and default profile for detection. The detected character set is used for conversion.in - input stream that you want to detectlen - the sampling lengthIOException - if any I/O error occursUTFDataFormatException - if any invalid UTF-8 data sequence is detected. Note this occurs only if the source is UTF-8 datapublic LCSDetectionHTMLReader(InputStream in, int len, int flag) throws IOException, UTFDataFormatException
LCSDetectionHTMLReader object. Use the specified sampling length and default profile for detection. The detected character set is used for conversion if the flag is DETECTVAL or the meta value of the character set is used for conversion if the flag is METAVAL.in - input stream that you want to detectlen - the sampling lengthflag - METAVAL or DETECTVALIOException - if any I/O error occursUTFDataFormatException - if any invalid UTF-8 data sequence is detected. Note this occurs only if the source is UTF-8 data