public class LCSDetectionHTMLReader extends LCSDetectionReader
LCSDetectionHTMLReader
class extends the LCSDetectionReader
class to support the language/encoding detection for input in HTML format.
You can choose the character set of the HTML meta value or the detected character set value as the input character set with the flag METAVAL
or DETECTVAL
. The default flag value is DETECTVAL
.
The detection sampling length indicates how many bytes of plain text on which the detection feature will perform. The default sampling length is 1K. Generally, LCSD handles the language/encoding detection, and you do not need to set this value. By allowing changes to this value, you can control the detection sampling length.
You can get the detection results from the LCSDResultSet
class if needed.
Any read
method returns UTFDataFormatException
if the source is UTF-8 data and an invalid UTF-8 sequence is found.
Modifier and Type | Field and Description |
---|---|
static int |
DETECTVAL
Constant value to represent
DETECTVAL flag. |
static int |
METAVAL
Constant value to represent
METAVAL flag. |
DEFAULT_SAMPLING_SIZE
Constructor and Description |
---|
LCSDetectionHTMLReader(InputStream in)
Creates an
LCSDetectionHTMLReader object. |
LCSDetectionHTMLReader(InputStream in, int len)
Creates an
LCSDetectionHTMLReader object. |
LCSDetectionHTMLReader(InputStream in, int len, int flag)
Creates an
LCSDetectionHTMLReader object. |
close, getResult, mark, markSupported, read, read, read, ready, reset
public static final int METAVAL
METAVAL
flag.public static final int DETECTVAL
DETECTVAL
flag.public LCSDetectionHTMLReader(InputStream in) throws IOException, UTFDataFormatException
LCSDetectionHTMLReader
object. Use the default sampling length and default profile for detection. The detected character set is used for conversion.in
- input stream that you want to detectIOException
- if any I/O error occursUTFDataFormatException
- if any invalid UTF-8 data sequence is detected. Note this occurs only if the source is UTF-8 datapublic LCSDetectionHTMLReader(InputStream in, int len) throws IOException, UTFDataFormatException
LCSDetectionHTMLReader
object. Use the specified sampling length and default profile for detection. The detected character set is used for conversion.in
- input stream that you want to detectlen
- the sampling lengthIOException
- if any I/O error occursUTFDataFormatException
- if any invalid UTF-8 data sequence is detected. Note this occurs only if the source is UTF-8 datapublic LCSDetectionHTMLReader(InputStream in, int len, int flag) throws IOException, UTFDataFormatException
LCSDetectionHTMLReader
object. Use the specified sampling length and default profile for detection. The detected character set is used for conversion if the flag is DETECTVAL
or the meta value of the character set is used for conversion if the flag is METAVAL
.in
- input stream that you want to detectlen
- the sampling lengthflag
- METAVAL
or DETECTVAL
IOException
- if any I/O error occursUTFDataFormatException
- if any invalid UTF-8 data sequence is detected. Note this occurs only if the source is UTF-8 data