Class LCSDetectionHTMLReader

  • All Implemented Interfaces:
    Closeable, AutoCloseable, Readable

    public class LCSDetectionHTMLReader
    extends LCSDetectionReader
    The LCSDetectionHTMLReader class extends the LCSDetectionReader class to support the language/encoding detection for input in HTML format.

    You can choose the character set of the HTML meta value or the detected character set value as the input character set with the flag METAVAL or DETECTVAL. The default flag value is DETECTVAL.

    The detection sampling length indicates how many bytes of plain text on which the detection feature will perform. The default sampling length is 1K. Generally, LCSD handles the language/encoding detection, and you do not need to set this value. By allowing changes to this value, you can control the detection sampling length.

    You can get the detection results from the LCSDResultSet class if needed.

    Any read method returns UTFDataFormatException if the source is UTF-8 data and an invalid UTF-8 sequence is found.

    Since:
    10.2
    • Field Detail

      • METAVAL

        public static final int METAVAL
        Constant value to represent METAVAL flag.
        See Also:
        Constant Field Values
      • DETECTVAL

        public static final int DETECTVAL
        Constant value to represent DETECTVAL flag.
        See Also:
        Constant Field Values
    • Constructor Detail

      • LCSDetectionHTMLReader

        public LCSDetectionHTMLReader​(InputStream in)
                               throws IOException,
                                      UTFDataFormatException
        Creates an LCSDetectionHTMLReader object. Use the default sampling length and default profile for detection. The detected character set is used for conversion.
        Parameters:
        in - input stream that you want to detect
        Throws:
        IOException - if any I/O error occurs
        UTFDataFormatException - if any invalid UTF-8 data sequence is detected. Note this occurs only if the source is UTF-8 data
      • LCSDetectionHTMLReader

        public LCSDetectionHTMLReader​(InputStream in,
                                      int len)
                               throws IOException,
                                      UTFDataFormatException
        Creates an LCSDetectionHTMLReader object. Use the specified sampling length and default profile for detection. The detected character set is used for conversion.
        Parameters:
        in - input stream that you want to detect
        len - the sampling length
        Throws:
        IOException - if any I/O error occurs
        UTFDataFormatException - if any invalid UTF-8 data sequence is detected. Note this occurs only if the source is UTF-8 data
      • LCSDetectionHTMLReader

        public LCSDetectionHTMLReader​(InputStream in,
                                      int len,
                                      int flag)
                               throws IOException,
                                      UTFDataFormatException
        Creates an LCSDetectionHTMLReader object. Use the specified sampling length and default profile for detection. The detected character set is used for conversion if the flag is DETECTVAL or the meta value of the character set is used for conversion if the flag is METAVAL.
        Parameters:
        in - input stream that you want to detect
        len - the sampling length
        flag - METAVAL or DETECTVAL
        Throws:
        IOException - if any I/O error occurs
        UTFDataFormatException - if any invalid UTF-8 data sequence is detected. Note this occurs only if the source is UTF-8 data