Class LCSDetectionReader

  • All Implemented Interfaces:
    Closeable, AutoCloseable, Readable
    Direct Known Subclasses:
    LCSDetectionHTMLReader

    public class LCSDetectionReader
    extends Reader
    The LCSDetectionReader class is the language and character detector (LCSD) reader class that transparently detects the character set and converts it to the Unicode data.

    The most common usage is for the Reader interface to read the text data as follows:

     InputStream in = file.getInputStream();
     Reader rdr = new LCSDetectionReader(in);
     char[] cbuf = new char[1024];
     for (int len = -1; (len = rdr.read(cbuf)) != -1;)
     {
       // do something with cbuf
       ...
     }
     
    The detection occurs only once by sampling the first chunk of data.
    Since:
    10.2
    • Field Detail

      • DEFAULT_SAMPLING_SIZE

        protected static final int DEFAULT_SAMPLING_SIZE
        Default sampling byte length for language and character set detection.
        See Also:
        Constant Field Values
    • Constructor Detail

      • LCSDetectionReader

        public LCSDetectionReader​(InputStream in)
                           throws IOException,
                                  UTFDataFormatException
        Constructs the LCSD Reader instance with the character set determined by sampling initial data.
        Parameters:
        in - the InputStream object including the text data
        Throws:
        IOException - if any I/O error occurs
        UTFDataFormatException - if an invalid UTF-8 data sequence is detected. Note this occurs only if the source is UTF-8 data.
      • LCSDetectionReader

        public LCSDetectionReader​(InputStream in,
                                  int size)
                           throws IOException,
                                  UTFDataFormatException
        Constructs the LCSD Reader instance with the character set determined by sampling initial data.
        Parameters:
        in - the InputStream object including the text data
        size - the sampling size
        Throws:
        IOException - if any I/O error occurs
        UTFDataFormatException - if an invalid UTF-8 data sequence is detected. Note this occurs only if the source is in UTF-8 encoding
      • LCSDetectionReader

        public LCSDetectionReader​(String profile,
                                  InputStream in)
                           throws IOException,
                                  UTFDataFormatException
        Constructs the LCSD Reader instance with the character set determined by sampling initial data.
        Parameters:
        profile - the LCSD profile name. null is the default.
        in - the InputStream object including the text data
        Throws:
        IOException - if any I/O error occurs
        UTFDataFormatException - if an invalid UTF-8 data sequence is detected. Note this occurs only if the source is in UTF-8 encoding
      • LCSDetectionReader

        public LCSDetectionReader​(String profile,
                                  InputStream in,
                                  int size)
                           throws IOException,
                                  UTFDataFormatException
        Constructs the LCSD Reader instance with the character set determined by sampling initial data.
        Parameters:
        profile - the LCSD profile name. null is the default
        in - the InputStream object including the text data
        size - the sampling size
        Throws:
        IOException - if any I/O error occurs
        UTFDataFormatException - if an invalid UTF-8 data sequence is detected. Note this occurs only if the source is in UTF-8 encoding
      • LCSDetectionReader

        public LCSDetectionReader​(Reader reader)
                           throws IOException
        Constructs the LCSD Reader instance over the input stream reader.

        This constructor is used to detect the language from the reader object. The character set is always UTF-16.

        Parameters:
        reader - the InputStreamReader object
        Throws:
        IOException - if any I/O error occurs
      • LCSDetectionReader

        public LCSDetectionReader​(String profile,
                                  Reader reader)
                           throws IOException
        Constructs the LCSD Reader instance over the reader.

        This constructor is used to detect the language from the reader object. The character set is always UTF-16.

        Parameters:
        profile - the LCSD Profile name. null is the default
        reader - the reader including the text data
        Throws:
        IOException - if any I/O error occurs
    • Method Detail

      • getResult

        public LCSDResultSet getResult()
                                throws IOException,
                                       UTFDataFormatException
        Returns the result set of LCSD.

        If the language information is required in your application, call this method. The character set is implicitly used for the conversions, but if you need the name, call this method.

        Returns:
        the result set of LCSD
        Throws:
        IOException - if any I/O error occurs
        UTFDataFormatException - if an invalid UTF-8 data sequence is detected. Note this occurs only if the source is in UTF-8 encoding
      • read

        public int read​(char[] cbuf,
                        int offset,
                        int length)
                 throws IOException,
                        UTFDataFormatException
        Reads characters into a portion of an array.
        Specified by:
        read in class Reader
        Parameters:
        cbuf - destination buffer
        offset - offset at which to start storing characters
        length - maximum number of characters to read
        Returns:
        the number of characters read, or -1 if the end of the stream has been reached
        Throws:
        IOException - if I/O error occurs
        UTFDataFormatException - if any invalid UTF-8 data sequence is detected. Note this occurs only if the source is UTF-8 data.
      • ready

        public boolean ready()
                      throws IOException
        Tells whether this stream is ready to be read.
        Overrides:
        ready in class Reader
        Returns:
        true if the next read() call is guaranteed not to block input, otherwise false is returned
        Throws:
        IOException - if I/O error occurs
      • markSupported

        public boolean markSupported()
        Tells whether this stream supports the mark() operation.
        Overrides:
        markSupported in class Reader
        Returns:
        true if this stream supports the mark() operation
      • mark

        public void mark​(int readAheadLimit)
                  throws IOException
        Marks the present position in the stream.
        Overrides:
        mark in class Reader
        Parameters:
        readAheadLimit - limit on the number of characters that may be read while still preserving the mark. After reading the limited number of characters, attempting to reset the stream may fail.
        Throws:
        IOException - if the stream does not support the mark() operation, or if some other I/O error occurs
      • reset

        public void reset()
                   throws IOException
        Resets the stream.
        Overrides:
        reset in class Reader
        Throws:
        IOException - if the stream has not been marked, or if the mark has been invalidated, or if the stream does not support the reset() operation, or if some other I/O error occurs
      • read

        public int read​(char[] cbuf)
                 throws IOException,
                        UTFDataFormatException
        Reads characters into an array.
        Overrides:
        read in class Reader
        Parameters:
        cbuf - destination buffer
        Returns:
        the number of characters read, or -1 if the end of the stream has been reached
        Throws:
        IOException - if I/O error occurs
        UTFDataFormatException - if any invalid UTF-8 data sequence is detected. Note this occurs only if the source is UTF-8 data.