Class LCSDetector


  • public class LCSDetector
    extends Object
    The LCSDetector class contains methods to automatically detect and recognize language, encoding, or both based on text input.

    To use the LCSDetector class, call the getInstance() method to obtain an instance of the LCSDetector class. You can specify a profile by calling the getInstance(profile) method, or simply call the getInstance() method to use the standard profile depending on the content of the text you plan to sample. Certain profiles may yield more accurate results. For example, if you are sampling medical journals, you many want to use a profile that is built using mainly medical journals. If you are sampling computer related white papers, a profile built with similar documents improves the accuracy of the detection. Currently, we only provide one standard profile which is for general purpose detection.

    The detection process begins by calling the detect(byte[]) method. Statistics are cumulated every time a detect(byte[]) method is called. When you are ready for the result, call the getResult() method to retrieve an LCSDResultSet instance. To begin a new detection using the same LCSDetector instance, call the reset() method to remove the cumulated statistics.

    Since:
    10.1.0.2
    See Also:
    LCSDResultSet
    • Constructor Summary

      Constructors 
      Constructor Description
      LCSDetector()
      Constructor which uses the standard default profile.
      LCSDetector​(String name)
      Constructor which takes a profile name and allows you to choose a profile other than the default.
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      void detect​(byte[] input)
      Statistical data is cumulated in an internal structure when the detect() methods are called.
      int detect​(byte[] input, int offset, int length)
      Statistical data is cumulated in an internal structure when the detect() methods are called.
      void detect​(char[] input)
      Statistical data is cumulated in an internal structure when the detect() methods are called.
      int detect​(char[] input, int offset, int length)
      Statistical data is cumulated in an internal structure when the detect() methods are called.
      void detect​(InputStream input)
      Statistical data is cumulated in an internal structure when the detect() methods are called.
      int detect​(InputStream input, int length)
      Statistical data is cumulated in an internal structure when the detect() methods are called.
      void detect​(String input)
      Statistical data is cumulated in an internal structure when the detect() methods are called.
      int detect​(String input, int length)
      Statistical data is cumulated in an internal structure when the detect() methods are called.
      LCSDResultSet getResult()
      Determines the top ranking language/character set pairs from the cumulated statistical data.
      static boolean isCharsetSupported​(int charsettype, String charset)
      Check whether the given character set that is equivalent to the Oracle, IANA, or Java Character Set is supported by the detection feature.
      void reset()
      To reset statistical data for all pairs to 0.
      void setCharacterSetFilter​(String charset)
      Sets the character set filter if you know the character set of the input data.
      void setLanguageFilter​(String language)
      Sets the language filter if you know the language of the input data.
    • Constructor Detail

      • LCSDetector

        public LCSDetector()
        Constructor which uses the standard default profile.
      • LCSDetector

        public LCSDetector​(String name)
        Constructor which takes a profile name and allows you to choose a profile other than the default.
        Parameters:
        name - name of profile to use
        Throws:
        IllegalArgumentException - if an invalid profile name is specified
    • Method Detail

      • setCharacterSetFilter

        public void setCharacterSetFilter​(String charset)
        Sets the character set filter if you know the character set of the input data. The default value is none. If both the language filter and character set filter are set, they are ignored. If an invalid IANA character set name is passed in, it is ignored.
        Parameters:
        charset - IANA character set name
        Throws:
        IllegalArgumentException - if an invalid character set is specified
      • setLanguageFilter

        public void setLanguageFilter​(String language)
        Sets the language filter if you know the language of the input data. The default value is none. If both the language filter and character set filter are set, they are ignored. If an invalid language name is passed in, it is ignored.
        Parameters:
        language - ISO language name.
        Throws:
        IllegalArgumentException - if an invalida language is specified
      • detect

        public void detect​(byte[] input)
        Statistical data is cumulated in an internal structure when the detect() methods are called. Use the reset() method to clear the cumulated statistics.
        Parameters:
        input - the bytes to be sampled by the detect method
      • detect

        public int detect​(byte[] input,
                          int offset,
                          int length)
        Statistical data is cumulated in an internal structure when the detect() methods are called. Use the reset() method to clear the cumulated statistics. Only the specified length of bytes is sampled.
        Parameters:
        input - the bytes to be sampled by the detect method
        offset - the index of the first byte to sample
        length - the number of bytes to sample
        Returns:
        the number of bytes sampled, or -1 if the end of the array reached
        Throws:
        IllegalArgumentException - call the reset method
      • detect

        public void detect​(InputStream input)
                    throws IOException
        Statistical data is cumulated in an internal structure when the detect() methods are called. Use the reset() method to clear the cumulated statistics. The entire stream is sampled by the detect() method.
        Parameters:
        input - InputStream to be sampled by the detect method
        Throws:
        IOException - if error occurs while doing operation on stream
        IllegalArgumentException - call the reset method
      • detect

        public int detect​(InputStream input,
                          int length)
                   throws IOException
        Statistical data is cumulated in an internal structure when the detect() methods are called. Use the reset() method to clear the cumulated statistics. Only the specified length of bytes will be sampled.
        Parameters:
        input - InputStream to be sampled by the detect() method
        length - the number of bytes to sample
        Returns:
        the number of bytes sampled, or -1 if the end of the stream is reached
        Throws:
        IOException - if error occurs while doing operation on stream
        IllegalArgumentException - call reset method
      • detect

        public void detect​(String input)
        Statistical data is cumulated in an internal structure when the detect() methods are called. Use the reset() method to clear the cumulated statistics. The entire string is sampled by the detect() method.
        Parameters:
        input - to be sampled by the detect method
      • detect

        public int detect​(String input,
                          int length)
        Statistical data is cumulated in an internal structure when the detect() methods are called. Use the reset() method to clear the cumulated statistics. Only the specified length of characters will be sampled.
        Parameters:
        input - a string to be sampled by the detect() method
        length - the number of characters to sample
        Returns:
        the number of characters sampled, or -1 if the end of the string is reached
        Throws:
        IllegalArgumentException - call reset method
      • detect

        public void detect​(char[] input)
        Statistical data is cumulated in an internal structure when the detect() methods are called. Use the reset() method to clear the cumulated statistics. The entire array is sampled by the detect() method.
        Parameters:
        input - the characters to be sampled by the detect method
      • detect

        public int detect​(char[] input,
                          int offset,
                          int length)
        Statistical data is cumulated in an internal structure when the detect() methods are called. Use the reset() method to clear the cumulated statistics. Only the specified length of characters will be sampled.
        Parameters:
        input - the char array to be sampled by the detect() method
        offset - the index of the first character to sample
        length - the number of characters to sample
        Returns:
        the number of characters sampled, or -1 if the end of the array reached.
        Throws:
        IllegalArgumentException - call reset method
      • getResult

        public LCSDResultSet getResult()
        Determines the top ranking language/character set pairs from the cumulated statistical data.
        Returns:
        An LCSDResultSet object which contains the result
      • isCharsetSupported

        public static boolean isCharsetSupported​(int charsettype,
                                                 String charset)
        Check whether the given character set that is equivalent to the Oracle, IANA, or Java Character Set is supported by the detection feature.

        See LocaleMapper for the parameter ORACLE, IANA, or JAVA.

        Parameters:
        charsettype - can be ORACLE, IANA, or JAVA.
        charset - the given character set
        Returns:
        true if the given character set is supported by the detection feature, or false if not
        Throws:
        IllegalArgumentException - if an invalid profile is specified
      • reset

        public void reset()
        To reset statistical data for all pairs to 0.