Integrating Search

     Previous  Next    Open TOC in new window    View as PDF - New Window  Get Adobe Reader - New Window
Content starts here

Multi-language Searching and Indexing

WebLogic Portal provides several methods for configuring full-text search and indexing in multiple languages. Each method provides different capabilities. You need to decide on a per-repository basis which method is desirable. If you decided to change methods later, you also need to re-index your repository. Note that each document indexed can be associated with only one language.

The following sections describe each full-text search method and how to configure them:

You need to decide on a per-repository basis which approach is desirable. You should also consult the Autonomy documentation, which is included in your WebLogic Portal installation directory at <WLPORTAL_HOME>/content-mgmt/thirdparty/autonomy-wlp10/common/docs.

 


One Language per Autonomy Server

The default configuration for an Autonomy server is one language and one encoding across all repositories using that server. When you use this configuration for multiple languages, you need separate Autonomy servers for each language. In this case you need to configure all indexed content and all full-text queries against that content to use the same LanguageType (language and encoding).

For example, you could have three repositories accessing a single Autonomy server. All three repositories must use the same LanguageType, such as FrenchUTF8, and all documents indexed in each repository would need to be in French. Additionally, all queries on all repositories would need to be in French language with UTF8 encoding. If you needed two languages, you would have to set up two Autonomy servers, two repositories, and manually configure the default language type in each server.

To set a default language type for a server, you edit the DefaultLanguageType in the [LanguageTypes] section in the server’s configuration file (AutonomyIDOLServer.cfg). For more information about defining a global default language type, see the IDOL Server Administration Guide at <WLPORTAL_HOME>/content-mgmt/thirdparty/autonomy-wlp10/common/docs.

 


One Language per Repository

To mix multiple repositories, possibly with different languages, in the same Autonomy server, you need to specify the language and encoding for each repository. This means that all nodes in a repository and all queries must use the same language type and encoding. Both the language type and encoding are defined by the LanguageType. Some examples of language types are frenchUTF8 (French language, UTF8 encoding), frenchASCII, and russianCYRILLIC. When you use a language type, such as frenchUTF8, all documents in the French-UTF8 repository must be in French and all queries in that repository must be in the French language with UTF8 encoding.

The supported language types are listed in [LanguageTypes] section in the server’s configuration file (AutonomyIDOLServer.cfg), which is located in the <WLPORTAL_HOME>/content-mgmt/thirdparty/autonomy-wlp10/<os>/IDOLserver/
IDOL
directory.

To use this approach, you need to add two properties to the repository configuration. For example, to use the French language with UTF8 encoding add:

Generally you set these properties to the same value. All queries need use the same fullTextSearchQueryLanguageType language and encoding.

For instructions on how to add a property to a repository, see “Adding Custom Properties” in Configuring WLP Repositories in the Content Management Guide.

Note: After you disconnect a repository or make any changes to repository properties, Portal Administration Console users must log out and log back in to view the changes.

 


Mixing Languages Within a Repository

If you mix data of multiple language types within a repository, you can use Automatic Language Detection. This approach provides the greatest flexibility for both repository content and search options.

Automatic Language Detection identifies the language and encoding of a document when it is indexed and provides the ability to query data by language and/or encoding. For example you could specify that you want to find only French and Italian matches; regardless of encoding; or only Russian matches with UTF8 encoding; or all matches, regardless of language and encoding.

You configure Automatic Language Detection on a per-repository basis. This means you could have three different repositories with different indexing and querying abilities: two repositories might use Automatic Language Detection and have a mixture of documents of type frenchUTF8, englishASCII, and russianCYRILLIC, while the third repository contains only italianUTF8 documents.

Caution: When you configure Automatic Language Detection, any repository using the default configuration (one language and one encoding across all repositories using that server), will be automatically configured to use Automatic Language Detection. If you do not want this behavior, you must specify the language type for each language and its encoding for those repositories, as described in One Language per Repository.

Configuring Automatic Language Detection

When Automatic Language Detection is set, the server automatically identifies the language and encoding of a document when it is indexed. For more information about Automatic Language Detection, see the IDOL Server Administration Guide at <WLPORTAL_HOME>/content-mgmt/thirdparty/autonomy-wlp10/common/docs.

Note: Enabling this feature may have an impact on the ability to search for existing content in Content Management and GroupSpace repositories other than content defined as the DefaultLanguageType. This is because language reclassification can occur when this feature is enabled.

To configure Automatic Language Detection on a repository:

  1. Set the AutoDetectLanguagesAtIndex to true in the [Server] section of the AutonomyIDOLServer.cfg file, which is located in the <WLPORTAL_HOME>/content-mgmt/thirdparty/autonomy-wlp10/<os>/IDOLserver/IDOL directory.
  2. Do not set a fullTextSearchIndexLanguageType property on the repository; remove if already set.
  3. Optionally, specify the default query language type by adding a property to the repository. For example:
  4. fullTextSearchQueryLanguageType=frenchUTF8
  5. Optionally, specify that results are returned across all languages, not just the language of the fullTextSearchQueryLanguageType by adding the following property to the repository:
  6. fullTextSearchQueryAnyLanguage=true
  7. Re-index your repository content. For information on how to do this, see Re-Indexing WLP Repository Content.
  8. Note: During indexing, if the language type cannot be determined automatically, the DefaultLanguageType is used. This is a global server setting, not a repository setting.

Creating Queries

Queries are very flexible; they can be in any language and encoding. For example, you can construct a query that return results for Japanese documents using UTF-8, Shift_JIS, and EUC-JP encodings.

Use the following examples to specify the search results from your repositories. For additional information about these examples, see the WebLogic Portal Javadoc.

Query Text in Same Language and Any Encoding

If the query text is in the language and encoding defined by the fullTextSearchQueryLanguageType and you want results in the language of fullTextSearchQueryLanguageType regardless of the encoding, you do not need to create additional code.

Query Text in Same Language with Specific Encoding

If the query text is in the language and encoding defined by the fullTextSearchQueryLanguageType and you want the results in the same language as the fullTextSearchQueryLanguageType with a specific encoding:

params = new AutonomyLanguageParameterSet();
params.setLanguageType("englishASCII");
params.setMatchEncoding("UTF8");
context.setParameter(FullTextSearchLanguageParameterSet.
   QUERY_LANGUAGE_PARAMETER_SET_KEY, params);

Query Text in Another Language and Encoding

If the query text is in a language and encoding different from fullTextSearchQueryLanguageType, you can override the repository fullTextSearchQueryLanguageType in the ContentContext class. This returns results in the specified LanguageType language, regardless of encoding:

params = new AutonomyLanguageParameterSet();
params.setLanguageType("englishASCII");
context.setParameter(FullTextSearchLanguageParameterSet.
   QUERY_LANGUAGE_PARAMETER_SET_KEY, params);

Query Across All Languages

If the query text is in one language and encoding and you want to query across all languages:

params = new AutonomyLanguageParameterSet();
params.setLanguageType("englishASCII");
params.setAnyLanguage(true);
context.setParameter(FullTextSearchLanguageParameterSet.
QUERY_LANGUAGE_PARAMETER_SET_KEY, params);

Query Multiple Specific Languages

If the query text is in one language and encoding and you want to query multiple specific languages:

params = new AutonomyLanguageParameterSet();
params.setLanguageType("englishASCII");
params.setAnyLanguage(true);
params.setMatchLanguageType("frenchASCII+germanUTF8");
context.setParameter(FullTextSearchLanguageParameterSet.
   QUERY_LANGUAGE_PARAMETER_SET_KEY, params);

 


Enterprise Search for Microsoft Word, Excel, and PowerPoint Files in Multibyte Languages

You can configure search and indexing for Microsoft Word (.doc), Excel (.xls), and PowerPoint (.ppt) files in Content Management and GroupSpace communities. In these cases, you need to use the default configuration for an Autonomy server, that is, one language and one encoding across all repositories using that server. For more information, see One Language per Autonomy Server.

In addition to using the default Autonomy server configuration, you need to set the encodings for indexing and searching on the file names as described in this section. Without these encoding settings, search cannot find the file names based on multibyte encodings. These encoding are set in the following files:

Note: The supported language types and encodings are listed in [LanguageTypes] section in the AutonomyIDOLServer.cfg file, which is located in the <WLPORTAL_HOME>/content-mgmt/thirdparty/autonomy-wlp10/<os>/
IDOLserver/IDOL
directory.

Settings in omnislave.cfg

You must specify the system’s default encoding in the omnislave.cfg file. This file is located in the <WLPORTAL_HOME>/content-mgmt/thirdparty/autonomy-wlp10/<OS>/filters directory.

To specify the encoding:

  1. In the omnislave.cfg file, remove any FileNameFromCharSet=<encoding> settings from any sections in which they appear.
  2. In the [Configuration] section, add the system’s default encoding. For example:
  3.    FileNameFromCharSet=SHIFTJIS

Settings in AutonomyIDOLServer.cfg

You must specify the DefaultLanguageType and DefaultEncoding settings in the AutonomyIDOLServer.cfg file. The DefaultEncoding must be same encoding as specified in omnislave.cfg. And the DefaultLanguageType must be in the corresponding language type to the encoding specified in omnislave.cfg and the language of document. For example for the Japanese language with Shift-JIS encoding, you would specify:

[LanguageTypes]
DefaultLanguageType=japaneseSHIFT_JIS
DefaultEncoding=SHIFTJIS

GroupSpace Encoding

You need to also update the groupspace encoding in web.xml to use the same encoding that you specified in omnislave.cfg. (The web.xml file is located in the WEB-INF directory of the portal web project.) For example:

<context-param>
<param-name>com.bea.apps.groupspace.search.enterprise.outputEncoding</param-name>
<param-value>SHIFTJIS</param-value>
</context-param>
<context-param>
<param-name>com.bea.apps.groupspace.search.enterprise.connectionEncoding</param-name>
<param-value>shift_jis</param-value>
</context-param>
Note: The setting in AutonomyIDOLServer.cfg and web.xml are not confined to matters of file name search, but are required to handle multibyte characters in Enterprise Search.

After modifying these files, you must re-index the existing content for the multibyte characters in filenames. For information on how to do this, see Re-Indexing WLP Repository Content.


  Back to Top       Previous  Next