Skip Headers
Oracle® Fusion Middleware Autonomy Search Integration Sample Guide for Oracle WebLogic Portal
10g Release 3 (10.3.5)

Part Number E15073-04
Go to Documentation Home
Home
Go to Table of Contents
Contents
Go to Feedback page
Contact Us

Go to previous page
Previous
Go to next page
Next
View PDF

6 Configuring Multi-language Searching and Indexing

Oracle WebLogic Portal provides several methods for configuring full-text search and indexing in multiple languages. Each method provides different capabilities. You need to decide on a per-repository basis which method is desirable. If you decided to change methods later, you also need to re-index your repository. Note that each document indexed can be associated with only one language.

The following sections describe each full-text search method and how to configure them:

You need to decide on a per-repository basis which approach is desirable. You should also consult the Autonomy documentation.

6.1 One Language per Autonomy Server

The default configuration for an Autonomy server is one language and one encoding across all repositories using that server. When you use this configuration for multiple languages, you need separate Autonomy servers for each language. In this case you need to configure all indexed content and all full-text queries against that content to use the same LanguageType (language and encoding).

For example, you could have three repositories accessing a single Autonomy server. All three repositories must use the same LanguageType, such as FrenchUTF8, and all documents indexed in each repository would need to be in French. Additionally, all queries on all repositories would need to be in French language with UTF8 encoding. If you needed two languages, you would have to set up two Autonomy servers, two repositories, and manually configure the default language type in each server.

To set a default language type for a server, you edit the DefaultLanguageType in the [LanguageTypes] section in the server's configuration file (AutonomyIDOLServer.cfg). For more information about defining a global default language type, see the IDOL Server Administration Guide, published by Autonomy Corporation. Contact WebLogic Portal Customer Support to obtain a copy of this guide.

6.2 One Language per Repository

To mix multiple repositories, possibly with different languages, in the same Autonomy server, you need to specify the language and encoding for each repository. This means that all nodes in a repository and all queries must use the same language type and encoding. Both the language type and encoding are defined by the LanguageType. Some examples of language types are frenchUTF8 (French language, UTF8 encoding), frenchASCII, and russianCYRILLIC. When you use a language type, such as frenchUTF8, all documents in the French-UTF8 repository must be in French and all queries in that repository must be in the French language with UTF8 encoding.

The supported language types are listed in [LanguageTypes] section in the server's configuration file (AutonomyIDOLServer.cfg), which is located in the <WLPORTAL_HOME>/content-mgmt/thirdparty/autonomy-wlp10/<os>/IDOLserver/IDOL directory.

Note:

The path above is based on an upgraded Oracle WebLogic Portal domain. If you performed a clean installation of Oracle WebLogic Portal and installed Autonomy separately from Oracle WebLogic Portal, the path in your environment might differ from the path above. For this reason, you should change the path in your environment accordingly.

To use this approach, you need to add two properties to the repository configuration. For example, to use the French language with UTF8 encoding add:

Generally you set these properties to the same value. All queries need use the same fullTextSearchQueryLanguageType language and encoding.

For instructions on how to add a property to a repository, see "Adding Custom Properties" in the Oracle Fusion Middleware Content Management Guide for Oracle WebLogic Portal.

Note:

After you disconnect a repository or make any changes to repository properties, Portal Administration Console users must log out and log back in to view the changes.

6.3 Mixing Languages Within a Repository

If you mix data of multiple language types within a repository, you can use Automatic Language Detection. This approach provides the greatest flexibility for both repository content and search options.

Automatic Language Detection identifies the language and encoding of a document when it is indexed and provides the ability to query data by language and/or encoding. For example you could specify that you want to find only French and Italian matches; regardless of encoding; or only Russian matches with UTF8 encoding; or all matches, regardless of language and encoding.

You configure Automatic Language Detection on a per-repository basis. This means you could have three different repositories with different indexing and querying abilities: two repositories might use Automatic Language Detection and have a mixture of documents of type frenchUTF8, englishASCII, and russianCYRILLIC, while the third repository contains only italianUTF8 documents.

Caution:

When you configure Automatic Language Detection, any repository using the default configuration (one language and one encoding across all repositories using that server), is automatically configured to use Automatic Language Detection. If you do not want this behavior, specify the language type for each language and its encoding for those repositories, as described in Section 6.2, "One Language per Repository."

6.3.1 Configuring Automatic Language Detection

When Automatic Language Detection is set, the server automatically identifies the language and encoding of a document when it is indexed. For more information about Automatic Language Detection, see the IDOL Server Administration Guide, published by Autonomy Corporation. Contact WebLogic Portal Customer Support to obtain a copy of this guide.

Note:

Enabling this feature may have an impact on the ability to search for existing content in Content Management repositories other than content defined as the DefaultLanguageType. This is because language reclassification can occur when this feature is enabled.

To configure Automatic Language Detection on a repository:

  1. Set the AutoDetectLanguagesAtIndex to true in the [Server] section of the AutonomyIDOLServer.cfg file, which is located in the <WLPORTAL_HOME>/content-mgmt/thirdparty/autonomy-wlp10/<os>/IDOLserver/IDOL directory.

    Note:

    The path above is based on an upgraded Oracle WebLogic Portal domain. If you performed a clean installation of Oracle WebLogic Portal and installed Autonomy separately from Oracle WebLogic Portal, the path in your environment might differ from the path above. For this reason, you should change the path in your environment accordingly.

  2. Do not set a fullTextSearchIndexLanguageType property on the repository; remove if already set.

  3. Optionally, specify the default query language type by adding a property to the repository. For example:

  4. fullTextSearchQueryLanguageType=frenchUTF8

  5. Optionally, specify that results are returned across all languages, not just the language of the fullTextSearchQueryLanguageType by adding the following property to the repository:

  6. fullTextSearchQueryAnyLanguage=true

  7. Re-index your repository content. For information on how to do this, see Section 8.2, "Re-Indexing WLP Repository Content."

    Note:

    During indexing, if the language type cannot be determined automatically, the DefaultLanguageType is used. This is a global server setting, not a repository setting.

6.3.2 Creating Queries

Queries are very flexible; they can be in any language and encoding. For example, you can construct a query that return results for Japanese documents using UTF-8, Shift_JIS, and EUC-JP encodings.

Use the following examples to specify the search results from your repositories. For additional information about these examples, see the Oracle Fusion Middleware Java API Reference for Oracle WebLogic Portal.

6.3.3 Query Text in Same Language and Any Encoding

If the query text is in the language and encoding defined by the fullTextSearchQueryLanguageType and you want results in the language of fullTextSearchQueryLanguageType regardless of the encoding, you do not need to create additional code.

6.3.4 Query Text in Same Language with Specific Encoding

If the query text is in the language and encoding defined by the fullTextSearchQueryLanguageType and you want the results in the same language as the fullTextSearchQueryLanguageType with a specific encoding:

params = new AutonomyLanguageParameterSet();
params.setLanguageType("englishASCII");
params.setMatchEncoding("UTF8");
context.setParameter(FullTextSearchLanguageParameterSet.
   QUERY_LANGUAGE_PARAMETER_SET_KEY, params);

6.3.5 Query Text in Another Language and Encoding

If the query text is in a language and encoding different from fullTextSearchQueryLanguageType, you can override the repository fullTextSearchQueryLanguageType in the ContentContext class. This returns results in the specified LanguageType language, regardless of encoding:

params = new AutonomyLanguageParameterSet();
params.setLanguageType("englishASCII");
context.setParameter(FullTextSearchLanguageParameterSet.
   QUERY_LANGUAGE_PARAMETER_SET_KEY, params);

6.3.6 Query Across All Languages

If the query text is in one language and encoding and you want to query across all languages:

params = new AutonomyLanguageParameterSet();
params.setLanguageType("englishASCII");
params.setAnyLanguage(true);
context.setParameter(FullTextSearchLanguageParameterSet.
QUERY_LANGUAGE_PARAMETER_SET_KEY, params);

6.3.7 Query Multiple Specific Languages

If the query text is in one language and encoding and you want to query multiple specific languages:

params = new AutonomyLanguageParameterSet();
params.setLanguageType("englishASCII");
params.setAnyLanguage(true);
params.setMatchLanguageType("frenchASCII+germanUTF8");
context.setParameter(FullTextSearchLanguageParameterSet.
   QUERY_LANGUAGE_PARAMETER_SET_KEY, params);

6.4 Enterprise Search for Microsoft Word, Excel, and PowerPoint Files in Multibyte Languages

You can configure search and indexing for Microsoft Word (.doc), Excel (.xls), and PowerPoint (.ppt) files in Content Management communities. In these cases, you need to use the default configuration for an Autonomy server, that is, one language and one encoding across all repositories using that server. For more information, see Section 6.1, "One Language per Autonomy Server."

In addition to using the Autonomy server configuration, you need to set the encodings for indexing and searching on the file names as described in this section. Without these encoding settings, search cannot find the file names based on multibyte encodings. These encoding are set in the following files:

6.4.1 Settings in omnislave.cfg

You must specify the system's default encoding in the omnislave.cfg file. This file is located in the <WLPORTAL_HOME>/content-mgmt/thirdparty/autonomy-wlp10/<OS>/filters directory.

Note:

The path above is based on an upgraded Oracle WebLogic Portal domain. If you performed a clean installation of Oracle WebLogic Portal and installed Autonomy separately from Oracle WebLogic Portal, the path above might differ from the example. For this reason, you should change the path in your environment accordingly.

To specify the encoding:

  1. In the omnislave.cfg file, remove any FileNameFromCharSet=<encoding> settings from any sections in which they appear.

  2. In the [Configuration] section, add the system's default encoding. For example:

    FileNameFromCharSet=SHIFTJIS
    

6.4.2 Settings in AutonomyIDOLServer.cfg

You must specify the DefaultLanguageType and DefaultEncoding settings in the AutonomyIDOLServer.cfg file. The DefaultEncoding must be same encoding as specified in omnislave.cfg. And the DefaultLanguageType must be in the corresponding language type to the encoding specified in omnislave.cfg and the language of document. For example for the Japanese language with Shift-JIS encoding, you would specify:

[LanguageTypes]
DefaultLanguageType=japaneseSHIFT_JIS
DefaultEncoding=SHIFTJIS