5
Understanding the Ultra Search Administration Tool

The Ultra Search administration tool lets you configure and schedule the Ultra Search crawler. This chapter contains the following topics:

Ultra Search Administration Tool
Logging On to Ultra Search
Logging On and Managing Instances as SSO Users
Instances Page
Crawler Page
Web Access Page
Attributes Page
Sources Page
Schedules Page
Queries Page
Users Page
Globalization Page

See Also:
The Oracle Ultra Search online help for detailed information about using the Ultra Search administration tool

Ultra Search Administration Tool

The administration tool is a Web application for configuring and scheduling the Ultra Search crawler. The administration tool is typically installed on the same machine as your Web server. You can access the administration tool from any browser in your intranet, directly as an Ultra Search database user, or as a single sign-on (SSO) user with a SSO server.

Note:

The Ultra Search administration tool and the Ultra Search query applications are part of the Ultra Search middle tier components module. However, the Ultra Search administration tool is independent from the Ultra Search query application. Therefore, they can be hosted on different machines to enhance security or scalability.

With the administration tool, you use to do the following:

Log on to Ultra Search
Create Ultra Search instances
Manage administrative users
Define data sources and assign them to data groups
Configure and schedule the Ultra Search crawler
Set query options
Translate search attributes and LOV and data group display names to different languages

See Also:
Chapter 2, "Installing and Configuring Ultra Search" for information about how to deploy the administration tool

Setting Crawler Parameters

To configure the Ultra Search crawler, you must do the following:

Set crawler parameters, such as the number of crawler threads. To do so, use the Crawler Page.
Set Web access parameters, such as authentication and the proxy server. To do so, use the Web Access Page.
Define crawler data sources. Data sources can be Web pages, database tables, files, email mailing lists, Oracle9i Application Server (9iAS) Portals, or user-defined data sources. You can assign one or more data sources to a crawler schedule. To define data sources, use the Sources Page. On this page, you can also set parameters for the source, such as domain inclusions or exclusions for Web sources or the display URL template or column for table sources.
Define a crawler synchronization schedule. The crawler uses the synchronization schedule to reconcile the Ultra Search index with current data source content. To define crawling schedules, use the Schedules Page.

Setting Query Options

Use query options to let users limit their searches. Searches can be limited to document attributes and data groups.

Attributes

Search attributes can be mapped to HTML metatags, table columns, document attributes, and email headers. Some attributes, such as author and description, are predefined and need no configuration. However, you can customize your own attributes. To set custom document attributes to expose to the query user, use the Attributes Page.

Data Groups

Data source groups are logical entities exposed to the search engine user. When entering a query, the search engine user is asked to select one or more data groups to search from. A data group consists of one or more data sources. To define data groups, use the Queries Page.

Online Help in Different Languages

Ultra Search provides context-sensitive online help, based on the language setting in the Users Page. If the translated help files are not installed on the local machine, then English online help files are used.

To download the latest online help files, visit the Oracle Technology Network (OTN). You must register online before using OTN; registration is free and can be done at http://technet.oracle.com/membership/index.htm.

If you already have a user name and password for OTN, then you can go directly to the documentation section of the OTN Web site at http://technet.oracle.com/docs/index.htm.

Logging On to Ultra Search

The following users can log on to the Ultra Search administration tool:

Users managed by the Oracle Internet Directory (OID). These are authenticated by the single-sign-on server (SSO)
Users existing in the database on which Ultra Search lives (non-SSO mode) [applicable in iAS]
Enterprise Manager IAS_ADMIN users [applicable in iAS]

To log on to the administration tool, point your Web browser to one of the following URLs:

For non-SSO mode: http://<hostname>:<port>/ultrasearch/admin/index.jsp
For SSO mode: http://<hostname>:<port>/ultrasearch/admin_sso/index.jsp

Immediately after installation, the only users able to create and manage instances are the following:

The WKSYS database user
The IAS_ADMIN Enterprise Manager user [applicable in iAS]
The PORTAL SSO user belonging to the default company [applicable in iAS]
The ORCLADMIN SSO user belonging to the default company [applicable in iAS]

After you are logged on as one of these special users, you can grant permission to other users, enabling them to create and manage Ultra Search instances. Using the Ultra Search administration tool, you can only grant and revoke Ultra Search related permissions to and from exiting users. To add or delete users, use the OID for single-sign-on users or Oracle Enterprise Manager for local database users.

Note:

The Ultra Search product database dictionary is installed in the WKSYS schema.

See Also:

Chapter 2, "Installing and Configuring Ultra Search"
"Instances Page" for more information about creating Ultra Search instances
"Users Page" for more information about granting permission to other users
"Logging On and Managing Instances as SSO Users" for more information about how Ultra Search handles SSO users

Logging On and Managing Instances as SSO Users

Note:

Single sign-on (SSO) is available only with the Oracle9i Application Server (9iAS) release. It is not available with the Oracle9i database release.

Logging On to Ultra Search through Oracle Portal

When single sign-on (SSO) users log in to the SSO-protected Ultra Search administration tool through the Oracle Portal administration page, one of the following occurs:

If the SSO user has been granted super-user privileges in Ultra Search, then the Ultra Search administration tool presents a pull-down list allowing the SSO user to select an instance to manage.
If the SSO user does not have super-user privileges, but has been explicitly granted permission to manage one or more Ultra Search instances, then the Ultra Search administration tool also presents a pull-down list allowing the SSO user to select an instance to manage.
If the SSO user does not have super-user privileges and does not have privileges to manage any Ultra Search instances, then the Ultra Search administration tool displays an error message indicating that the user has no privileges and that they should contact the appropriate authority to be granted privileges.

Granting privileges to SSO users

You might need to grant super-user privileges, or privileges for managing an Ultra Search instance, to an SSO user. This process is slightly different, depending on whether Oracle Portal is running in hosted mode or non-hosted mode, as described in the following section:

Note:

An SSO user is uniquely identified by Ultra Search with an SSO-nickname/subscriber-nickname combination.

In non-hosted mode, the subscriber-nickname is not required when granting privileges to an SSO user. This is because there is exactly one subscriber in Oracle Portal in non-hosted mode.

In hosted mode, the subscriber-nickname is required when granting privileges to an SSO user. This is because there can be more than one subscriber in Oracle Portal, and two or more users with the same SSO-nickname (for example, PORTAL) could be distinct SSO users distinguished by their subscriber-nickname. When running in hosted mode, also note the following:

When granting permissions for the default subscriber user, always specify "DEFAULT COMPANY" for the subscriber-nickname, even though the actual nickname could be different; for example, "ORACLE". The actual nickname is not recognized by Ultra Search.

When logging in to SSO as the default subscriber user, leave the subscriber nickname blank. Alternatively, enter "DEFAULT COMPANY" instead of the actual subscriber nickname; for example, "ORACLE" so that it is recognized by Ultra Search.

Note:

At any point after installation, an Oracle Portal script could be run to alter the running mode from non-hosted to hosted. Whenever this is performed, the Oracle Portal script invokes an Ultra Search script to inform Ultra Search of the change from non-hosted to hosted modes.

See Also:

Hosting Developer's Guide at http://otn.oracle.com/.

Instances Page

After successfully logging on to the Ultra Search administration tool, you find yourself on the Instances Page. This page manages all Ultra Search instances in the local database. In the top left corner of the page, there are tabs for creating, selecting, deleting, and editing instances.

Before you can use the administration tool to configure crawling and indexing, you must create an Ultra Search instance. An Ultra Search instance is identified with a name and has its own crawling schedules and index. Only users granted the WKADMIN role can create Ultra Search instances.

Creating an Instance

To create an instance, select the Create tab on the Instances Page. This takes you to another page with links for creating a regular instance (a master instance) and creating a read-only snapshot instance. Only Ultra Search super-users can create new instances.

Note:

If the search domains of Ultra Search instances overlap, then there could be crawling conflict for table data sources with logging enabled, email data sources, and some user-defined data sources.

Creating a Regular Instance

To create an instance, do the following:

Prepare the database user.

Every Ultra Search instance exists in one and only one database user/schema. To create a new Ultra Search instance, you first must have a database user that has been configured for Ultra Search and that does not already contain an Ultra Search instance.

The database user you create to house the Ultra Search instance should be assigned a dedicated self-contained tablespace. This is important if you plan to ever create snapshot instances of this instance. To do this, create a new tablespace. Then, create a new database user whose default tablespace is the one you just created.

See Also:

"Configuring the Oracle Server for Ultra Search" for information and instructions on configuring database users for Ultra Search
"Creating a Snapshot Instance"
Oracle9i Database Administrator's Guide for details on using transportable tablespaces

Follow instance creation in the Ultra Search administration tool.

From the main instance creation page, select the "Create instance" link, and provide the following information:
- Instance name
- Database schema: this is the user name from step 1.
- Schema password
You can also enter the following optional index preferences:
- Lexer
  
  Specify the name of the lexer you want to use for indexing. The default lexer is wksys.wk_lexer, as defined in the wk0pref.sql file. After the instance is created, the lexer can no longer be changed.
- Stoplist
  
  Specify the name of a stoplist you want to use during indexing. The default stoplist is wksys.wk_stoplist, as defined in the wk0pref.sql file. Try to avoid modifying the stoplist after the instance has been created.
- Storage
  
  Specify the name of the storage preference for the index of your instance. The default storage preference is wksys.wk_storage, as defined in the wk0pref.sql file. After the instance is created, the storage preference cannot be changed.
  See Also:
  
  Oracle Text Reference for more information on these creating and modifying lexers, stoplists, and storage
  
  "Managing Stoplists"

Creating a Snapshot Instance

A snapshot instance is a copy of another instance. Unlike a regular instance, a snapshot instance is read-only; it does not synchronize its index to the search domain. Also, after the master instances re-synchronizes to the search domain, the snapshot instance becomes out of date. At that point, you should delete the snapshot and create a new one.

Note:

The snapshot and its master instance cannot reside on the same database.

A snapshot instance is useful for the following:

Query Processing

Two Ultra Search instances can answer queries about the same search domain. Therefore, in a set amount of time, two instances can answer more queries about that domain than one instance. Because snapshot instances do not involve crawling and indexing, snapshot instance creation is fast and inexpensive. Thus, snapshot instances can improve scalability.
Backups

If the master instance gets corrupted, its snapshot can be transformed into a regular instance by editing the instance mode to updatable. Because the snapshot and its master instance cannot reside on the same database, a snapshot instance should be made updatable only to replace a corrupted master instance.

A snapshot instance does not inherit authentication from the master instance. Therefore, if you make a snapshot instance updatable, you must reenter any authentication information needed to crawl the search domain.

To create a snapshot instance, do the following:

Prepare the database user.

As with regular instances, snapshot instances require a database user that has been configured for Ultra Search and that does not already contain an Ultra Search instance.
Copy the data from the master instance.

This is done with the transportable tablespace mechanism, which does not allow renaming of tablespaces. Therefore, snapshot instances cannot be created on the same database as its master.

Identify the tablespace or the set of tablespaces that contain all the master instance data. Then, copy it, and plug it into the database user from step 1.
Follow snapshot instance creation in the Ultra Search administration tool.

From the main instance creation page, select the "Create read-only snapshot instance" link, and provide the following information:
- Snapshot instance name
- Snapshot schema name: this is the database user from step 1.
- Snapshot schema password
- Database link: this is the name of the database link to the database where the master instance lives.
- Master instance name

After providing this information, click Apply.

See Also:

"Configuring the Oracle Server for Ultra Search" for information and instructions on configuring database users for Ultra Search
Oracle9i Database Administrator's Guide for details on using transportable tablespaces

Selecting an Instance

You can have multiple Ultra Search instances. For example, an organization could have separate Ultra Search instances for its marketing, human resources, and development portals. The administration tool requires you to specify an instance before it lets you make any instance-specific changes.

To select an instance, do the following:

Click the Select tab on the Instances Page.
Select an instance from the pull-down menu.
Click Apply.

Note:
Instances do not share data. Data sources, schedules, and indexes are specific to each instance.

Deleting an Instance

To delete an instance, do the following:

Click the Delete tab on the Instances Page.
Select an instance from the pull-down menu.
Click Apply.

Note:
To delete an Ultra Search instance, the user must be assigned the WKADMIN role.

Editing an Instance

To edit an instance, click the Edit tab on the Instances Page. You can change the instance mode (make it instance updatable) or change the instance password.

Instance Mode

You can change the instance mode to updatable or read-only. Updatable instances synchronize themselves to the search domain on a set schedule, whereas read-only instances (snapshot instances) do not do any synchronization. To set the instance mode, select the box corresponding the to mode you want, and click Apply.

Schema Password

An Ultra Search instance must know the password of the database user in which it resides. The instance cannot get this information directly from the database. During instance creation, Oracle provides the database user password, and the instance caches this information.

If this database user password changes, then the password that the instance has cached must be updated. To do this, enter the new password and click Apply. After the new password is verified against the database, it replaces the cached password.

Crawler Page

The Ultra Search crawler is a Java application that spawns threads to crawl defined data sources, such as Web sites, database tables, or email archives. Crawling occurs at regularly scheduled intervals, as defined in the Schedules Page.

With this page, you can do the following:

Settings

Crawler Threads

Specify the number of crawler threads to be spawned at run time.

Number of Processors

Specify the number of central processing units (CPUs) that exist on the server where the Ultra Search crawler will run. This setting determines the optimal number of document conversion threads used by the system. A document conversion thread converts multiformat documents into HTML documents for proper indexing.

Automatic Language Detection

Not all documents retrieved by the Ultra Search crawler specify the language. For documents with no language specification, the Ultra Search crawler attempts to automatically detect language. Click Yes to turn on this feature.

The language recognizer is trained statistically using trigram data from documents in various languages (Danish, Dutch, English, French, German, Italian, Portuguese, and Spanish). It starts with the hypothesis that the given document does not belong to any language and ultimately refutes this hypothesis for a particular language where possible. It operates on Latin-1 alphabet and any language with a deterministic Unicode range of characters (Chinese, Japanese, Korean, and so on.).

The crawler determine the language code by checking the HTTP header content-language or the LANGUAGE column, if it is a table data source. If it cannot determine the language, then it takes the following steps:

If the language recognizer is not available or if it is unable to determine a language code, then the default language code is used
If the language recognizer is available, then the output from the recognizer is used.

This language code is populated in 'LANG' column of the wk$url and wk$doc tables. Multi-lexer is the only lexer used for Ultra Search. All document URLs are stored in wk$doc for indexing and wk$url for crawling.

Default Language

If automatic language detection is disabled, or when a Web document does not have a specified language, the crawler assumes that the Web page is written in this default language. This setting is important, because language directly determines how a document is indexed.

Note:

This default language is used only if the crawler cannot determine the document language during crawling. Set language preference in the Users Page.

You can select a default language for the crawler or for data sources. Default language support for indexing and querying is available for the following languages:

English
Brazilian portuguese
Danish
Dutch
French
German
Italian
Japanese
Korean
Portuguese
Simplified Chinese
Spanish
Swedish
Traditional Chinese

Crawling Depth

A Web document could contain links to other Web documents, which could contain more links. This setting lets you specify the maximum number of nested links the crawler will follow.

See Also:

Appendix A, "Tuning the Web Crawling Process" for more information on the importance of the crawling depth

Crawler Timeout Threshold

Specify in seconds a crawler timeout. The crawler timeout threshold is used to force a timeout when the crawler cannot access a Web page.

Default Character Set

Specify the default character set. The crawler uses this setting when an HTML document does not have its character set specified.

Temporary Directory Location and Size

Specify a temporary directory and size. The crawler uses the temporary directory for intermittent storage during indexing. Specify the absolute path of the temporary directory. The size is the maximum temporary space in megabytes that will be used by the crawler.

The size of the temporary directory is important because it affects index fragmentation. The smaller the size, the more fragmented the index. As a result, the query will be slower, and index optimization needs to be performed more frequently. Increasing the directory size reduces index fragmentation, but it also reduces crawling throughput (total number of documents crawled each hour). This is because it takes longer to index a bigger temporary directory, and the crawler needs to wait for the indexing to complete before it can continue writing new documents to the directory.

Crawler Logging

Specify the following:

Level of detail: everything or only a summary
Crawler logfile directory
Crawler logfile language

The log file directory stores the crawler log files. The log file records all crawler activity, warnings, and error messages for a particular schedule. It includes messages logged at startup, runtime, and shutdown. Logging everything can create very large log files when crawling a large number of documents. However, in certain situations, it can be beneficial to configure the crawler to print detailed activity to each schedule log file. The crawler logfile language is the language the crawler uses to generate the log file.

Database Connect String

The database connect string is a standard JDBC connect string used by the remote crawler when it connects to the database. The connect string can be provided in the form of [hostname]:[port]:[sid] or in the form of a TNS keyword-value syntax; for example,

"(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=...)(PORT=5521)...))"

In a Real Application Clusters environment, the TNS keyword-value syntax should be used, because it allows connection to any node of the system. For example,

 "(DESCRIPTION=(LOAD_
BALANCE=yes)(ADDRESS=(PROTOCOL=TCP)(HOST=cls02a)(PORT=3001))
(ADDRESS=(PROTOCOL=TCP)(HOST=cls02b)(PORT=3001)))(CONNECT_DATA=(SERVICE_
NAME=sales.us.acme.com)))"

Remote Crawler Profiles

Use this page to view and edit remote crawler profiles. A remote crawler profile consists of all parameters needed to run the Ultra Search crawler on a remote machine other than the Ultra Search database. A remote crawler profile is identified by the hostname. The profile includes the cache, log, and mail directories that the remote crawler shares with the database machine.

To set these parameters, click Edit. Enter the shared directory paths as seen by the remote crawler. You must ensure that these directories are shared or mounted appropriately.

Crawler Statistics

Use this page to view the following crawler statistics:

Summary of Crawler Activity

This provides a general summary of crawler activity:

Aggregate crawler statistics
Total number of documents indexed
Crawler statistics by data source type

Detailed Crawler Statistics

This includes the following:

List of hosts crawled and indexed
Document distribution by depth
Document distribution by document type
Document distribution by data source type

Crawler Progress

This displays crawler progress for the past week. It shows the total number of documents that have been indexed for exactly one week prior to the current time. The Time column rounds the current time to the nearest hour.

Problematic URLs

This lists errors encountered during the crawling process. It also lists the number of URLs that cause each error.

Web Access Page

Use this page to set up basic authentication and proxies.

URL Authentication

The Ultra Search crawler provides basic authentication information to hosts that require it. Basic authentication is based on the model that the client must authenticate itself with a username and a password for each realm. A realm is a string that identifies a set of protected URLs on a Web server. Enter the host, realm, username, and password, and click Add.

Proxies

Specify a proxy server if the search space includes Web pages that reside outside your organization's firewall. Specifying a proxy server is optional. Currently, only the HTTP protocol is supported.

Note:

The crawler cannot use a proxy server that requires proxy authentication.

You can also set domain exceptions.

Attributes Page

When your indexed documents contain metadata, such as author and date information, you can let users refine their searches based on this information. For example, users can search for all documents where the author attribute has a certain value.

The list of values (LOV) for a document attribute can help specify a search query. An attribute value can have a display name for it. For example, the attribute country might use country code as the attribute value, but show the name of the country to the user. There could be multiple translations of the attribute display name.

To define a search attribute, use the Search Attributes subtab. Ultra Search provides some system-defined attributes, such as "Author" and "Description." You can also define your own.

After defining search attributes, you must map between document attributes and global search attributes for data sources. To do so, use the Mappings subtab.

Note:

Ultra Search provides a command-line tool to load metadata, such as search attribute LOVs and display names into an Ultra Search database. If you have a large amount of data, this is probably faster than using the HTML-based administration tool. For more information, see Appendix E, "Loading Metadata into Ultra Search".

Search Attributes

Search attributes are attributes exposed to the query user. Ultra Search provides system-defined attributes, such as "Author" and "Description." Ultra Search maintains a global list of search attributes. You can add, edit, or delete search attributes. You can also click Manage LOV to change the list of values (LOV) for the search attribute. There are two categories of attribute LOVs: one is global across all data sources, the other is data source-specific.

To define your own attribute, enter the name of the attribute in the text box; select string, date, or number; and click Add.

You can add or delete LOV entry and display name for search attributes. Display name is optional. If display name is absent, then LOV entry is used in the query screen.

Note:

LOV is only represented as string type. If LOV is in date format, then you must use "DD-MM-YYYY" to enter the LOV.

Update Policy

To update the policy value, click the Manage LOV icon for any attribute.

A data source-specific LOV can be updated in three ways:

Manually
The crawler agent can automatically update the LOV during the crawling process
New LOV entries can be automatically added by inspecting attribute values of incoming documents

Caution:
If the update policy is agent-controlled, then the LOV and all translated values are erased in the next crawling.

Mappings

This section displays mapping information for user-defined sources. Mapping is done at the agent level, and document attributes are automatically mapped to search attributes with the same name initially. Document attributes and search attributes are mapped one-to-one. For each user-defined data source, you can edit which global search attribute the document attribute is mapped to.

For Web or table data sources, mappings are created manually when you create the data source. For user-defined data sources, mappings are automatically created on subsequent crawls.

Click Edit mappings to change this mapping.

Editing the existing mapping is costly, because the crawler must recrawl all documents for this data source. You should avoid this step, unless necessary.

Note:

After you define a search attribute mapping, you cannot remove that mapping.

Note:

There are no user-managed mappings for email sources. There are two predefined mappings for emails. The "From" field of an email is intrinsically mapped to the Ultra Search "Author" attribute. Likewise, the "Subject" field of an email is mapped to the Ultra Search "Subject" attribute. The abstract of the email message is mapped to the "Description" attribute.

Sources Page

A collection of documents is called a source. The data source is characterized by the properties of its location, such as a Web site or an email inbox. The Ultra Search crawler retrieves data from one or more data sources.

The different types of sources are:

Web Sources
Table Sources
Email Sources
File Sources
User-Defined Sources (requires a crawler agent)
Oracle9iAS Portal Sources
See Also:

"Schedules Page" to assign one or more data sources to a synchronization schedule

"Queries Page" to assign data sources to data groups to enable restrictive querying

You can create as many data sources as you want. The following section explains how to create and edit data sources.

Web Sources

A Web source represents HTML content on a specific Web site. Web sources differ from other data source types because they exist specifically to facilitate maintenance crawling of specific Web sites.

Creating Web Sources

To create a new Web source, do the following:

Specify a name for the Web source.
Override the default crawler settings for each Web source. This step is optional. You can change the crawling depth, the character set, the language, the number of crawler threads, and the crawler timeout threshold. You can also enable or disable robots exclusion, language detection, and the UrlRewriter. You can also edit those in Edit Web Sources. If you change any of the default settings, click Update.

Robots exclusion lets you control which parts of your sites can be visited by robots. If robots exclusion is enabled (default), then the Web crawler traverses the pages based on the access policy specified in the Web server robots.txt file. For example, when a robot visits http://www.foobar.com/, it checks for http://www.foobar.com/robots.txt. If it finds it, the crawler analyzes its contents to see if it is allowed to retrieve the document. If you own the Web sites, then you can disable robots exclusions. However, when crawling other Web sites, you should always comply with robots.txt by enabling robots exclusion.

The URL Rewriter is a user-supplied java module for implementing the Ultra Search UrlRewriter interface. It is used by the crawler to filter or rewrite extracted URL links before they are put into the URL queue. URL filtering removes unwanted links, and ULR rewriting transforms the URL link. This transformation is necessary when access URLs are used.

The UrlRewriter provides the following possible outcomes for links:
- There is no change to the link. The crawler inserts it as it is.
- Discard the link. There is no insertion.
- A new display URL is returned, replacing the URL link for insertion.
- A display URL and an access URL are returned. The display URL may or may not be identical to the URL link.
The generated new "url link" is subject to all existing host, path, and mimetype inclusion and exclusion rules.

You must put the implemented rewriter class in a jar file and provide the class name and jar file name here.
See Also:

"Ultra Search URL Rewriter API"

"Display URL and Access URL"

"Crawler Page" for information on default languages
Enter a starting address. This is the URL for the crawler to begin crawling.
Set URL boundary rules to refine the crawling space. You can include or exclude hosts and URL paths. For example, an inclusion domain of oracle.com limits the Ultra Search crawler to hosts belonging to Oracle Corporation worldwide. (This is a suffix inclusion, so anything ending with oracle.com is crawled; however, http://www.oracle.com.tw is not crawled.) An exclusion domain uk.oracle.com prevents the crawler from crawling Oracle hosts in the United Kingdom. You cannot include or exclude port numbers. In other words, you cannot crawl www.acme.com:7777 but not www.acme.com:8888. Use UrlRewriter for this. Exclusion rules always override inclusion rules.
Specify the types of documents the Ultra Search crawler should process for this source. HTML and plain text are default document types that the crawler will always process.
Define, edit, or delete metatag mappings for your Web source. Metatags are descriptive tags in the HTML document header. One metatag can map to only one search attribute.

Table Sources

A table source represents content in a database table or view. The database table or view can reside in the Ultra Search database instance or in a remote database. Ultra Search accesses remote databases using database links.

See Also:

"Limitations With Database Links"

Creating Table Sources

To create a table source, click Create new table source, and follow these steps:

Specify a table source name, and the name of the database link, schema, and table. Click Locate table.
Specify settings for your table source, such as the default language and the primary key column. You can also specify the column where final content should be delivered, and the type of data stored in that column; for example, HTML, plain text, or binary. For information on default languages, see "Crawler Page".
Verify the information about your table source.
Decide whether or not to use the Ultra Search logging mechanism to optimize the crawling of table data sources. With this logging mechanism, only newly updated documents are revisited during the crawling process. You can enable logging for Oracle tables, enable logging for non-Oracle tables, or disable the logging mechanism. If you enable logging, then you are prompted to create a log table and log triggers. Oracle SQL statements are provided for Oracle tables. If you are using non-Oracle tables, then you must manually create a log table and log triggers. Follow the examples provided to create the log table and log triggers. After you have created the table, enter the table name in Log table name.
Map table columns to document attributes. Each table column can be mapped to exactly one document attribute. This lets the search engine seamlessly search data from the table source.
Specify the display URL template or column for the table source. This step is optional. Ultra Search uses a default text viewer for table data sources. If you specify display URL, then Ultra Search uses the Web URL defined to display the table data retrieved. If display URL column is available, then Ultra Search uses the column to get the URL to display the table data source content. You can also specify display URL templates in the following format: http://[hostname]:[port]/[path]?[parameter_name]=$(key1) where key1 is the corresponding table's primary key column.

The Table Column to Key Mappings section provides mapping information. Ultra Search supports table keys in STRING, NUMBER, or DATE type. If key1 is of NUMBER or DATE type, then you must specify the format model used by the Web site so that Oracle knows how to interpret the string. For example, the date format model for the string '11-Nov-1999' is 'DD-Mon-YYYY'. You can also map other table columns to Ultra Search attributes. Do not map the text column.

See Also:
Oracle9i SQL Reference for more on format models

Editing Table Sources

Click Edit to change the name of the table source; change, add, or delete table column and search attribute mappings; change the display URL template or column; and view values of the table source settings.

Table Sources Comprised of More Than One Table

If a table source has more than one table, then a view joining the relevant tables must be created. Ultra Search then uses this view as the table source. For example, two tables with a master-detail relationship can be joined through a select statement on the master table and a user-implemented PL/SQL function that concatenate the detail table rows.

Limitations With Database Links

The following restrictions apply to base tables or views on a remote database that are accessed over a database link by the crawler.

If the text column of the base table or view is of type BLOB or CLOB, then the table must have a ROWID column. A table or view might not have a ROWID column for various reasons, including the following:
- A view comprised of a join of one or more tables
- A view based on a single table using a GROUP BY clause
The best way to know if a remote table or view can be safely crawled by Ultra Search is to check for the existence of the ROWID column. To do so, run the following SQL statement against that table or view using SQL*Plus:
```
SELECT MIN(ROWID) FROM <table or view name>;
```
The base table or view cannot have text columns of type BFILE.

Email Sources

An email source derives its content from emails sent to a specific email address. When the Ultra Search crawler searches an email source, it collects all emails that have the specific email address in any of the "To:" or "Cc:" email header fields.

The most popular application of an email source is where an email source represents all emails sent to a mailing list. In such a scenario, multiple email sources are defined where each email source represents an email list.

To crawl email sources, you need an IMAP account. At present, the Ultra Search crawler can only crawl one IMAP account. Therefore, all emails to be crawled must be found in the inbox of that IMAP account. For example, in the case of mailing lists, the IMAP account should be subscribed to all desired mailing lists. All new postings to the mailing lists are sent to the IMAP email account and subsequently crawled. The Ultra Search crawler is IMAP4 compliant.

When the Ultra Search crawler retrieves an email message, it deletes the email message from the IMAP server. Then, it converts the email message content to HTML and temporarily stores that HTML in the cache directory for indexing. Next, the Ultra Search crawler stores all retrieved messages in a directory known as the archive directory. The email files stored in this directory are displayed to the search end-user when referenced by a query hit.

To crawl email sources, you must specify the username and password of the email account on the IMAP server. Also specify the IMAP server hostname and the archive directory.

Creating Email Sources

To create email sources, you must enter an email address and a description. The description can be viewed by all search end-users, so you should specify a short but meaningful name. When you create (register) an email source, the name you use is the email of the mailing list. If the emails are not sent to one of the registered mailing lists, then those emails are not crawled.

You can specify email address aliases for an email source. Specifying an alias for an email source causes all emails sent to the main email address, as well as the alias address, to be gathered by the crawler.

File Sources

A file source is the set of documents that can be accessed through the file protocol on the Ultra Search database machine or on a remote crawler machine.

To edit the name of a file source, click Edit.

Creating File Sources

To create a new file source, do the following:

Specify a name for the file source.
Designate files or directories to be crawled. If a URL represents a single file, then the Ultra Search crawler searches only that file. If a URL represents a directory, then the crawler recursively crawls all files and subdirectories in that directory.
Specify inclusion and exclusion paths to modify the crawling space associated with this file source. This step is optional. An inclusion path limits the crawling space. An exclusion path lets you further define the crawling space. If neither path is specified, then crawling is limited to the underlying file system access privileges.
Specify the types of documents the Ultra Search crawler should process for this file source. HTML and plain text are default document types that the crawler will always process.

Ultra Search displays file data sources in text format by default. However, if you specify display URL for the file data source, then Ultra Search uses the URL to display the file data source.

With display URL for file data sources, the URL uses network protocols, such as HTTP or HTTPS, to access the file data source. To generate display URL for the file data source, specify the prefix of the original file URL and the prefix of the display URL. Ultra Search replaces the prefix of the file URL with the prefix of the display URL.

For example, if your file URL is file:///home/archive/<sub_dir_name>/<file_name> and the display URL is https://host:7777/private/<sub_dir_name>/<file_name>, then you can specify the file URL prefix to file:///home/archive and the display URL prefix to https://host:7777/private.

User-Defined Sources

Ultra Search lets you define, edit, or delete your own data sources and types in addition to the ones provided. You might implement your own crawler agent to crawl and index a proprietary document repository or management system, such as Lotus Notes or Documentum, which contain their own databases and interfaces.

For each new data source type, you must implement a crawler agent as a Java class. The agent collects document URLs and associated metadata from the proprietary document source and returns the information to the Ultra Search crawler, which enques it for later crawling.

See Also:

"Ultra Search Crawler Agent API"

To define a new data source, you first define a data source type to represent it. You define the type name, the crawler agent java class/jar file, and parameters to be used, such as starting address. After you define your data type, define a new data source by specifying parameter values.

Creating User-Defined Data Sources

To create a new user-defined data source, click Create new source. To create, edit, or delete data source types, click Manage types.

To create a user-defined data source type:

Specify data source type name, description, and crawler agent Java class file or jar file name. The crawler agent Java class path is predefined at installation time. The agent collects the list of document URLs and associated metadata from the proprietary document source and returns it to the Ultra Search crawler, which enqueues the information for later crawling. The agent class file or jar file must be located under $ORACLE_HOME/ultrasearch/lib/agent/.
Specify parameters for this data source type. If you add parameters, you need to enter the parameter name and a description. Also, you must decide whether to encrypt the parameter value.

Edit data source type information by changing the data source type name, description, crawler agent Java class/jar file name, or parameters.

To create a user-defined data source:

Specify a name, data source type, and default language for the data source. Each data source is created based on data source type definition. For information on default languages, see "Crawler Page".
Enter parameter values, such as starting point.
Specify mappings. This step is optional. Document attributes are automatically mapped directly to the search attribute with the same name during crawling. If you want document attributes to map to another search attribute, specify it here. The crawler picks up attributes that have been returned by the crawler agent or specified here.

Edit user-defined data sources by changing the name, type, default language, or starting address.

Oracle9iAS Portal Sources

Ultra Search supports the crawling and indexing of Oracle9i Application Server (9iAS) Portal installations. This enables searching across multiple portal installations. To crawl a 9iAS Portal, you must first register your portal with Ultra Search. To register your portal:

Select a name and portal URL base for the portal source. After it is created, the portal URL base is not updatable. For information on default languages, see "Crawler Page".
Click Register Portal. Ultra Search attempts to contact the Oracle 9iAS Portal instance and retrieve information about it.

After registering your portal, select the Oracle 9iAS Portal page groups you want to index. Each page group chosen is created as a 9iAS portal source.

You can edit the types of documents the Ultra Search crawler should process for a portal source. HTML and plain text are default document types that the crawler will always process. Edit document types by clicking the edit icon of the portal source after it has been created.

See Also:

The Oracle 9iAS Portal documentation.

Schedules Page

Use this page to schedule data synchronization and index optimization. Data synchronization means keeping the Ultra Search index up to date with all data sources. Index optimization means keeping the updated index optimized for best query performance.

See Also:

"Synchronizing Data Sources"

Data Synchronization

The tables on this page display information about synchronization schedules. A synchronization schedule has one or more data sources assigned to it. The synchronization schedule frequency specifies when the assigned data sources should be synchronized. Schedules are sorted first by name. Within a synchronization schedule, individual data sources are listed and can be sorted by source name or source type.

Creating Synchronization Schedules

To create a new schedule, click Create New Schedule and follow these steps:

Name the schedule
Select a schedule frequency and determine whether the schedule should automatically accept all URLs for indexing or examine URLs before indexing. You can also associate the schedule with a remote crawler profile.
Assign data sources to the schedule. After a data source has been assigned to a group, it cannot be assigned to other groups.

Updating Schedules

Update the indexing option in the Update Schedule page. If you decide to examine URLs before indexing for the schedule, then after you run the schedule, the schedule status is shown as "Indexing pending."

In data harvesting mode, you should begin crawling first. After crawling is done, click Examine URL to examine document URLs and status, remove unwanted documents, and start indexing. After you click Begin index, you see schedule status change from launching, executing, scheduled, and so on.

After you click the link for a specific host, you see list of document URLs that have been crawled for the host. You can delete document URLs in this section.

Editing Synchronization Schedules

After a synchronization schedule has been defined, you can do the following in the Synchronization Schedules List:

To assign the schedule to either a crawler that runs on the database host or a remote crawler that runs on a separate host, click Hostname.
To change its frequency, click the schedule interval text.
To alter its status, click Status.
To delete it, click Delete.

To edit its name, data source assignments, recrawl policy, or crawling mode, click Edit. When the crawler retrieves a document, it checks to see if it has changed. By default, if the document has not changed, the crawler does not process it. In certain situations, you might want to force the crawler to reprocess all documents. Click Edit to edit schedules in the following ways:

Update schedule name. This step is optional. To change the schedule name, specify a name for the schedule, and click Update schedule name.
Assign data sources to schedule. To assign a data source, select one or more available sources and click >>. After a data source has been assigned to a group, it cannot be assigned to any other group. To undo assignments of a data source, select one or more scheduled sources and click <<.

Update crawler recrawl policy. You can update the recrawl policy to the following:

Process documents that have changed: This is maintenance crawling. Only documents that have changed are recrawled and indexed. For Web data sources, if there are new links in the updated document, then they are followed. For file data sources, new files are collected if its parent directory has changed.

Process all documents: The crawler recrawls the data source. For example, a user wants to crawl only text and HTML on a Web site. Later, the user also wants to crawl Microsoft word and Adobe PDF documents. The user must modify the document types for the source, edit the schedule to select "process all documents," then reexecute the schedule so that the crawler picks up PDF and doc document types for this data source. The crawler treats every document as if it has been changed, which means each document is fetched and processed again. For example, a new mimetype document is added, new host/path inclusion rules are added, or crawling depth increases (for example, from 5 to 7). Every document must be reparsed to pick up links that had been dropped before due to depth limit.

Note:

"Process all documents" does not help when the crawling scope has been narrowed. For example, if crawling depth was reduced from 7 to 5, the PDF mimetype was deleted, or a host inclusion was removed, then the user must remove the affected documents manually in a SQL*Plus session.

Update crawling mode. You can update the crawling mode to the following:
- Automatically accept all URLs for indexing: This mode crawls and indexes.
- Examine URLs before indexing: This mode crawls only.
- Index only: This mode indexes only.
The crawler behaves differently for the documents collected.

Crawling mode and recrawl policy can be combined for six different combinations. For example, "process all documents" and "index only" forces reindexing existing documents in this data source, while "process document that have changed" and "index only" re-indexes only changed documents.

Launching Synchronization Schedules

You can launch a synchronization schedule in the following ways:

Set a schedule frequency and wait for the predetermined launch time.

Execute it immediately. To do so, click Status, then Execute immediately.

Note:

Launching a synchronization schedule could take a very long time. If a schedule has been launched before, then the next time a schedule is launched, all URLs that belong to the data source(s) to be crawled by the schedule are copied over into a queue table. Depending on the number of URLs associated with that data source, the copy operation can potentially take a long time. The administration tool displays the schedule state as 'Launching' during the entire time.

Synchronization Status and Crawler Progress

Click the link in the status column to see the synchronization schedule status. To see the crawling progress for any data source associated with this schedule, click the statistics icon.

The crawling progress contains the following information:

Data source type
Data source name
Start time
Finish time
Elapsed time
Total indexing time
Total size of document data collected
Average document size
Average fetch throughput

It also contains the following statistics:

Documents to fetch
Documents fetched: This is the sum of "Document non-indexable", "Document conversion failure", and "Documents indexed"
Document fetch failures: This could be an HTTP server timeout or another HTTP server error
Documents rejected: The document is not within the URL boundary rule.
Documents discovered: This is the sum of "Documents to fetch", "Documents fetched", "Document fetch failures", and "Documents rejected"
Documents indexed
Documents non-indexable: This could be a file directory, a portal page that is a discovery node, or a robot metatag that says no index.
Document conversion failures: The binary file filter failed.

Index Optimization

To ensure fast query results, the Ultra Search crawler maintains an active index of all documents crawled over all data sources. This page lets you schedule when you would like the index to be optimized. The index should be optimized during hours of low usage.

Note:

Increasing the crawler temporary directory size can reduce index fragmentation.

Index Optimization Schedule

You can specify the index optimization schedule frequency. Be sure to specify all required data for the option that you select. You can optimize the index immediately, or you can enable the schedule.

Optimization Process Duration

Specify a maximum duration for the index optimization process. The actual time taken for optimization will not exceed this limit, but it could be shorter. Specifying a longer optimization time will result in a more optimized index. Alternatively, you can specify that the optimization continue until it is finished.

Queries Page

This section lets you specify query-related settings, such as data source groups, URL submission, relevancy boosting, and query statistics.

Data Groups

Data source groups are logical entities exposed to the search engine user. When entering a query, the user is asked to select one or more data groups to search from. Use this page to define these data groups.

A data group consists of one or more data sources. Data source can be assigned to multiple data groups. Data groups are sorted first by name. Within each data group, individual data sources are listed and can be sorted by source name or source type.

To create a new data source group, do the following:

Specify a name for the group.
Assign data sources to group. To assign a Web or table data source to this data group, select one or more available Web sources or table sources and click >>. After a data source has been assigned to a group, it cannot be assigned to any other group. To unassign a Web or table data source, select one or more scheduled sources and click << .
Click Finish.

URL Submission

URL Submission Methods

URL submission lets query users submit URLs. These URLs are added to the seed URL list and included in the Ultra Search crawler search space. You can allow or disallow query users to submit URLs.

URL Boundary Rules Checking

URLs are submitted to a specific Web data source. URL boundary rules checking ensures that submitted URLs comply with the URL boundary rules of the web data source. You can allow or disallow URL boundary rules checking.

Relevancy Boosting

Relevancy boosting lets administrators override the search results and influence the order that documents are ranked in the query result list. This can be used to promote important documents to higher scores. It also makes them easier to find.

There are two methods for locating URLs for relevancy boosting: locate by search or manual URL entry.

Locate by Search

To boost a URL, first locate a URL by performing a search. You can specify a hostname to narrow the search. After you have located the URL, click Information to edit the query string and score for the document.

Manual URL Entry

If a document has not been crawled or indexed, then it cannot be found in a search. However, you can provide a URL and enter the relevancy boosting information with it. To do so, click Create, and enter the following:

Specify the document URL. You must assign the URL to a data source. This document is indexed the next time it is crawled.
Enter scores in the range of 1 to 100 for one or more query strings. When a user performs a search using the exact query string, the score applies for this URL.

The document is searchable after the document is loaded for the term. The document is also indexed the next time the schedule is run.

With manual URL entry, you can only assign URLs for Web data sources. Users will get an error message on this page if no Web data source is defined.

Note:

Ultra Search provides a command-line tool to load metadata, such as document relevance boosting, into an Ultra Search database. If you have a large amount of data, this is probably faster than using the HTML-based administration tool. For more information, see Appendix E, "Loading Metadata into Ultra Search".

Query Statistics

Enabling Query Statistics

This section lets you enable or disable the collection of query statistics. The logging of query statistics reduces query performance. Therefore, Oracle recommends that you disable the collection of query statistics during regular operation.

Note:

After you enable query statistics, the table that stores statistics data is truncated every Sunday at 1:00 A.M.

Viewing Statistics

If query statistics is enabled, you can click one of the following categories:

Daily Summary of Query Statistics

This summarizes all query activity on a daily basis. The statistics gathered are:

Average query time: the average time taken over all queries
Number of queries: the total number of queries made in the day
Number of hits: the average number of results returned by each query

Top 50 Queries

This summarizes the 50 most frequent queries that occurred in the past 24 hours.

Query string: the query string
Average query time: the average time to return a result
Number of queries: the total number of queries made in the past 24 hours
Number of hits: the average number of results returned by each query
Frequency: the number of queries divided by total number of queries over all query strings
Percentage of ineffective queries: the number of ineffective queries divided by total number of queries over all query strings

Top 50 Ineffective Queries

This summarizes the 50 most frequent queries that occurred in the past 24 hours. Each row in the table describes statistics for a particular query string.

Query string: the query string
Number of queries: the total number of queries made in the past 24 hours
Percentage of ineffective queries: the number of ineffective queries divided by total number of queries for that string

Top 50 Failed Queries

This summarizes the top 50 queries that failed over the past 24 hours. A failed query is one where the search engine end-user did not locate any query results.

The columns are:

Query string: the query string
Number of queries: the total number of queries made in the past 24 hours
Frequency: the percentage occurrence of a failed query
Cumulative frequency: the cumulative percentage occurrence of all failed queries

See Also:
Appendix B, "Tuning Query Performance"

Users Page

Use this page to manage Ultra Search administrative users. You can assign a user to manage an Ultra Search instance. You can also select a language preference.

Preferences

This section lets you set preference options for the Ultra Search administrator.

You can specify the date and time format. The pull-down menu lists the following languages:

English
Brazilian Portuguese
French
German
Italian
Japanese
Korean
Simplified Chinese
Spanish
Traditional Chinese

You can also select the number of rows to display on each page.

Super-Users

A user with super-user privileges can perform all administrative functions on all instances, including creating instances, dropping instances, and granting privileges. Only super-users can access this page.

To grant super-user administrative privileges to another user, specify the user name and type. Specify also whether the user should be allowed to grant super-user privileges to other users. Then click Add.

Privileges

Only instance owners, users that have been granted general administrative privileges on this instance, or super-users are allowed to access this page. Instance owners must have been granted the WKUSER role.

Granting general administrative privileges to a user allows that user to modify general settings for this instance. To do this, specify the user name and type. Specify also whether the user should be allowed to grant administrative privileges to other users. Then click Add.

To remove one ore more users from the list of administrators for this instance, select one or more usernames from the list of current administrators and click Remove.

General administrative privileges do not include the ability to:

Create an instance
Delete an instance

These privileges belong to super-users.

Globalization Page

Ultra Search lets you translate names to different languages. This page lets you enter multiple values for search attributes, list of values (LOV) display names, and data groups.

Search Attribute Name

This section lets you translate attribute display names to different languages. The pull-down menu lists the following languages:

English
Arabic
Brazilian Portuguese
Canadian French
Czech
Danish
Dutch
Finnish
French
German
Greek
Hebrew
Hungarian
Italian
Japanese
Korean
Latin American Spanish
Norwegian
Polish
Portuguese
Romanian
Russian
Simplified Chinese
Slovak
Spanish
Swedish
Thai
Traditional Chinese
Turkish

LOV Display Name

This section lets you translate data group names to different languages. Select a search attribute from the pull-down menu: author, description, mimetype, subject, or title. Select the LOV type, and then select the language from the pull-down menu. The pull-down menu lists the language options.

Data Group Name

This section lets you translate data group display names to different languages. The pull-down menu lists the language options.

5 Understanding the Ultra Search Administration Tool

Ultra Search Administration Tool

Setting Crawler Parameters

Setting Query Options

Attributes

Data Groups

Online Help in Different Languages

Logging On to Ultra Search

Logging On and Managing Instances as SSO Users

Logging On to Ultra Search through Oracle Portal

Granting privileges to SSO users

Instances Page

Creating an Instance

Creating a Regular Instance

Creating a Snapshot Instance

Selecting an Instance

Deleting an Instance

Editing an Instance

Instance Mode

Schema Password

Crawler Page

Settings

Crawler Threads

Number of Processors

Automatic Language Detection

Default Language

Crawling Depth

Crawler Timeout Threshold

Default Character Set

Crawler Logging

Database Connect String

Remote Crawler Profiles

Crawler Statistics

Summary of Crawler Activity

Detailed Crawler Statistics

Crawler Progress

Problematic URLs

Web Access Page

URL Authentication

Proxies

Attributes Page

Search Attributes

Update Policy

Mappings

Sources Page

Web Sources

Creating Web Sources

Table Sources

Creating Table Sources

Editing Table Sources

Table Sources Comprised of More Than One Table

Limitations With Database Links

Email Sources

Creating Email Sources

File Sources

Creating File Sources

User-Defined Sources

Creating User-Defined Data Sources

Oracle9iAS Portal Sources

Schedules Page

Data Synchronization

Creating Synchronization Schedules

Updating Schedules

Editing Synchronization Schedules

Launching Synchronization Schedules

Synchronization Status and Crawler Progress

Index Optimization

Index Optimization

Index Optimization Schedule

Optimization Process Duration

Queries Page

Data Groups

URL Submission

URL Submission Methods

URL Boundary Rules Checking

Relevancy Boosting

Locate by Search

Manual URL Entry

Query Statistics

Enabling Query Statistics

5
Understanding the Ultra Search Administration Tool