Understanding Verity Search Indexes
This section provides an overview of search indexes and discusses:
Types of indexes.
Components of the search architecture.
Search index limitations.
User search strategies.
Important! The Verity search engine is not supported for use with the PeopleSoft Search Framework. If you intend to configure the PeopleSoft Search Framework and any of the search features based on the PeopleSoft Search Framework, such as Application Search or Keyword Search, you must install Oracle Secure Enterprise Search.
Overview of Verity Search Indexes
A search index is a collection of files that is used during a search to quickly find documents of interest. The process of creating the search index is also called building the search index. The set of files that make up the index is a collection. This collection contains a list of words in the indexed documents, an internal documents table containing document field information, and logical pointers to the actual document files.
Fields contain metadata about a document. For example, Author and Title might be fields in an index. VdkVgwKey is a special field that identifies each document and is unique to all of the documents in the collection.
The document table is a relational table with one row for each document and columns of fields. Every index can be modified by defining a set of fields for it.
In PeopleSoft search implementations, every search index has a home location where all of the files pertaining to that index are located. This directory is the home directory of the index and is typically located at PS_CFG_HOME/data/search/INDEXNAME. Under this directory is another directory named for the database to which the application server or the Process Scheduler is connected. The actual collection files reside in this database directory.
Every search index can be modified by changing the configuration files that are associated with the index. These configuration files are known as style files and reside in the style directory. A typical configuration of style files define fields for a particular index.
Types of Verity Indexes
PeopleSoft software supports three types of search indexes:
HTTP spider indexes.
File system indexes.
Record-based indexes are used to create indexes of data in PeopleSoft tables. For example, if the PeopleSoft application has a catalog record that has two fields (Description and PartID), you can create a record-based index to index the contents of the Description and PartID fields. Once the index is created, you can use the PeopleCode search application programming interface (API) to search this index.
HTTP Spider Indexes
HTTP spider indexes index a web repository by accessing the documents from a web server. You typically specify the starting uniform resource locator (URL). Then the indexer walks through all documents by following the document links and indexes the documents in that repository. You can control to what depth the indexer should traverse.
File System Indexes
File system indexes are similar to HTTP spider indexes, except that the repository that is indexed is a file system. You typically specify the path to the folder or directory. Then the indexer indexes all documents within that folder. HTTP spider indexes and file system indexes are sometimes collectively referred to as spider indexes. The indexer recognizes a wide variety of document formats, such as Word or Excel documents. Any document that is an unknown format will be skipped by the indexer.
Components of the Verity and PeopleSoft Search Architecture
PeopleSoft search architecture uses two main technologies: that provided by the PeopleSoft Portal and that provided by Verity. They are connected by the PeopleSoft search API.
PeopleSoft Portal Technologies
The PeopleSoft Portal search technology contains the following components:
Search input field.
Captures a query string that is entered by users in the portal header.
Passes the query string that is captured in the search input field to the Verity search engine.
Portal Registry API.
Applies security to filter the search results.
Contains a repository of content references that can be searched.
Search results page.
Formats and displays search results for the user.
Enables users to personalize search behavior and results.
Note: By default, the PeopleSoft search performs case-insensitive searches.
The basic items of the Verity architecture that are incorporated in the PeopleSoft Portal search architecture are:
This is the set of files forming a search index. When a user performs a search, the search is conducted against the Verity collection. You can create and maintain your own collections with the Search Design and Search Administration PeopleTools.
This is an intermediate file that is created in the process of building a Verity collection. The BIF file is a text file that is used to specify the documents to be submitted to a collection. It contains a unique key, the document size (in bytes), field names and values, and the document location in the file system.
This is another intermediate file that is created in the process of building a Verity collection. The XML file is a text file named indexname.xml that contains all of the information from the documents that are searchable but not returned in the results list. This information is stored in zones. Zones are specific regions of a document to which searches can be limited.
These files describe a set of configuration options that are used to create the indexes that are associated with a collection.
This Verity command-line tool is used to:
Index a collection.
Insert new documents into a collection.
Perform simple maintenance tasks, like purging and deleting a collection.
Control indexing behavior and performance.
PeopleSoft Verity Search Utilities
To create and administer search indexes for use with PeopleSoft software, use the PeopleTools utilities underThe utilities enable you to administer indexes and to create file system, spider, and record-based indexes.
Building Verity Indexes
For both HTTP spider and file system indexes, options are available to include or exclude certain documents based on file types and Multipurpose Internet Mail Extensions (MIME) types. The index building procedure is different for record-based indexes and the spider indexes. Typically, the index building procedure is carried out from an Application Engine job that is scheduled by using the process scheduler.
The steps for building record-based indexes are:
The data from the application tables is read and two files called indexname.xml andindexname.bif are created.
indexname.xml contains one XML record for each document that needs to be indexed. The XML record contains all of the data that needs to be indexed.indexname.bif contains field information, the VdkVgwKey document, and offsets to denote the start and end of each document in the XML file.
The XML and the bulk insert file (BIF) files are typically generated through PeopleCode and reside in the home location of the index. The Verity utility, mkvdk, is called, passing in the BIF file as the argument to build the index.
The steps for building spider indexes are:
The Verity utility, vspider, is called.
The vspider utility takes a number of arguments, but the most important ones are the starting URL or directory to spider and the number of links to follow.
The vspider utility walks through all of the documents in the repository and builds the index.
Verity Search Index Limitations
Following are the PeopleSoft search index limitations:
Verity does not run on IBM z/OS.
Verity collections must reside on the PeopleSoft application server or be accessible from it through a shared drive.
Satisfying this requirement can take several forms, depending on the application server's operating system. On Microsoft Windows, this could be a network drive. On UNIX, this could be an NFS-mounted drive.
Verity collections are most efficient if you index large groups of data, rather than indexing one or two documents at a time.
Small updates degrade the index and require that you run the Verity cleanup utility.
Style files are located in the style subdirectory of the index.
To make style changes, apply them to the files in this directory.
You can have only one language per collection.
Additionally, a number of Verity search index features are limited to certain maximum values, as follows:
Wildcard auto-expansion is limited to 16,000 matches.
Number of collections
The maximum number of physical collections that can be searched at one time is 128.
Documents per collection
The maximum number of documents allowed per collection is 16 million, subject to disk space availability.
Fields per collection
The maximum number of fields allowed per collection is 250.
The maximum length of any field is 32 kilobytes.
Note: The actual number of characters that translates to depends on the character set being used.
Field value length in bulk files
The maximum length of a field value in a bulk file is 32 kilobytes.
Note: The actual number of characters that translates to depends on the character set being used.
Zones per document
The number of zones allowed per document is unlimited.
Characters in path
The maximum path size allowed is 256 characters.
Maximum documents with sort specification
The maximum number of documents that are returned when a sort specification is applied is 16,000.
Sort fields per search
The maximum number of fields that can be included in a sort specification is 16.
Refer to the Verity documentation for details about these features.
User Search Strategies for Verity
A user submits a search request by entering a search string into the search input form field in the portal header. The “<form action=...>” element in the portal header is generated at runtime to link to a PeopleSoft Internet Architecture page, and a Java script submits the form. The query string is passed to the Search API as a parameter named PortalSearchQuery to find matching results. Those results are filtered for security through PeopleCode by the Portal Registry API. The search results page echoes the original query string and displays a list of content references that match the request. If the user clicks the Go button but does not enter a search query, the search results page displays without any results.
The search results page performs the following steps:
Changes the case of the entered text to all uppercase characters.
By default, the Verity search engine searches for all mixed-case variations when a query string is entered in all lowercase or in all uppercase. However, search queries that are entered in mixed-case automatically become case sensitive. (For example, a query on Apple behaves as if the user had specified Apple, which would find only the precise stringApple, while a query on apple finds APPLE, Apple, and apple.) But the portal makes one important change: It changes the case of the query sting to all uppercase, prohibiting users from truly executing case-sensitive searches. This avoids situations where mixed-case searches would otherwise return no results. On the search results page, however, the original case is echoed back to the user.
Formats the query string to pass to the Search API.
This includes filtering out expired and hidden content reference, and content references that are not valid yet.
Calls the Search API.
This returns the query results.
Calls the Portal Registry API.
This is done to apply security filtering to the results. Security is applied in PeopleCode by checking the Authorized property.
Formats and displays search results.
This completes the user's search request.
Note: End users must enter a double backslash (\\) when they need to submit a search request containing a backslash in the text. Using a single backslash (\), will cause undesired results.