Before users can execute searches, they need a database of searchable data against which they can target their searches. To do this, you create a database, called a collection, that indexes and stores information about the documents such as their content and file properties.
Searches require collections of files upon which to perform their searches. Once the documents are indexed, their contents and file properties, such as their titles, creation dates, and authors, are available for searching.
You can add or delete documents from a collection: optimizing, updating, and managing your collections as needed.
Note. Search cannot work if the Web Publishing collection does not yet exist or has been deleted. If search does not work, restart the server with the web publishing
function turned on (the default), and try searching again.
This section includes the following topics:
About Collections
When your server administrator indexes all or some of a server's documents, information about the documents is stored in a collection. Collections contain such information as the format of the documents, the language they are in, their searchable attributes, the number of documents in the collection, the collection's status, and a brief description of the collection. For more details, see Displaying Collection Contents.
When you create a collection, you indicate the type of files that it contains: HTML, ASCII, news, email, or PDF.
You can index all the files in a directory or only those with a specific extensionfor example, all the HTML or PDF documents.
A collection has records with information about each document that has been indexed. If the document is deleted from the collection, only the collection's entry for that document is removed. The original document is not deleted.
When you have multiple server instances, the collection you create is only associated with the server instance on which the collection was created. Therefore, users can only search collections for that server instance.
About Collection Attributes
Certain file formats have a default set of attributes that are indexed for files of that type, as shown in
Table 16.2.
By default, HTML collections have Title and SourceType attributes, but they can be indexed to permit searching and sorting by up to 30 file attributes tagged with the HTML <META> tag. You can change the maximum settings for file attributes in webpub.conf, as discussed in Adjusting the Maximum Number of Attributes.
For example, a document could have these lines of HTML code:
If this document was indexed with its META tags extracted, you could search it for specific values in the writer or product fields. For example, you could enter this query: Writer <contains> Hunter or Song <contains> Blue.
Note. Any attribute values in META-tagged fields are text strings only, which means
that dates and numbers are sorted as text, not as dates or numbers. Also, illegal
HTML characters in a META-tagged attribute are replaced with a hyphen. You
can use the Add Custom Property window (choose Web Publishing and click
the Add Custom Property link) to redefine the text-formatted dates and
numbers so that you can perform searches based on actual dates and numbers
for data in the Web Publishing collection.
Creating a New Collection
You can create a collection that indexes the content of all or some of the files in a directory. You can define collections that contain only one kind of file or you can create a collection of documents in various formats that are automatically converted to HTML during indexing. When you define a multiple format collection (with the auto-convert option), the indexer first converts the documents into HTML and then indexes the contents of the HTML documents. The converted HTML documents are put into the html_doc directory in the server's search collections folder.
You can only have 12 collections on your server, which is limited to 10 user-defined collections for any server that uses web publishing. If you want to use a 13th collection, you must remove one of your existing collections (choose Search and click the Maintain Collection link). Do not remove the web publishing collection if one exists for your server.
You can only have entries for a maximum of 16 million documents in your collections. A document that is indexed in multiple collections counts as multiple documents. It is best to create new collections of over 10,000 documents at low-traffic times, or the indexing operation may affect your system's performance.
Note. You need to have at least 3MB of available disk space on your system to create
a collection. For information on how you can restrict the size of the index files,
see Restricting Your Index File Size.
To create a new collection, perform the following steps:
From theServer Manager, choose Search.
Click the New Collection link.
The web server displays the Create a Collection window. The Directory to
Index field displays the currently defined document directory and provides
a drop-down list of all the additional document directories defined for the server. For
more information about additional document directories, see Mapping URLs.
You can select any of the items in the drop-down list as a starting point for finding the directory you want to index.
If you want to index a different subdirectory, click the View button to see a list of resources.
You can index any directory that is listed or you can view the subdirectories in a listed directory and index one of those instead. Once you click the index link for a directory, you return to the Create Collection window and the directory name appears in the Directory to Index field.
You can index all HTML files in the chosen directory by leaving the default *.html pattern in the Documents matching field or you can define your own wildcard expression to restrict indexing to documents that match
that pattern.
For example, you could enter *.html to only index the content in
documents with the .html extension, or you could use either of these
patterns (complete with parentheses) to index all HTML documents:
(*.htm|*.html or *(.htm|.html)
You can define multiple wildcards in an expression. For details of the
syntax for wildcard patterns, see Using Wildcards.
Note. You cannot index a file that includes a semi-colon (;) in its name. You must
rename such files before you can index them.
To index the subdirectories within the specified directory, click Include Subdirectories.
Type a name for your collection in the Collection Name field.
The collection name is used for collection maintenance. This is the physical
file name for the file, so follow the standard directory-naming conventions
for your operating system. You can use any characters up to a maximum of
128 characters. Spaces are converted to underscores.
Note. Do not use accented characters in the collection name. If you need
accented characters, exclude the accents from the collection name, but use
accented characters in the label. The label is what is displayed to the user
from the search interface.
Type a user-defined name for your collection in the optional Collection Label field.
This name is what users see when they use the text search interface. Make
your collection's label as descriptive and relevant as possible. You can use
any characters except single or double quotation marks, up to a maximum
of 128 characters.
Type a description for your collection (up to a maximum of 1024 characters) in the optional
Description field.
This description is displayed in the collection contents page.
Select the type of files the collection is to contain: ASCII, HTML, news, email, or PDF.
The kind of file format you choose indicates which default attributes are
used in the collection and which, if any, automatic HTML conversion of the
content is done as part of indexing. For information about the attributes for
each format, see Table 16.2 and About Collection Attributes.
If you choose HTML as the file type and also try to index non-HTML files,
the server creates the collection with the HTML set of default attributes and
does not attempt to convert any non-HTML file it indexes. If you index
HTML files into an ASCII collection, even the HTML markup tags are
indexed as part of the file's contents and when you display the files, the
contents are displayed as raw text. Regardless of the file type chosen, the
content of the file is always indexed.
Complex PDF files, such as those that are password protected or that
contain graphical navigation elements cannot be correctly converted when
they are indexed as part of a multi-format collection. The file data converts
correctly when they are part of a PDF-only collection. Graphic elements are
not converted.
Select whether or not to extract META-tagged attributes from HTML files during indexing.
If you extract these attributes, you can search on their values. You can
index on a maximum of thirty (30) different user-defined META tags in a
document. You can only use this option for HTML collections.
Select the collection's language from the drop-down list.
The default is English, labeled "English (ISO-8859-1)." For more information
on character sets, see Managing Server Content
Click OK to create a new collection.
Note. Once you begin indexing a collection, you cannot stop the process until either
the indexing is complete or you reboot the system. Shutting down your server
does not kill the process.
Configuring a Collection
After you have initially created a collection, you can modify some of the initial settings for the collection. This data resides in the collection information file, dblist.ini, and when you reconfigure a collection, the dblist.ini file is updated to reflect your changes. For more information about the configuration files, see Configuring Manually. You can revise the description, change its label, define a different URL for its documents, and define how to indicate highlighting in displayed documents, which pattern files to use, and how to format dates.
Note.
This window allows you to modify some of the settings for the web publishing default collection, web_htm, because you are not changing actual collection data. Avoid making unnecessary changes to this collection's settings.
To configure a collection, perform the following steps:
From the Server Manager, choose Search.
Click the Configure Collection link.
The web server displays the Configure Collection window.
In the optional Description field, you can type a description for your collection up to a
maximum of 1024 characters.
In the optional Collection Label field, you can type a user-defined name for your
collection.
This is what users see when they use the text search interface. Make your
collection's label as descriptive and relevant as possible. You can use any
characters except single or double quotation marks, up to a maximum of
128 characters.
In the URL for Documents field, you can type in the new URL mapping for the collection's
documents if that has changed.
That is, if you originally indexed the directory of files that corresponded to
those defined by the URL mapping /publisher/help, and you have
changed that mapping to the simpler /helpFiles, you would replace the
URL of /publisher/help with the /helpFiles in this field. For more
information about additional document directories, see Mapping URLs.
In the Highlight Begin and Highlight End fields, you can type in the HTML tagging you
want the server to use when highlighting a search query word or phrase in a displayed document.
The default is to use bold, with the <b> and </b> tags, but you can add to
this or change it. For example, you could add <blink><FONT COLOR =
#FF0000> and the corresponding </blink></FONT> to highlight with
blinking bold red text.
You can define different default pattern files for displaying the search results: how the search
result's header, footer, and list entry line are formatted, respectively.
Initially, the pattern files are in the
server_root\plugins\search\ui\text.
In the Result Pattern File field, you can enter the name of the pattern file you want to
use when displaying a single highlighted document from the list of search results.
In the Date Format field, you can specify how you want input dates to be interpreted when
using this collection: MM/DD/YY, DD/MM/YY, or YY/MM/DD.
Click OK to change the collection configuration.
Updating a Collection
After you have initially created a collection, you may want to add or remove files. If you are adding documents, the files' contents are indexed (and converted if necessary), when their entries are added to the collection. If you are removing documents, the entries for the files are removed from the collection along with their metadata. This function does not affect the original documents, only their entries in the collection.
Note. If you selected the Extract Metatags option when you created this collection,
then the META-tagged HTML attributes are indexed whenever you add new
documents to this collection.
To update a collection, perform the following steps:
From the Server Manager, choose Search.
Click the Update Collection link.
The web server displays the Update Collection window.
Select the collection you want to update from the drop-down list.
The list of documents in the center of the form shows you what documents
have index entries in the currently selected collection. The list holds 100
records, and the Prev and Next buttons get the previous (or next) set of 100
files for collections that have more than 100 files in them.
In the Documents Matching field, you can type in a single filename or you can use wildcards
to specify the type of files you want added to or removed from the collection.
If you enter a wildcard such as *.html, only files with this extension are
affected. You can indicate files within a subdirectory by typing in the
pathname as it appears in the list of files. For example, you could delete all
the HTML files in the /frenchDocs directory by typing in (no slash before
the directory name): frenchDocs/*.html
Note. Be careful how you construct wildcard expressions. For example, if
you type in index.html, you can add or remove the index file from the
current collection. If instead you type in the expression */index.html,
you can add or remove all index.html files in the collection.
Select whether to index and add all matching documents from the subdirectories of the document
directory that was originally defined for the collection.
That is, if the collection originally indexed the /publisher directory, this
option looks for documents matching the new pattern within all the
subdirectories within /publisher. This does not apply for removing
documents.
Click AddDocs to add the indicated files and subdirectories.
Click RemoveDocs to remove the indicated files.
Maintaining a Collection
Periodically, you may want to maintain your collections. With normal usage, these tasks may not be necessary, but if you do a great deal of indexing and updating of collections, you may want to use some of these functions occasionally. You can perform the following collection management tasks:
Note. Do not use your local file manager to remove collections, especially not the web publishing collections. If by chance you do, when you try to execute a search before restarting your server again, the search will fail even if it doesn't use the web publishing collection. Once you restart your server, a new web publishing collection will be automatically created for you, so your search can execute.
To perform any of the collection management tasks, use The Maintain Collection Page in the Server Manager.
Scheduling Regular Maintenance
You can schedule collection maintenance at regular intervals. You can set up separate maintenance schedules for optimizing and reindexing. With normal usage, these tasks may not be necessary, but if you do a great deal of indexing and updating of collections, you may want to use some of these functions occasionally. For example, some very active web sites may require frequent reindexing if new documents are added on a daily basis.
A common combination of tasks is to set up a pair of regularly scheduled reindex and update operations to clean out deleted entries an to add entries for new documents matching your collection criteria.
You can optimize a collection to improve performance if you frequently add, delete, or update documents or directories in your collections. An analogy is defragmenting your hard drive. Optimizing is not done automatically, so you must manually optimize after you reindex or update a collection. One situation when you might want to optimize a collection is just before publishing it to another site or before putting it onto a read-only CD-ROM.
You can reindex a collection, which locates each file that has an entry in the collection and reindexes its attributes and contents, extracting the META-tagged attributes if that option was selected when the files were originally indexed into the collection. This does not add entries for new documents but cleans up the collection by removing entries to files that have been deleted.
You can update a collection, by entering new indexing criteria for the collection, say *.html, which adds any new documents that match the criteria.
To optimize, reindex, or update your collection, perform the following steps:
From the Server Manager, choose Search.
Click the Schedule Collection Maintenance link.
The web server displays the Schedule Collection Maintenance window.
Choose a collection from the drop-down list.
This lists all the collections that you have created.
Choose an action from the drop-down list: Reindex, Optimize, or Update.
You can set up different schedules for different operations on the same
collection.
If you choose to update your collection, two extra fields are displayed for entering the document
matching criteria and for including documents found in subdirectories that match your criteria.
In the Schedule Time field, type in the time of day when you want the scheduled maintenance
to take place.
Use a military format (HH:MM). HH must be less than 24 and MM must be
less than 60. You must enter a time.
In the section labeled Schedule Day(s) of the Week, check one or more of the day checkboxes.
You can select all days. You must select at least one day.
Click OK to schedule the maintenance.
For Unix/Linux users, to make your newly scheduled maintenance take effect, you must restart the ns-cron process from the Administration Server.
To restart the ns-cron process, peform the following steps:
From the Administration Server, Choose Global Settings.
Click the Cron Control link.
If ns-cron is already on, click Restart
to restart it. If ns-cron is not on, click Start to start it up.
In either case, your regularly scheduled maintenance will now be able to
take place.
Unscheduling Collection Maintenance
If you have scheduled regular reindexing or optimizing of a collection, you can remove the scheduled maintenance when you no longer want the collection to be maintained at regular intervals.
To unschedule collection maintenance, perform the following steps:
From the Server Manager, choose Search.
Click the Remove Scheduled Collection Maintenance link.
The web server displays the Remove Scheduled Collection Maintenance
window.
Choose a collection from the drop-down list for Choose Collection.
This lists all your collections for which you have set up regular
maintenance.
Choose an action from the drop-down list: Reindex or Optimize.
In the lower part of the frame, you can see the time and days of the week when the scheduled maintenance is currently scheduled to take place.
Click OK to remove the scheduled maintenance.
For Unix/Linux users, to make your newly scheduled maintenance take effect, you must restart the ns-cron process.
To restart the ns-cron process, perform the following steps:
From the Administration Server, choose Global Settings.
Click the Cron Control link.
If ns-cron is already on, click Restart to
restart it. If ns-cron is not on, click Start to start it up.
In either case, your regularly scheduled maintenance will no longer take
place.