AquaLogic Interaction Administrator Guide

     Previous Next  Open TOC in new window   View as PDF - New Window  Get Adobe Reader - New Window
Content starts here

About the Thesaurus File

The thesaurus is a comma-delimited file, in which each line represents a single thesaurus entry.

The first comma-delimited element on a line is the name of the thesaurus entry. The remaining elements on that line are the search tokens that should be treated as synonyms for the thesaurus entry. Each synonym can be assigned a weight that determines the amount each match contributes to the overall query score. For example, a file that contains the following two lines defines thesaurus entries for couch and dog:

couch,sofa[0.9],divan[0.5],davenport[0.4]
dog,canine,doggy[0.85],pup[0.7],mutt[0.3]

Searches for couch generate results with text matching terms couch, sofa, divan, and davenport. Searches for dog generate results that have text matching terms dog, canine, doggy, pup, and mutt. In the example shown, the term dog has the same contribution to the relevance score of a matching item as the term canine. This is equivalent to a default synonym weighting of 1.0. In contrast, the presence of the term pup contributes less to the relevance score than the presence of the term dog, by a factor of 0.7 (70%).

The example thesaurus entries constitute a complete comma-delimited file. No other information is needed at the beginning or the end of the file. Entries can also contain spaces. For example, a file that contains the following text creates a thesaurus entry for New York City:
new york city,big apple[0.9],gotham[0.5]

Searches for the phrase “new york city” will return results that also include results containing “big apple” and “gotham.” Thesaurus expansion for phrase entries only occurs for searches on the complete phrase, not the individual words that constitute the phrase. Similarly, the synonym entries are treated as phrases and not as individual terms. So while a search for “new york city” returns items containing “big apple” and “gotham,” a search for new (or for york, or for city, or for “new york”) will not. Conversely, an item that contains big or apple but not the phrase “big apple” will not be returned by a search for “new york city.”

Comma-delimited files support all UTF8-encoded characters; they are not limited to ASCII. However, punctuation should not be included. For example, if you want to make ne’er-do-well a synonym of wastrel, replace the punctuation with whitespace:
wastrel,ne er do well[0.7]
This matches documents that contain ne’er-do-well, ne er do well or some combination of these punctuations and spaces (such as ne’er do well). If you want your synonym to match documents that contain neer-do-well, which does not separate the initial ne and er with an apostrophe, you must include a separate synonym for that, such as:
wastrel,ne er do well[0.7],neer do well[0.7]
Comment lines can be specified by beginning the line with a “#”:
# furniture entries 
couch,sofa[0.9],divan[0.5],davenport[0.4]
#chair,stool[5.0]
# animal entries
dog,canine[0.9],doggy[0.85],pup[0.7],mutt[0.3]

In this example, the Search Service parses two thesaurus entries: couch and dog. There will be no entry for chair.

These examples are of entries that contain only ASCII characters. This utility supports non-ASCII characters as well, as long as they are UTF8-encoded.
Note: Some editors, especially when encoding UTF-8, insert a byte order mark at the beginning of the file. Files with byte order marks are not supported, so remove the byte order mark before running the customize utility.

A CDF thesaurus file can have at most 50,000 distinct entries (lines). Each entry can have at most 50 comma-delimited elements (including the name of the entry). If either of these limits are exceeded, the customize utility will exit with an appropriate error message.


  Back to Top      Previous Next