Oracle Commerce Guided Search - Finding Indexed Terms That Are Unknown to OLT

Finding Indexed Terms That Are Unknown to OLT

This section describes how to generate vocabulary reports that list indexed terms (property values or dimension names) that are unknown to OLT, how to run queries on these reports, and how to correct problems so that OLT can recognize and process them. Unknown terms are reported only for the languages for which OLT analysis is enabled.

Terms can be unknown for any of the following reasons:

Incorrect language assignments: The terms belong to a language other that the language associated with the records and/or properties where they occur. For example, "millésime" (French for "vintage") reported as "unknown" in a record incorrectly associated with the German language. For information about how to assign languages to records and properties, see Assigning Language IDs globally, per record, and per property.
Non-linguistic entities: Non-linguistic entities such as weights and measures, part numbers, or stock keeping units (SKUs) are not included in OLT dictionaries and to that extent are unknown to OLT. When OLT encounters non-linguistic entities, it treats them as literal strings that must be matched exactly. Language analysis in not performed on non-linguistic entities; it is not possible, for example, to perform stemming or decompounding on terms such as 12V, 10x15, mm², or 110004A846K.
Limitations of OLT: The terms are not included in the OLT dictionary, even though they are valid words in their proper languages. For information about how to customize your OLT dictionary to include unknown terms, see Creating an auxiliary OLT dictionary.

The reports can include the following information about each unknown term:

The number of times that this term occurs in your indexed data.
The language associated with the records and/or properties where the term occurs.

Generating a Vocabularly Report of Unknown Terms

You can generate a vocabulary report listing the unknown terms in all your indexed data at baseline update time. You can also automatically generate vocabulary reports of all the unknown terms in the data affected by partial updates, whenever you run the partial updates.

Note

Vocabulary reports are for audit and review only. You cannot modify your OLT dictionaries by editing vocabulary reports; for information about auxiliary OLT dictionaries, see Creating an auxiliary OLT dictionary.

Generating Vocabulary Reports at Baseline Update Time

You can specify that a vocabulary report be generated whenever you run a baseline update. To do this, add the --vocabulary-reportoption to the dgidx command line:

dgidx --vocabulary-report

If you are running dgidx from the Deployment Template, add an <arg>--vocabulary-report</arg> element to the DataIngest.xml file; for example:

<dgidx id="..." host-id="...">
  . . .
  <args>	
    <arg>--vocabulary-report</arg>
    . . .
  </args>
</dgidx>

The report is written to a file whose name is of the form:

db_prefix.vocabulary_report.xml

where:

db_prefix is the name of the project; for example:

Discover.vocabulary_report.xml

The file is written to the directory specified in the <output_dir> element of the DataIngest.xml file; for example:

<output-dir>data/dgidx_output</output-dir>

Thus, the vocabulary report in this example would be written to a file with the following name in the following directory:

[appdir]/data/dgidx_output/Discover.vocabulary_report.xml

Note

In this guide, the abbreviation[appdir] stands for the directory that contains your application.

Generating Vocabulary Reports at Partial Update Time

To generate reports that include only the unknown terms added to your spelling dictionary through partial updates, add the --vocabulary-report option to the dgraph command line:

dgraph --vocabulary-report

A dgraph vocabulary report is generated whenever you run a partial update successfully. The report includes only those unknown terms that occur in records introduced or affected by the partial update.

If you are running Dgraph from the Deployment Template, add an <arg>--vocabulary-report</arg> element to the DgraphDefaults.xml file; for example:

<dgraph-defaults>
  <args>	
    <arg>--vocabulary-report</arg>
    . . .
  </args>
  . . .
</dgraph-defaults>

The report is written to a file whose name is of the form:

<db_prefix>.vocabulary_report.v<VERSION>.xml

where:

db_prefix is the name of the project

<VERSION> is the generation number of the committed partial update.

The file is written to the directory specified in the <input_dir> element of the AuthoringDgraphCluster.xml and LiveDgraphCluster.xmlfiles; thus, the vocabulary report for a partial update might be written to a file with the following name in the following directory:

[appdir]/data/dgraphs/AuthoringDgraph/dgraph_input/Discover.vocabulary_report.v27.xml

Sample Queries on Vocabulary Report

The following XQuery scripts illustrate how to query vocabulary reports for commonly useful information.

Sample 1: Find the Fifty Most Frequently Occurring Unknown Terms

The following XQuery code queries vocabulary_report.xml for the fifty most frequently occuring unknown words. This information can identify instances of language misconfiguration and words needing OLT customization.

declare namespace ene="http://xmlns.endeca.com/ene/dgraph";		
		let $x := doc("vocabulary_report.xml")
		let $sorted-unknowns :=
		  for $term in $x//ene:terms[@class="unknown"]/ene:term
		  order by number($term/@count) descending
		  return <unknown>{$term/../../@lang} {$term/@count} {$term/@value}</unknown>
		(: Return the unknown terms as a table of count-lang-term tuples. :)
  (: Sort by frequency of occurence and limit to top 50. :)
		let $tab := "&#9;"
		for $unk in subsequence($sorted-unknowns, 1, 50)
		return fn:concat($unk/@count/string(), $tab, $unk/@lang/string(),
		  $tab, $unk/@value/string())

The following is an example of the vocabulary report from a Swedish-English technical catalog that was analyzed by the Sample 1 XQuery and shows unknowns that fall into a variety of categories:

  159711 sv false
		   496 sv mp3 
     270 sv 12v  
		    55 sv creative 
	     51 sv microfiber 
	    	36 sv gummiklätt

You might customize the OLT dictionary in response to this information as follows:

The word "false" is English metadata in the product catalog that is being tokenized as Swedish. Because this metadata is never searched by customers, no customization of the OLT dictionary is needed.
The words "mp3" and "12v" are common abbreviations that are only expected to match exactly and do not need customization.
The words "microfiber", a technical term, and "gummiklätt", a compound, are candidates for customization of the Swedish OLT dictionary. Unless these words are added to the OLT dictionary, they will be matched only by an exact match (or thesaurus entry); matching by inflection or component parts will not occur.
The English word "creative" may be in a property that is intended for English search but is mistakenly associated with Swedish. If this is the case, there will be more English words as unknowns. Analyzing the number of unknowns by property can be used to identify properties that may have a high number of unknowns due to a misconfiguration of the language assumed for the property.

Sample 2: Display the Properties that Include the Ten Most Frequently Occurring Unknown Terms

The following XQuery code queries vocabulary-report.xml for the properties with the ten most frequently occurring unknown words. This information is useful for identifying language misconfiguration.

declare namespace ene="http://xmlns.endeca.com/ene/dgraph";
		let $x := doc("vocabulary_report.xml")
		let $sorted-unknowns := 
    for $prop in $x//ene:terms[@class="unknown"]/ene:by_property
		  order by number($prop/@count) descending                                    		
		  return <unknown>{$prop/../../@lang} {$prop/@count} {$prop/@name}</unknown>
		(: Return the unknown terms as a table of count-lang-property tuples.:)
	 (: Sort by frequency of occurence and limit to top 10. :)
	 let $tab := "&#9;"
		for $unk in subsequence($sorted-unknowns, 1, 10)
		return fn:concat($unk/@count/string(), $tab, $unk/@lang/string(),$tab, $unk/@name/string())

The following is an example of the vocabulary report from a Swedish-English technical catalog that was analyzed by the Sample 2 XQuery and shows unknowns that fall into several different categories:

   30446 sv ProductDescription_en 
		 28195 sv ProductStockStatus 
		 10286 sv ProductImageURL 
		   345 sv ProductDescription_sv

You might customize the OLT dictionary in response to this information as follows:

The "ProductStockStatus" and "ProductImageURL" fields contain metadata that is not involved in text search, so the number of unknowns is not an issue.
The "ProductDescription_en" and "ProductDescription_sv" fields are text searchable. In this case, the property "ProductDescription_en" was incorrectly configured as Swedish. Reconfiguration of the language associated with ProductDescript_en will correct the high unknown count.

Finding Indexed Terms That Are Unknown to OLT

Generating a Vocabularly Report of Unknown Terms

Note

Note

Sample Queries on Vocabulary Report

Guided Search Internationalization Guide