This section describes how to generate vocabulary reports that list indexed terms (property values or dimension names) that are unknown to OLT, how to run queries on these reports, and how to correct problems so that OLT can recognize and process them. Unknown terms are reported only for the languages for which OLT analysis is enabled.

Terms can be unknown for any of the following reasons:

The reports can include the following information about each unknown term:

  • The number of times that this term occurs in your indexed data.

  • The language associated with the records and/or properties where the term occurs.

You can generate a vocabulary report listing the unknown terms in all your indexed data at baseline update time. You can also automatically generate vocabulary reports of all the unknown terms in the data affected by partial updates, whenever you run the partial updates.

Generating Vocabulary Reports at Baseline Update Time

You can specify that a vocabulary report be generated whenever you run a baseline update. To do this, add the --vocabulary-report option to the dgidx command line:

dgidx --vocabulary-report 

If you are running dgidx from the Deployment Template, add an <arg>--vocabulary-report</arg> element to the DataIngest.xml file; for example:

<dgidx id="..." host-id="...">
  . . .
  <args>	
    <arg>--vocabulary-report</arg>
    . . .
  </args>
</dgidx>

The report is written to a file whose name is of the form:

db_prefix.vocabulary_report.xml

where:

db_prefix is the name of the project; for example:

Discover.vocabulary_report.xml

The file is written to the directory specified in the <output_dir> element of the DataIngest.xml file; for example:

<output-dir>data/dgidx_output</output-dir>

Thus, the vocabulary report in this example would be written to a file with the following name in the following directory:

[appdir]/data/dgidx_output/Discover.vocabulary_report.xml

Note

In this guide, the abbreviation[appdir] stands for the directory that contains your application.

Generating Vocabulary Reports at Partial Update Time

To generate reports that include only the unknown terms added to your spelling dictionary through partial updates, add the --vocabulary-report option to the dgraph command line:

dgraph --vocabulary-report

A dgraph vocabulary report is generated whenever you run a partial update successfully. The report includes only those unknown terms that occur in records introduced or affected by the partial update.

If you are running Dgraph from the Deployment Template, add an <arg>--vocabulary-report</arg> element to the DgraphDefaults.xml file; for example:

<dgraph-defaults>
  <args>	
    <arg>--vocabulary-report</arg>
    . . .
  </args>
  . . .
</dgraph-defaults>

The report is written to a file whose name is of the form:

<db_prefix>.vocabulary_report.v<VERSION>.xml

where:

db_prefix is the name of the project

<VERSION> is the generation number of the committed partial update.

The file is written to the directory specified in the <input_dir> element of the AuthoringDgraphCluster.xml and LiveDgraphCluster.xml files; thus, the vocabulary report for a partial update might be written to a file with the following name in the following directory:

[appdir]/data/dgraphs/AuthoringDgraph/dgraph_input/Discover.vocabulary_report.v27.xml

The following XQuery scripts illustrate how to query vocabulary reports for commonly useful information.

Sample 1: Find the Fifty Most Frequently Occurring Unknown Terms

The following XQuery code queries vocabulary_report.xml for the fifty most frequently occuring unknown words. This information can identify instances of language misconfiguration and words needing OLT customization.

declare namespace ene="http://xmlns.endeca.com/ene/dgraph";		
		let $x := doc("vocabulary_report.xml")
		let $sorted-unknowns :=
		  for $term in $x//ene:terms[@class="unknown"]/ene:term
		  order by number($term/@count) descending
		  return <unknown>{$term/../../@lang} {$term/@count} {$term/@value}</unknown>
		(: Return the unknown terms as a table of count-lang-term tuples. :)
  (: Sort by frequency of occurence and limit to top 50. :)
		let $tab := "&#9;"
		for $unk in subsequence($sorted-unknowns, 1, 50)
		return fn:concat($unk/@count/string(), $tab, $unk/@lang/string(),
		  $tab, $unk/@value/string())

The following is an example of the vocabulary report from a Swedish-English technical catalog that was analyzed by the Sample 1 XQuery and shows unknowns that fall into a variety of categories:

  159711 sv false
		   496 sv mp3 
     270 sv 12v  
		    55 sv creative 
	     51 sv microfiber 
	    	36 sv gummiklätt

You might customize the OLT dictionary in response to this information as follows:

Sample 2: Display the Properties that Include the Ten Most Frequently Occurring Unknown Terms

The following XQuery code queries vocabulary-report.xml for the properties with the ten most frequently occurring unknown words. This information is useful for identifying language misconfiguration.

declare namespace ene="http://xmlns.endeca.com/ene/dgraph";
		let $x := doc("vocabulary_report.xml")
		let $sorted-unknowns := 
    for $prop in $x//ene:terms[@class="unknown"]/ene:by_property
		  order by number($prop/@count) descending                                    		
		  return <unknown>{$prop/../../@lang} {$prop/@count} {$prop/@name}</unknown>
		(: Return the unknown terms as a table of count-lang-property tuples.:)
	 (: Sort by frequency of occurence and limit to top 10. :)
	 let $tab := "&#9;"
		for $unk in subsequence($sorted-unknowns, 1, 10)
		return fn:concat($unk/@count/string(), $tab, $unk/@lang/string(),$tab, $unk/@name/string()) 

The following is an example of the vocabulary report from a Swedish-English technical catalog that was analyzed by the Sample 2 XQuery and shows unknowns that fall into several different categories:

   30446 sv ProductDescription_en 
		 28195 sv ProductStockStatus 
		 10286 sv ProductImageURL 
		   345 sv ProductDescription_sv 		

You might customize the OLT dictionary in response to this information as follows:

  • The "ProductStockStatus" and "ProductImageURL" fields contain metadata that is not involved in text search, so the number of unknowns is not an issue.

  • The "ProductDescription_en" and "ProductDescription_sv" fields are text searchable. In this case, the property "ProductDescription_en" was incorrectly configured as Swedish. Reconfiguration of the language associated with ProductDescript_en will correct the high unknown count.


Copyright © Legal Notices