This section describes how to generate vocabulary reports that list indexed terms (property values or dimension names) that are unknown to OLT, how to run queries on these reports, and how to correct problems so that OLT can recognize and process them. Unknown terms are reported only for the languages for which OLT analysis is enabled.
Terms can be unknown for any of the following reasons:
Incorrect language assignments: The terms belong to a language other that the language associated with the records and/or properties where they occur. For example, "millésime" (French for "vintage") reported as "unknown" in a record incorrectly associated with the German language. For information about how to assign languages to records and properties, see Assigning Language IDs globally, per record, and per property.
Non-linguistic entities: Non-linguistic entities such as weights and measures, part numbers, or stock keeping units (SKUs) are not included in OLT dictionaries and to that extent are unknown to OLT. When OLT encounters non-linguistic entities, it treats them as literal strings that must be matched exactly. Language analysis in not performed on non-linguistic entities; it is not possible, for example, to perform stemming or decompounding on terms such as 12V, 10x15, mm², or 110004A846K.
Limitations of OLT: The terms are not included in the OLT dictionary, even though they are valid words in their proper languages. For information about how to customize your OLT dictionary to include unknown terms, see Creating an auxiliary OLT dictionary.
The reports can include the following information about each unknown term:
The number of times that this term occurs in your indexed data.
The language associated with the records and/or properties where the term occurs.
You can generate a vocabulary report listing the unknown terms in all your indexed data at baseline update time. You can also automatically generate vocabulary reports of all the unknown terms in the data affected by partial updates, whenever you run the partial updates.
Note
Vocabulary reports are for audit and review only. You cannot modify your OLT dictionaries by editing vocabulary reports; for information about auxiliary OLT dictionaries, see Creating an auxiliary OLT dictionary.
Generating Vocabulary Reports at Baseline Update Time
You can specify that a vocabulary report be generated whenever you run
a baseline update. To do this, add the
--vocabulary-report
option to the dgidx command line:
dgidx --vocabulary-report
If you are running dgidx from the Deployment Template, add an
<arg>--vocabulary-report</arg>
element to
the
DataIngest.xml
file; for example:
<dgidx id="..." host-id="...">
. . .
<args>
<arg>--vocabulary-report</arg>
. . .
</args>
</dgidx>
The report is written to a file whose name is of the form:
db_prefix.vocabulary_report.xml
where:
db_prefix
is the name of the project; for
example:
Discover.vocabulary_report.xml
The file is written to the directory specified in the
<output_dir>
element of the
DataIngest.xml
file; for example:
<output-dir>data/dgidx_output</output-dir>
Thus, the vocabulary report in this example would be written to a file with the following name in the following directory:
[appdir]/data/dgidx_output/Discover.vocabulary_report.xml
Note
In this guide, the abbreviation[appdir]
stands for the
directory that contains your application.
Generating Vocabulary Reports at Partial Update Time
To generate reports that include only the unknown terms added to your
spelling dictionary through partial updates, add the
--vocabulary-report
option to the dgraph command line:
dgraph --vocabulary-report
A dgraph vocabulary report is generated whenever you run a partial update successfully. The report includes only those unknown terms that occur in records introduced or affected by the partial update.
If you are running Dgraph from the Deployment Template, add an
<arg>--vocabulary-report</arg>
element to
the
DgraphDefaults.xml
file; for example:
<dgraph-defaults>
<args>
<arg>--vocabulary-report</arg>
. . .
</args>
. . .
</dgraph-defaults>
The report is written to a file whose name is of the form:
<db_prefix>.vocabulary_report.v<VERSION>.xml
where:
db_prefix
is the name of the project
<VERSION>
is the generation number of the
committed partial update.
The file is written to the directory specified in the
<input_dir>
element of the
AuthoringDgraphCluster.xml
and
LiveDgraphCluster.xml
files; thus, the vocabulary report for a partial update might
be written to a file with the following name in the following directory:
[appdir]/data/dgraphs/AuthoringDgraph/dgraph_input/Discover.vocabulary_report.v27.xml
The following XQuery scripts illustrate how to query vocabulary reports for commonly useful information.
Sample 1: Find the Fifty Most Frequently Occurring Unknown Terms
The following XQuery code queries
vocabulary_report.xml
for the fifty most frequently
occuring unknown words. This information can identify instances of language
misconfiguration and words needing OLT customization.
declare namespace ene="http://xmlns.endeca.com/ene/dgraph"; let $x := doc("vocabulary_report.xml") let $sorted-unknowns := for $term in $x//ene:terms[@class="unknown"]/ene:term order by number($term/@count) descending return <unknown>{$term/../../@lang} {$term/@count} {$term/@value}</unknown> (: Return the unknown terms as a table of count-lang-term tuples. :) (: Sort by frequency of occurence and limit to top 50. :) let $tab := "	" for $unk in subsequence($sorted-unknowns, 1, 50) return fn:concat($unk/@count/string(), $tab, $unk/@lang/string(), $tab, $unk/@value/string())
The following is an example of the vocabulary report from a Swedish-English technical catalog that was analyzed by the Sample 1 XQuery and shows unknowns that fall into a variety of categories:
159711 sv false 496 sv mp3 270 sv 12v 55 sv creative 51 sv microfiber 36 sv gummiklätt
You might customize the OLT dictionary in response to this information as follows:
The word "false" is English metadata in the product catalog that is being tokenized as Swedish. Because this metadata is never searched by customers, no customization of the OLT dictionary is needed.
The words "mp3" and "12v" are common abbreviations that are only expected to match exactly and do not need customization.
The words "microfiber", a technical term, and "gummiklätt", a compound, are candidates for customization of the Swedish OLT dictionary. Unless these words are added to the OLT dictionary, they will be matched only by an exact match (or thesaurus entry); matching by inflection or component parts will not occur.
The English word "creative" may be in a property that is intended for English search but is mistakenly associated with Swedish. If this is the case, there will be more English words as unknowns. Analyzing the number of unknowns by property can be used to identify properties that may have a high number of unknowns due to a misconfiguration of the language assumed for the property.
Sample 2: Display the Properties that Include the Ten Most Frequently Occurring Unknown Terms
The following XQuery code queries
vocabulary-report.xml
for the properties with the ten
most frequently occurring unknown words. This information is useful for
identifying language misconfiguration.
declare namespace ene="http://xmlns.endeca.com/ene/dgraph"; let $x := doc("vocabulary_report.xml") let $sorted-unknowns := for $prop in $x//ene:terms[@class="unknown"]/ene:by_property order by number($prop/@count) descending return <unknown>{$prop/../../@lang} {$prop/@count} {$prop/@name}</unknown> (: Return the unknown terms as a table of count-lang-property tuples.:) (: Sort by frequency of occurence and limit to top 10. :) let $tab := "	" for $unk in subsequence($sorted-unknowns, 1, 10) return fn:concat($unk/@count/string(), $tab, $unk/@lang/string(),$tab, $unk/@name/string())
The following is an example of the vocabulary report from a Swedish-English technical catalog that was analyzed by the Sample 2 XQuery and shows unknowns that fall into several different categories:
30446 sv ProductDescription_en 28195 sv ProductStockStatus 10286 sv ProductImageURL 345 sv ProductDescription_sv
You might customize the OLT dictionary in response to this information as follows:
The "ProductStockStatus" and "ProductImageURL" fields contain metadata that is not involved in text search, so the number of unknowns is not an issue.
The "ProductDescription_en" and "ProductDescription_sv" fields are text searchable. In this case, the property "ProductDescription_en" was incorrectly configured as Swedish. Reconfiguration of the language associated with ProductDescript_en will correct the high unknown count.