Oracle Commerce Guided Search - Processing binary data with the Document Converter

Processing binary data with the Document Converter

One typical usage scenario for binary column data is to read in documents like PDFs or Word files from the database. In this case, the Advanced JDBC Column Handler would write out this binary column data to the temporary files mentioned above. Then the pipeline would invoke the Document Converter to convert these binary-formatted files into plaintext Guided Search properties indexed for search. The following example pipeline component could be used to do this conversion:

<RECORD_MANIPULATOR FRC_PVAL_IDX="TRUE" NAME="BLOB Manip.">
<RECORD_SOURCE>Records In</RECORD_SOURCE>
<EXPRESSION LABEL="" NAME="IF" TYPE="VOID" URL="">

<COMMENT>if a reference to a BLOB file exists...</COMMENT>
<EXPRESSION LABEL="" NAME="PROP_EXISTS" TYPE="INTEGER" URL="">
	<EXPRNODE NAME="PROP_NAME" VALUE="BLOB_COL_NAME"/>
</EXPRESSION>

<EXPRESSION LABEL="" NAME="RENAME" TYPE="VOID" URL="">
<COMMENT>… rename the BLOB property,</COMMENT>
	<EXPRNODE NAME="OLD_NAME" VALUE="BLOB_COL_NAME"/>
	<EXPRNODE NAME="NEW_NAME" VALUE="Endeca.Document.Body"/>
</EXPRESSION>

<EXPRESSION LABEL="" NAME="CONVERTTOTEXT" TYPE="VOID" URL="">
<COMMENT>extract the searchable text from the file,</COMMENT>
	<EXPRNODE NAME="RESPONSE_TIMEOUT" VALUE="300"/>
</EXPRESSION>

<EXPRESSION TYPE="VOID" NAME="REMOVE_EXPORTED_PROP">
<COMMENT>and then remove the file from the filesystem.</COMMENT>
	<EXPRNODE NAME="PROP_NAME" VALUE="Endeca.Document.Body"/>
	<EXPRNODE NAME="REMOVE_PROPS" VALUE="TRUE"/>
</EXPRESSION>

</EXPRESSION>	
</RECORD_MANIPULATOR>

Note that the binary column’s property name should be renamed to Endeca.Document.Body, since this is the property sought by the Document converter module. After this manipulator processes a record, it will create properties like Endeca.Document.Text, which contains the converted document text and Endeca.Document.Encoding, which reflects the binary file format detected. For more information about the Document converter module, see the VOID CONVERTTOTEXT section of the Data Foundry Expression Reference.

Processing binary data with the Document Converter

Guided Search Platform Services Forge Guide