5.10 Working with General Architecture for Text Engineering (GATE)
General Architecture for Text Engineering (GATE) is an open source natural language processor and information extractor.
For details about GATE, see http://gate.ac.uk
.
You can use GATE to perform semantic indexing of documents stored in the database. The extractor type mdsys.gatenlp_extractor
is defined as a subtype of the RDFCTX_EXTRACTOR type. The implementation of this extractor type sends an unstructured document to a GATE engine over a TCP connection, receives corresponding annotations, and converts them into RDF following a user-specified XML style sheet.
The requests for information extraction are handled by a server socket implementation, which instantiates the GATE components and listens to extraction requests at a pre-determined port. The host and the post for the GATE listener are recorded in the database, as shown in the following example, for all instances of the mdsys.gatenlp_extractor
type to use.
begin sem_rdfctx.set_extractor_param ( param_key => 'GATE_NLP_HOST', param_value => 'gateserver.example.com', param_desc => 'Host for GATE NLP Listener '); sem_rdfctx.set_extractor_param ( param_key => 'GATE_NLP_PORT', param_value => '7687', param_desc => 'Port for Gate NLP Listener'); end;
The server socket application receives an unstructured document and constructs an annotation set with the desired types of annotations. Each annotation in the set may be customized to include additional features, such as the relevant phrase from the input document and some domain specific features. The resulting annotation set is serialized into XML (using the annotationSetToXml
method in the gate.corpora.DocumentXmlUtils
Java package) and returned back to the socket client.
A sample Java implementation for the GATE listener is available for download from the code samples and examples page on OTN (see RDF Graph Management Examples (PL/SQL and Java) for information about this page).
The mdsys.gatenlp_extractor
implementation in the database receives the annotation set encoded in XML, and converts it to RDF/XML using an XML style sheet. You can replace the default style sheet (listed in Default Style Sheet for GATE Extractor Output) used by the mdsys.gatenlp_extractor
implementation with a custom style sheet when you instantiate the type.
The following example creates an extractor policy that uses a custom style sheet to generate RDF from the annotation set produced by the GATE extractor:
begin sem_rdfctx.create_policy (policy_name => 'GATE_EXTR', extractor => mdsys.gatenlp_extractor( sys.XMLType('<?xml version="1.0"?> <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" > .. </xsl:stylesheet>'))); end; /
Parent topic: Semantic Indexing for Documents