This adapter provides functions to create full-text indexes and load them into Apache Solr servers. These functions call the Solr org.apache.solr.hadoop.MapReduceIndexerTool
at run time to generate a full-text index on HDFS and optionally merge it into Solr servers. You can declare and use multiple custom put functions supplied by this adapter and the built-in put function within a single query. For example, you can load data into different Solr collections or into different Solr clusters.
This adapter is described in the following topics:
The first time that you use the Solr adapter, ensure that Solr is installed and configured on your Hadoop cluster as described in "Installing Oracle XQuery for Hadoop".
Your Oracle XQuery for Hadoop query must use the following configuration properties or the equivalent annotation:
oracle.hadoop.xquery.solr.loader.zk-host
oracle.hadoop.xquery.solr.loader.collection
If the index is loaded into a live set of Solr servers, then this configuration property or the equivalent annotation is also required:
oracle.hadoop.xquery.solr.loader.go-live
You can set the configuration properties using either the -D
or -conf
options in the hadoop
command when you run the query. See "Running Queries" and "Solr Adapter Configuration Properties"
This example sets OXH_SOLR_MR_HOME
and uses the hadoop -D
option in a query to set the configuration properties:
$ export OXH_SOLR_MR_HOME=/usr/lib/solr/contrib/mr $ hadoop jar $OXH_HOME/lib/oxh.jar -D oracle.hadoop.xquery.solr.loader.zk-host=/solr -D oracle.hadoop.xquery.solr.loader.collection=collection1 -D oracle.hadoop.xquery.solr.loader.go-live=true ./myquery.xq -output ./myoutput
To use the built-in functions in your query, you must import the Solr module as follows:
import module "oxh:solr";
The Solr module contains the following functions:
The solr
prefix is bound to the oxh:solr
namespace by default.
Writes a single document to the Solr index.
This document XML format is specified by Solr at
https://wiki.apache.org/solr/UpdateXmlMessages
declare %solr:put function solr:put($value as element(doc)) external;
$value
: A single XML element named doc
, which contains one or more field
elements, as shown here:
<doc> <field name="field_name_1">field_value_1</field> . . . <field name="field_name_N">field_value_N</field> </doc>
A generated index that is written into the output_dir
/solr-put
directory, where output_dir is the query output directory
You can use the following annotations to define functions that generate full-text indexes and load them into Solr.
Custom functions for generating Solr indexes must have the following signature:
declare %solr:put [additional annotations] function local:myFunctionName($value as node()) external;
Declares the solr put function. Required.
Name of the subdirectory under the query output directory where the index files will be written. Optional, the default value is the function local name.
Controls various aspects of index generation. You can specify multiple %solr-property
annotations.
These annotations correspond to the command-line options of org.apache.solr.hadoop.MapReduceIndexerTool
. Each MapReduceIndexerTool?
option has an equivalent Oracle XQuery for Hadoop configuration property and a %solr-property
annotation. Annotations take precedence over configuration properties. See "Solr Adapter Configuration Properties" for more information about supported configuration properties and the corresponding annotations.
See Also:
For more information aboutMapReduceIndexerTool?
command line options, see Cloudera Search User Guide at
$value
: An element or a document node conforming to the Solr XML syntax. See "solr:put" for details.
This example uses the following HDFS text file. The file contains user profile information such as user ID, full name, and age, separated by colons (:).
mydata/users.txt john:John Doe:45 kelly:Kelly Johnson:32 laura:Laura Smith: phil:Phil Johnson:27
The first query creates a full-text index searchable by name.
import module "oxh:text"; import module "oxh:solr"; for $line in text:collection("mydata/users.txt") let $split := fn:tokenize($line, ":") let $id := $split[1] let $name := $split[2] return solr:put( <doc> <field name="id">{ $id }</field> <field name="name">{ $name }</field> </doc> )
The second query accomplishes the same result, but uses a custom put function. It also defines all configuration parameters by using function annotations. Thus, setting configuration properties is not required when running this query.
import module "oxh:text"; declare %solr:put %solr-property:go-live %solr-property:zk-host("/solr") %solr-property:collection("collection1") function local:my-solr-put($doc as element(doc)) external; for $line in text:collection("mydata/users.txt") let $split := fn:tokenize($line, ":") let $id := $split[1] let $name := $split[2] return local:my-solr-put( <doc> <field name="id">{ $id }</field> <field name="name">{ $name }</field> </doc> )
The Solr adapter configuration properties correspond to the Solr MapReduceIndexerTool
options.
MapReduceIndexerTool
is a MapReduce batch job driver that creates Solr index shards from input files, and writes the indexes into HDFS. It also supports merging the output shards into live Solr servers, typically a SolrCloud.
You can specify these properties with the generic -conf
and -D
hadoop
command-line options in Oracle XQuery for Hadoop. Properties specified using this method apply to all Solr adapter put functions in your query. See "Running Queries" and especially "Generic Options" for more information about the hadoop
command-line options.
Alternatively, you can specify these properties as Solr adapter put function annotations with the %solr-property
prefix. These annotations are identified in the property descriptions. Annotations apply only to the particular Solr adapter put function that contains them in its declaration.
See Also:
For discussions about how Solr uses theMapReduceIndexerTool
options, see the Cloudera Search User Guide at
Type: String
Default Value: Not defined
Equivalent Annotation: %solr-property:collection
Description: The SolrCloud collection for merging the index, such as mycollection
. Use this property with oracle.hadoop.xquery.solr.loader.go-live
and oracle.hadoop.xquery.solr.loader.zk-host
. Required as either a property or an annotation.
Type: String
Default Value: Not defined
Equivalent Annotation:%solr-property:fair-scheduler-pool
Description: The name of the fair scheduler pool for submitting jobs. The job runs using fair scheduling instead of the default Hadoop scheduling method. Optional.
Type: String values true
or false
Default Value: false
Equivalent Annotation: %solr-property:go-live
Description: Set to true
to enable the final index to merge into a live Solr cluster. Use this property with oracle.hadoop.xquery.solr.loader.collection
and oracle.hadoop.xquery.solr.loader.zk-host
. Optional.
Type: Integer
Default Value: 1000
Equivalent Annotation: %solr-property:go-live-threads
Description: The maximum number of live merges that can run in parallel. Optional.
Type: String
Default Value:
Equivalent Annotation: %solr-property:log4j
Description: The relative or absolute path to the log4j.properties
configuration file on the local file system For example, /path/to/log4j.properties
. Optional.
This file is uploaded for each MapReduce task.
Type: String
Default Value: -1
Equivalent Annotation: %solr-property:mappers
Description: The maximum number of mapper tasks that Solr uses. A value of -1
enables the use of all map slots available on the cluster.
Type: String
Default Value: 1
Equivalent Annotation: %solr-property:max-segments
Description: The maximum number of segments in the index generated by each reducer.
Type: String
Default Value: -1
Equivalent Annotation: %solr-property:reducers
Description: The number of reducers to use:
-1
: Uses all reduce slots available on the cluster.
-2
: Uses one reducer for each Solr output shard. This setting disables the MapReduce M-tree merge algorithm, which typically improves scalability.
Type: String
Default Value: Not defined
Equivalent Annotation: %solr-property:zk-host
Description: The address of a ZooKeeper ensemble used by the SolrCloud cluster. Specify the address as a list of comma-separated host:port pairs, each corresponding to a ZooKeeper server. For example, 127.0.0.1:2181,127.0.0.1:2182,127.0.0.1:2183/solr
. Optional.
If the address starts with a slash (/), such as /solr
, then Oracle XQuery for Hadoop automatically prefixes the address with the ZooKeeper connection string.
This property enables Solr to determine the number of output shards to create and the Solr URLs in which to merge them. Use this property with oracle.hadoop.xquery.solr.loader.collection
and oracle.hadoop.xquery.solr.loader.go-live
. Required as either a property or an annotation.