Solr Adapter

This adapter provides functions to create full-text indexes and load them into Apache Solr servers. These functions call the Solr org.apache.solr.hadoop.MapReduceIndexerTool at run time to generate a full-text index on HDFS and optionally merge it into Solr servers. You can declare and use multiple custom put functions supplied by this adapter and the built-in put function within a single query. For example, you can load data into different Solr collections or into different Solr clusters.

This adapter is described in the following topics:

Prerequisites for Using the Solr Adapter

The first time that you use the Solr adapter, ensure that Solr is installed and configured on your Hadoop cluster as described in "Installing Oracle XQuery for Hadoop".

Configuration Settings

Your Oracle XQuery for Hadoop query must use the following configuration properties or the equivalent annotation:

  • oracle.hadoop.xquery.solr.loader.zk-host

  • oracle.hadoop.xquery.solr.loader.collection

If the index is loaded into a live set of Solr servers, then this configuration property or the equivalent annotation is also required:

  • oracle.hadoop.xquery.solr.loader.go-live

You can set the configuration properties using either the -D or -conf options in the hadoop command when you run the query. See "Running Queries" and "Solr Adapter Configuration Properties"

Example Query Using the Solr Adapter

This example sets OXH_SOLR_MR_HOME and uses the hadoop -D option in a query to set the configuration properties:

$ export OXH_SOLR_MR_HOME=/usr/lib/solr/contrib/mr 
$ hadoop jar $OXH_HOME/lib/oxh.jar -D oracle.hadoop.xquery.solr.loader.zk-host=/solr -D oracle.hadoop.xquery.solr.loader.collection=collection1 -D oracle.hadoop.xquery.solr.loader.go-live=true  ./myquery.xq -output ./myoutput

Built-in Functions for Loading Data into Solr Servers

To use the built-in functions in your query, you must import the Solr module as follows:

import module "oxh:solr";

The Solr module contains the following functions:

The solr prefix is bound to the oxh:solr namespace by default.

solr:put

Writes a single document to the Solr index.

This document XML format is specified by Solr at

https://wiki.apache.org/solr/UpdateXmlMessages

Signature

declare %solr:put function
   solr:put($value as element(doc)) external;

Parameters

$value: A single XML element named doc, which contains one or more field elements, as shown here:

<doc>
<field name="field_name_1">field_value_1</field>
     .
     .
     .
<field name="field_name_N">field_value_N</field>
</doc>

Returns

A generated index that is written into the output_dir/solr-put directory, where output_dir is the query output directory

Custom Functions for Loading Data into Solr Servers

You can use the following annotations to define functions that generate full-text indexes and load them into Solr.

Signature

Custom functions for generating Solr indexes must have the following signature:

declare %solr:put [additional annotations] 
   function local:myFunctionName($value as node()) external;

Annotations

%solr:put

Declares the solr put function. Required.

%solr:file(directory_name)

Name of the subdirectory under the query output directory where the index files will be written. Optional, the default value is the function local name.

%solr-property:property_name(value)

Controls various aspects of index generation. You can specify multiple %solr-property annotations.

These annotations correspond to the command-line options of org.apache.solr.hadoop.MapReduceIndexerTool. Each MapReduceIndexerTool? option has an equivalent Oracle XQuery for Hadoop configuration property and a %solr-property annotation. Annotations take precedence over configuration properties. See "Solr Adapter Configuration Properties" for more information about supported configuration properties and the corresponding annotations.

See Also:

For more information about MapReduceIndexerTool? command line options, see Cloudera Search User Guide at

http://www.cloudera.com/content/cloudera-content/cloudera-docs/Search/latest/Cloudera-Search-User-Guide/csug_mapreduceindexertool.html

Parameters

$value: An element or a document node conforming to the Solr XML syntax. See "solr:put" for details.

Examples of Solr Adapter Functions

Example 1   Using the Built-in solr:put Function

This example uses the following HDFS text file. The file contains user profile information such as user ID, full name, and age, separated by colons (:).

mydata/users.txt
john:John Doe:45 
kelly:Kelly Johnson:32
laura:Laura Smith: 
phil:Phil Johnson:27

The first query creates a full-text index searchable by name.

import module "oxh:text";
import module "oxh:solr";
for $line in text:collection("mydata/users.txt") 
let $split := fn:tokenize($line, ":") 
let $id := $split[1]
let $name := $split[2]
return solr:put(
<doc>
<field name="id">{ $id }</field>
<field name="name">{ $name }</field>
</doc>
)

The second query accomplishes the same result, but uses a custom put function. It also defines all configuration parameters by using function annotations. Thus, setting configuration properties is not required when running this query.

import module "oxh:text";
declare %solr:put %solr-property:go-live %solr-property:zk-host("/solr") %solr-property:collection("collection1") 
function local:my-solr-put($doc as element(doc)) external;
for $line in text:collection("mydata/users.txt") 
let $split := fn:tokenize($line, ":") 
let $id := $split[1]
let $name := $split[2]
return local:my-solr-put(
<doc>
<field name="id">{ $id }</field>
<field name="name">{ $name }</field>
</doc>
)

Solr Adapter Configuration Properties

The Solr adapter configuration properties correspond to the Solr MapReduceIndexerTool options.

MapReduceIndexerTool is a MapReduce batch job driver that creates Solr index shards from input files, and writes the indexes into HDFS. It also supports merging the output shards into live Solr servers, typically a SolrCloud.

You can specify these properties with the generic -conf and -D hadoop command-line options in Oracle XQuery for Hadoop. Properties specified using this method apply to all Solr adapter put functions in your query. See "Running Queries" and especially "Generic Options" for more information about the hadoop command-line options.

Alternatively, you can specify these properties as Solr adapter put function annotations with the %solr-property prefix. These annotations are identified in the property descriptions. Annotations apply only to the particular Solr adapter put function that contains them in its declaration.

See Also:

For discussions about how Solr uses the MapReduceIndexerTool options, see the Cloudera Search User Guide at

http://www.cloudera.com/content/cloudera-content/cloudera-docs/Search/latest/Cloudera-Search-User-Guide/csug_mapreduceindexertool.html

oracle.hadoop.xquery.solr.loader.collection

Type: String

Default Value: Not defined

Equivalent Annotation: %solr-property:collection

Description: The SolrCloud collection for merging the index, such as mycollection. Use this property with oracle.hadoop.xquery.solr.loader.go-live and oracle.hadoop.xquery.solr.loader.zk-host. Required as either a property or an annotation.

oracle.hadoop.xquery.solr.loader.fair-scheduler-pool

Type: String

Default Value: Not defined

Equivalent Annotation:%solr-property:fair-scheduler-pool

Description: The name of the fair scheduler pool for submitting jobs. The job runs using fair scheduling instead of the default Hadoop scheduling method. Optional.

oracle.hadoop.xquery.solr.loader.go-live

Type: String values true or false

Default Value: false

Equivalent Annotation: %solr-property:go-live

Description: Set to true to enable the final index to merge into a live Solr cluster. Use this property with oracle.hadoop.xquery.solr.loader.collection and oracle.hadoop.xquery.solr.loader.zk-host. Optional.

oracle.hadoop.xquery.solr.loader.go-live-threads

Type: Integer

Default Value: 1000

Equivalent Annotation: %solr-property:go-live-threads

Description: The maximum number of live merges that can run in parallel. Optional.

oracle.hadoop.xquery.solr.loader.log4j

Type: String

Default Value:

Equivalent Annotation: %solr-property:log4j

Description: The relative or absolute path to the log4j.properties configuration file on the local file system For example, /path/to/log4j.properties. Optional.

This file is uploaded for each MapReduce task.

oracle.hadoop.xquery.solr.loader.mappers

Type: String

Default Value: -1

Equivalent Annotation: %solr-property:mappers

Description: The maximum number of mapper tasks that Solr uses. A value of -1 enables the use of all map slots available on the cluster.

oracle.hadoop.xquery.solr.loader.max-segments

Type: String

Default Value: 1

Equivalent Annotation: %solr-property:max-segments

Description: The maximum number of segments in the index generated by each reducer.

oracle.hadoop.xquery.solr.loader.reducers

Type: String

Default Value: -1

Equivalent Annotation: %solr-property:reducers

Description: The number of reducers to use:

  • -1: Uses all reduce slots available on the cluster.

  • -2: Uses one reducer for each Solr output shard. This setting disables the MapReduce M-tree merge algorithm, which typically improves scalability.

oracle.hadoop.xquery.solr.loader.zk-host

Type: String

Default Value: Not defined

Equivalent Annotation: %solr-property:zk-host

Description: The address of a ZooKeeper ensemble used by the SolrCloud cluster. Specify the address as a list of comma-separated host:port pairs, each corresponding to a ZooKeeper server. For example, 127.0.0.1:2181,127.0.0.1:2182,127.0.0.1:2183/solr. Optional.

If the address starts with a slash (/), such as /solr, then Oracle XQuery for Hadoop automatically prefixes the address with the ZooKeeper connection string.

This property enables Solr to determine the number of output shards to create and the Solr URLs in which to merge them. Use this property with oracle.hadoop.xquery.solr.loader.collection and oracle.hadoop.xquery.solr.loader.go-live. Required as either a property or an annotation.