About the Bulk Load API

Data records can be added or replaced via the Dgraph's Bulk Load Interface.

Besides the Data Ingest API, the Bulk Load Interface is available to ingest records into an Endeca data domain. The Bulk Load API exists in the form of a collection of Java classes in a single endeca_bulk_load.jar file, which is shipped in the Endeca Server's apis directory. The Javadoc for the Bulk Load API is located in the apis/doc/bulk_load directory.

Bulk Load characteristics

The characteristics of the interface are:
  • The API can load data source records only. It cannot load PDRs, DDRs, managed attribute values, the GCR, or the Dgraph configuration documents.
  • Existing records in the Endeca data domain are replaced, not updated. That is, the replace operation is not additive. Therefore, the key-value pair list of the incoming record will completely replace the key-value pair list of the existing record.
  • A primary-key attribute (also called the record spec) is required for each record to be added or replaced.
  • If an assignment is for a standard attribute (property) that does not exist in the Endeca data domain, the new standard attribute is automatically created with system default values for the PDR. For these default values, see Default values for new attributes.
  • There is a limit of 128MB on the maximum size of a source record. An attempt to ingest a source record larger than 128MB will fail and an error will be returned (with the record spec of the rejected record), but the bulk load ingest will continue after that rejected record.
The interface rejects non-XML 1.0 characters upon ingest. That is, a valid character for ingest must be a character according to production 2 of the XML 1.0 specification. If an invalid character is detected, an exception is thrown with this error message:
Character <c> is not legal in XML 1.0

The record with the invalid character is rejected.

Implementing Logging

The BulkIngester class is the primary entry point for the client-side Bulk Load Interface for loading data into an Endeca data domain. BulkIngester uses the java.util.logging package to output some logging for debugging purposes. (Clients see the debugging messages through Callback classes and Exception messages.)

To implement logging for your program, you can use other logging libraries, such as Apache Commons logging, Simple Logging Facade for Java (SLF4J), and so on. For details, see the documentation that is provided with those logging components.

Implementing SSL connections

The BulkIngester constructor's socketFactory parameter takes in a socket factory that is used to create sockets for the bulk ingest operations. Clients can therefore specify their own pre-configured SocketFactory. This is especially useful in the case of SSL, so that you can use this parameter to specify an SSLSocketFactory for SSL connections.

The sample program imports two Java socket classes and then sets the socketFactory parameter as follows:
import javax.net.SocketFactory;
import javax.net.ssl.SSLSocketFactory;
...
public BulkLoad(String host, int port, boolean ssl) throws IOException {
		_host = host;
		_port = port;
		BulkLoadCallback blcb = new BulkLoadCallback();

		SocketFactory sf = ssl ? SSLSocketFactory.getDefault() : SocketFactory.getDefault();
...

If the value of the incoming ssl Boolean parameter is true, then sf is set to a copy of the environment's default SSL socket factory, otherwise it is set to a copy of the default non-SSL socket factory.

Post-ingest behavior

There are two operations that must occur at some time after each bulk-load ingest:
  • A merge of the ingested records to a single generation, which re-indexes the database to optimize query performance.
  • A rebuild of the aspell spelling dictionary, so that the newly-added data will be available for spelling DYM and autocorrect.
The BulkIngester constructor's doFinalMerge parameter allows you to set when the post-ingest merge occurs:
  • If set to true, the merge is forced immediately after ingest. This behavior is intended to maximize query performance at the end of a single, large, homogenous data update that would occur during a regularly scheduled update window.
  • If set to false, a merge is not forced at the end of an update, but instead relies on the regular background merge process to keep the generations in order over time. This behavior is more suitable for parallel heterogeneous data updates where low overall update latency is paramount.
The BulkIngester constructor's doUpdateDictionary parameter lets you specify when the spelling dictionaries are updated:
  • A setting of true means a dictionary update is forced immediately after the ingest.
  • A setting of false means the dictionary update is disabled. You can later update the dictionaries. For information, see Updating spelling dictionaries for the data domain.

If you are doing multiple, consecutive bulk-load operations, you can set both properties to false on all except the last one.