Data records can be added or replaced via the Dgraph's Bulk Load
Interface.
Besides the Data Ingest API, the Bulk Load Interface is available to
ingest records into an Endeca data domain. The Bulk Load API exists in the form
of a collection of Java classes in a single
endeca_bulk_load.jar file, which is shipped in the
Endeca Server's
apis directory. The Javadoc for the Bulk Load API is
located in the
apis/doc/bulk_load directory.
Bulk Load characteristics
The characteristics of the interface are:
- The API can load data
source records only. It cannot load PDRs, DDRs, managed attribute values, the
GCR, or the Dgraph configuration documents.
- Existing records in the
Endeca data domain are replaced, not updated. That is, the replace operation is
not additive. Therefore, the key-value pair list of the incoming record will
completely replace the key-value pair list of the existing record.
- A primary-key attribute
(also called the record spec) is required for each record to be added or
replaced.
- If an assignment is for a
standard attribute (property) that does not exist in the Endeca data domain,
the new standard attribute is automatically created with system default values
for the PDR. For these default values, see
Default values for new attributes.
- There is a limit of 128MB
on the maximum size of a source record. An attempt to ingest a source record
larger than 128MB will fail and an error will be returned (with the record spec
of the rejected record), but the bulk load ingest will continue after that
rejected record.
The interface rejects non-XML 1.0 characters upon ingest. That is, a
valid character for ingest must be a character according to production 2 of the
XML 1.0 specification. If an invalid character is detected, an exception is
thrown with this error message:
Character <c> is not legal in XML 1.0
The record with the invalid character is rejected.
Implementing Logging
The
BulkIngester class is the primary entry point for
the client-side Bulk Load Interface for loading data into an Endeca data
domain.
BulkIngester uses the
java.util.logging package to output some logging for
debugging purposes. (Clients see the debugging messages through Callback
classes and Exception messages.)
To implement logging for your program, you can use other logging
libraries, such as Apache Commons logging, Simple Logging Facade for Java
(SLF4J), and so on. For details, see the documentation that is provided with
those logging components.
Implementing SSL connections
The
BulkIngester constructor's
socketFactory parameter takes in a socket factory
that is used to create sockets for the bulk ingest operations. Clients can
therefore specify their own pre-configured SocketFactory. This is especially
useful in the case of SSL, so that you can use this parameter to specify an
SSLSocketFactory for SSL connections.
The sample program imports two Java socket classes and then sets the
socketFactory parameter as follows:
import javax.net.SocketFactory;
import javax.net.ssl.SSLSocketFactory;
...
public BulkLoad(String host, int port, boolean ssl) throws IOException {
_host = host;
_port = port;
BulkLoadCallback blcb = new BulkLoadCallback();
SocketFactory sf = ssl ? SSLSocketFactory.getDefault() : SocketFactory.getDefault();
...
If the value of the incoming
ssl Boolean parameter is
true, then
sf is set to a copy of the environment's default SSL
socket factory, otherwise it is set to a copy of the default non-SSL socket
factory.
Post-ingest behavior
There are two operations that must occur at some time after each
bulk-load ingest:
- A merge of the ingested
records to a single generation, which re-indexes the database to optimize query
performance.
- A rebuild of the aspell
spelling dictionary, so that the newly-added data will be available for
spelling DYM and autocorrect.
The
BulkIngester constructor's
doFinalMerge parameter allows you to set when the
post-ingest merge occurs:
- If set to
true, the merge is forced immediately after ingest.
This behavior is intended to maximize query performance at the end of a single,
large, homogenous data update that would occur during a regularly scheduled
update window.
- If set to
false, a merge is not forced at the end of an
update, but instead relies on the regular background merge process to keep the
generations in order over time. This behavior is more suitable for parallel
heterogeneous data updates where low overall update latency is paramount.
The
BulkIngester constructor's
doUpdateDictionary parameter lets you specify when
the spelling dictionaries are updated:
- A setting of
true means a dictionary update is forced immediately
after the ingest.
- A setting of
false means the dictionary update is disabled. You
can later update the dictionaries. For information, see
Updating spelling dictionaries for the data domain.
If you are doing multiple, consecutive bulk-load operations, you can
set both properties to
false on all except the last one.