Importing records from HDFS for ingest

The Dgraph HDFS Agent plays a major part in the loading of data from a Data Processing workflow into the Dgraph.

The Dgraph HDFS Agent's role in the ingest procedure is to read the output Avro files from the Data Processing workflow, format them for ingest, and send them to the Dgraph.

Specifically, the high-level, general steps in the ingest process are:

A Data Processing workflow finishes by writing a set of records in Avro files in the output directory.
The Spark client then locates the Dgraph leader node and the Bulk Load port for the ingest, based on the data set name. The Dgraph that will ingest the records must be a leader within the Dgraph cluster, within the BDD deployment. The leader Dgraph node is elected and determined automatically by Big Data Discovery.
The Dgraph HDFS Agent reads the Avro files and prepares them in a format that the Bulk Load interface of the Dgraph can accept.
The Dgraph HDFS Agent sends the files to the Dgraph via the Bulk Load interface's port.
When a job is successfully completed, the files holding the initial data are deleted.

The ingest of data sets is done with a round-robin, multiplexing algorithm. The Dgraph HDFS Agent divides the records from a given data set into batches. Each batch is processed as a complete ingest before the next batch is processed. If two or more data sets are being processed, the round-robin algorithm alternates between sending record batches from each source data set to the Dgraph. Therefore, although only one given ingest operation is being processed by the Dgraph at any one time, this multiplexing scheme does allow all active ingest operations to be scheduled in a fair fashion.

Note that if Data Processing writes a NULL or empty value to the HDFS Avro file, the Dgraph HDFS Agent skips those values when constructing a record from the source data for the consumption by the Bulk Load interface.

Updating the spelling dictionaries

When the Dgraph HDFS Agent sends the ingest request to the Dgraph, it also sets the updateSpellingDictionaries flag in the bulk load request. The Dgraph thus updates the spelling dictionaries for the data set from the data corpus. This operation is performed after every successful ingest. The operation also enables spelling correction for search queries against the data set.

Post-ingest merge operation

After sending the record files to the Dgraph for ingest, the Dgraph HDFS Agent also requests a full merge of all generations of the Dgraph database files.

The merge operation consists of two actions:

The Dgraph HDFS Agent sends a URL merge request to the Dgraph.
If it successfully receives the request, the Dgraph performs the merge.

The final results of the merge are logged to the Dgraph out log.