The Dgraph HDFS Agent plays a major part in the loading of data from a Data Processing workflow into the Dgraph.
The Dgraph HDFS Agent's role in the ingest procedure is to read the output Avro files from the Data Processing workflow, format them for ingest, and send them to the Dgraph.
The ingest of data sets is done with a round-robin, multiplexing algorithm. The Dgraph HDFS Agent divides the records from a given data set into batches. Each batch is processed as a complete ingest before the next batch is processed. If two or more data sets are being processed, the round-robin algorithm alternates between sending record batches from each source data set to the Dgraph. Therefore, although only one given ingest operation is being processed by the Dgraph at any one time, this multiplexing scheme does allow all active ingest operations to be scheduled in a fair fashion.
Note that if Data Processing writes a NULL or empty value to the HDFS Avro file, the Dgraph HDFS Agent skips those values when constructing a record from the source data for the consumption by the Bulk Load interface.
Updating the spelling dictionaries
When the Dgraph HDFS Agent sends the ingest request to the Dgraph, it also sets the updateSpellingDictionaries
flag in the bulk load request. The Dgraph thus updates the spelling dictionaries for the data set from the data corpus. This operation is performed after every successful ingest. The operation also enables spelling correction for search queries against the data set.
Post-ingest merge operation
After sending the record files to the Dgraph for ingest, the Dgraph HDFS Agent also requests a full merge of all generations of the Dgraph database files.
The final results of the merge are logged to the Dgraph out log.