External Table-Based Data Loading

5.4.2.2 External Table-Based Data Loading

External table-based data loading uses an external table to load the graph data into Oracle Database. External table loading allows users to access the data in external sources as if it were in a regular relational table in the database. In this case, the vertices (or edges) in the given input stream will be spread among multiple chunks by the splitter thread. Each chunk will be processed by a different loader thread that is in charge of passing all the elements in the chunk to Oracle Database. The number of splitter and loader threads used is determined by the degree of parallelism (DOP) specified by the user.

After the external tables are automatically created by the data loading logic, the loader will read from the external tables and load all the data into the property graph schema tables (VT$ and GE$).

External-table based data loading requires a directory object where the files read by the external tables will be stored. This directory can be created by running the following scripts in a SQL*Plus environment:

create or replace directory tmp_dir as '/tmppath/';
grant read, write on directory tmp_dir to public;

The following code fragment loads the graph data from a vertex and edge files in Oracle Flat-file format using an external table-based parallel data loading with a degree of parallelism of 48.

    String szOPVFile = "../../data/connections.opv"; 
    String szOPEFile = "../../data/connections.ope"; 
    String szExtDir = "tmp_dir";
    OraclePropertyGraph opg = OraclePropertyGraph.getInstance( args, szGraphName); 
    opgdl = OraclePropertyGraphDataLoader.getInstance(); 
    opgdl.loadDataWithExtTab(opg, szOPVFile, szOPEFile, 48 /*DOP*/, 
                             true /*named pipe flag: setting the flag to true will use 
                                   named pipe based splitting; otherwise, regular file 
                                   based splitting would be used*/, 
                             szExtDir /* database directory object */, 
                             true /*rebuild index */, 
                             "pddl=t,pdml=t,NO_DUP=T" /*options */);

To optimize the performance of the data loading operations, a set of flags and hints can be specified when calling the External table-based data loading. These hints include:

DOP: The degree of parallelism to use when loading the data. This parameter determines the number of chunks to generate when splitting the file, as well as the number of loader threads to use when loading the data into the property graph VT$ and GE$ tables.
Rebuild index: If this flag is set to true, the data loader will disable all the indexes and constraints defined over the property graph where the data will be loaded. After all the data is loaded into the property graph, all the indexes and constraints will be rebuilt.
Load options: An option (or multiple options delimited by commas) to optimize the data loading operations. These options include:
- NO_DUP=T: Chooses a faster way to load the data into the property graph tables as no validation for duplicate Key/value pairs will be conducted.
- PDML=T: Enables parallel execution for DML operations for the database session used in the data loader. This hint is used to improve the performance of long-running batching jobs.
- PDDL=T: Enables parallel execution for DDL operations for the database session used in the data loader. This hint is used to improve the performance of long-running batching jobs.
- KEEP_WORK_TABS=T: Skips cleaning and deleting the working tables after the data loading is complete. This is for debugging use only.
- KEEP_TMP_FILES=T: Skips removing the temporary splitter files after the data loading is complete. This is for debugging use only.
Splitter Flag: An integer value defining the type of files or streams used in the splitting phase to generate the data chunks used in the graph loading phase. The temporary files can be created as regular files (0) or named pipes (1).

By default, External table-based data loading uses regular files to handle temporary files for data chunks. Named pipes can only be used on operating system that supports them. It is generally a good practice to use regular files together with DBFS.
Split File Prefix: The prefix used for the temporary files or pipes created when the splitting phase is generating the data chunks for the graph loading. By default, the prefix “Chunk” is used for regular files and “Pipe” is used for named pipes.
Tablespace: The name of the tablespace where all the temporary work tables will be created.

As with the JDBC-based data loading, external table-based data loading supports parallel data loading using a single file, multiple files, partitions, and fine-tuning.

Subtopics:

External Table-Based Data Loading with Multiple Files
External table-based Data Loading with Partitions
External Table-Based Parallel Data Loading Using Fine-Tuning

External Table-Based Data Loading with Multiple Files

External table-based data loading also supports loading vertices and edges from multiple files or input streams into the database. The following code fragment loads multiple vertex and edge files using the parallel data loading APIs. In the example, two string arrays szOPVFiles and szOPEFiles are used to hold the input files.

    String szOPVFile = "../../data/connections.opv"; 
    String szOPEFile = "../../data/connections.ope"; 
    String szExtDir = "tmp_dir";
    OraclePropertyGraph opg = OraclePropertyGraph.getInstance( args, szGraphName); 
    opgdl = OraclePropertyGraphDataLoader.getInstance(); 
    opgdl.loadDataWithExtTab(opg, szOPVFile, szOPEFile, 48 /* DOP */, 
                             true /* named pipe flag */, 
                             szExtDir /* database directory object */, 
                             true /* rebuild index flag */, 
                             "pddl=t,pdml=t" /* options */);

External table-based Data Loading with Partitions

When dealing with a very large property graph, the external table-based data loading API allows loading the graph data in Oracle flat file format into Oracle Database using logical partitioning. Each partition represents a subset of vertices (or edges) in the graph data file of size that is approximately the number of distinct element IDs in the file divided by the number of partitions. Each partition is identified by an integer ID in the range of [0, Number of partitions – 1].

To use parallel data loading with partitions, you must specify the total number of partitions to use and the partition offset besides the base parameters used in the loadDataWithExtTab API. To fully load a graph data file or input stream into the database, you must execute the data loading operation as many times as the defined number of partitions. For example, to load the graph data from a file using two partitions, there should be two data loading API calls using an offset of 0 and 1. Each call to the data loader can be processed using multiple threads or a separate Java client on a single system or multiple systems.

Note that this approach is intended to be used with a single vertex file (or input stream) and a single edge file (or input stream). Additionally, this option requires disabling the indexes and constraints on vertices and edges. These indices and constraints must be rebuilt after all partitions have been loaded.

The example for JDBC-based data loading with partitions can be easily migrated to work as external-table based loading with partitions. The only needed changes are to replace API loadData() with loadDataWithExtTab(), and supply some additional input parameters such as the database directory object.

External Table-Based Parallel Data Loading Using Fine-Tuning

External table-based data loading also supports fine-tuning the subset of data from a line to be loaded, as well as the ID offset to use when loading the elements into the property graph instance. You can specify the subset of data to load from a file by specifying the maximum number of lines to read from the file as well as the offset line number for both vertices and edges. This way, data will be loaded from the offset line number until the maximum number of lines has been read. If the maximum line number is -1, the loading process will scan the data until reaching the end of file.

Because graph data files may have some ID collisions, the external table-based data loading allows you to define a vertex and edge ID offset. This way, the ID of each loaded vertex will be obtained from the sum of the original vertex ID with the given vertex ID offset. Similarly, the ID of each loaded edge will be generated from the sum of the original edge ID with the given edge ID offset. Note that the vertices and edge files must be correlated, because the in/out vertex ID for the loaded edges will be modified with respect to the specified vertex ID offset. This operation is supported only in a data loading using a single partition.

The following code fragment loads the first 100 vertices and edges from the given graph data file. In this example, no ID offset is provided.

    String szOPVFile = "../../data/connections.opv"; 
    String szOPEFile = "../../data/connections.ope"; 

    // Run the data loading using fine tuning 
    long lVertexOffsetlines = 0; 
    long lEdgeOffsetlines = 0; 
    long lVertexMaxlines = 100; 
    long lEdgeMaxlines = 100;
    long lVIDOffset = 0;
    long lEIDOffset = 0;
    String szExtDir = "tmp_dir";

    OraclePropertyGraph opg = OraclePropertyGraph.getInstance( args, szGraphName); 
    OraclePropertyGraphDataLoader opgdl = OraclePropertyGraphDataLoader.getInstance();
    
    opgdl.loadDataWithExtTab(opg, szOPVFile, szOPEFile, 
                             lVertexOffsetlines /* offset of lines to start loading 
                                                   from partition, default 0 */, 
                             lEdgeOffsetlines /* offset of lines to start loading from 
                                                 partition, default 0 */, 
                             lVertexMaxlines /* maximum number of lines to start 
                                                loading from partition, default -1 
                                               (all lines in partition) */, 
                             lEdgeMaxlines /* maximum number of lines to start loading 
                                              from partition, default -1 (all lines in 
                                              partition) */, 
                             lVIDOffset /* vertex ID offset: the vertex ID will be 
                                          original vertex ID + offset, default 0 */, 
                             lEIDOffset /* edge ID offset: the edge ID will be 
                                          original edge ID + offset, default 0 */, 
                             4 /* DOP */, 
                             1 /* Total number of partitions, default 1 */, 
                             0 /* Partition to load (from 0 to totalPartitions - 1, 
                                  default 0) */, 
                             OraclePropertyGraphDataLoader.NAMEDPIPE 
                             /* splitter flag */, 
                             "chunkPrefix" /* prefix */, 
                             szExtDir /* database directory object */, 
                             true /* rebuild index flag */, 
                             "pddl=t,pdml=t" /* options */);

Parent topic: Parallel Loading of Graph Data