SQL*Loader-Based Data Loading

5.4.2.3 SQL*Loader-Based Data Loading

SQL*Loader-based data loading uses Oracle SQL*Loader to load the graph data into Oracle Database. SQL*Loader loads data from external files into Oracle Database tables. In this case, the vertices (or edges) in the given input stream will be spread among multiple chunks by the splitter thread. Each chunk will be processed by a different loader thread that inserts all the elements in the chunk into a temporary work table using SQL*Loader. The number of splitter and loader threads used is determined by the degree of parallelism (DOP) specified by the user.

After all the graph data is loaded into the temporary work table, the graph loader will load all the data stored in the temporary work tables into the property graph VT$ and GE$ tables.

The following code fragment loads the graph data from a vertex and edge files in Oracle flat-file format using a SQL-based parallel data loading with a degree of parallelism of 48. To use the APIs, the path to the SQL*Loader must be specified.

    String szUser = "username";
    String szPassword = "password";
    String szDbId = "db18c"; /*service name of the database*/
    String szOPVFile = "../../data/connections.opv"; 
    String szOPEFile = "../../data/connections.ope"; 
    String szSQLLoaderPath = "<YOUR_ORACLE_HOME>/bin/sqlldr";
    OraclePropertyGraph opg = OraclePropertyGraph.getInstance( args, szGraphName); 

    opgdl = OraclePropertyGraphDataLoader.getInstance(); 
    opgdl.loadDataWithSqlLdr(opg, szUser, szPassword, szDbId, 
                             szOPVFile, szOPEFile, 
                             48 /* DOP */, 
                             true /*named pipe flag */, 
                             szSQLLoaderPath /* SQL*Loader path: the path to 
                                                bin/sqlldr*/, 
                             true /*rebuild index */, 
                             "pddl=t,pdml=t" /* options */);

As with JDBC-based data loading, SQL*Loader-based data loading supports parallel data loading using a single file, multiple files, partitions, and fine-tuning.

Subtopics:

SQL*Loader-Based Data Loading with Multiple Files
SQL*Loader-Based Data Loading with Partitions
SQL*Loader-Based Parallel Data Loading Using Fine-Tuning

SQL*Loader-Based Data Loading with Multiple Files

SQL*Loader-based data loading supports loading vertices and edges from multiple files or input streams into the database. The following code fragment loads multiple vertex and edge files using the parallel data loading APIs. In the example, two string arrays szOPVFiles and szOPEFiles are used to hold the input files.

    String szUser = "username";
    String szPassword = "password";
    String szDbId = "db18c"; /*service name of the database*/
    String[] szOPVFiles = new String[] {"../../data/connections-p1.opv", 
                                        "../../data/connections-p2.opv"}; 
    String[] szOPEFiles = new String[] {"../../data/connections-p1.ope", 
                                        "../../data/connections-p2.ope"}; 
    String szSQLLoaderPath = "../../../dbhome_1/bin/sqlldr";
    OraclePropertyGraph opg = OraclePropertyGraph.getInstance( args, szGraphName); 

    opgdl = OraclePropertyGraphDataLoader.getInstance(); 
    opgdl. loadDataWithSqlLdr (opg, szUser, szPassword, szDbId, 
                               szOPVFiles, szOPEFiles, 
                               48 /* DOP */, 
                               true /* named pipe flag */, 
                               szSQLLoaderPath /* SQL*Loader path */, 
                               true /* rebuild index flag */, 
                               "pddl=t,pdml=t" /* options */);

SQL*Loader-Based Data Loading with Partitions

When dealing with a large property graph, the SQL*Loader-based data loading API allows loading the graph data in Oracle flat-file format into Oracle Database using logical partitioning. Each partition represents a subset of vertices (or edges) in the graph data file of size that is approximately the number of distinct element IDs in the file divided by the number of partitions. Each partition is identified by an integer ID in the range of [0, Number of partitions – 1].

To use parallel data loading with partitions, you must specify the total number of partitions to use and the partition offset, in addition to the base parameters used in the loadDataWithSqlLdr API. To fully load a graph data file or input stream into the database, you must execute the data loading operation as many times as the defined number of partitions. For example, to load the graph data from a file using two partitions, there should be two data loading API calls using an offset of 0 and 1. Each call to the data loader can be processed using multiple threads or a separate Java client on a single system or multiple systems.

Note that this approach is intended to be used with a single vertex file (or input stream) and a single edge file (or input stream). Additionally, this option requires disabling the indexes and constraints on vertices and edges. These indexes and constraints must be rebuilt after all partitions have been loaded.

The example for JDBC-based data loading with partitions can be easily migrated to work as SQL*Loader- based loading with partitions. The only changes needed are to replace API loadData() with loadDataWithSqlLdr(), and supply some additional input parameters such as the location of SQL*Loader.

SQL*Loader-Based Parallel Data Loading Using Fine-Tuning

SQL Loader-based data loading supports fine-tuning the subset of data from a line to be loaded, as well as the ID offset to use when loading the elements into the property graph instance. You can specify the subset of data to load from a file by specifying the maximum number of lines to read from the file and the offset line number for both vertices and edges. This way, data will be loaded from the offset line number until the maximum number of lines has been read. If the maximum line number is -1, the loading process will scan the data until reaching the end of file.

Because graph data files may have some ID collisions, the SQL Loader-based data loading allows you to define a vertex and edge ID offset. This way, the ID of each loaded vertex will be obtained from the sum of the original vertex ID with the given vertex ID offset. Similarly, the ID of each loaded edge will be generated from the sum of the original edge ID with the given edge ID offset. Note that the vertices and edge files must be correlated, because the in/out vertex ID for the loaded edges will be modified with respect to the specified vertex ID offset. This operation is supported only in a data loading using a single partition.

The following code fragment loads the first 100 vertices and edges from the given graph data file. In this example, no ID offset is provided.

    String szUser = "username";
    String szPassword = "password";
    String szDbId = "db18c"; /* service name of the database */
    String szOPVFile = "../../data/connections.opv"; 
    String szOPEFile = "../../data/connections.ope"; 
    String szSQLLoaderPath = "../../../dbhome_1/bin/sqlldr";
    
    // Run the data loading using fine tuning 
    long lVertexOffsetlines = 0; 
    long lEdgeOffsetlines = 0; 
    long lVertexMaxlines = 100; 
    long lEdgeMaxlines = 100;
    long lVIDOffset = 0;
    long lEIDOffset = 0;
    OraclePropertyGraph opg = OraclePropertyGraph.getInstance( args, szGraphName); 
    OraclePropertyGraphDataLoader opgdl = OraclePropertyGraphDataLoader.getInstance();
    
    opgdl.loadDataWithSqlLdr(opg, szUser, szPassword, szDbId, 
                             szOPVFile, szOPEFile, 
                             lVertexOffsetlines /* offset of lines to start loading 
                                                   from partition, default 0*/, 
                             lEdgeOffsetlines /* offset of lines to start loading from 
                                                 partition, default 0*/, 
                             lVertexMaxlines /* maximum number of lines to start 
                                                loading from partition, default -1 
                                                (all lines in partition)*/, 
                             lEdgeMaxlines /* maximum number of lines to start loading 
                                              from partition, default -1 (all lines in 
                                              partition) */, 
                             lVIDOffset /* vertex ID offset: the vertex ID will be 
                                           original vertex ID + offset, default 0 */, 
                             lEIDOffset /* edge ID offset: the edge ID will be 
                                           original edge ID + offset, default 0 */, 
                             48 /* DOP */, 
                             1 /* Total number of partitions, default 1 */, 
                             0 /* Partition to load (from 0 to totalPartitions - 1, 
                                  default 0) */, 
                             OraclePropertyGraphDataLoader.NAMEDPIPE 
                             /* splitter flag */, 
                             "chunkPrefix" /* prefix */, 
                             szSQLLoaderPath /* SQL*Loader path: the path to 
                                                bin/sqlldr*/, 
                             true /* rebuild index */,
                             "pddl=t,pdml=t" /* options */);

Parent topic: Parallel Loading of Graph Data