Converting CSV Files for Vertices and Edges to Oracle-Defined Property Graph Flat Files

5.9.7 Converting CSV Files for Vertices and Edges to Oracle-Defined Property Graph Flat Files

Some applications use CSV (comma-separated value) format to encode vertices and edges of a graph. In this format, each record of the CSV file represents a single vertex or edge, with all its properties. You can convert a CSV file representing the vertices of a graph to Oracle-defined flat file format definition (.opv for vertices, .ope for edges).

The CSV file to be converted may include a header line specifying the column name and the type of the attribute that the column represents. If the header includes only the attribute names, then the converter will assume that the data type of the values will be String.

The Java APIs to convert CSV to OPV or OPE receive an InputStream from which they read the vertices or edges (from CSV), and write them in the .opv or .ope format to an OutputStream. The converter APIs also allow customization of the conversion process.

The following subtopics provide instructions for converting vertices and edges:

Vertices: Converting a CSV File to Oracle-Defined Flat File Format (.opv)
Edges: Converting a CSV File to Oracle-Defined Flat File Format (.ope)

The instructions for both are very similar, but with differences specific to vertices and edges.

Vertices: Converting a CSV File to Oracle-Defined Flat File Format (.opv)

If the CSV file does not include a header, you must specify a ColumnToAttrMapping array describing all the attribute names (mapped to its values data types) in the same order in which they appear in the CSV file. Additionally, the entire columns from the CSV file must be described in the array, including special columns such as the ID for the vertices. If you want to specify the headers for the column in the first line of the same CSV file, then this parameter must be set to null.

To convert a CSV file representing vertices, you can use one of the convertCSV2OPV APIs. The simplest of these APIs requires:

An InputStream to read vertices from a CSV file
The name of the column that is representing the vertex ID (this column must appear in the CSV file)
An integer offset to add to the VID (an offset is useful to avoid collision in ID values of graph elements)
A ColumnToAttrMapping array (which must be null if the headers are specified in the file)
Degree of parallelism (DOP)
An integer denoting offset (number of vertex records to skip) before converting
An OutputStream in which the vertex flat file (.opv) will be written
An optional DataConverterListener that can be used to keep track of the conversion progress and decide what to do if an error occurs

Additional parameters can be used to specify a different format of the CSV file:

The delimiter character, which is used to separate tokens in a record. The default is the comma character ',’.
The quotation character, which is used to quote String values so they can contain special characters, for example, commas. If a quotation character appears in the value of the String itself, it must be escaped either by duplication or by placing a backslash character '\' before it. Some examples are:
- """Hello, world"", the screen showed…"
- "But Vader replied: \"No, I am your father.\""
The Date format, which will be used to parse the date values. For the CSV conversion, this parameter can be null, but it is recommended to be specified if the CSV has a specific date format. Providing a specific date format helps performance, because that format will be used as the first option when trying to parse date values. Some example date formats are:
- "yyyy-MM-dd'T'HH:mm:ss.SSSXXX"
- "MM/dd/yyyy HH:mm:ss"
- "ddd, dd MMM yyyy HH':'mm':'ss 'GMT'"
- "dddd, dd MMMM yyyy hh:mm:ss"
- "yyyy-MM-dd"
- "MM/dd/yyyy"
A flag indicating if the CSV file contains String values with new line characters. If this parameter is set to true, all the Strings in the file that contain new lines or quotation characters as values must be quoted.
- "The first lines of Don Quixote are:""In a village of La Mancha, the name of which I have no desire to call to mind""."

The following code fragment shows how to create a ColumnToAttrMapping array and use the API to convert a CSV file into an .opv file.

    String inputCSV             = "/path/mygraph-vertices.csv";
    String outputOPV            = "/path/mygraph.opv"; 
    ColumnToAttrMapping[] ctams = new ColumnToAttrMapping[4];
    ctams[0]                    = ColumnToAttrMapping.getInstance("VID",   Long.class);
    ctams[1]                    = ColumnToAttrMapping.getInstance("name",  String.class);
    ctams[2]                    = ColumnToAttrMapping.getInstance("score", Double.class);
    ctams[3]                    = ColumnToAttrMapping.getInstance("age",   Integer.class);
    String vidColumn            = "VID";

    isCSV = new FileInputStream(inputCSV);
    osOPV = new FileOutputStream(new File(outputOPV));
      
    // Convert Vertices
    OraclePropertyGraphUtilsBase.convertCSV2OPV(isCSV, vidColumn, 0, ctams, 1, 0, osOPV, null);
    isOPV.close();
    osOPV.close();

In this example, the CSV file to be converted must not include the header and contain four columns (the vertex ID, name, score, and age). An example CVS is as follows:

1,John,4.2,30
2,Mary,4.3,32
3,"Skywalker, Anakin",5.0,46
4,"Darth Vader",5.0,46
5,"Skywalker, Luke",5.0,53

The resulting .opv file is as follows:

1,name,1,John,,
1,score,4,,4.2,
1,age,2,,30,
2,name,1,Mary,,
2,score,4,,4.3,
2,age,2,,32,
3,name,1,Skywalker%2C%20Anakin,,
3,score,4,,5.0,
3,age,2,,46,
4,name,1,Darth%20Vader,,
4,score,4,,5.0,
4,age,2,,46,
5,name,1,Skywalker%2C%20Luke,,
5,score,4,,5.0,
5,age,2,,53,

Edges: Converting a CSV File to Oracle-Defined Flat File Format (.ope)

If the CSV file does not include a header, you must specify a ColumnToAttrMapping array describing all the attribute names (mapped to its values data types) in the same order in which they appear in the CSV file. Additionally, the entire columns from the CSV file must be described in the array, including special columns such as the ID for the edges if it applies, and the START_ID, END_ID, and TYPE, which are required. If you want to specify the headers for the column in the first line of the same CSV file, then this parameter must be set to null.

To convert a CSV file representing vertices, you can use one of the convertCSV2OPE APIs. The simplest of these APIs requires:

An InputStream to read vertices from a CSV file
The name of the column that is representing the edge ID (this is optional in the CSV file; if it is not present, the line number will be used as the ID)
An integer offset to add to the EID (an offset is useful to avoid collision in ID values of graph elements)
Name of the column that is representing the source vertex ID (this column must appear in the CSV file)
Name of the column that is representing the destination vertex ID (this column must appear in the CSV file)
Offset to the VID (lOffsetVID). This offset will be added on top of the original SVID and DVID values. (A variation of this API takes in two arguments (lOffsetSVID and lOffsetDVID): one offset for SVID, the other offset for DVID.)
A boolean flag indicating if the edge label column is present in the CSV file.
Name of the column that is representing the edge label (if this column is not present in the CSV file, then this parameter will be used as a constant for all edge labels)
A ColumnToAttrMapping array (which must be null if the headers are specified in the file)
Degree of parallelism (DOP)
An integer denoting offset (number of edge records to skip) before converting
An OutputStream in which the edge flat file (.ope) will be written
An optional DataConverterListener that can be used to keep track of the conversion progress and decide what to do if an error occurs.

Additional parameters can be used to specify a different format of the CSV file:

The delimiter character, which is used to separate tokens in a record. The default is the comma character ',’.
The quotation character, which is used to quote String values so they can contain special characters, for example, commas. If a quotation character appears in the value of the String itself, it must be escaped either by duplication or by placing a backslash character '\' before it. Some examples are:
- """Hello, world"", the screen showed…"
- "But Vader replied: \"No, I am your father.\""
The Date format, which will be used to parse the date values. For the CSV conversion, this parameter can be null, but it is recommended to be specified if the CSV has a specific date format. Providing a specific date format helps performance, because that format will be used as the first option when trying to parse date values. Some example date formats are:
- "yyyy-MM-dd'T'HH:mm:ss.SSSXXX"
- "MM/dd/yyyy HH:mm:ss"
- "ddd, dd MMM yyyy HH':'mm':'ss 'GMT'"
- "dddd, dd MMMM yyyy hh:mm:ss"
- "yyyy-MM-dd"
- "MM/dd/yyyy"
A flag indicating if the CSV file contains String values with new line characters. If this parameter is set to true, all the Strings in the file that contain new lines or quotation characters as values must be quoted.
- "The first lines of Don Quixote are:""In a village of La Mancha, the name of which I have no desire to call to mind""."

The following code fragment shows how to use the API to convert a CSV file into an .ope file with a null ColumnToAttrMapping array.

    String inputOPE    = "/path/mygraph-edges.csv";
    String outputOPE   = "/path/mygraph.ope"; 
    String eidColumn   = null;             // null implies that an integer sequence will be used
    String svidColumn  = "START_ID";
    String dvidColumn  = "END_ID";
    boolean hasLabel   = true;
    String labelColumn = "TYPE";

    isOPE = new FileInputStream(inputOPE);
    osOPE = new FileOutputStream(new File(outputOPE));
      
    // Convert Edges
    OraclePropertyGraphUtilsBase.convertCSV2OPE(isOPE, eidColumn, 0, svidColumn, dvidColumn, hasLabel, labelColumn, null, 1, 0, osOPE, null);

An input CSV that uses the former example to be converted should include the header specifying the columns name and their type. An example CSV file is as follows.

START_ID:long,weight:float,END_ID:long,:TYPE
1,1.0,2,loves
1,1.0,5,admires
2,0.9,1,loves
1,0.5,3,likes
2,0.0,4,likes
4,1.0,5,is the dad of
3,1.0,4,turns to
5,1.0,3,saves from the dark side

The resulting .ope file is as follows.

1,1,2,loves,weight,3,,1.0,
2,1,5,admires,weight,3,,1.0,
3,2,1,loves,weight,3,,0.9,
4,1,3,likes,weight,3,,0.5,
5,2,4,likes,weight,3,,0.0,
6,4,5,is%20the%20dad%20of,weight,3,,1.0,
7,3,4,turns%20to,weight,3,,1.0,
8,5,3,saves%20from%20the%20dark%20side,weight,3,,1.0,

Parent topic: Oracle Flat File Format Definition