C.4 Bulk Loading RDF Data Using SQL Developer

RDF Bulk load operations can be invoked from SQL Developer.

Two major steps are required after some initial preparation: (1) loading data from the file system into a “staging“ table and (2) loading data from a “staging“ table into a semantic model.

Do the following to prepare for the actual bulk loading.

  1. Prepare the RDF dataset or datasets.

    • The data must be on the file system of the Database server – not on the client system.

    • The data must be in N-triple or N-quad format. (Apache Jena, for example, can be used to convert other formats to N-triple/N-quad,)

    • A Unix named pipe can be used to decompress zipped files on the fly.

    For example, you can download RDF datasets from LinkedGeoData. For an introduction, see http://linkedgeodata.org/Datasets and http://linkedgeodata.org/RDFMapping.

    To download from LinkedGeoData, go to https://hobbitdata.informatik.uni-leipzig.de/LinkedGeoData/downloads.linkedgeodata.org/releases/ and browse the listed directories. For a fairly small dataset you can download https://hobbitdata.informatik.uni-leipzig.de/LinkedGeoData/downloads.linkedgeodata.org/releases/2014-09-09/2014-09-09-ontology.sorted.nt.bz2.

    Each .bz2 file is a compressed archive containing a comparable-named .nt file. To specify an .nt file as a data source, you must extract (decompress) the corresponding .bz2 file, unless you create a Unix named pipe to avoid having to store uncompressed data.

  2. Create a regular, non-DBA user to perform the load.

    For example, using the DBA navigator (not the Connections navigator), expand the system connection, expand Security, right-click Users, and select Create New.

    Create a user (for example, named RDFUSER) with CONNECT, RESOURCE, and UNLIMITED TABLESPACE privileges.

  3. Add a connection for this regular, non-DBA user (for example, a connection named RDFUSER).

    Default Tablespace: USERS

    Temporary Tablespace: TEMPTBS

  4. As the system user, create a directory in the database that points to your RDF data directory.

    Using the Connections navigator (not the DBA navigator), expand the system connection, right-click Directory and select Create Directory.

    Directory Name: Desired directory name. For example: RDFDIR

    Database Server Directory: Desired location for the directory. For example: /home/oracle/RDF/MyData

    Click Apply.

  5. Grant privileges on the directory to the regular, non-DBA user (for example, RDFUSER). For example, using the system connection's SQL Worksheet:

    SQL> grant read, write on directory RDFDIR to RDFUSER;

    Tip: you can use a named pipe to avoid having to store uncompressed data. For example:

    $ mkfifo named_pipe.nt
    $ bzcat myRdfFile.nt.bz2 > named_pipe.nt
  6. Expand the regular, non-DBA user (for example, RDFUSER) connection and click RDF Semantic Graph.

  7. Create a model to hold the RDF data.

    Click Model, then New Model.

    Model Name: Enter a model name (for example, MY_ONTOLOGY)

    Application Table: * Create new <Model_Name>_TPL table * (that is, have an application table with a triple column named TRIPLE automatically created)

    Model Tablespace: tablespace to hold the RDF data (for example, RDFTBS)

    Click Apply.

    To see the model, expand Models in the object hierarchy, and click the model name to bring up the SPARQL editor for that model.

    You can run a query and see that the model is empty.

Using the Models menu, perform a bulk load from the Models menu. Bulk load has two phases:

  • Loading data from the file system into a simple "staging" table in the database. This uses an external table to read from the file system.

  • Loading data from the staging table into the semantic network. Load from the staging table into the model (for example, MY_ONTOLOGY).

To perform these two phases:

  1. Load data into the staging table.

    Right-click REGULAR_MODELS (under the network name) and select Load RDF Data into Staging Table from External Table.

    For Source External Table, Source Table: Desired table name (for example, MY_ONTOLOGY_EXT).

    Log File: Desired file name (for example, my_ontology.log)

    Bad File: Desired file name (for example, my_ontology.bad)

    Source Table Owner: Schema of the table with RDF data (for example, RDFUSER)

    For Input Files, Input Files: Input file (for example, named_pipe.nt).

    For Staging Table, Staging table: Name for the staging table (for example, MY_ONTOLOGY_STAGE).

    If the table does not exist, check Create Staging Table.

    Input Format: Desired format (for example, N-QUAD)

    Staging Table Owner: Schema for the staging table (for example, RDFUSER)

  2. Load from the staging table into the model.

    Note:

    Unicode data in the staging table should be escaped as specified in WC3 N-Triples format (http://www.w3.org/TR/rdf-testcases/#ntriples). You can use the SEM_APIS.ESCAPE_RDF_TERM function to escape Unicode values in the staging table. For example:

    create table esc_stage_tab(rdf$stc_sub, rdf$stc_pred, rdf$stc_obj);
    
    insert /*+ append nologging parallel */ into esc_stage_tab (rdf$stc_sub, rdf$stc_pred, rdf$stc_obj)
    select sem_apis.escape_rdf_term(rdf$stc_sub, options=>’ UNI_ONLY=T '), sem_apis.escape_rdf_term(rdf$stc_pred, options=>’ UNI_ONLY=T '), sem_apis.escape_rdf_term(rdf$stc_obj, options=>’ UNI_ONLY=T ')
    from stage_tab; 
    

    Right-click REGULAR_MODELS (under the network name) and select Bulk Load into Model from staging Table.

    Model: Name for the model (for example, MY_ONTOLOGY).

    (If the model does not exist, check Create Model. However, in this example, the model does already exist.)

    Staging Table Owner: Schema of the staging table (for example, RDFUSER)

    Staging Table Name: Name of the staging table (for example, MY_ONTOLOGY_STAGE)

    Parallel: Degree of parallelism (for example, 2)

    Suggestion: Check the following options: MBV_METHOD=SHADOW, Rebuild application table indexes, Create event trace table

    Click Apply.

Do the following after the bulk load operation.

  1. Gather statistics for the whole semantic network.

    In the Connections navigator for a DBA user, expand the RDF Semantic Graph node for the connection and select Gather Statistics (DBA)).

  2. Run some SPARQL queries on our model.

    In the Connections navigator, expand the RDF Semantic Graph node for the connection and select the model.

    Use the SPARQL Query Editor to enter and execute desired SPARQL queries.

  3. Optionally, check the bulk load trace to get information about each step.

    Expand RDF Semantic Graph and then expand Bulk Load Traces to display a list of bulk load traces. Clicking one of them will show useful information about the execution time for the load, number of distinct values and triples, number of duplicate triples, and other details.