3 Integrating Big Data Spatial and Graph with Oracle Database

You can use Oracle Big Data Connectors to facilitate spatial data access between Big Data Spatial and Graph and Oracle Database.

This chapter assumes that you have a working knowledge of the following:

3.1 Using Oracle SQL Connector for HDFS with Delimited Text Files

This topic is applicable when the files in HDFS are delimited text files (fields must be delimited using single-character markers, such as commas or tabs) and the spatial data is stored as GeoJSON or WKT format.

If such data is to be used by Big Data Spatial and Graph and is to be accessed from an Oracle database using the Oracle SQL connection for HDFS, certain configuration steps are needed.

For this example, assume that the files in HDFS contain records separated by new lines, and the fields within each record are separated by tabs, such as in the following:

"6703"       1       62      "Hong Kong"     3479846 POINT (114.18306 22.30693)
"6702"  57      166     "Singapore"     1765655 POINT (103.85387 1.29498) 
  1. Log in to a node of the Hadoop cluster.

  2. Create the configuration file required by OSCH (Oracle SQL Connector for HDFS), such as the following example:

    <?xml version="1.0"?>
     <configuration>
        <property>
          <name>oracle.hadoop.exttab.tableName</name>
          <value>TWEETS_EXT_TAB_FILE</value> 
        </property> 
       <property>
          <name>oracle.hadoop.exttab.sourceType</name>
          <value>text</value>
        </property>
       <property> 
          <name>oracle.hadoop.exttab.dataPaths</name>
          <value>/user/scott/simple_tweets_data/*.log</value> 
        </property>   
        <property>
          <name>oracle.hadoop.connection.url</name>
          <value>jdbc:oracle:thin:@//myhost:1521/myservicename</value> 
        </property> 
        <property>
          <name>oracle.hadoop.connection.user</name>
          <value>scott</value> 
        </property>      
        <property>
          <name>oracle.hadoop.exttab.fieldTerminator</name>
          <value>\u0009</value> 
        </property>      
        <property>
          <name>oracle.hadoop.exttab.columnNames</name>
          <value>ID,FOLLOWERS_COUNT,FRIENDS_COUNT,LOCATION,USER_ID,GEOMETRY</value> 
        </property>      
        <property>
          <name>oracle.hadoop.exttab.defaultDirectory</name>
          <value>TWEETS_DT_DIR</value> 
        </property>      
    </configuration>
    
  3. Name the configuration file tweets_text.xml.

  4. On a node of the Hadoop cluster, execute the following command:

    hadoop jar $OSCH_HOME/jlib/orahdfs.jar \
           oracle.hadoop.exttab.ExternalTable \
           -conf /home/oracle/tweets_text.xml \
           -createTable
    

    The command prompts for the database password .

    You can either create the OSCH_HOME environment variable or replace OSCH_HOME in the command syntax with the full path to the installation directory for Oracle SQL Connector for HDFS. On Oracle Big Data Appliance, this directory is: /opt/oracle/orahdfs-version

The table TWEETS_EXT_TAB_FILE is now ready to query. It can be queried like any other table from the database. The database is the target database specified in the configuration file in a previous step.. The following query selects the count of rows in the table:

select count(*) from TWEETS_EXT_TAB_FILE;

You can perform spatial operations on that table just like any other spatial table in the database. The following example retrieves information about users that are tweeting within in a quarter-mile (0.25 mile) radius of a specific movie theater:

select sdo_geom.SDO_DISTANCE(ci.geometry, SDO_GEOMETRY(tw.geometry, 8307), 0.05, 'UNIT=MILE'), ci.name, tw.user_id 
from CINEMA ci, TWEETS_EXT_TAB_FILE tw where SDO_WITHIN_DISTANCE(ci.geometry, SDO_GEOMETRY(tw.geometry, 8307), 'DISTANCE=0.25 UNIT=MILE') = 'TRUE'

Here the table CINEMA is a spatial table in the Oracle database, and the HDFS table TWEETS_EXT_TAB_FILE can be used to query against this table. The data from the tweets table is read in as WKT (well known text), and the WKT constructor of SDO_GEOMETRY is used to materialize this data as a geometry in the database.

Note that the SRID of the geometries is 8307. Also ,if the spatial data is in GeoJSON format, then the query should be as follows:

select sdo_geom.SDO_DISTANCE(ci.geometry, SDO_UTIL.FROM_GEOJSON(tw.geometry, '', 8307), 0.05, 'UNIT=MILE'), ci.name, tw.user_id 
from CINEMA ci, TWEETS_EXT_TAB_FILE tw where SDO_WITHIN_DISTANCE(ci.geometry, SDO_UTIL.FROM_GEOJSON(tw.geometry, '', 8307), 'DISTANCE=0.25 UNIT=MILE') = 'TRUE'

3.2 Using Oracle SQL Connector for HDFS with Hive Tables

Oracle SQL Connector for HDFS (OSCH) directly supports HIVE tables defined on HDFS.

The Hive tables must be nonpartitioned, and defined using ROW FORMAT DELIMITED and FILE FORMAT TEXTFILE clauses. The spatial data must be in GeoJSON or WKT format.

Both Hive-managed tables and Hive external tables are supported.

For example, the Hive command to create a table on the file described in Using Oracle SQL Connector for HDFS with Delimited Text Files is as follows. It assumes that the user already has a Hive table defined on HDFS data. The data in HDFS must be in the supported format, and the spatial data must be in GeoJSON or WKT format.

CREATE EXTERNAL TABLE IF NOT EXISTS TWEETS_HIVE_TAB(
  ID string, 
  FOLLOWERS_COUNT int, 
  FRIENDS_COUNT int, 
  LOCATION string, 
  USER_ID int, 
  GEOMETRY string)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY '\t' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  '/user/scott/simple_tweets_data';

The following example queries the table.

select ID, FOLLOWERS_COUNT, FRIENDS_COUNT, LOCATION, USER_ID, GEOMETRY from TWEETS_HIVE_TAB limit 10;

The output looks as follow:

"6703"       1       62      "Hong Kong"     3479846 POINT (114.18306 22.30693)
"6702"  57      166     "Singapore"     1765655 POINT (103.85387 1.29498)
  1. Log in to a node of the Hadoop cluster.

  2. Create the configuration file required by OSCH (Oracle SQL Connector for HDFS), such as the following example:

    <?xml version="1.0"?>
     <configuration>
        <property>
          <name>oracle.hadoop.exttab.tableName</name>
          <value>TWEETS_EXT_TAB_HIVE</value> 
        </property> 
       <property> 
          <name>oracle.hadoop.exttab.sourceType</name>
          <value>hive</value> 
        </property>   
       <property> 
          <name>oracle.hadoop.exttab.hive.tableName</name>
          <value>TWEETS_HIVE_TAB</value> 
        </property>   
       <property> 
          <name>oracle.hadoop.exttab.hive.databaseName</name>
          <value>default</value> 
        </property>   
        <property>
          <name>oracle.hadoop.connection.url</name>
          <value>jdbc:oracle:thin:@//myhost:1521/myservicename</value> 
        </property> 
        <property>
          <name>oracle.hadoop.connection.user</name>
          <value>scott</value> 
        </property>      
        <property>
          <name>oracle.hadoop.exttab.defaultDirectory</name>
          <value>TWEETS_DT_DIR</value> 
        </property>      
    </configuration> 
    
  3. Name the configuration file tweets_text.xml.

  4. On a node of the Hadoop cluster, execute the following command:

    # Add HIVE_HOME/lib* to HADOOP_CLASSPATH.  
    export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HIVE_HOME/lib/*
    hadoop jar $OSCH_HOME/jlib/orahdfs.jar \
           oracle.hadoop.exttab.ExternalTable \
           -conf /home/oracle/tweets_hive.xml \
           -createTable
    

    The command prompts for the database password . You can either create the OSCH_HOME environment variable or replace OSCH_HOME in the command syntax with the full path to the installation directory for Oracle SQL Connector for HDFS. On Oracle Big Data Appliance, this directory is: /opt/oracle/orahdfs-version

    Set the environment variable HIVE_HOME to point to the Hive installation directory (for example, /usr/lib/hive).

The table TWEETS_EXT_TAB_FILE is now ready to query. It can be queried like any other table from the database. The following query selects the count of rows in the table:

select count(*) from TWEETS_EXT_TAB_HIVE;;

You can perform spatial operations on that table just like any other spatial table in the database. The following example retrieves information about users that are tweeting within in a quarter-mile (0.25 mile) radius of a specific movie theater:

select sdo_geom.SDO_DISTANCE(ci.geometry, SDO_GEOMETRY(tw.geometry, 8307), 0.05, 'UNIT=MILE), ci.name, tw.user_id 
from CINEMA ci, TWEETS_EXT_TAB_HIVE tw where SDO_WITHIN_DISTANCE(ci.geometry, SDO_GEOMETRY(tw.geometry, 8307), 'DISTANCE=0.25 UNIT=MILE') = 'TRUE'

Here the table CINEMA is a spatial table in the Oracle database, and the HDFS table TWEETS_EXT_TAB_FILE can be used to query against this table. The data from the tweets table is read in as WKT (well known text), and the WKT constructor of SDO_GEOMETRY is used to materialize this data as a geometry in the database.

Note that the SRID of the geometries is 8307. Also ,if the spatial data is in GeoJSON format, then the query should be as follows:

select sdo_geom.SDO_DISTANCE(ci.geometry, SDO_UTIL.FROM_GEOJSON(tw.geometry, '', 8307), 0.05, 'UNIT=MILE), ci.name, tw.user_id 
from CINEMA ci, TWEETS_EXT_TAB_HIVE tw where SDO_WITHIN_DISTANCE(ci.geometry, SDO_UTIL.FROM_GEOJSON(tw.geometry, '', 8307), 'DISTANCE=0.25 UNIT=MILE') = 'TRUE'

3.3 Using Oracle SQL Connector for HDFS with Files Generated by Oracle Loader for Hadoop

To use Oracle SQL Connector for HDFS (OSCH) with files generated by Oracle Loader for Hadoop (OLH), you must understand how OLH is used to move data from HDFS to Oracle Database.

Modifications are required for moving Big Data Spatial and Graph spatial data into the database. This solution generally applies for any kind of files in HDFS or any kind of Hive data. The spatial information can be in a well known format or a custom format.

First, an example of how to create external tables from files in HDFS containing spatial information in a user defined format. Assume that the files in HDFS have records the following format:

{
        "type":"Feature",
        "id":"6703",
        "followers_count":1,
        "friends_count":62,
        "location":"Hong Kong",
        "user_id":3479846,
        "longitude":114.18306,
        "latitude":22.30693
}

{
        "type":"Feature",
        "id":"6702",
        "followers_count":57,
        "friends_count":166,
        "location":"Singapore",
        "user_id":1765655,
        "longitude":103.85387,
        "latitude":1.29498
}

The Hive command to create a table for those records is as follows:

add jar
     /opt/oracle/oracle-spatial-graph/spatial/vector/jlib/ojdbc8.jar
     /opt/oracle/oracle-spatial-graph/spatial/vector/jlib/sdoutl.jar
     /opt/oracle/oracle-spatial-graph/spatial/vector/jlib/sdoapi.jar
     /opt/oracle/oracle-spatial-graph/spatial/vector/jlib/sdohadoop-vector.jar
     /opt/oracle/oracle-spatial-graph/spatial/vector/jlib/sdohadoop-vector-hive.jar
     … (add here jars containing custom SerDe and/or InputFormats);
CREATE EXTERNAL TABLE IF NOT EXISTS CUST_TWEETS_HIVE_TAB (id STRING, geometry STRING, followers_count STRING, friends_count STRING, location STRING, user_id STRING)                                         
ROW FORMAT SERDE 'mypackage.TweetsSerDe'              
STORED AS INPUTFORMAT 'oracle.spatial.hadoop.vector.geojson.mapred.GeoJsonInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/user/scott/simple_tweets_data';

The InputFormat object oracle.spatial.hadoop.vector.geojson.mapred.GeoJsonInputFormat can read those records even if they are not strict GeoJSON. Thus, the preceding example does not need a custom InputFormat specification. However, it does require a custom Hive Serializer and Deserializer (SerDe) to transform the latitude and longitude into a WKT or GeoJSON geometry. For that, the Spatial Java API can be used in the deserialize function of the SerDe, as the following example

    @Override
    public Object deserialize(Writable w) throws SerDeException {
        Text rowText = (Text) w;
        List<Text> row = new ArrayList<Text>(columnNames.size());
        
        //default all values to null
        for(int i=0;i<columnNames.size();i++){
                row.add(null);
        }
        
        // Try parsing row into JSON object
        JsonNode recordNode = null;
        
        try {
                String txt = rowText.toString().trim();
                recordNode = jsonMapper.readTree(txt);
                        row.set(columnNames.indexOf("id"), new Text(recordNode.get("id").getTextValue()));
                        row.set(columnNames.indexOf("followers_count"), new Text(recordNode.get("followers_count").toString()));
                        row.set(columnNames.indexOf("friends_count"), new Text(recordNode.get("friends_count").toString()));
                        row.set(columnNames.indexOf("location"), new Text(recordNode.get("location").getTextValue()));
                        row.set(columnNames.indexOf("user_id"), new Text(recordNode.get("user_id").toString()));
                        
                        Double longitude = recordNode.get("longitude").getDoubleValue();
                        Double latitude = recordNode.get("latitude").getDoubleValue();
                        
                        //use the Spatial API to create the geometry
                        JGeometry geom = JGeometry.createPoint(new double[]{
                                        longitude, 
                                        latitude}, 
                                        2, //dimensions
                                        8307 //SRID
                                        );
                        //Transform the JGeometry to WKT
                        String geoWKT = new String(wkt.fromJGeometry(geom));
                        row.set(columnNames.indexOf("geometry"), new Text(geoWKT));
        } catch (Exception e) {
            throw new SerDeException("Exception parsing JSON: " +e.getMessage(), e);
        }
        
        return row;
    }    

In the preceding example, to return the geometries in GeoJSON format, replace the following:

String geoWKT = new String(wkt.fromJGeometry(geom));
row.set(columnNames.indexOf("geometry"), new Text(geoWKT));

with this:

row.set(columnNames.indexOf("geometry"), new Text(geom.toGeoJson()));

More SerDe examples to transform data in GeoJSON, WKT, or ESRI Shapefiles with the Spatial Java API are available in the folder: /opt/oracle/oracle-spatial-graph/spatial/vector/examples/hive/java/src/oracle/spatial/hadoop/vector/hive/java/src/serde

The following example queries the Hive table:

select ID, FOLLOWERS_COUNT, FRIENDS_COUNT, LOCATION, USER_ID, GEOMETRY from CUST_TWEETS_HIVE_TAB limit 10;

The output looks like the following:

6703 1       62      Hong Kong       3479846 POINT (114.18306 22.30693)
6702    57      166     Singapore       1765655 POINT (103.85387 1.29498)

3.3.1 Creating HDFS Data Pump Files or Delimited Text Files

You can use the Hive table from Using Oracle SQL Connector for HDFS with Files Generated by Oracle Loader for Hadoop to create HDFS Data Pump files or delimited text files.

  1. Create a table in the Oracle database as follows:

    CREATE TABLE tweets_t(id INTEGER
      PRIMARY KEY, geometry VARCHAR2(4000), followers_count NUMBER,
      friends_count NUMBER, location VARCHAR2(4000), user_id NUMBER);
    

    This table will be used as the target table. Oracle Loader for Hadoop uses table metadata from the Oracle database to identify the column names, data types, partitions, and other information. For simplicity, create this table with the same columns (fields) as the Hive table. After the external table is created, you can remove this table or use it to insert the rows from the external table into the target table. (For more information about target tables, see About the Target Table Metadata.

  2. Create the loader configuration file, as in the following example:

    <?xml version="1.0" encoding="UTF-8" ?>
    <configuration>
    <!--                          Input settings                             -->
    <property>
            <name>mapreduce.inputformat.class</name>
            <value>oracle.hadoop.loader.lib.input.HiveToAvroInputFormat</value>
    </property>
    <property>
            <name>oracle.hadoop.loader.input.hive.databaseName</name>
            <value>default</value>
    </property>
    <property>
            <name>oracle.hadoop.loader.input.hive.tableName</name>
            <value>CUST_TWEETS_HIVE_TAB</value>
    </property>
    <!--                          Output settings                             -->
     <property>
       <name>mapreduce.outputformat.class</name>
       <value>oracle.hadoop.loader.lib.output.DataPumpOutputFormat</value>
     </property>
     <property>
       <name>mapred.output.dir</name>
       <value>/user/scott/data_output</value>
     </property>
    <!--                          Table information                            -->
    <property>
            <name>oracle.hadoop.loader.loaderMap.targetTable</name>
            <value>tweets_t</value>
    </property> 
    <!--                          Connection information                      -->
    <property>
      <name>oracle.hadoop.loader.connection.url</name>
      <value>jdbc:oracle:thin:@//myhost:1521/myservicename</value>
    </property>
    <property>
        <name>oracle.hadoop.loader.connection.user</name>
        <value>scott</value>
    </property>
    <property>
        <name>oracle.hadoop.loader.connection.password</name>
        <value>welcome1</value>        
        <description> Having the password in cleartext is NOT RECOMMENDED. Use Oracle Wallet instead. </description>
    </property>
    </configuration>
    

    With this configuration, Data Pump files will be created in HDFS. If you want delimited text files as the output, then replace th following:

    oracle.hadoop.loader.lib.output.DataPumpOutputFormat
    

    with this:

    oracle.hadoop.loader.lib.output.DelimitedTextOutputFormat
    
  3. Name the configuration file tweets_hive_to_data_pump.xml.

  4. Create the Data Pump files:

    # Add HIVE_HOME/lib* and the Hive configuration directory to HADOOP_CLASSPATH.
    export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HIVE_HOME/lib/*:$HIVE_CONF_DIR
    # Add Oracle Spatial libraries to HADOOP_CLASSPATH.
    export ORACLE_SPATIAL_VECTOR_LIB_PATH=/opt/oracle/oracle-spatial-graph/spatial/vector/jlib
    
    export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$ORACLE_SPATIAL_VECTOR_LIB_PATH/ojdbc8.jar:$ORACLE_SPATIAL_VECTOR_LIB_PATH/sdoutl.jar:$ORACLE_SPATIAL_VECTOR_LIB_PATH/sdoapi.jar:$ORACLE_SPATIAL_VECTOR_LIB_PATH/sdohadoop-vector.jar:$ORACLE_SPATIAL_VECTOR_LIB_PATH/sdohadoop-vector-hive.jar
    
    # The Oracle Spatial libraries need to be added to the libjars option as well.
    export LIBJARS=$ORACLE_SPATIAL_VECTOR_LIB_PATH/ojdbc8.jar,$ORACLE_SPATIAL_VECTOR_LIB_PATH/sdoutl.jar,$ORACLE_SPATIAL_VECTOR_LIB_PATH/sdoapi.jar,$ORACLE_SPATIAL_VECTOR_LIB_PATH/sdohadoop-vector.jar,$ORACLE_SPATIAL_VECTOR_LIB_PATH/sdohadoop-vector-hive.jar
    
    # And the following HIVE jar files have to be added to the libjars option.
    export LIBJARS=$LIBJARS,$HIVE_HOME/lib/hive-exec-*.jar,$HIVE_HOME/lib/hive-metastore-*.jar,$HIVE_HOME/lib/libfb303*.jar
    
    hadoop jar ${OLH_HOME}/jlib/oraloader.jar \
               oracle.hadoop.loader.OraLoader \
               -conf /home/oracle/tweets_hive_to_data_pump.xml \
               -libjars $LIBJARS
    

For the preceding example:

  • Be sure that the environment variable OLH_HOME has to be set to the installation directory.

  • Set the environment variable HIVE_HOME to point to the Hive installation directory (for example, /usr/lib/hive).

  • Set the environment variable HIVE_CONF_DIR to point to the Hive configuration directory (for example, /etc/hive/conf).

  • Add the following Hive jar files, in a comma-separated list, to the -libjars option of the hadoop command. Replace the asterisks (*) with the complete file names on your system:

    hive-exec-*.jar
    hive-metastore-*.jar
    libfb303*.jar
    
  • If oracle.kv.hadoop.hive.table.TableStorageHandler is used to create the Hive table (with the data coming from Oracle NoSQL Database), you must also add the following jar file to the -libjars option of the hadoop command: $KVHOME/lib/kvclient.jar (where KVHOME is the directory where the Oracle NoSQL Database is installed)

  • If org.apache.hadoop.hive.hbase.HBaseStorageHandler is used to create the Hive table (with the data coming from Apache HBase), you must also add the following JAR files, in a comma-separated list, to the -libjars option of the hadoop command:
    $HIVE_HOME/lib/hbase-server.jar
    $HIVE_HOME/lib/hive-hbase-handler.jar
    $HIVE_HOME/lib/hbase-common.jar
    $HIVE_HOME/lib/hbase-client.jar
    $HIVE_HOME/lib/hbase-hadoop2-compat.jar
    $HIVE_HOME/lib/hbase-hadoop-compat.jar
    $HIVE_HOME/lib/hbase-protocol.jar
    $HIVE_HOME/lib/htrace-core.jar
    

3.3.2 Creating the SQL Connector for HDFS

To create the SQL Connector fo HDFS, follow the instructions in this topic.

  1. Create the configuration file for the SQL Connector for HDFS), as in the following example:

    <?xml version="1.0"?>
     <configuration>
        <property>
          <name>oracle.hadoop.exttab.tableName</name>
          <value>TWEETS_EXT_TAB_DP</value> 
        </property> 
        <property>
          <name>oracle.hadoop.exttab.sourceType</name>
          <value>datapump</value> 
        </property> 
       <property> 
          <name>oracle.hadoop.exttab.dataPaths</name>
          <value>/user/scott/data_output/oraloader-0000*.dat</value>
        </property>   
        <property>
          <name>oracle.hadoop.connection.url</name>
          <value>jdbc:oracle:thin:@//myhost:1521/myservicename</value> 
        </property> 
        <property>
          <name>oracle.hadoop.connection.user</name>
          <value>scott</value> 
        </property>      
        <property>
          <name>oracle.hadoop.exttab.defaultDirectory</name>
          <value>TWEETS_DT_DIR</value> 
        </property>   
    </configuration>
    

    If the files are delimited text files, follow the steps in Using Oracle SQL Connector for HDFS with Delimited Text Files.

  2. Name the configuration file tweets_ext_from_dp.xml.

  3. Create the external table.

    hadoop jar $OSCH_HOME/jlib/orahdfs.jar \
               oracle.hadoop.exttab.ExternalTable \
               -conf /home/oracle/tweets_ext_from_dp.xml\
               -createTable
    

    In the preceding command, you can either create the OSCH_HOME environment variable, or replace OSCH_HOME in the command with the full path to the installation directory for Oracle SQL Connector for HDFS. On Oracle Big Data Appliance, this directory is: /opt/oracle/orahdfs-version

The table TWEETS_EXT_TAB_DP is now ready to query. It can be queried like any other table in the database. For example:

select count(*) from TWEETS_EXT_TAB_DP;

You can perform spatial operations on that table, such as the following example to retrieve the users that are tweeting in a quarter-mile radius of a cinema:

select sdo_geom.SDO_DISTANCE(ci.geometry, SDO_GEOMETRY(tw.geometry, 8307), 0.5, 'UNIT=YARD'), ci.name, tw.user_id 
from CINEMA ci, TWEETS_EXT_TAB_DP tw where SDO_WITHIN_DISTANCE(ci.geometry, SDO_GEOMETRY(tw.geometry, 8307), 'DISTANCE=200 UNIT=MILE') = 'TRUE';

This information can be used further to customize advertising.

Note that the SRID of the geometries is 8307. Also, if the spatial data is in GeoJSON format, then the query should be as follows:

select sdo_geom.SDO_DISTANCE(ci.geometry, SDO_UTIL.FROM_GEOJSON(tw.geometry, '', 8307), 0.5, 'UNIT=YARD'), ci.name, tw.user_id 
from CINEMA ci, TWEETS_EXT_TAB_DP tw where SDO_WITHIN_DISTANCE(ci.geometry, SDO_UTIL.FROM_GEOJSON(tw.geometry, '', 8307), 'DISTANCE=200 UNIT=MILE') = 'TRUE';

3.4 Integrating HDFS Spatial Data with Oracle Database Using Oracle Big Data SQL

You can use Oracle Big Data SQL to facilitate spatial data access between HDFS and Oracle Database.

To enable the spatial features in Oracle Big Data SQL, update the file bigdata.properties to add the following lines at the end (replacing $ORACLE_SPATIAL_VECTOR_LIB_PATH with the path to the Oracle Spatial libraries):

java.classpath.user=$ORACLE_SPATIAL_VECTOR_LIB_PATH/ojdbc8.jar:
$ORACLE_SPATIAL_VECTOR_LIB_PATH/sdoutl.jar: $ORACLE_SPATIAL_VECTOR_LIB_PATH/sdoapi.jar:
$ORACLE_SPATIAL_VECTOR_LIB_PATH/sdohadoop-vector.jar:
$ORACLE_SPATIAL_VECTOR_LIB_PATH/sdohadoop-vector-hive.jar
(Also add here jars containing custom SerDe and/or InputFormat specifications.)

If the files are in HDFS, you can use the following solutions:

If you are accessing spatial data from Oracle NoSQL Database or Apache HBase, you can use the solution in Creating Oracle External Tables Using Hive Tables with Big Data SQL.

To use Oracle SQL Connector for HDFS (OSCH) with files generated by Oracle Loader for Hadoop (OLH), you must understand how OLH is used to move data from HDFS to Oracle Database.

Modifications are required for moving Big Data Spatial and Graph spatial data into the database. This solution generally applies for any kind of files in HDFS or any kind of Hive data. The spatial information can be in a well known format or a custom format.

First, an example of how to create external tables from files in HDFS containing spatial information in a user defined format. Assume that the files in HDFS have records the following format:

{
        "type":"Feature",
        "id":"6703",
        "followers_count":1,
        "friends_count":62,
        "location":"Hong Kong",
        "user_id":3479846,
        "longitude":114.18306,
        "latitude":22.30693
}

{
        "type":"Feature",
        "id":"6702",
        "followers_count":57,
        "friends_count":166,
        "location":"Singapore",
        "user_id":1765655,
        "longitude":103.85387,
        "latitude":1.29498
}

The Hive command to create a table for those records is as follows:

add jar
     /opt/oracle/oracle-spatial-graph/spatial/vector/jlib/ojdbc8.jar
     /opt/oracle/oracle-spatial-graph/spatial/vector/jlib/sdoutl.jar
     /opt/oracle/oracle-spatial-graph/spatial/vector/jlib/sdoapi.jar
     /opt/oracle/oracle-spatial-graph/spatial/vector/jlib/sdohadoop-vector.jar
     /opt/oracle/oracle-spatial-graph/spatial/vector/jlib/sdohadoop-vector-hive.jar
     … (add here jars containing custom SerDe and/or InputFormats);
CREATE EXTERNAL TABLE IF NOT EXISTS CUST_TWEETS_HIVE_TAB (id STRING, geometry STRING, followers_count STRING, friends_count STRING, location STRING, user_id STRING)                                         
ROW FORMAT SERDE 'mypackage.TweetsSerDe'              
STORED AS INPUTFORMAT 'oracle.spatial.hadoop.vector.geojson.mapred.GeoJsonInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/user/scott/simple_tweets_data';

The InputFormat object oracle.spatial.hadoop.vector.geojson.mapred.GeoJsonInputFormat can read those records even if they are not strict GeoJSON. Thus, the preceding example does not need a custom InputFormat specification. However, it does require a custom Hive Serializer and Deserializer (SerDe) to transform the latitude and longitude into a WKT or GeoJSON geometry. For that, the Spatial Java API can be used in the deserialize function of the SerDe, as the following example

    @Override
    public Object deserialize(Writable w) throws SerDeException {
        Text rowText = (Text) w;
        List<Text> row = new ArrayList<Text>(columnNames.size());
        
        //default all values to null
        for(int i=0;i<columnNames.size();i++){
                row.add(null);
        }
        
        // Try parsing row into JSON object
        JsonNode recordNode = null;
        
        try {
                String txt = rowText.toString().trim();
                recordNode = jsonMapper.readTree(txt);
                        row.set(columnNames.indexOf("id"), new Text(recordNode.get("id").getTextValue()));
                        row.set(columnNames.indexOf("followers_count"), new Text(recordNode.get("followers_count").toString()));
                        row.set(columnNames.indexOf("friends_count"), new Text(recordNode.get("friends_count").toString()));
                        row.set(columnNames.indexOf("location"), new Text(recordNode.get("location").getTextValue()));
                        row.set(columnNames.indexOf("user_id"), new Text(recordNode.get("user_id").toString()));
                        
                        Double longitude = recordNode.get("longitude").getDoubleValue();
                        Double latitude = recordNode.get("latitude").getDoubleValue();
                        
                        //use the Spatial API to create the geometry
                        JGeometry geom = JGeometry.createPoint(new double[]{
                                        longitude, 
                                        latitude}, 
                                        2, //dimensions
                                        8307 //SRID
                                        );
                        //Transform the JGeometry to WKT
                        String geoWKT = new String(wkt.fromJGeometry(geom));
                        row.set(columnNames.indexOf("geometry"), new Text(geoWKT));
        } catch (Exception e) {
            throw new SerDeException("Exception parsing JSON: " +e.getMessage(), e);
        }
        
        return row;
    }    

In the preceding example, to return the geometries in GeoJSON format, replace the following:

String geoWKT = new String(wkt.fromJGeometry(geom));
row.set(columnNames.indexOf("geometry"), new Text(geoWKT));

with this:

row.set(columnNames.indexOf("geometry"), new Text(geom.toGeoJson()));

More SerDe examples to transform data in GeoJSON, WKT, or ESRI Shapefiles with the Spatial Java API are available in the folder: /opt/oracle/oracle-spatial-graph/spatial/vector/examples/hive/java/src/oracle/spatial/hadoop/vector/hive/java/src/serde

The following example queries the Hive table:

select ID, FOLLOWERS_COUNT, FRIENDS_COUNT, LOCATION, USER_ID, GEOMETRY from CUST_TWEETS_HIVE_TAB limit 10;

The output looks like the following:

6703 1       62      Hong Kong       3479846 POINT (114.18306 22.30693)
6702    57      166     Singapore       1765655 POINT (103.85387 1.29498)

3.4.1 Creating Oracle External Tables for HDFS Files with Big Data SQL

You can create Oracle external tables for any kind of files in HDFS. The spatial information can be in a well known format or a custom format.

If the geometry format is not WKT or GeoJSON, then use one of the provided SerDe examples in the folder /opt/oracle/oracle-spatial-graph/spatial/vector/examples/hive/java/src/oracle/spatial/hadoop/vector/hive/java/src/serde, or create a custom SerDe as in the example in Using Oracle SQL Connector for HDFS with Files Generated by Oracle Loader for Hadoop.

After that, create an Oracle external table, as in the following example:

CREATE TABLE SAMPLE_TWEETS (id VARCHAR2(4000), 
  geometry VARCHAR2(4000), 
  followers_count VARCHAR2(4000), 
  friends_count VARCHAR2(4000), 
  location VARCHAR2(4000), user_id VARCHAR2(4000)) ORGANIZATION EXTERNAL
             (TYPE oracle_hdfs DEFAULT DIRECTORY DEFAULT_DIR
 ACCESS PARAMETERS (
  com.oracle.bigdata.rowformat: \
     SERDE 'mypackage.TweetsSerDe'
  com.oracle.bigdata.fileformat: \
     INPUTFORMAT 'oracle.spatial.hadoop.vector.geojson.mapred.GeoJsonInputFormat' \
     OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' \
  )
LOCATION ('/user/scott/simple_tweets_data/*.log'));

The table SAMPLE_TWEETS is now ready to query. It can be queried like any other table in the database. For example:

select count(*) from SAMPLE_TWEETS;

You can perform spatial operations on that table, such as the following example to retrieve the users that are tweeting in a quarter-mile radius of a cinema:

select sdo_geom.SDO_DISTANCE(ci.geometry, SDO_GEOMETRY(tw.geometry, 8307), 0.5, 'UNIT=YARD'), ci.name, tw.user_id 
from CINEMA ci, SAMPLE_TWEETS tw where SDO_WITHIN_DISTANCE(ci.geometry, SDO_GEOMETRY(tw.geometry, 8307), 'DISTANCE=200 UNIT=MILE') = 'TRUE';

This information can be used further to customize advertising.

Note that the SRID of the geometries is 8307. Also, if the spatial data is in GeoJSON format, then the query should be as follows:

select sdo_geom.SDO_DISTANCE(ci.geometry, SDO_UTIL.FROM_GEOJSON(tw.geometry, '', 8307), 0.5, 'UNIT=YARD'), ci.name, tw.user_id 
from CINEMA ci, SAMPLE_TWEETS tw where SDO_WITHIN_DISTANCE(ci.geometry, SDO_UTIL.FROM_GEOJSON(tw.geometry, '', 8307), 'DISTANCE=200 UNIT=MILE') = 'TRUE';

3.4.2 Creating Oracle External Tables Using Hive Tables with Big Data SQL

You can create Oracle external tables using Hive tables with Big Data SQL. The spatial information can be in a well known format or a custom format.

A Hive table used to create an Oracle external table must be created as described in Using Oracle SQL Connector for HDFS with Files Generated by Oracle Loader for Hadoop.

Create an Oracle external table that can be created using the Hive table. For example:

CREATE TABLE SAMPLE_TWEETS (id VARCHAR2(4000),  geometry VARCHAR2(4000),  followers_count VARCHAR2(4000),  friends_count VARCHAR2(4000),  location VARCHAR2(4000), user_id VARCHAR2(4000))  ORGANIZATION EXTERNAL
(TYPE ORACLE_HIVE
 DEFAULT DIRECTORY DEFAULT_DIR 
 ACCESS PARAMETERS (
com.oracle.bigdata.cluster=cluster
com.oracle.bigdata.tablename=default.CUST_TWEETS_HIVE_TAB)
) PARALLEL 2 REJECT LIMIT UNLIMITED;

The table SAMPLE_TWEETS is now ready to query. It can be queried like any other table in the database. For example:

select count(*) from SAMPLE_TWEETS;

You can perform spatial operations on that table, such as the following example to retrieve the users that are tweeting in a quarter-mile radius of a cinema:

select sdo_geom.SDO_DISTANCE(ci.geometry, SDO_GEOMETRY(tw.geometry, 8307), 0.5, 'UNIT=YARD'), ci.name, tw.user_id 
from CINEMA ci, SAMPLE_TWEETS tw where SDO_WITHIN_DISTANCE(ci.geometry, SDO_GEOMETRY(tw.geometry, 8307), 'DISTANCE=200 UNIT=MILE') = 'TRUE';

This information can be used further to customize advertising.

Note that the SRID of the geometries is 8307. Also, if the spatial data is in GeoJSON format, then the query should be as follows:

select sdo_geom.SDO_DISTANCE(ci.geometry, SDO_UTIL.FROM_GEOJSON(tw.geometry, '', 8307), 0.5, 'UNIT=YARD'), ci.name, tw.user_id 
from CINEMA ci, SAMPLE_TWEETS tw where SDO_WITHIN_DISTANCE(ci.geometry, SDO_UTIL.FROM_GEOJSON(tw.geometry, '', 8307), 'DISTANCE=200 UNIT=MILE') = 'TRUE';