2 Using Big Data Spatial and Graph with Spatial Data

This chapter provides conceptual and usage information about loading, storing, accessing, and working with spatial data in a Big Data environment.

2.1 About Big Data Spatial and Graph Support for Spatial Data

Spatial data represents the location characteristics of real or conceptual objects in relation to the real or conceptual space on a Geographic Information System (GIS) or other location-based application.

Oracle Big Data Spatial and Graph features enable spatial data to be stored, accessed, and analyzed quickly and efficiently for location-based decision making.

These features are used to geotag, enrich, visualize, transform, load, and process the location-specific two and three dimensional geographical images, and manipulate geometrical shapes for GIS functions.

2.1.1 What is Big Data Spatial and Graph on Apache Hadoop?

Oracle Big Data Spatial and Graph on Apache Hadoop is a framework that uses the MapReduce programs and analytic capabilities in a Hadoop cluster to store, access, and analyze the spatial data. The spatial features provide a schema and functions that facilitate the storage, retrieval, update, and query of collections of spatial data. Big Data Spatial and Graph on Hadoop supports storing and processing spatial images, which could be geometric shapes, raster, or vector images and stored in one of the several hundred supported formats.

Note:

Oracle Spatial and Graph Developer's Guide for an introduction to spatial concepts, data, and operations

2.1.2 Advantages of Oracle Big Data Spatial and Graph

The advantages of using Oracle Big Data Spatial and Graph include the following:

  • Unlike some of the GIS-centric spatial processing systems and engines, Oracle Big Data Spatial and Graph is capable of processing both structured and unstructured spatial information.

  • Customers are not forced or restricted to store only one particular form of data in their environment. They can have their data stored both as a spatial or nonspatial business data and still can use Oracle Big Data to do their spatial processing.

  • This is a framework, and therefore customers can use the available APIs to custom-build their applications or operations.

  • Oracle Big Data Spatial can process both vector and raster types of information and images.

2.1.3 Oracle Big Data Spatial Features and Functions

The spatial data is loaded for query and analysis by the Spatial Server and the images are stored and processed by an Image Processing Framework. You can use the Oracle Big Data Spatial and Graph server on Hadoop for:

  • Cataloguing the geospatial information, such as geographical map-based footprints, availability of resources in a geography, and so on.

  • Topological processing to calculate distance operations, such as nearest neighbor in a map location.

  • Categorization to build hierarchical maps of geographies and enrich the map by creating demographic associations within the map elements.

The following functions are built into Oracle Big Data Spatial and Graph:

  • Indexing function for faster retrieval of the spatial data.

  • Map function to display map-based footprints.

  • Zoom function to zoom-in and zoom-out specific geographical regions.

  • Mosaic and Group function to group a set of image files for processing to create a mosaic or subset operations.

  • Cartesian and geodetic coordinate functions to represent the spatial data in one of these coordinate systems.

  • Hierarchical function that builds and relates geometric hierarchy, such as country, state, city, postal code, and so on. This function can process the input data in the form of documents or latitude/longitude coordinates.

2.1.4 Oracle Big Data Spatial Files, Formats, and Software Requirements

The stored spatial data or images can be in one of these supported formats:

  • GeoJSON files

  • Shapefiles

  • Both Geodetic and Cartesian data

  • Other GDAL supported formats

You must have the following software, to store and process the spatial data:

  • Java runtime

  • GCC Compiler - Only when the GDAL-supported formats are used

2.2 Oracle Big Data Vector and Raster Data Processing

Oracle Big Data Spatial and Graph supports the storage and processing of both vector and raster spatial data.

2.2.1 Oracle Big Data Spatial Raster Data Processing

For processing the raster data, the GDAL loader loads the raster spatial data or images onto a HDFS environment. The following basic operations can be performed on a raster spatial data:

  • Mosaic: Combine multiple raster images to create a single mosaic image.

  • Subset: Perform subset operations on individual images.

  • Raster algebra operations: Perform algebra operations on every pixel in the rasters (for example, add, divide, multiply, log, pow, sine, sinh, and acos).

  • User-specified processing: Raster processing is based on the classes that user sets to be executed in mapping and reducing phases.

This feature supports a MapReduce framework for raster analysis operations. The users have the ability to custom-build their own raster operations, such as performing an algebraic function on a raster data and so on. For example, calculate the slope at each base of a digital elevation model or a 3D representation of a spatial surface, such as a terrain. For details, see Oracle Big Data Spatial Hadoop Image Processing Framework for Raster Data Processing.

2.2.2 Oracle Big Data Spatial Vector Data Processing

This feature supports the processing of spatial vector data:

  • Loaded and stored on to a Hadoop HDFS environment

  • Stored either as Cartesian or geodetic data

The stored spatial vector data can be used for performing the following query operations and more:

  • Point-in-polygon

  • Distance calculation

  • Anyinteract

  • Buffer creation

Sevetal data service operations are supported for the spatial vector data:

  • Data enrichment

  • Data categorization

  • Spatial join

In addition, there is a limited Map Visualization API support for only the HTML5 format. You can access these APIs to create custom operations. For details, see "Oracle Big Data Spatial Vector Analysis."

2.3 Oracle Big Data Spatial Hadoop Image Processing Framework for Raster Data Processing

Oracle Spatial Hadoop Image Processing Framework allows the creation of new combined images resulting from a series of processing phases in parallel with the following features:

  • HDFS Images storage, where every block size split is stored as a separate tile, ready for future independent processing

  • Subset, user-defined, and map algebra operations processed in parallel using the MapReduce framework

  • Ability to add custom processing classes to be executed in the mapping or reducing phases in parallel in a transparent way

  • Fast processing of georeferenced images

  • Support for GDAL formats, multiple bands images, DEMs (digital elevation models), multiple pixel depths, and SRIDs

  • Java API providing access to framework operations; useful for web services or standalone Java applications

  • Framework for testing and debugging user processing classes in the local environment

The Oracle Spatial Hadoop Image Processing Framework consists of two modules, a Loader and Processor, each one represented by a Hadoop job running on different stages in a Hadoop cluster, as represented in the following diagram. Also, you can load and process the images using the Image Server web application, and you can use the Java API to expose the framework’s capabilities.

For installation and configuration information, see:

2.3.1 Image Loader

The Image Loader is a Hadoop job that loads a specific image or a group of images into HDFS.

  • While importing, the image is tiled and stored as an HDFS block.

  • GDAL is used to tile the image.

  • Each tile is loaded by a different mapper, so reading is parallel and faster.

  • Each tile includes a certain number of overlapping bytes (user input), so that the tiles cover area from the adjacent tiles.

  • A MapReduce job uses a mapper to load the information for each tile. There are 'n' number of mappers, depending on the number of tiles, image resolution and block size.

  • A single reduce phase per image puts together all the information loaded by the mappers and stores the images into a special .ohif format, which contains the resolution, bands, offsets, and image data. This way the file offset containing each tile and the node location is known.

  • Each tile contains information for every band. This is helpful when there is a need to process only a few tiles; then, only the corresponding blocks are loaded.

The following diagram represents an Image Loader process:

2.3.2 Image Processor

The Image Processor is a Hadoop job that filters tiles to be processed based on the user input and performs processing in parallel to create a new image.

  • Processes specific tiles of the image identified by the user. You can identify one, zero, or multiple processing classes. These classes are executed in the mapping or reducing phase, depending on your configuration. For the mapping phase, after the execution of processing classes, a mosaic operation is performed to adapt the pixels to the final output format requested by the user. If no mosaic operation was requested, the input raster is sent to reduce phase as is. For reducer phase, all the tiles are put together into a GDAL data set that is input for user reduce processing class, where final output may be changed or analyzed according to user needs.

  • A mapper loads the data corresponding to one tile, conserving data locality.

  • Once the data is loaded, the mapper filters the bands requested by the user.

  • Filtered information is processed and sent to each mapper in the reduce phase, where bytes are put together and a final processed image is stored into HDFS or regular File System depending on the user request.

The following diagram represents an Image Processor job:

2.3.3 Image Server

The Image Server is a web application the enables you to load and process images from different and variety of sources, especially from the Hadoop File System (HDFS). This Oracle Image Server has several main applications:

  • • Visualization of rasters in the entire globe and the ability to create a mosaic from direct selection in the map.

  • Raster Image processing to create catalogs from the source images and process into a single unit. You can also view the image thumbnails.

  • Hadoop console configuration, used to set up the cluster connection parameters and for the jobs, initial setup.

2.4 Loading an Image to Hadoop Using the Image Loader

The first step to process images using the Oracle Spatial and Graph Hadoop Image Processing Framework is to actually have the images in HDFS, followed by having the images separated into smart tiles. This allows the processing job to work separately on each tile independently. The Image Loader lets you import a single image or a collection of them into HDFS in parallel, which decreases the load time.

The Image Loader imports images from a file system into HDFS, where each block contains data for all the bands of the image, so that if further processing is required on specific positions, the information can be processed on a single node.

2.4.1 Image Loading Job

The image loading job has its custom input format that splits the image into related image splits. The splits are calculated based on an algorithm that reads square blocks of the image covering a defined area, which is determined by

area = ((blockSize - metadata bytes) / number of bands) / bytes per pixel.

For those pieces that do not use the complete block size, the remaining bytes are refilled with zeros.

Splits are assigned to different mappers where every assigned tile is read using GDAL based on the ImageSplit information. As a result an ImageDataWritable instance is created and saved in the context.

The metadata set in the ImageDataWritable instance is used by the processing classes to set up the tiled image in order to manipulate and process it. Since the source images are read from multiple mappers, the load is performed in parallel and faster.

After the mappers finish reading, the reducer picks up the tiles from the context and puts them together to save the file into HDFS. A special reading process is required to read the image back.

2.4.2 Input Parameters

The following input parameters are supplied to the Hadoop command:

hadoop jar /opt/oracle/oracle-spatial-graph/spatial/raster/jlib/hadoop-imageloader.jar 
  -files <SOURCE_IMGS_PATH>
  -out <HDFS_OUTPUT_FOLDER>
  -gdal <GDAL_LIB_PATH>
  -gdalData <GDAL_DATA_PATH>
  [-overlap <OVERLAPPING_PIXELS>]
  [-thumbnail <THUMBNAIL_PATH>]
  [-expand <false|true>]
  [-extractLogs <false|true>]
  [-logFilter <LINES_TO_INCLUDE_IN_LOG>]

Where:

  • SOURCE_IMGS_PATH is a path to the source image(s) or folder(s). For multiple inputs use a comma separator. This path must be accessible via NFS to all nodes in the cluster.
  • HDFS_OUTPUT_FOLDER is the HDFS output folder where the loaded images are stored.
  • OVERLAPPING_PIXELS is an optional number of overlapping pixels on the borders of each tile, if this parameter is not specified a default of two overlapping pixels is considered.
  • GDAL_LIB_PATH is the path where GDAL libraries are located.
  • GDAL_DATA_PATH is the path where GDAL data folder is located. This path must be accessible through NFS to all nodes in the cluster.
  • THUMBNAIL_PATH is an optional path to store a thumbnail of the loaded image(s). This path must be accessible through NFS to all nodes in the cluster and must have write access permission for yarn users.
  • -expand controls whether the HDFS path of the loaded raster expands the source path, including all directories. If you set this to false, the .ohif file is stored directly in the output directory (specified using the -o option) without including that directory’s path in the raster.
  • -extractLogs controls whether the logs of the executed application should be extracted to the system temporary directory. By default, it is not enabled. The extraction does not include logs that are not part of Oracle Framework classes.
  • -logFilter <LINES_TO_INCLUDE_IN_LOG> is a comma-separated String that lists all the patterns to include in the extracted logs, for example, to include custom processing classes packages.

For example, the following command loads all the georeferenced images under the images folder and adds an overlapping of 10 pixels on every border possible. The HDFS output folder is ohiftest and thumbnail of the loaded image are stored in the processtest folder.

hadoop jar /opt/oracle/oracle-spatial-graph/spatial/raster/jlib/hadoop-imageloader.jar   -files /opt/shareddir/spatial/demo/imageserver/images/hawaii.tif -out ohiftest -overlap 10 -thumbnail /opt/shareddir/spatial/processtest –gdal /opt/oracle/oracle-spatial-graph/spatial/raster/gdal/lib –gdalData /opt/shareddir/data

By default, the Mappers and Reducers are configured to get 2 GB of JVM, but users can override this settings or any other job configuration properties by adding an imagejob.prop properties file in the same folder location from where the command is being executed. This properties file may list all the configuration properties that you want to override. For example,

mapreduce.map.memory.mb=2560
mapreduce.reduce.memory.mb=2560
mapreduce.reduce.java.opts=-Xmx2684354560
mapreduce.map.java.opts=-Xmx2684354560

Java heap memory (java.opts properties) must be equal to or less than the total memory assigned to mappers and reducers (mapreduce.map.memory and mapreduce.reduce.memory). Thus, if you increase Java heap memory, you might also need to increase the memory for mappers and reducers.

2.4.3 Output Parameters

The reducer generates two output files per input image. The first one is the .ohif file that concentrates all the tiles for the source image, each tile may be processed as a separated instance by a processing mapper. Internally each tile is stored as a HDFS block, blocks are located in several nodes, one node may contain one or more blocks of a specific .ohif file. The .ohif file is stored in user specified folder with -out flag, under the /user/<USER_EXECUTING_JOB>/OUT_FOLDER/<PARENT_DIRECTORIES_OF_SOURCE_RASTER> if the flag –expand was not used. Otherwise, the .ohif file will be located at /user/<USER_EXECUTING_JOB>/OUT_FOLDER/, and the file can be identified as original_filename.ohif.

The second output is a related metadata file that lists all the pieces of the image and the coordinates that each one covers. The file is located in HDFS under the metadata location, and its name is hash generated using the name of the ohif file. This file is for Oracle internal use only, and lists important metadata of the source raster. Some example lines from a metadata file:

srid:26904
datatype:1
resolution:27.90809458890406,-27.90809458890406
file:/user/hdfs/ohiftest/opt/shareddir/spatial/data/rasters/hawaii.tif.ohif
bands:3
mbr:532488.7648166901,4303164.583549625,582723.3350767174,4269619.053853762
0,532488.7648166901,4303164.583549625,582723.3350767174,4269619.053853762
thumbnailpath:/opt/shareddir/spatial/thumb/

If the -thumbnail flag was specified, a thumbnail of the source image is stored in the related folder. This is a way to visualize a translation of the .ohif file. Job execution logs can be accessed using the command yarn logs -applicationId <applicationId>.

2.5 Processing an Image Using the Oracle Spatial Hadoop Image Processor

Once the images are loaded into HDFS, they can be processed in parallel using Oracle Spatial Hadoop Image Processing Framework. You specify an output, and the framework filters the tiles to fit into that output, processes them, and puts them all together to store them into a single file. Map algebra operations are also available and, if set, will be the first part of the processing phase. You can specify additional processing classes to be executed before the final output is created by the framework.

The image processor loads specific blocks of data, based on the input (mosaic description or a single raster), and selects only the bands and pixels that fit into the final output. All the specified processing classes are executed and the final output is stored into HDFS or the file system depending on the user request.

2.5.1 Image Processing Job

The image processing job has different flows depending on the type of processing requested by the user.

2.5.1.1 Default Image Processing Job Flow

The default image processing job flow is executed when any of the following processing is requested:

  • Mosaic operation

  • Single raster operation

  • Basic multiple raster algebra operation

The flow has its own custom FilterInputFormat, which determines the tiles to be processed, based on the SRID and coordinates. Only images with same data type (pixel depth) as the mosaic input data type (pixel depth) are considered. Only the tiles that intersect with coordinates specified by the user for the mosaic output are included. For processing of a single raster or basic multiple raster algebra operation (excluding mosaic), the filter includes all the tiles of the input rasters, because the processing will be executed on the complete images. Once the tiles are selected, a custom ImageProcessSplit is created for each image.

When a mapper receives the ImageProcessSplit, it reads the information based on what the ImageSplit specifies, performs a filter to select only the bands indicated by the user, and executes the list of map operations and of processing classes defined in the request, if any.

Each mapper process runs in the node where the data is located. After the map algebra operations and processing classes are executed, a validation verifies if the user is requesting mosaic operation or if analysis includes the complete image; and if a mosaic operation is requested, the final process executes the operation. The mosaic operation selects from every tile only the pixels that fit into the output and makes the necessary resolution changes to add them in the mosaic output. The single process operation just copies the previous raster tile bytes as they are. The resulting bytes are stored in NFS to be recovered by the reducer.

A single reducer picks the tiles and puts them together. If you specified any basic multiple raster algebra operation, then it is executed at the same time the tiles are merged into the final output. This operation affects only the intersecting pixels in the mosaic output, or in every pixel if no mosaic operation was requested. If you specified a reducer processing class, the GDAL data set with the output raster is sent to this class for analysis and processing. If you selected HDFS output, the ImageLoader is called to store the result into HDFS. Otherwise, by default the image is prepared using GDAL and is stored in the file system (NFS).

2.5.1.2 Multiple Raster Image Processing Job Flow

The multiple raster image processing job flow is executed when a complex multiple raster algebra operation is requested. It applies to rasters that have the same MBR, pixel type, pixel size, and SRID, since these operations are applied pixel by pixel in the corresponding cell, where every pixel represents the same coordinates.

The flow has its own custom MultipleRasterInputFormat, which determines the tiles to be processed, based on the SRID and coordinates. Only images with same MBR, pixel type, pixel size and SRID are considered. Only the rasters that match with coordinates specified by the first raster in the catalog are included. All the tiles of the input rasters are considered, because the processing will be executed on the complete images.

Once the tiles are selected, a custom MultipleRasterSplit is created. This split contains a small area of every original tile, depending on the block size, because now all the rasters must be included in a split, even if it is only a small area. Each of these is called an IndividualRasterSplit, and they are contained in a parent MultipleRasterSplit.

When a mapper receives the MultipleRasterSplit, it reads the information of all the raster´s tiles that are included in the parent split, performs a filter to select only the bands indicated by the user and only the small corresponding area to process in this specific mapper, and then executes the complex multiple raster algebra operation.

Data locality may be lost in this part of the process, because multiple rasters are included for a single mapper that may not be in the same node. The resulting bytes for every pixel are put in the context to be recovered by the reducer.

A single reducer picks pixel values and puts them together. If you specified a reducer processing class, the GDAL data set with the output raster is sent to this class for analysis and processing. The list of tiles that this class receives is null for this scenario, and the class can only work with the output data set. If you selected HDFS output, the ImageLoader is called to store the result into HDFS. Otherwise, by default the image is prepared using GDAL and is stored in the file system (NFS).

2.5.2 Input Parameters

The following input parameters can be supplied to the hadoop command:

hadoop jar /opt/oracle/oracle-spatial-graph/spatial/raster/jlib/hadoop-imageprocessor.jar 
  -config  <MOSAIC_CONFIG_PATH>
  -gdal  <GDAL_LIBRARIES_PATH>
  -gdalData  <GDAL_DATA_PATH>
  [-catalog  <IMAGE_CATALOG_PATH>]
  [-usrlib  <USER_PROCESS_JAR_PATH>]
  [-thumbnail  <THUMBNAIL_PATH>]
  [-nativepath  <USER_NATIVE_LIBRARIES_PATH>]
  [-params  <USER_PARAMETERS>]
  [-file  <SINGLE_RASTER_PATH>]

Where:

  • MOSAIC_CONFIG_PATH is the path to the mosaic configuration xml, that defines the features of the output.
  • GDAL_LIBRARIES_PATH is the path where GDAL libraries are located.
  • GDAL_DATA_PATH is the path where the GDAL data folder is located. This path must be accessible via NFS to all nodes in the cluster.
  • IMAGE_CATALOG_PATH is the path to the catalog xml that lists the HDFS image(s) to be processed. This is optional because you can also specify a single raster to process using –file flag.
  • USER_PROCESS_JAR_PATH is an optional user-defined jar file or comma-separated list of jar files, each of which contains additional processing classes to be applied to the source images.
  • THUMBNAIL_PATH is an optional flag to activate the thumbnail creation of the loaded image(s). This path must be accessible via NFS to all nodes in the cluster and is valid only for an HDFS output.
  • USER_NATIVE_LIBRARIES_PATH is an optional comma-separated list of additional native libraries to use in the analysis. It can also be a directory containing all the native libraries to load in the application.
  • USER_PARAMETERS is an optional key/value list used to define input data for user processing classes. Use a semicolon to separate parameters. For example: azimuth=315;altitude=45
  • SINGLE_RASTER_PATH is an optional path to the .ohif file that will be processed by the job. If this is set, you do not need to set a catalog.

For example, the following command will process all the files listed in the catalog file input.xml file using the mosaic output definition set in testFS.xml file.

hadoop jar /opt/oracle/oracle-spatial-graph/spatial/raster/jlib/hadoop-imageprocessor.jar -catalog /opt/shareddir/spatial/demo/imageserver/images/input.xml -config /opt/shareddir/spatial/demo/imageserver/images/testFS.xml -thumbnail /opt/shareddir/spatial/processtest –gdal /opt/oracle/oracle-spatial-graph/spatial/raster/gdal/lib –gdalData /opt/shareddir/data

By default, the Mappers and Reducers are configured to get 2 GB of JVM, but users can override this settings or any other job configuration properties by adding an imagejob.prop properties file in the same folder location from where the command is being executed.

2.5.2.1 Catalog XML Structure

The following is an example of input catalog XML used to list every source image considered for mosaic operation generated by the image processing job.

-<catalog>
  -<image>
<raster>/user/hdfs/ohiftest/opt/shareddir/spatial/data/rasters/maui.tif.ohif</raster>
<bands datatype='1' config='1,2,3'>3</bands>
   </image>
</catalog>

A <catalog> element contains the list of <image> elements to process.

Each <image> element defines a source image or a source folder within the <raster> element. All the images within the folder are processed.

The <bands> element specifies the number of bands of the image, The datatype attribute has the raster data type and the config attribute specifies which band should appear in the mosaic output band order. For example: 3,1,2 specifies that mosaic output band number 1 will have band number 3 of this raster, mosaic band number 2 will have source band 1, and mosaic band number 3 will have source band 2. This order may change from raster to raster.

2.5.2.2 Mosaic Definition XML Structure

The following is an example of a mosaic configuration XML used to define the features of the output generated by the image processing job.

-<mosaic exec="false">
  -<output>
   <SRID>26904</SRID>
   <directory type="FS">/opt/shareddir/spatial/processOutput</directory>
   <!--directory type="HDFS">newData</directory-->
   <tempFSFolder>/opt/shareddir/spatial/tempOutput</tempFSFolder>
   <filename>littlemap</filename>
   <format>GTIFF</format>
  <width>1600</width>
  <height>986</height>
  <algorithm order="0">2</algorithm>
  <bands layers="3" config="3,1,2"/>
  <nodata>#000000</nodata>
  <pixelType>1</pixelType>
  </output>
  -<crop>
   -<transform>
    356958.985610072,280.38843650364862,0,2458324.0825054757,0,-280.38843650364862 </transform>
  </crop>
<process><classMapper params="threshold=454,2954">oracle.spatial.hadoop.twc.FarmTransformer</classMapper><classReducer params="plot_size=100400">oracle.spatial.hadoop.twc.FarmAlignment</classReducer></process>
   <operations>
        <localif operator="<" operand="3" newvalue="6"/>
                   <localadd arg="5"/>
                   <localsqrt/>
                   <localround/>
   </operations>
</mosaic>

The <mosaic> element defines the specifications of the processing output. The exec attribute specifies if the processing will include mosaic operation or not. If set to “false”, a mosaic operation is not executed and a single raster is processed; if set to “true” or not set, a mosaic operation is performed. Some of the following elements are required only for mosaic operations and ignored for single raster processing.

The <output> element defines the features such as <SRID> considered for the output. All the images in different SRID are converted to the mosaic SRID in order to decide if any of its tiles fit into the mosaic or not. This element is not required for single raster processing, because the output rster has the same SRID as the input.

The <directory> element defines where the output is located. It can be in an HDFS or in regular FileSystem (FS), which is specified in the tag type.

The <tempFsFolder> element sets the path to store the mosaic output temporarily. The attribute delete=”false” can be specified to keep the output of the process even if the loader was executed to store it in HDFS.

The <filename> and <format> elements specify the output filename. <filename> is not required for single raster process; and if it is not specified, the name of the input file (determined by the -file attribute during the job call) is used for the output file. <format> is not required for single raster processing, because the output raster has the same format as the input.

The <width> and <height> elements set the mosaic output resolution. They are not required for single raster processing, because the output raster has the same resolution as the input.

The <algorithm> element sets the order algorithm for the images. A 1 order means, by source last modified date, and a 2 order means, by image size. The order tag represents ascendant or descendant modes. (These properties are for mosaic operations where multiple rasters may overlap.)

The <bands> element specifies the number of bands in the output mosaic. Images with fewer bands than this number are discarded. The config attribute can be used for single raster processing to set the band configuration for output, because there is no catalog.

The <nodata> element specifies the color in the first three bands for all the pixels in the mosaic output that have no value.

The <pixelType> element sets the pixel type of the mosaic output. Source images that do not have the same pixel size are discarded for processing. This element is not required for single raster processing: if not specified, the pixel type will be the same as for the input.

The <crop> element defines the coordinates included in the mosaic output in the following order: startcoordinateX, pixelXWidth, RotationX, startcoordinateY, RotationY, and pixelheightY. This element is not required for single raster processing: if not specified, the complete image is considered for analysis.

The <process> element lists all the classes to execute before the mosaic operation.

The <classMapper> element is used for classes that will be executed during mapping phase, and the <classReducer> element is used for classes that will be executed during reduce phase. Both elements have the params attribute, where you can send input parameters to processing classes according to your needs.

The <operations> element lists all the map algebra operations that will be processed for this request.

2.5.3 Job Execution

The first step of the job is to filter the tiles that would fit into the output. As a start, the location files that hold tile metadata are sent to theInputFormat.

By extracting the pixelType, the filter decides whether the related source image is valid for processing or not. Based on the user definition made in the catalog xml, one of the following happens:

  • If the image is valid for processing, then the SRID is evaluated next

  • If it is different from the user definition, then the MBR coordinates of every tile are converted into the user SRID and evaluated.

This way, every tile is evaluated for intersection with the output definition.

  • For a mosaic processing request, only the intersecting tiles are selected, and a split is created for each one of them.

  • For a single raster processing request, all the tiles are selected, and a split is created for each one of them.

  • For a complex multiple raster algebra processing request, all the tiles are selected if the MBR and pixel size is the same. Depending on the number of rasters selected and the blocksize, a specific area of every tile´s raster (which does not always include the complete original raster tile) is included in a single parent split.

A mapper processes each split in the node where it is stored. (For complex multiple raster algebra operations, data locality may be lost, because a split contains data from several rasters.) The mapper executes the sequence of map algebra operations and processing classes defined by the user, and then the mosaic process is executed if requested. A single reducer puts together the result of the mappers and, for user-specified reducing processing classes, sets the output data set to these classes for analysis or process. Finally, the process stores the image into FS or HDFS upon user request. If the user requested to store the output into HDFS, then the ImageLoader job is invoked to store the image as an .ohif file.

By default, the mappers and reducers are configured to get 1 GB of JVM, but you can override this settings or any other job configuration properties by adding an imagejob.prop properties file in the same folder location from where the command is being executed.

2.5.4 Processing Classes and ImageBandWritable

The processing classes specified in the catalog XML must follow a set of rules to be correctly processed by the job. All the processing classes in the mapping phase must implement the ImageProcessorInterface interface. For the reducer phase, they must implement the ImageProcessorReduceInterface interface.

When implementing a processing class, you may manipulate the raster using its object representation ImageBandWritable. An example of an processing class is provided with the framework to calculate the slope on DEMs. You can create mapping operations, for example, to transforms the pixel values to another value by a function. The ImageBandWritable instance defines the content of a tile, such as resolution, size, and pixels. These values must be reflected in the properties that create the definition of the tile. The integrity of the mosaic output depends on the correct manipulation of these properties.

The ImageBandWritable instance defines the content of a tile, such as resolution, size, and pixels. These values must be reflected in the properties that create the definition of the tile. The integrity of the output depends on the correct manipulation of these properties.

Table 2-1 ImageBandWritable Properties

Type - Property Description

IntWritable dstWidthSize

Width size of the tile

IntWritable dstHeightSize

Height size of the tile

IntWritable bands

Number of bands in the tile

IntWritable dType

Data type of the tile

IntWritable offX

Starting X pixel, in relation to the source image

IntWritable offY

Starting Y pixel, in relation to the source image

IntWritable totalWidth

Width size of the source image

IntWritable totalHeight

Height size of the source image

IntWritable bytesNumber

Number of bytes containing the pixels of the tile and stored into baseArray

BytesWritable[] baseArray

Array containing the bytes representing the tile pixels, each cell represents a band

IntWritable[][] basePaletteArray

Array containing the int values representing the tile palette, each array represents a band. Each integer represents an entry for each color in the color table, there are four entries per color

IntWritable[] baseColorArray

Array containing the int values representing the color interpretation, each cell represents a band

DoubleWritable[] noDataArray

Array containing the NODATA values for the image, each cell contains the value for the related band

ByteWritable isProjection

Specifies if the tile has projection information with Byte.MAX_VALUE

ByteWritable isTransform

Specifies if the tile has the geo transform array information with Byte.MAX_VALUE

ByteWritable isMetadata

Specifies if the tile has metadata information with Byte.MAX_VALUE

IntWritable projectionLength

Specifies the projection information length

BytesWritable projectionRef

Specifies the projection information in bytes

DoubleWritable[] geoTransform

Contains the geo transform array

IntWritable metadataSize

Number of metadata values in the tile

IntWritable[] metadataLength

Array specifying the length of each metadataValue

BytesWritable[] metadata

Array of metadata of the tile

GeneralInfoWritable mosaicInfo

The user-defined information in the mosaic xml. Do not modify the mosaic output features. Modify the original xml file in a new name and run the process using the new xml

MapWritable extraFields

Map that lists key/value pairs of parameters specific to every tile to be passed to the reducer phase for analysis

Processing Classes and Methods

When modifying the pixels of the tile, first get the band information into an array using the following method:

byte [] bandData1 =(byte []) img.getBand(0);

The bytes representing the tile pixels of band 1 are now in the bandData1 array. The base index is zero.

The getBand(int bandId) method will get the band of the raster in the specified bandId position. You can cast the object retrieved to the type of array of the raster; it could be byte, short (unsigned int 16 bits, int 16 bits), int (unsigned int 32 bits, int 32 bits), float (float 32 bits), or double (float 64 bits).

With the array of pixels available, it is possible now to transform them upon a user request.

After processing the pixels, if the same instance of ImageBandWritable must be used, then execute the following method:

img.removeBands;

This removes the content of previous bands, and you can start adding the new bands. To add a new band use the following method:

img.addBand(Object band);

Otherwise, you may want to replace a specific band by using trhe following method:

img.replaceBand(Object band, int bandId)

In the preceding methods, band is an array containing the pixel information, and bandID is the identifier of the band to be replaced.. Do not forget to update the instance size, data type, bytesNumber and any other property that might be affected as a result of the processing operation. Setters are available for each property.

2.5.4.1 Location of the Classes and Jar Files

All the processing classes must be contained in a single jar file if you are using the Oracle Image Server Console. The processing classes might be placed in different jar files if you are using the command line option.

When new classes are visible in the classpath, they must be added to the mosaic XML in the <process><classMapper> or <process><classReducer> section. Every <class> element added is executed in order of appearance: for mappers, just before the final mosaic operation is performed; and for reducers, just after all the processed tiles are put together in a single data set.

2.5.5 Map Algebra Operations

You can process local map algebra operations on the input rasters, where pixels are altered depending on the operation. The order of operations in the configuration XML determines the order in which the operations are processed. After all the map algebra operations are processed, the processing classes are run, and finally the mosaic operation is performed.

The following map algebra operations can be added in the <operations> element in the mosaic configuration XML, with the operation name serving as an element name.

localnot: Gets the negation of every pixel, inverts the bit pattern. If the result is a negative value and the data type is unsigned, then the NODATA value is set. If the raster does not have a specified NODATA value, then the original pixel is set.

|locallog: Returns the natural logarithm (base e) of a pixel. If the result is NaN, then original pixel value is set; if the result is Infinite, then the NODATA value is set. If the raster does not have a specified NODATA value, then the original pixel is set.

locallog10: Returns the base 10 logarithm of a pixel. If the result is NaN, then the original pixel value is set; if the result is Infinite, then the NODATA value is set. If the raster does not have a specified NODATA value, then the original pixel is set.

localadd: Adds the specified value as argument to the pixel .Example: <localadd arg="5"/>

localdivide: Divides the value of each pixel by the specified value set as argument. Example: <localdivide arg="5"/>

localif: Modifies the value of each pixel based on the condition and value specified as argument. Valid operators: = , <, >, >=, < !=. Example:: <localif operator="<" operand="3" newvalue="6"/>, which modifies all the pixels whose value is less than 3, setting the new value to 6.

localmultiply: Multiplies the value of each pixel times the value specified as argument. Example: <localmultiply arg="5"/>

localpow: Raises the value of each pixel to the power of the value specified as argument. Example: <localpow arg="5"/>. If the result is infinite, the NODATA value is set to this pixel. If the raster does not have a specified NODATA value, then the original pixel is set.

localsqrt: Returns the correctly rounded positive square root of every pixel. If the result is infinite or NaN, the NODATA value is set to this pixel. If the raster does not have a specified NODATA value, then the original pixel is set.

localsubstract: Subtracts the value specified as argument to every pixel value. Example: <localsubstract arg="5"/>

localacos: Calculates the arc cosine of a pixel. If the result is NaN, the NODATA value is set to this pixel. If the raster does not have a specified NODATA value, then the original pixel is set.

localasin: Calculates the arc sine of a pixel. If the result is NaN, the NODATA value is set to this pixel. If the raster does not have a specified NODATA value, then the original pixel is set.

localatan: Calculates the arc tangent of a pixel. If the result is NaN, the NODATA value is set to this pixel. If the raster does not have a specified NODATA value, then the original pixel is set.

localcos: Calculates the cosine of a pixel. If the result is NaN, the NODATA value is set to this pixel. If the raster does not have a specified NODATA value, then the original pixel is set.

localcosh: Calculates the hyperbolic cosine of a pixel. If the result is NaN, the NODATA value is set to this pixel. If the raster does not have a specified NODATA value, then the original pixel is set.

localsin: Calculates the sine of a pixel. If the result is NaN, the NODATA value is set to this pixel. If the raster does not have a specified NODATA value, then the original pixel is set.

localtan: Calculates the tangent of a pixel. The pixel is not modified if the cosine of this pixel is 0. If the result is NaN, the NODATA value is set to this pixel. If the raster does not have a specified NODATA value, then the original pixel is set.

localsinh: Calculates the arc hyperbolic sine of a pixel. If the result is NaN, the NODATA value is set to this pixel. If the raster does not have a specified NODATA value, then the original pixel is set.

localtanh: Calculates the hyperbolic tangent of a pixel. If the result is NaN, the NODATA value is set to this pixel. If the raster does not have a specified NODATA value, then the original pixel is set.

localdefined: Maps an integer typed pixel to 1 if the cell value is not NODATA; otherwise, 0.

localundefined: Maps an integer typed Raster to 0 if the cell value is not NODATA; otherwise, 1.

localabs: Returns the absolute value of signed pixel. If the result is Infinite, the NODATA value is set to this pixel. If the raster does not have a specified NODATA value, then the original pixel is set.

localnegate: Multiplies by -1 the value of each pixel.

localceil: Returns the smallest value that is greater than or equal to the pixel value and is equal to a mathematical integer. If the result is Infinite, the NODATA value is set to this pixel. If the raster does not have a specified NODATA value, then the original pixel is set.

localfloor: Returns the smallest value that is less than or equal to the pixel value and is equal to a mathematical integer. If the result is Infinite, the NODATA value is set to this pixel. If the raster does not have a specified NODATA value, then the original pixel is set.

localround: Returns the closest integer value to every pixel.

2.5.6 Multiple Raster Algebra Operations

You can process raster algebra operations that involve more than one raster, where pixels are altered depending on the operation and taking in consideration the pixels from all the involved rasters in the same cell.

Only one operation can be processed at a time and it is defined in the configuration XML using the <multipleops> element. Its value is the operation to process.

There are two types of operations:

2.5.6.1 Basic Multiple Raster Algebra Operations

Basic multiple raster algebra operations are executed in the reducing phase of the job.

They can be requested along with a mosaic operation or just a process request. If requested along with a mosaic operation, the input rasters must have the same MBR, pixel size, SRID and data type.

When a mosaic operation is performed, only the intersecting pixels (pixels that are identical in both rasters) are affected.

The operation is processed at the time that mapping tiles are put together in the output dataset, the pixel values that intersect (if a mosaic operation was requested) or all the pixels (when mosaic is not requested) are altered according to the requested operation.

The order in which rasters are added to the data set is the mosaic operation order if it was requested; otherwise, it is the order of appearance in the catalog.

The following basic multiple raster algebra operations are available:

  • add: Adds every pixel in the same cell for the raster sequence.

  • substract: Subtracts every pixel in the same cell for the raster sequence.

  • divide: Divides every pixel in the same cell for the raster sequence.

  • multiply: Multiplies every pixel in the same cell for the raster sequence.

  • min: Assigns the minimum value of the pixels in the same cell for the raster sequence.

  • max: Assigns the maximum value of the pixels in the same cell for the raster sequence.

  • mean: Calculates the mean value for every pixel in the same cell for the raster sequence.

  • and: Processes binary “and” operation on every pixel in the same cell for raster sequence, “and“ operation copies a bit to the result if it exists in both operands.

  • or: Processes binary “or” operation on every pixel in the same cell for raster sequence, “or” operation copies a bit if it exists in either operand.

  • xor: Processes binary “xor” operation on every pixel in the same cell for raster sequence, “xor” operation copies the bit if it is set in one operand but not both.

2.5.6.2 Complex Multiple Raster Algebra Operations

Complex multiple raster algebra operations are executed in the mapping phase of the job, and a job can only process this operation; any request for resizing, changing the SRID, or custom mapping must have been previously executed. The input for this job is a series of rasters with the same MBR, SRID, data type, and pixel size.

The tiles for this job include a piece of all the rasters in the catalog. Thus, every mapper has access to an area of cells in all the rasters, and the operation is processed there. The resulting pixel for every cell is written in the context, so that reducer can put results in the output data set before processing the reducer processing classes.

The order in which rasters are considered to evaluate the operation is the order of appearance in the catalog.

The following complex multiple raster algebra operations are available:

  • combine: Assigns a unique output value to each unique combination of input values in the same cell for the raster sequence.

  • majority: Assigns the value within the same cells of the rasters sequence that is the most numerous. If there is a values tie, the one on the right is selected.

  • minority: Assigns the value within the same cells of the raster sequence that is the least numerous. If there is a values tie, the one on the right is selected.

  • variety: Assigns the count of unique values at each same cell in the sequence of rasters.

  • mask: Generates a raster with the values from the first raster, but only includes pixels in which the corresponding pixel in the rest of rasters of the sequence is set to the specified mask values. Otherwise, 0 is set.

  • inversemask: Generates a raster with the values from the first raster, but only includes pixels in which the corresponding pixel in the rest of rasters of the sequence is not set to the specified mask values. Otherwise, 0 is set.

  • equals: Creates a raster with data type byte, where cell values equal 1 if the corresponding cells for all input rasters have the same value. Otherwise, 0 is set.

  • unequal: Creates a raster with data type byte, where cell values equal 1 if the corresponding cells for all input rasters have a different value. Otherwise, 0 is set.

  • greater: Creates a raster with data type byte, where cell values equal 1 if the cell value in the first raster is greater than the rest of corresponding cells for all input. Otherwise, 0 is set.

  • greaterorequal: Creates a raster with data type byte, where cell values equal 1 if the cell value in the first raster is greater or equal than the rest of corresponding cells for all input. Otherwise, 0 is set.

  • less: Creates a raster with data type byte, where cell values equal 1 if the cell value in the first raster is less than the rest of corresponding cells for all input. Otherwise, 0 is set.

  • lessorequal: Creates a raster with data type byte, where cell values equal 1 if the cell value in the first raster is less or equal than the rest of corresponding cells for all input. Otherwise, 0 is set.

2.5.7 Output

When you specify an HDFS directory in the configuration XML, the output generated is an .ohif file as in the case of an ImageLoader job,

When the user specifies a FS directory in the configuration XML, the output generated is an image with the filename and type specified and is stored into regular FileSystem.

In both the scenarios, the output must comply with the specifications set in the configuration XML. The job execution logs can be accessed using the command yarn logs -applicationId <applicationId>.

2.6 Loading and Processing an Image Using the Oracle Spatial Hadoop Raster Processing API

The framework provides a raster processing API that lets you load and process rasters without creating XML but instead using a Java application. The application can be executed inside the cluster or on a remote node.

The API provides access to the framework operations, and is useful for web service or standalone Java applications.

To execute any of the jobs, a HadoopConfiguration object must be created. This object is used to set the necessary configuration information (such as the jar file name and the GDAL paths) to create the job, manipulate rasters, and execute the job. The basic logic is as follows:

     //Creates Hadoop Configuration
     HadoopConfiguration hadoopConf = new HadoopConfiguration();
     //Assigns GDAL_DATA location based on specified SHAREDDIR, this data folder is required by gdal to look for data tables that allow SRID conversions
     String gdalData = sharedDir + ProcessConstants.DIRECTORY_SEPARATOR + "data";
     hadoopConf.setGdalDataPath(gdalData);
     //Sets jar name for processor
     hadoopConf.setMapreduceJobJar("hadoop-imageprocessor.jar");
     //Creates the job
     RasterProcessorJob processor = (RasterProcessorJob) hadoopConf.createRasterProcessorJob();

If the API is used on a remote node, you can set properties in the Hadoop Configuration object to connect to the cluster. For example:

        //Following config settings are required for standalone execution. (REMOTE ACCESS)
        hadoopConf.setUser("hdfs");
        hadoopConf.setHdfsPathPrefix("hdfs://den00btb.us.oracle.com:8020");
        hadoopConf.setResourceManagerScheduler("den00btb.us.oracle.com:8030");
        hadoopConf.setResourceManagerAddress("den00btb.us.oracle.com:8032");
        hadoopConf.setYarnApplicationClasspath("/etc/hadoop/conf/,/usr/lib/hadoop/*,/usr/lib/hadoop/lib/*," +
                          "/usr/lib/hadoop-hdfs/*,/usr/lib/hadoop-hdfs/lib/*,/usr/lib/hadoop-yarn/*," +
                          "/usr/lib/hadoop-yarn/lib/*,/usr/lib/hadoop-mapreduce/*,/usr/lib/hadoop-mapreduce/lib/* ");

After the job is created, the properties for its execution must be set depending on the job type. There are two job classes: RasterLoaderJob to load the rasters into HDFS, and RasterProcessorJob to process them.

The following example loads a Hawaii raster into the APICALL_HDFS directory. It creates a thumbnail in a shared folder, and specifies 10 pixels overlapping on each edge of the tiles.

    private static void executeLoader(HadoopConfiguration hadoopConf){
        hadoopConf.setMapreduceJobJar("hadoop-imageloader.jar");
        RasterLoaderJob loader = (RasterLoaderJob) hadoopConf.createRasterLoaderJob();
        loader.setFilesToLoad("/net/den00btb/scratch/zherena/hawaii/hawaii.tif");
        loader.setTileOverlap("10");
        loader.setOutputFolder("APICALL");
        loader.setRasterThumbnailFolder("/net/den00btb/scratch/zherena/processOutput");
        try{
        loader.setGdalPath("/net/den00btb/scratch/zherena/gdal/lib");
         
        boolean loaderSuccess = loader.execute();
            if(loaderSuccess){
                System.out.println("Successfully executed loader job");
            }
            else{
                System.out.println("Failed to execute loader job");
            }
        }catch(Exception e ){
        System.out.println("Problem when trying to execute raster loader " + e.getMessage());
        }
    }
}

The following example processes the loaded raster.

private static void executeProcessor(HadoopConfiguration hadoopConf){
    hadoopConf.setMapreduceJobJar("hadoop-imageprocessor.jar");
    RasterProcessorJob processor = (RasterProcessorJob) hadoopConf.createRasterProcessorJob();
     
    try{
    processor.setGdalPath("/net/den00btb/scratch/zherena/gdal/lib");
    MosaicConfiguration mosaic = new MosaicConfiguration();
        mosaic.setBands(3);
        mosaic.setDirectory("/net/den00btb/scratch/zherena/processOutput");
        mosaic.setFileName("APIMosaic");
        mosaic.setFileSystem(RasterProcessorJob.FS);
        mosaic.setFormat("GTIFF");
        mosaic.setHeight(3192);
        mosaic.setNoData("#FFFFFF");
        mosaic.setOrderAlgorithm(ProcessConstants.ALGORITMH_FILE_LENGTH);
        mosaic.setOrder("1");
        mosaic.setPixelType("1");
        mosaic.setPixelXWidth(67.457513);
        mosaic.setPixelYWidth(-67.457513);
        mosaic.setSrid("26904");
        mosaic.setUpperLeftX(830763.281336);
        mosaic.setUpperLeftY(2259894.481403);
        mosaic.setWidth(1300);
    processor.setMosaicConfigurationObject(mosaic.getCompactMosaic()); 
        RasterCatalog catalog = new RasterCatalog();
        Raster raster = new Raster();
        raster.setBands(3);
        raster.setBandsOrder("1,2,3");
        raster.setDataType(1);
        raster.setRasterLocation("/user/hdfs/APICALL/net/den00btb/scratch/zherena/hawaii/hawaii.tif.ohif");
        catalog.addRasterToCatalog(raster);
           
        processor.setCatalogObject(catalog.getCompactCatalog());
    boolean processorSuccess = processor.execute();
        if(processorSuccess){
            System.out.println("Successfully executed processor job");
        }
        else{
            System.out.println("Failed to execute processor job");
        }
    }catch(Exception e ){
    System.out.println("Problem when trying to execute raster processor " + e.getMessage());
    }
}

In the preceding example, the thumbnail is optional if the mosaic results will be stored in HDFS. If a processing jar file is specified (used when the additional user processing classes are specified), the location of the jar file containing these lasses must be specified. The other parameters are required for the mosaic to be generated successfully.

Several examples of using the processing API are provided /opt/oracle/oracle-spatial-graph/spatial/raster/examples/java/src. Review the Java classes to understand their purpose. You may execute them using the scripts provided for each example located under /opt/oracle/oracle-spatial-graph/spatial/raster/examples/java/cmd.

After you have executed the scripts and validated the results, you can modify the Java source files to experiment on them and compile them using the provided script /opt/oracle/oracle-spatial-graph/spatial/raster/examples/java/build.xml. Ensure that you have write access on the /opt/oracle/oracle-spatial-graph/spatial/raster/jlib directory.

2.7 Using the Oracle Spatial Hadoop Raster Simulator Framework to Test Raster Processing

When you create custom processing classes. you can use the Oracle Spatial Hadoop Raster Simulator Framework to do the following by "pretending" to plug them into the Oracle Raster Processing Framework.

  • Develop user processing classes on a local computer

  • Avoid the need to deploy user processing classes in a cluster or in Big Data Lite to verify their correct functioning

  • Debug user processing classes

  • Use small local data sets

  • Create local debug outputs

  • Automate unit tests

The Simulator framework will emulate the loading and processing processes in your local environment, as if they were being executed in a cluster. You only need to create a Junit test case that loads one or more rasters and processes them according to your specification in XML or a configuration object.

Tiles are generated according to specified block size, so you must set a block size. The number of mappers and reducers to execute depends on the number of tiles, just as in regular cluster execution. OHIF files generated during the loading process are stored in local directory, because no HDFS is required.

  • Simulator (“Mock”) Objects

  • User Local Environment Requirements

  • Sample Test Cases to Load and Process Rasters

Simulator (“Mock”) Objects

To load rasters and convert them into .OHIF files that can be processed, a RasterLoaderJobMock must be executed. This class constructor receives the HadoopConfiguration that must include the block size, the directory or rasters to load, the output directory to store the OHIF files, and the gdal directory. The parameters representing the input files and the user configuration vary in terms of how you specify them:

  • Location Strings for catalog and user configuration XML file

  • Catalog object (CatalogMock)

  • Configuration objects (MosaicProcessConfigurationMock or SingleProcessConfigurationMock)

  • Location for a single raster processing and a user configuration (MosaicProcessConfigurationMock or SingleProcessConfigurationMock)

User Local Environment Requirements

Before you create test cases, you need to configure your local environment.

  1. 1. Ensure that a directory has the native gdal libraries, gdal-data and libproj.

    For Linux:

    1. Follow the steps in Getting and Compiling the Cartographic Projections Library to obtain libproj.so.

    2. Get the gdal distribution from the Spatial installation on your cluster or BigDataLite VM at /opt/oracle/oracle-spatial-graph/spatial/raster/gdal.

    3. Move libproj.so to your local gdal directory under gdal/lib with the rest of the native gdal libraries.

    For Windows:

    1. Get the gdal distribution from your Spatial install on your cluster or BigDataLite VM at /opt/oracle/oracle-spatial-graph/spatial/raster/examples/java/mock/lib/gdal_windows.x64.zip.

    2. Be sure that Visual Studio installed. When you install it, make sure you select the Common Tools for Visual C++.

    3. Download the PROJ 4 source code, version branch 4.9 from https://trac.osgeo.org/proj/browser/branches/4.9.

    4. Open the Visual Studio Development Command Prompt and type:

      cd PROJ4/src_dir
      nmake /f makefile.vc
      
    5. Move proj.dll to your local gdal directory under gdal/bin with the rest of the native gdal libraries.

  2. Add GDAL native libraries to system path.

    For Linux: Export LD_LIBRARY_PATH with corresponding native gdal libraries directory

    For Windows: Add to the Path environment variable the native gdal libraries directory.

  3. Ensure that the Java project has Junit libraries.

  4. Ensure that the Java project has the following Hadoop jar and Oracle Image Processing Framework files in the classpath You may get them from the Oracle BigDataLite VM or from your cluster; these are all jars included in the Hadoop distribution, and for specific framework jars, go to /opt/oracle/oracle-spatial-graph/spatial/raster/jlib:

    (In the following list, VERSION_INCLUDED refers to the version number from the Hadoop installation containing the files; it can be a BDA cluster or a BigDataLite VM.)

    commons-collections-VERSION_INCLUDED.jar
    commons-configuration-VERSION_INCLUDED.jar
    commons-lang-VERSION_INCLUDED.jar
    commons-logging-VERSION_INCLUDED.jar
    commons-math3-VERSION_INCLUDED.jar
    gdal.jar
    guava-VERSION_INCLUDED.jar
    hadoop-auth-VERSION_INCLUDED-cdhVERSION_INCLUDED.jar
    hadoop-common-VERSION_INCLUDED-cdhVERSION_INCLUDED.jar
    hadoop-imageloader.jar
    hadoop-imagemocking-fwk.jar
    hadoop-imageprocessor.jar
    hadoop-mapreduce-client-core-VERSION_INCLUDED-cdhVERSION_INCLUDED.jar
    hadoop-raster-fwk-api.jar
    jackson-core-asl-VERSION_INCLUDED.jar
    jackson-mapper-asl-VERSION_INCLUDED.jar
    log4j-VERSION_INCLUDED.jar
    slf4j-api-VERSION_INCLUDED.jar
    slf4j-log4j12-VERSION_INCLUDED.jar
    

Sample Test Cases to Load and Process Rasters

After your Java project is prepared for your test cases, you can test the loading and processing of rasters.

The following example creates a class with a setUp method to configure the directories for gdal, the rasters to load, your configuration XML files, the output thumbnails, ohif files, and process results. It also configures the block size (8 MB). (A small block size is recommended for single computers.)
 /**
         * Set the basic directories before starting the test execution
         */
        @Before
        public void setUp(){
                String sharedDir = "C:\\Users\\zherena\\Oracle Stuff\\Hadoop\\Release 4\\MockTest";
                String allAccessDir = sharedDir +  "/out/";
                gdalDir = sharedDir + "/gdal";
                directoryToLoad = allAccessDir + "rasters";
                xmlDir = sharedDir + "/xmls/";
                outputDir = allAccessDir;
                blockSize = 8;
        }

The following example creates a RasterLoaderJobMock object, and sets the rasters to load and the output path for OHIF files:

/**
         * Loads a directory of rasters, and generate ohif files and thumbnails
       * for all of them
         * @throws Exception if there is a problem during load process
         */
        @Test
        public void basicLoad() throws Exception {
                System.out.println("***LOAD OF DIRECTORY WITHOUT EXPANSION***");
                HadoopConfiguration conf = new HadoopConfiguration();
                conf.setBlockSize(blockSize);
                System.out.println("Set block size of: " +  
                                conf.getProperty("dfs.blocksize"));
                RasterLoaderJobMock loader = new RasterLoaderJobMock(conf, 
                                    outputDir, directoryToLoad, gdalDir);
                //Puts the ohif file directly in the specified output directory
                loader.dontExpandOutputDir();
                System.out.println("Starting execution");
            System.out.println("------------------------------------------------------------------------------------------------------------");
            loader.waitForCompletion();
                System.out.println("Finished loader");
                System.out.println("LOAD OF DIRECTORY WITHOUT EXPANSION ENDED");
                System.out.println();
                System.out.println();
        }

The following example specifies catalog and user configuration XML files to the RasterProcessorJobMock object. Make sure your catalog xml points to the correct location of your local OHIF files.

     /**
         * Creates a mosaic raster by using configuration and catalog xmls.    
         * Only two bands are selected per raster.
         * @throws Exception    if there is a problem during mosaic process.
         */
        @Test
        public void mosaicUsingXmls() throws Exception {
                System.out.println("***MOSAIC PROCESS USING XMLS***");
                HadoopConfiguration conf = new HadoopConfiguration();
                conf.setBlockSize(blockSize);
                System.out.println("Set block size of: " +   
               conf.getProperty("dfs.blocksize"));
               String catalogXml = xmlDir + "catalog.xml";
               String configXml = xmlDir + "config.xml";
        RasterProcessorJobMock processor = new  RasterProcessorJobMock(conf, configXml, catalogXml, gdalDir);
                System.out.println("Starting execution");
              System.out.println("------------------------------------------------------------------------------------------------------------");
                processor.waitForCompletion();
                System.out.println("Finished processor");
                System.out.println("***********************************************MOSAIC PROCESS USING XMLS ENDED***********************************************");
                System.out.println();
                System.out.println();

Additional examples using the different supported configurations for RasterProcessorJobMock are provided in /opt/oracle/oracle-spatial-graph/spatial/raster/examples/java/mock/src.They include an example using an external processing class, which is also included and can be debugged.

2.8 Oracle Big Data Spatial Vector Analysis

Oracle Big Data Spatial Vector Analysis is a Spatial Vector Analysis API, which runs as a Hadoop job and provides MapReduce components for spatial processing of data stored in HDFS. These components make use of the Spatial Java API to perform spatial analysis tasks. There is a web console provided along with the API. The supported features include:

In addition, read the following information for understanding the implementation details:

2.8.1 Multiple Hadoop API Support

Oracle Big Data Spatial Vector Analysis provides classes for both the old and new (context objects) Hadoop APIs. In general, classes in the mapred package are used with the old API, while classes in the mapreduce package are used with the new API

The examples in this guide use the old Hadoop API; however, all the old Hadoop Vector API classes have equivalent classes in the new API. For example, the old class oracle.spatial.hadoop.vector.mapred.job.SpatialIndexing has the equivalent new class named oracle.spatial.hadoop.vector.mapreduce.job.SpatialIndexing. In general, and unless stated otherwise, only the change from mapred to mapreduce is needed to use the new Hadoop API Vector classes.

Classes such as oracle.spatial.hadoop.vector.RecordInfo, which are not in the mapred or mapreduce package, are compatible with both Hadoop APIs.

2.8.2 Spatial Indexing

A spatial index is in the form of a key/value pair and generated as a Hadoop MapFile. Each MapFile entry contains a spatial index for one split of the original data. The key and value pair contains the following information:

  • Key: a split identifier in the form: path + start offset + length.

  • Value: a spatial index structure containing the actual indexed records.

The following figure depicts a spatial index in relation to the user data. The records are represented as r1, r2, and so on. The records are grouped into splits (Split 1, Split 2, Split 3, Split n). Each split has a Key-Value pair where the key identifies the split and the value identifies an Rtree index on the records in that split.

Related subtopics:

2.8.2.1 Spatial Indexing Class Structure

Records in a spatial index are represented using the class oracle.spatial.hadoop.vector.RecordInfo. A RecordInfo typically contains a subset of the original record data and a way to locate the record in the file where it is stored. The specific RecordInfo data depends on two things:

  • InputFormat used to read the data

  • RecordInfoProvider implementation, which provides the record's data

The fields contained within a RecordInfo:

  • Id: Text field with the record Id.

  • Geometry: JGeometry field with the record geometry.

  • Extra fields: Additional optional fields of the record can be added as name-value pairs. The values are always represented as text.

  • Start offset: The position of the record in a file as a byte offset. This value depends on the InputFormat used to read the original data.

  • Length: The original record length in bytes.

  • Path: The file path can be added optionally. This is optional because the file path can be known using the spatial index entry key. However, to add the path to the RecordInfo instances when a spatial index is created, the value of the configuration property oracle.spatial.recordInfo.includePathField key is set to true.

2.8.2.2 Configuration for Creating a Spatial Index

A spatial index is created using a combination of FileSplitInputFormat, SpatialIndexingMapper, InputFormat, and RecordInfoProvider, where the last two are provided by the user. The following code example shows part of the configuration needed to run a job that creates a spatial index for the data located in the HDFS folder /user/data.

//input
 
conf.setInputFormat(FileSplitInputFormat.class);
FileSplitInputFormat.setInputPaths(conf, new Path("/user/data"));
FileSplitInputFormat.setInternalInputFormatClass(conf, GeoJsonInputFormat.class);
FileSplitInputFormat.setRecordInfoProviderClass(conf, GeoJsonRecordInfoProvider.class);

//output
 
conf.setOutputFormat(MapFileOutputFormat.class);
FileOutputFormat.setOutputPath(conf, new Path("/user/data_spatial_index"));
 
//mapper
 
conf.setMapperClass(SpatialIndexingMapper.class); 
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(RTreeWritable.class);

In this example,

  • The FileSplitInputFormat is set as the job InputFormat. FileSplitInputFormat is a subclass of CompositeInputFormat (WrapperInputFormat in the new Hadoop API version), an abstract class that uses another InputFormat implementation (internalInputFormat) to read the data. The internal InputFormat and the RecordInfoProvider implementations are specified by the user and they are set to GeoJsonInputFormat and GeoJsonRecordInfoProvider, respectively.

  • The MapFileOutputFormat is set as the OutputFormat in order to generate a MapFile

  • The mapper is set to SpatialIndexingMappper. The mapper output key and value types are Text (splits identifiers) and RTreeWritable (the actual spatial indexes).

  • No reducer class is specified so it runs with the default reducer. The reduce phase is needed to sort the output MapFile keys.

Alternatively, this configuration can be set easier by using the oracle.spatial.hadoop.vector.mapred.job.SpatialIndexing class. SpatialIndexing is a job driver that creates a spatial index. In the following example, a SpatialIndexing instance is created, set up, and used to add the settings to the job configuration by calling the configure() method. Once the configuration has been set, the job is launched.

SpatialIndexing<LongWritable, Text> spatialIndexing = new SpatialIndexing<LongWritable, Text>();
 
//path to input data
 
spatialIndexing.setInput("/user/data");
 
//path of the spatial index to be generated
 
spatialIndexing.setOutput("/user/data_spatial_index");
 
//input format used to read the data
 
spatialIndexing.setInputFormatClass(TextInputFormat.class);
 
//record info provider used to extract records information
 
spatialIndexing.setRecordInfoProviderClass(TwitterLogRecordInfoProvider.class);
 
//add the spatial indexing configuration to the job configuration
 
spatialIndexing.configure(jobConf);
 
//run the job
 
JobClient.runJob(jobConf);

2.8.2.3 Spatial Index Metadata

A metadata file is generated for every spatial index that is created. The spatial index metadata can be used to quickly find information related to a spatial index, such as the number of indexed records, the minimum bounding rectangle (MBR) of the indexed data, and the paths of both the spatial index and the indexed source data. The spatial index metadata can be retrieved using the spatial index name.

A spatial index metadata file contains the following information:

  • Spatial index name

  • Path to the spatial index

  • Number of indexed records

  • Number of local indexes

  • Extra fields contained in the indexed records

  • Geometry layer information such as th SRID, dimensions, tolerance, dimension boundaries, and whether the geometries are geodetic or not

  • The following information for each of the local spatial index files: path to the indexed data, path to the local index, and MBR of the indexed data

The following metadata proeprties can be set when creating a spatial index using the SpatialIndexing class:

  • indexName: Name of the spatial index. If not set, the output folder name is used.

  • metadataDir: Path to the directory where the metadata file will be stored.
    • By default, it will be stored in the following path relative to the user directory: oracle_spatial/index_metadata. If the user is hdfs, it will be /user/hdfs/oracle_spatial/index_metadata.

  • overwriteMetadata: If set to true, then when a spatial index metadata file already exists for a spatial index with the same indexName in the current metadataDir, the spatial index metadata will be overwritten. If set to false and if a spatial index metadata file already exists for a spatial index with the same indexName in the current metadataDir, then an error is raised.

The following example sets the metadata directory and spatial index name, and specifies to overwrite any existing metadata if the index already exists:

spatialIndexing.setMetadataDir("/user/hdfs/myIndexMetadataDir");
spatialIndexing.setIndexName("testIndex");
spatialIndexing.setOverwriteMetadata(true);

An existing spatial index can be passed to other jobs by specifying only the indexName and optionally the indexMetadataDir where the index metadata can be found. When the index name is provided, there is no need to specify the spatial index path and the input format.

The following job drivers accept the indexName as a parameter:

  • oracle.spatial.hadoop.vector.mapred.job.Categorization

  • oracle.spatial.hadoop.vector.mapred.job.SpatialFilter

  • oracle.spatial.hadoop.vector.mapred.job.Binning

  • Any driver that accepts oracle.spatial.hadoop.vector.InputDataSet, such as SpatialJoin and Partitioning

If the index name is not found in the indexMetadataDir path, an error is thrown indicating that the spatial index could not be found.

The following example shows a spatial index being set as the input data set for a binning job:

Binning binning = new Binning();
binning.setIndexName("indexExample");
binning.setIndexMetadataDir("indexMetadataDir");

2.8.2.4 Input Formats for a Spatial Index

An InputFormat must meet the following requisites to be supported:

  • It must be a subclass of FileInputFormat.

  • The getSplits()method must return either FileSplit or CombineFileSplit split types.

  • For the old Hadoop API, the RecordReader’s getPos() method must return the current position to track back a record in the spatial index to its original record in the user file. If the current position is not returned, then the original record cannot be found using the spatial index.

    However, the spatial index still can be created and used in operations that do not require the original record to be read. For example, additional fields can be added as extra fields to avoid having to read the whole original record.

    Note:

    The spatial indexes are created for each split as returned by the getSplits() method. When the spatial index is used for filtering (see Spatial Filtering), it is recommended to use the same InputFormat implementation than the one used to create the spatial index to ensure the splits indexes can be found.

The getPos() method has been removed from the Hadoop new API; however, org.apache.hadoop.mapreduce.lib.input.TextInputFormat and CombineTextInputFormat are supported, and it is still possible to get the record start offsets.

Other input formats from the new API are supported, but the record start offsets will not be contained in the spatial index. Therefore, it is not possible to find the original records. The requirements for a new API input format are the same as for the old API. However, they must be translated to the new APIs FileInputFormat, FileSplit, and CombineFileSplit.

2.8.2.5 Support for GeoJSON and Shapefile Formats

The Vector API comes with InputFormat and RecordInfoProvider implementations for GeoJSON and Shapefile file formats.

The following InputFormat/RecordInfoProvider pairs can be used to read and interpret GeoJSON and ShapeFiles, respectively:

oracle.spatial.hadoop.vector.geojson.mapred.GeoJsonInputFormat / oracle.spatial.hadoop.vector.geojson.GeoJsonRecordInfoProvider

oracle.spatial.hadoop.vector.shapefile.mapred.ShapeFileInputFormat / oracle.spatial.hadoop.vector.shapefile.ShapeFileRecordInfoProvider

More information about the usage and properties is available in the Javadoc.

2.8.2.6 Removing a Spatial Index

A previously generated spatial index can be removed by executing the following.

oracle.spatial.hadoop.vector.util.Tools removeSpatialIndex indexName=<INDEX_NAME> [indexMetadataDir=<PATH>] [removeIndexFiles=<true|false*>]

Where:

  • indexName: Name of a previously generated index.

  • indexMetadataDir (optional): Path to the index metadata directory. If not specified, the following path relative to the user directory will be used: oracle_spatial/index_metadata

  • removeIndexFiles (optional): true if generated index map files need to be removed in addition to the index metadata file. By default, it is false.

2.8.3 Using MVSuggest

MVSuggest can be used at the time of spatial indexing to get an approximate location for records that do not have geometry but have some text field. This text field can be used to determine the record location. The geometry returned by MVSuggest is used to include the record in the spatial index.

Because it is important to know the field containing the search text for every record, the RecordInfoProvider implementation must also implement LocalizableRecordInfoProvider. Alternatively, the configuration parameter oracle.spatial.recordInfo.locationField can be set with the name of the field containing the search text. For more information, see the Javadoc for LocalizableRecordInfoProvider.

A standalone version of MVSuggest is shipped with the Vector API and it can be used in some jobs that accept the MVSConfig as an input parameter.

The following job drivers can work with MVSuggest and all of them have the setMVSConfig() method which accepts an instance of MVSConfig:

  • oracle.spatial.hadoop.vector.mapred.job.SpatialIndexing: has the option of using MVSuggest to get approximate spatial location for records which do not contain geometry.

  • oracle.spatial.hadoop.vector.mapred.job.Categorization: MVSuggest can be used to assign a record to a specific feature in a layer, for example, the feature California in the USA states layer.

  • oracle.spatial.hadoop.vector.mapred.job.SuggestService: A simple job that generates a file containing a search text and its match per input record.

The MVSuggest configuration is passed to a job using the MVSConfig or the LocalMVSConfig classes. The basic MVSuggest properties are:

  • serviceLocation: It is the minimum property required in order to use MVSuggest. It contains the path or URL where the MVSuggest directory is located or in the case of a URL, where the MVSuggest service is deployed.

  • serviceInterfaceType: the type of MVSuggest implementation used. It can be LOCAL(default) for a standalone version and WEB for the web service version.

  • matchLayers: an array of layer names used to perform the searches.

When using the standalone version of MVSuggest, you must specify an MVSuggest directory or repository as the serviceLocation. An MVSuggest directory must have the following structure:

mvsuggest_config.json
repository folder
   one or more layer template files in .json format
   optionally, a _config_ directory
   optionally, a _geonames_ directory

The examples folder comes with many layer template files and a _config_ directory with the configuration for each template.

It is possible to set the repository folder (the one that contains the templates) as the mvsLocation instead of the whole MVSuggest directory. In order to do that, the class LocalMVSConfig can be used instead of MVSConfig and the repositoryLocation property must be set to true as shown in the following example:

LocalMVSConfig lmvsConf = new LocalMVSConfig();
lmvsConf.setServiceLocation(“file:///home/user/mvs_dir/repository/”);
lmvsConf.setRepositoryLocation(true);
lmvsConf.setPersistentServiceLocation(“/user/hdfs/hdfs_mvs_dir”);
spatialIndexingJob.setMvsConfig(lmvsConf);

The preceding example sets a repository folder as the MVS service location. setRepositoryLocation is set to true to indicate that the service location is a repository instead of the whole MVSuggest directory. When the job runs, a whole MVSuggest directory will be created using the given repository location; the repository will be indexed and will be placed in a temporary folder while the job finishes. The previously indexed MVSuggest directory can be persisted so it can be used later. The preceding example saves the generated MVSuggest directory in the HDFS path /user/hdfs/hdfs_mvs_dir. Use the MVSDirectory if the MVSuggest directory already exists.

2.8.4 Spatial Filtering

Once the spatial index has been generated, it can be used to spatially filter the data. The filtering is performed before the data reaches the mapper and while it is being read. The following sample code example demonstrates how the SpatialFilterInputFormat is used to spatially filter the data.

//set input path and format
 
FileInputFormat.setInputPaths(conf, new Path("/user/data/"));
conf.setInputFormat(SpatialFilterInputFormat.class);
 
//set internal input format
 
SpatialFilterInputFormat.setInternalInputFormatClass(conf, TextInputFormat.class);
if( spatialIndexPath != null )  
{
 
     //set the path to the spatial index and put it in the distributed cache
 
     boolean useDistributedCache = true;
     SpatialFilterInputFormat.setSpatialIndexPath(conf, spatialIndexPath, useDistributedCache);
} 
else 
{
     //as no spatial index is used a RecordInfoProvider is needed
 
     SpatialFilterInputFormat.setRecordInfoProviderClass(conf, TwitterLogRecordInfoProvider.class);
}
 
//set spatial operation used to filter the records
 
SpatialOperationConfig spatialOpConf = new SpatialOperationConfig();
spatialOpConf.setOperation(SpatialOperation.IsInside);
spatialOpConf.setJsonQueryWindow("{\"type\":\"Polygon\", \"coordinates\":[[-106.64595, 25.83997, -106.64595, 36.50061, -93.51001, 36.50061, -93.51001, 25.83997 , -106.64595, 25.83997]]}");
spatialOpConf.setSrid(8307);
spatialOpConf.setTolerance(0.5);
spatialOpConf.setGeodetic(true);

SpatialFilterInputFormat has to be set as the job's InputFormat. The InputFormat that actually reads the data must be set as the internal InputFormat. In this example, the internal InputFormat is TextInputFormat.

If a spatial index is specified, it is used for filtering. Otherwise, a RecordInfoProvider must be specified in order to get the records geometries, in which case the filtering is performed record by record.

As a final step, the spatial operation and query window to perform the spatial filter are set. It is recommended to use the same internal InputFormat implementation used when the spatial index was created or, at least, an implementation that uses the same criteria to generate the splits. For details see "Input Formats for a Spatial Index."

If a simple spatial filtering needs to be performed (that is, only retrieving records that interact with a query window), the built-in job driver oracle.spatial.hadoop.vector.mapred.job.SpatialFilter can be used instead. This job driver accepts indexed or non-indexed input and a SpatialOperationConfig to perform the filtering.

Related:

2.8.4.1 Filtering Records

The following steps are executed when records are filtered using the SpatialFilterInputFormat and a spatial index.

  1. SpatialFilterInputFormat getRecordReader() method is called when the mapper requests a RecordReader for the current split.

  2. The spatial index for the current split is retrieved.

  3. A spatial query is performed over the records contained in it using the spatial index.

    As a result, the ranges in the split that contains records meeting the spatial filter are known. For example, if a split goes from the file position 1000 to 2000, upon executing the spatial filter it can be determined that records that fulfill the spatial condition are in the ranges 1100-1200, 1500-1600 and 1800-1950. So the result of performing the spatial filtering at this stage is a subset of the original filter containing smaller splits.

  4. An InternalInputFormat RecordReader is requested for every small split from the resulting split subset.

  5. A RecordReader is returned to the caller mapper. The returned RecordReader is actually a wrapper RecordReader with one or more RecordReaders returned by the internal InputFormat.

  6. Every time the mapper calls the RecordReader, the call to next method to read a record is delegated to the internal RecordReader.

These steps are shown in the following spatial filter interaction diagram.

2.8.4.2 Filtering Using the Input Format

A previously generated Spatial Index can be read using the input format implementation oracle.spatial.hadoop.vector.mapred.input.SpatialIndexInputFormat (or its new Hadoop API equivalent with the mapreduce package instead of mapred). SpatialIndexInputFormat is used just like any other FileInputFormat subclass in that it takes an input path and it is set as the job’s input format. The key and values returned are the id (Text) and record information (RecordInfo) of the records stored in the spatial index.

Aditionally, a spatial filter opertion can be performed by specifying a spatial operation configuration to the input format, so that only the records matching some spatial interaction will be returned to a mapper. The following example shows how to configure a job to read a spatial index to retrieve all the records that are inside a specific area.

JobConf conf = new JobConf();
conf.setMapperClass(MyMapper.class);
conf.setInputFormat(SpatialIndexInputFormat.class);
SpatialOperationConfig spatialOpConf = new SpatialOperationConfig();
spatialOpConf.setOperation(SpatialOperation.IsInside);
spatialOpConf.setQueryWindow(JGeometry.createLinearPolygon(new double[]{47.70, -124.28, 47.70,  -95.12, 35.45, -95.12, 35.45, -124.28, 47.70, -124.28}, 2, 8307));
SpatialIndexInputFormat.setFilterSpatialOperationConfig(spatialOpConf, conf);

The mapper in the preceding example can add a nonspatial filter by using the RecordInfo extra fields, as shown in the following example.

public class MyMapper extends MapReduceBase implements Mapper<Text, RecordInfo, Text, RecordInfo>{
        @Override
        public void map(Text key, RecordInfo value, OutputCollector<Text, RecordInfo> output, Reporter reporter)
                        throws IOException {
                if( Integer.valueOf(value.getField("followers_count")) > 0){
                        output.collect(key, value);
                }
        }
}

2.8.5 Classifying Data Hierarchically

The Vector Analysis API provides a way to classify the data into hierarchical entities. For example, in a given set of catalogs with a defined level of administrative boundaries such as continents, countries and states, it is possible to join a record of the user data to a record of each level of the hierarchy data set. The following example generates a summary count for each hierarchy level, containing the number of user records per continent, country, and state or province:

Categorization catJob = new Categorization();
//set a spatial index as the input
                                 
catJob.setIndexName("indexExample");
 
//set the job's output
                                 
catJob.setOutput("hierarchy_count");
                                 
//set HierarchyInfo implementation which describes the world administrative boundaries hierarchy
                        
catJob.setHierarchyInfoClass( WorldDynaAdminHierarchyInfo.class );
                                 
//specify the paths of the hierarchy data
                                 
Path[] hierarchyDataPaths = {
                                new Path("file:///home/user/catalogs/world_continents.json"), 
                                new Path("file:///home/user/catalogs/world_countries.json"), 
                                new Path("file:///home/user/catalogs/world_states_provinces.json")};
catJob.setHierarchyDataPaths(hierarchyDataPaths);
                                 
//set the path where the index for the previous hierarchy data will be generated
                                 
catJob.setHierarchyIndexPath(new Path("/user/hierarchy_data_index/"));
                                 
//setup the spatial operation which will be used to join records from the two datasets (spatial index and hierarchy data).
SpatialOperationConfig spatialOpConf = new SpatialOperationConfig();                     
spatialOpConf.setOperation(SpatialOperation.IsInside);
spatialOpConf.setSrid(8307);
spatialOpConf.setTolerance(0.5);
spatialOpConf.setGeodetic(true);
catJob.setSpatialOperationConfig(spatialOpConf);
                                 
//add the previous setup to the job configuration
                                 
catJob.configure(conf);
                                 
//run the job 
RunningJob rj = JobClient.runJob(conf);

The preceding example uses the Categorization job driver. The configuration can be divided into the following categories:

  • Input data: A previously generated spatial index (received as the job input).

  • Output data: A folder that contains the summary counts for each hierarchy level.

  • Hierarchy data configuration: This contains the following:

    • HierarchyInfo class: This is an implementation of HierarchyInfo class in charge of describing the current hierarchy data. It provides the number of hierarchy levels, level names, and the data contained at each level.

    • Hierarchy data paths: This is the path to each one of the hierarchy catalogs. These catalogs are read by the HierarchyInfo class.

    • Hierarchy index path: This is the path where the hierarchy data index is stored. Hierarchy data needs to be preprocessed to know the parent-child relationships between hierarchy levels. This information is processed once and saved at the hierarchy index, so it can be used later by the current job or even by any other jobs.

  • Spatial operation configuration: This is the spatial operation to be performed between records of the user data and the hierarchy data in order to join both datasets. The parameters to set here are the Spatial Operation type (IsInside), SRID (8307), Tolerance (0.5 meters), and whether the geometries are Geodetic (true).

Internally, the Categorization.configure() method sets the mapper and reducer to be SpatialHierarchicalCountMapper and SpatialHierarchicalCountReducer, respectively. SpatialHierarchicalCountMapper's output key is a hierarchy entry identifier in the form hierarchy_level + hierarchy_entry_id. The mapper output value is a single count for each output key. The reducer sums up all the counts for each key.

Note:

The entire hierarchy data may be read into memory and hence the total size of all the catalogs is expected to be significantly less than the user data. The hierarchy data size should not be larger than a couple of gigabytes.

If you want another type of output instead of counts, for example, a list of user records according to the hierarchy entry. In this case, the SpatialHierarchicalJoinMapper can be used. The SpatialHierarchicalJoinMapper output value is a RecordInfo instance, which can be gathered in a user-defined reducer to produce a different output. The following user-defined reducer generates a MapFile for each hierarchy level using the MultipleOutputs class. Each MapFile has the hierarchy entry ids as keys and ArrayWritable instances containing the matching records for each hierarchy entry as values. The following is an user-defined reducer that returns a list of records by hierarchy entry:

public class HierarchyJoinReducer extends MapReduceBase implements Reducer<Text, RecordInfo, Text, ArrayWritable> {
 
       private MultipleOutputs mos = null;
       private Text outKey = new Text();
       private ArrayWritable outValue = new ArrayWritable( RecordInfo.class );
 
       @Override
       public void configure(JobConf conf) 
       {
         super.configure(conf);
 
         //use MultipleOutputs to generate different outputs for each hierarchy level
    
         mos = new MultipleOutputs(conf);
        }
        @Override
        public void reduce(Text key, Iterator<RecordInfo> values,
                          OutputCollector<Text, RecordInfoArrayWritable> output, Reporter reporter)
                          throws IOException 
        {
 
          //Get the hierarchy level name and the hierarchy entry id from the key
 
          String[] keyComponents = HierarchyHelper.getMapRedOutputKeyComponents(key.toString());
          String hierarchyLevelName = keyComponents[0];
          String entryId = keyComponents[1];
          List<Writable> records = new LinkedList<Writable>();
 
          //load the values to memory to fill output ArrayWritable
      
          while(values.hasNext())
          {
            RecordInfo recordInfo = new RecordInfo( values.next() );
            records.add( recordInfo );          
          }
          if(!records.isEmpty())
          {
 
            //set the hierarchy entry id as key
 
            outKey.set(entryId);
 
            //list of records matching the hierarchy entry id
 
            outValue.set( records.toArray(new Writable[]{} ) );
 
            //get the named output for the given hierarchy level
 
            hierarchyLevelName = FileUtils.toValidMONamedOutput(hierarchyLevelName);
            OutputCollector<Text, ArrayWritable> mout = mos.getCollector(hierarchyLevelName, reporter);
 
           //Emit key and value
 
           mout.collect(outKey, outValue);
          }
}
 
        @Override
        public void close() throws IOException 
        {
          mos.close();
        }
}

The same reducer can be used in a job with the following configuration to generate a list of records according to the hierarchy levels:

JobConf conf = new JobConf(getConf());
 
//input path
 
FileInputFormat.setInputPaths(conf, new Path("/user/data_spatial_index/") );
 
//output path
 
FileOutputFormat.setOutputPath(conf, new Path("/user/records_per_hier_level/") );
 
//input format used to read the spatial index
 
conf.setInputFormat( SequenceFileInputFormat.class);
 
//output format: the real output format will be configured for each multiple output later
 
conf.setOutputFormat(NullOutputFormat.class);
 
//mapper
 
conf.setMapperClass( SpatialHierarchicalJoinMapper.class );
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(RecordInfo.class);
 
//reducer
 
conf.setReducerClass( HierarchyJoinReducer.class );
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(ArrayWritable.class);
 
////////////////////////////////////////////////////////////////////
 
//hierarchy data setup
 
//set HierarchyInfo class implementation
 
conf.setClass(ConfigParams.HIERARCHY_INFO_CLASS, WorldAdminHierarchyInfo.class, HierarchyInfo.class);
 
//paths to hierarchical catalogs
 
Path[] hierarchyDataPaths = {
new Path("file:///home/user/catalogs/world_continents.json"), 
new Path("file:///home/user/catalogs/world_countries.json"), 
new Path("file:///home/user/catalogs/world_states_provinces.json")};
 
//path to hierarchy index
 
Path hierarchyDataIndexPath = new Path("/user/hierarchy_data_index/");
 
//instantiate the HierarchyInfo class to index the data if needed.
 
HierarchyInfo hierarchyInfo = new WorldAdminHierarchyInfo();
hierarchyInfo.initialize(conf);
 
//Create the hierarchy index if needed. If it already exists, it will only load the hierarchy index to the distributed cache
 
HierarchyHelper.setupHierarchyDataIndex(hierarchyDataPaths, hierarchyDataIndexPath, hierarchyInfo, conf);
 
///////////////////////////////////////////////////////////////////
 
//setup the multiple named outputs:
 
int levels = hierarchyInfo.getNumberOfLevels();
for(int i=1; i<=levels; i++)
{
    String levelName = hierarchyInfo.getLevelName(i);
 
    //the hierarchy level name is used as the named output
 
    String namedOutput = FileUtils.toValidMONamedOutput(levelName);
    MultipleOutputs.addNamedOutput(conf, namedOutput, MapFileOutputFormat.class, Text.class, ArrayWritable.class);
}
 
//finally, setup the spatial operation
 
SpatialOperationConfig spatialOpConf = new SpatialOperationConfig();                     
spatialOpConf.setOperation(SpatialOperation.IsInside);
spatialOpConf.setSrid(8307);
spatialOpConf.setTolerance(0.5);
spatialOpConf.setGeodetic(true);
spatialOpConf.store(conf);
 
//run job
 
JobClient.runJob(conf);

Supposing the output value should be an array of record ids instead of an array of RecordInfo instances, it would be enough to perform a couple of changes in the previously defined reducer.

The line where outValue is declared, in the previous example, changes to:

private ArrayWritable outValue = new ArrayWritable(Text.class);

The loop where the input values are retrieved, in the previous example, is changed. Therefore, the record ids are got instead of the whole records:

while(values.hasNext())
{
  records.add( new Text(values.next().getId()) );
}

While only the record id is needed the mapper emits the whole RecordInfo instance. Therefore, a better approach is to change the mappers output value. The mappers output value can be changed by extending AbstractSpatialJoinMapper. In the following example, the mapper emits only the record ids instead of the whole RecorInfo instance every time a record matches some of the hierarchy entries:

public class IdSpatialHierarchicalMapper extends AbstractSpatialHierarchicalMapper< Text > 
{
       Text outValue = new Text();
 
       @Override
       protected Text getOutValue(RecordInfo matchingRecordInfo) 
       {
 
         //the out value is the record's id
 
         outValue.set(matchingRecordInfo.getId());
         return outValue;
       }
}

2.8.5.1 Changing the Hierarchy Level Range

By default, all the hierarchy levels defined in the HierarchyInfo implementation are loaded when performing the hierarchy search. The range of hierarchy levels loaded is from level 1 (parent level) to the level returned by HierarchyInfo.getNumberOfLevels() method. The following example shows how to setup a job to only load the levels 2 and 3.

conf.setInt( ConfigParams.HIERARCHY_LOAD_MIN_LEVEL, 2);
conf.setInt( ConfigParams.HIERARCHY_LOAD_MAX_LEVEL, 3);

Note:

These parameters are useful when only a subset of the hierarchy levels is required and when you do not want to modify the HierarchyInfo implementation.

2.8.5.2 Controlling the Search Hierarchy

The search is always performed only at the bottom hierarchy level (the higher level number). If a user record matches some hierarchy entry at this level, then the match is propagated to the parent entry in upper levels. For example, if a user record matches Los Angeles, then it also matches California, USA, and North America. If there are no matches for a user record at the bottom level, then the search does not continue into the upper levels.

This behavior can be modified by setting the configuration parameter ConfigParams.HIERARCHY_SEARCH_MULTIPLE_LEVELS to true. Therefore, if a search at the bottom hierarchy level resulted in some unmatched user records, then search continues into the upper levels until the top hierarchy level is reached or there are no more user records to join. This behavior can be used when the geometries of parent levels do not perfectly enclose the geometries of their child entries

2.8.5.3 Using MVSuggest to Classify the Data

MVSuggest can be used instead of the spatial index to classify data. For this case, an implementation of LocalizableRecordInfoProvider must be known and sent to MVSuggest to perform the search. See the information about LocalizableRecordInfoProvider.

In the following example, the program option is changed from spatial to MVS. The input is the path to the user data instead of the spatial index. The InputFormat used to read the user record and an implementation of LocalizableRecordInfoProvider are specified. The MVSuggest service configuration is set. Notice that there is no spatial operation configuration needed in this case.

Categorization<LongWritable, Text> hierCount = new Categorization<LongWritable, Text>();

// the input path is the user's data

hierCount.setInput("/user/data/");

// set the job's output

hierCount.setOutput("/user/mvs_hierarchy_count");

// set HierarchyInfo implementation which describes the world
// administrative boundaries hierarchy

hierCount.setHierarchyInfoClass(WorldDynaAdminHierarchyInfo.class);

// specify the paths of the hierarchy data

Path[] hierarchyDataPaths = { new Path("file:///home/user/catalogs/world_continents.json"),
                new Path("file:///home/user/catalogs/world_countries.json"),
                new Path("file:///home/user/catalogs/world_states_provinces.json") };
hierCount.setHierarchyDataPaths(hierarchyDataPaths);

// set the path where the index for the previous hierarchy data will be
// generated

hierCount.setHierarchyIndexPath(new Path("/user/hierarchy_data_index/"));

// No spatial operation configuration is needed, Instead, specify the
// InputFormat used to read the user's data and the
// LocalizableRecordInfoProvider class.

hierCount.setInputFormatClass(TextInputFormat.class);
hierCount.setRecordInfoProviderClass(MyLocalizableRecordInfoProvider.class);

// finally, set the MVSuggest configuration

LocalMVSConfig lmvsConf = new LocalMVSConfig();
lmvsConf.setServiceLocation("file:///home/user/mvs_dir/oraclemaps_pub");
lmvsConf.setRepositoryLocation(true);
hierCount.setMvsConfig(lmvsConf);

// add the previous setup to the job configuration
hierCount.configure(conf);

// run the job

JobClient.runJob(conf);

Note:

When using MVSuggest, the hierarchy data files must be the same as the layer template files used by MVSuggest. The hierarchy level names returned by the HierarchyInfo.getLevelNames() method are used as the matching layers by MVSuggest.

2.8.6 Generating Buffers

The API provides a mapper to generate a buffer around each record's geometry. The following code sample shows how to run a job to generate a buffer for each record geometry by using the BufferMapper class.

//configure input
conf.setInputFormat(FileSplitInputFormat.class);
FileSplitInputFormat.setInputPaths(conf, "/user/waterlines/");
FileSplitInputFormat.setRecordInfoProviderClass(conf, GeoJsonRecordInfoProvider.class);
 
//configure output
conf.setOutputFormat(SequenceFileOutputFormat.class);
SequenceFileOutputFormat.setOutputPath(conf, new Path("/user/data_buffer/"));   

//set the BufferMapper as the job mapper
conf.setMapperClass(BufferMapper.class);
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(RecordInfo.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(RecordInfo.class);
 
//set the width of the buffers to be generated
conf.setDouble(ConfigParams.BUFFER_WIDTH, 0.2);
 
//run the job
JobClient.runJob(conf);

BufferMapper generates a buffer for each input record containing a geometry. The output key and values are the record id and a RecordInfo instance containing the generated buffer. The resulting file is a Hadoop MapFile containing the mapper output key and values. If necessary, the output format can be modified by implementing a reducer that takes the mapper’s output keys and values, and outputs keys and values of a different type.

BufferMapper accepts the following parameters:

Parameter ConfigParam constant Type Description

oracle.spatial.buffer.width

BUFFER_WIDTH

double

The buffer width

oracle.spatial.buffer.sma

BUFFER_SMA

double

The semi major axis for the datum used in the coordinate system of the input

oracle.spatial.buffer.iFlat

BUFFER_IFLAT

double

The flattening value

oracle.spatial.buffer.arcT

BUFFER_ARCT

double

The arc tolerance used for geodetic densification

2.8.7 Spatial Binning

The Vector API provides the class oracle.spatial.hadoop.vector.mapred.job.Binning to perform spatial binning over a spatial data set. The Binning class is a MapReduce job driver that takes an input data set (which can be spatially indexed or not), assigns each record to a bin, and generates a file containing all the bins (which contain one or more records and optionally aggregated values).

A binning job can be configured as follows:

  1. Specify the data set to be binned and the way it will be read and interpreted (InputFormat and RecordInfoProvider), or, specify the name of an existing spatial index.

  2. Set the output path.

  3. Set the grid MBR, that is, the rectangular area to be binned.

  4. Set the shape of the bins: RECTANGLE or HEXAGON.

  5. Specify the bin (cell) size. For rectangles, specify the width and height. For hexagon-shaped cells, specify the hexagon width. Each hexagon is always drawn with only one of its vertices as the base.

  6. Optionally, pass a list of numeric field names to be aggregated per bin.

The resulting output is a text file where each record is a bin (cell) in JSON format and contains the following information:

  • id: the bin id

  • geom: the bin geometry; always a polygon that is a rectangle or a hexagon

  • count: the number of points contained in the bin

  • aggregated fields: zero or more aggregated fields

The following example configures and runs a binning job:

//create job driver
Binning<LongWritable, Text> binJob = new Binning<LongWritable, Text>();
//setup input
binJob.setInput("/user/hdfs/input/part*");
binJob.setInputFormatClass(GeoJsonInputFormat.class);
binJob.setRecordInfoProviderClass(GeoJsonRecordInfoProvider.class);
//set binning output
binJob.setOutput("/user/hdfs/output/binning");
//create a binning configuration to produce rectangular cells
BinningConfig binConf = new BinningConfig();
binConf.setShape(BinShape.RECTANGLE);
//set the bin size
binConf.setCellHeight(0.2);
binConf.setCellWidth(0.2);
//specify the area to be binned
binConf.setGridMbr(new double[]{-50,10,50,40});
binJob.setBinConf(binConf);
//save configuration
binJob.configure(conf);
//run job
JobClient.runJob(conf);

2.8.8 Spatial Clustering

The job driver class oracle.spatial.hadoop.mapred.KMeansClustering can be used to find spatial clusters in a data set. This class uses a distributed version of the K-means algorithm.

Required parameters:

  • Path to the input data set, the InputFormat class used to read the input data set and the RecordInfoProvider used to extract the spatial information from records.

  • Path where the results will be stored.

  • Number of clusters to be found.

Optional parameters:

  • Maximum number of iterations before the algorithm finishes.

  • Criterion function used to determine when the clusters converge. It is given as an implementation of oracle.spatial.hadoop.vector.cluster.kmeans.CriterionFunction. The Vector API contains the following criterion function implementations: SquaredErrorCriterionFunction and EuclideanDistanceCriterionFunction.

  • An implementation of oracle.spatial.hadoop.vector.cluster.kmeans.ClusterShapeGenerator, which is used to generate a geometry for each cluster. The default implementation is ConvexHullClusterShapeGenerator and generates a convex hull for each cluster. If no cluster geometry is needed, the DummyClusterShapeGenerator class can be used.

  • The initial k cluster points as a sequence of x,y ordinates. For example: x1,y1,x2,y2,…xk,yk

The result is a file named clusters.json, which contains an array of clusters called features. Each cluster contains the following information:

  • id: Cluster id

  • memberCount: Number of elements in the cluster

  • geom: Cluster geometry

The following example runs the KMeansClustering algorithm to find 5 clusters. By default, the SquredErrorCriterionFunction and ConvexHullClusterShapeGenerator are used , so you do not need yo set these classes explicitly. Also note that runIterations() is called to run the algorithm; internally, it launches one MapReduce per iteration. In this example, the number 20 is passed to runIterations() as the maximum number of iterations allowed.

//create the cluster job driver
KMeansClustering<LongWritable, Text> clusterJob = new KMeansClustering<LongWritable, Text>();
//set input properties:
//input dataset path
clusterJob.setInput("/user/hdfs/input/part*");
//InputFormat class
clusterJob.setInputFormatClass(GeoJsonInputFormat.class);
//RecordInfoProvider implementation
clusterJob.setRecordInfoProviderClass(GeoJsonRecordInfoProvider.class);
//specify where the results will be saved
clusterJob.setOutput("/user/hdfs/output/clusters");
//5 cluster will be found
clusterJob.setK(5);
//run the algorithm
success = clusterJob.runIterations(20, conf);

2.8.9 Spatial Join

The spatial join feature allows detecting spatial interactions between records of two different large data sets.

The driver class oracle.spatial.hadoop.vector.mapred.job.SpatialJoin can be used to execute or configure a job to perform a spatial join between two data sets. The job driver takes the following inputs:

  • Input data sets: Two input data sets are expected. Each input data set is represented using the class oracle.spatial.hadoop.vector.InputDataSet, which holds information about where to find and how to read a data set, such as path(s), spatial index, input format, and record info provider used to interpret records from the data set. It also accepts a spatial configuration for the data set.

  • Spatial operation configuration: The spatial operation configuration defines the spatial interaction used to determine if two records are related to each other. It also defines the area to cover (MBR), that is, only records within or intersecting the MBR will be considered in the search.

  • Partitioning result file path: An optional parameter that points to a previously generated partitioning result for both data sets. Data need to be partitioned in order to distribute the work; if this parameter is not provided, a partitioning process will be executed over the input data sets. (See Spatial Partitioning for more information.)

  • Output path: The path where the result file will be written.

The spatial join result is a text file where each line is a pair of records that meet the spatial interaction defined in the spatial operation configuration.

The following table shows the currently supported spatial interactions for the spatial join.

Spatial Operation Extra Parameters Type

AnyInteract

None

(NA)

IsInside

None

(N/A)

WithinDistance

oracle.spatial.hadoop.vector.util.SpatialOperationConfig.PARAM_WD_DISTANCE

double

For a WithinDistance operation, the distance parameter can be specified in the SpatialOperationConfig, as shown in the following example:

spatialOpConf.setOperation(SpatialOperation.WithinDistance);
spatialOpConf.addParam(SpatialOperationConfig.PARAM_WD_DISTANCE, 5.0);

The following example runs a Spatial Join job for two input data sets. The first data set, postal boundaries, is specified providing the name of its spatial index. For the second data set, tweets, the path to the file, input format, and record info provider are specified. The spatial interaction to detect is IsInside, so only tweets (points) that are inside a postal boundary (polygon) will appear in the result along with their containing postal boundary.

SpatialJoin spatialJoin = new SpatialJoin();
List<InputDataSet> inputDataSets = new ArrayList<InputDataSet>(2);

// set the spatial index of the 3-digit postal boundaries of the USA as the first input data set 
InputDataSet pbInputDataSet = new InputDataSet();
pbInputDataSet.setIndexName("usa_pcb3_index");

//no input format or record info provider are required here as a spatial index is provided
inputDataSets.add(pbInputDataSet);

// set the tweets data set in GeoJSON format as the second data set 
InputDataSet tweetsDataSet = new InputDataSet();
tweetsDataSet.setPaths(new Path[]{new Path("/user/example/tweets.json")});
tweetsDataSet.setInputFormatClass(GeoJsonInputFormat.class);
tweetsDataSet.setRecordInfoProviderClass(GeoJsonRecordInfoProvider.class);
inputDataSets.add(tweetsDataSet); 
 
//set input data sets
spatialJoin.setInputDataSets(inputDataSets);

//spatial operation configuration
SpatialOperationConfig spatialOpConf = new SpatialOperationConfig();
spatialOpConf.setOperation(SpatialOperation.IsInside);
spatialOpConf.setBoundaries(new double[]{47.70, -124.28, 35.45, -95.12});
spatialOpConf.setSrid(8307);
spatialOpConf.setTolerance(0.5);
spatialOpConf.setGeodetic(true);
spatialJoin.setSpatialOperationConfig(spatialOpConf); 
    
//set output path
spatialJoin.setOutput("/user/example/spatialjoin");

// prepare job
JobConf jobConf = new JobConf(getConf());

//preprocess will partition both data sets as no partitioning result file was specified
spatialJoin.preprocess(jobConf);
spatialJoin.configure(jobConf);
JobClient.runJob(jobConf);

2.8.10 Spatial Partitioning

The partitioning feature is used to spatially partition one or more data sets.

Spatial partitioning consists of dividing the space into multiple rectangles, where each rectangle is intended to contain approximately the same number of points. Eventually these partitions can be used to distribute the work among reducers in other jobs, such as Spatial Join.

The spatial partitioning process is run or configured using the oracle.spatial.hadoop.mapred.job.Partitioning driver class, which accepts the following input parameters:

  • Input data sets: One or more input data sets can be specified. Each input data set is represented using the class oracle.spatial.hadoop.vector.InputDataSet, which holds information about where to find and how to read a data set, such as path(s), spatial index, input format, and record info provider used to interpret records from the data set. It also accepts a spatial configuration for the data set.

  • Sampling ratio: Only a fraction of the entire data set or sets is used to perform the partitioning. The sample ratio is the ratio of the sample size to the whole input data set size. If it is not specified, 10 percent (0.1) of the input data set size is used.

  • Spatial configuration: Defines the spatial properties of the input data sets, such as the SRID. You must specify at least the dimensional boundaries.

  • Output path: The path where the result file will be written.

The generated partitioning result file is in GeoJSON format and contains information for each generated partition, including the partition’s geometry and the number of points contained (from the sample).

The following example partitions a tweets data set. Because the sampling ratio is not provided, 0.1 is used by default.

Partitioning partitioning = new Partitioning();
List<InputDataSet> inputDataSets = new ArrayList<InputDataSet>(1);

//define the input data set
InputDataSet dataSet = new InputDataSet();
dataSet.setPaths(new Path[]{new Path("/user/example/tweets.json")});
dataSet.setInputFormatClass(GeoJsonInputFormat.class);
dataSet.setRecordInfoProviderClass(GeoJsonRecordInfoProvider.class);
inputDataSets.add(dataSet);
partitioning.setInputDataSets(inputDataSets);

//spatial configuration
SpatialConfig spatialConf = new SpatialConfig();
spatialConf.setSrid(8307);
spatialConf.setBoundaries(new double[]{-180,-90,180,90});
partitioning.setSpatialConfig(spatialConf);

//set output
partitioning.setOutput("/user/example/tweets_partitions.json");

//run the partitioning process
partitioning.runFullPartitioningProcess(new JobConf());

2.8.11 RecordInfoProvider

A record read by a MapReduce job from HDFS is represented in memory as a key-value pair using a Java type (typically) Writable subclass, such as LongWritable, Text, ArrayWritable or some user-defined type. For example, records read using TextInputFormat are represented in memory as LongWritable, Text key-value pairs.

RecordInfoProvider is the component that interprets these memory record representations and returns the data needed by the Vector Analysis API. Thus, the API is not tied to any specific format and memory representations.

The RecordInfoProvider interface has the following methods:

  • void setCurrentRecord(K key, V value)

  • String getId()

  • JGeometry getGeometry()

  • boolean getExtraFields(Map<String, String> extraFields)

There is always a RecordInfoProvider instance per InputFormat. The method setCurrentRecord() is called passing the current key-value pair retrieved from the RecordReader. The RecordInfoProvider is then used to get the current record id, geometry, and extra fields. None of these fields are required fields. Only those records with a geometry participates in the spatial operations. The Id is useful for differentiating records in operations such as categorization. The extra fields can be used to store any record information that can be represented as text and which is desired to be quickly accessed without reading the original record, or for operations where MVSuggest is used.

Typically, the information returned by RecordInfoProvider is used to populate RecordInfo instances. A RecordInfo can be thought as a light version of a record and contains the information returned by the RecordInfoProvider plus information to locate the original record in a file.

2.8.11.1 Sample RecordInfoProvider Implementation

This sample implementation, called JsonRecordInfoProvider, takes text records in JSON format, which are read using TextInputFormat. A sample record is shown here:

{ "_id":"ABCD1234", "location":" 119.31669, -31.21615", "locationText":"Boston, Ma", "date":"03-18-2015", "time":"18:05", "device-type":"cellphone", "device-name":"iPhone"}

When a JsonRecordInfoProvider is instantiated, a JSON ObjectMapper is created. The ObjectMapper is used to parse records values later when setCurrentRecord() is called. The record key is ignored. The record id, geometry, and one extra field are retrieved from the _id, location and locationText JSON properties. The geometry is represented as latitude-longitude pair and is used to create a point geometry using JGeometry.createPoint() method. The extra field (locationText) is added to the extraFields map, which serves as an out parameter and true is returned indicating that an extra field was added.

public class JsonRecordInfoProvider implements RecordInfoProvider<LongWritable, Text> {
private Text value = null;
private ObjectMapper jsonMapper = null;
private JsonNode recordNode = null;
 
public JsonRecordInfoProvider(){
 
//json mapper used to parse all the records
 
jsonMapper = new ObjectMapper();
 
}
 
@Override
public void setCurrentRecord(LongWritable key, Text value) throws Exception {
        try{
 
           //parse the current value
 
           recordNode = jsonMapper.readTree(value.toString());
           }catch(Exception ex){
              recordNode = null;
              throw ex;
        }
}
 
@Override
public String getId() {
        String id = null;
        if(recordNode != null ){
                id = recordNode.get("_id").getTextValue();
        }
        return id;
}
@Override
public JGeometry getGeometry() {
        JGeometry geom = null;
        if(recordNode!= null){
                //location is represented as a lat,lon pair
                String location = recordNode.get("location").getTextValue();
                String[] locTokens = location.split(",");
                double lat = Double.parseDouble(locTokens[0]);
                double lon = Double.parseDouble(locTokens[1]);
                geom =  JGeometry.createPoint( new double[]{lon, lat},  2, 8307);
        }
        return geom;
}
 
@Override
public boolean getExtraFields(Map<String, String> extraFields) {
        boolean extraFieldsExist = false;
        if(recordNode != null) {
                extraFields.put("locationText", recordNode.get("locationText").getTextValue() );
                extraFieldsExist = true;
        }
        return extraFieldsExist;
}
}

2.8.11.2 LocalizableRecordInfoProvider

This interface extends RecordInfoProvider and is used to know the extra fields that can be used as the search text, when MVSuggest is used.

The only method added by this interface is getLocationServiceField(), which returns the name of the extra field that will be sent to MVSuggest.

In addition, the following is an implementation based on "Sample RecordInfoProvider Implementation." The name returned in this example is locationText, which is the name of the extra field included in the parent class.

public class LocalizableJsonRecordInfoProvider extends JsonRecordInfoProvider implements LocalizableRecordInfoProvider<LongWritable, Text> {
 
@Override
public String getLocationServiceField() {
        return  "locationText";
}
}

An alternative to LocalizableRecordInfoProvider is to set the configuration property oracle.spatial.recordInfo.locationField with the name of the search field, which value should be sent to MVSuggest. Example: configuration.set(LocatizableRecordInfoProvider.CONF_RECORD_INFO_LOCATION_FIELD, “locationField”)

2.8.12 HierarchyInfo

The HierarchyInfo interface is used to describe a hierarchical dataset. This implementation of HierarchyInfo is expected to provide the number, names, and the entries of the hierarchy levels of the hierarchy it describes.

The root hierarchy level is always the hierarchy level 1. The entries in this level do not have parent entries and this level is referred as the top hierarchy level. Children hierarchy levels will have higher level values. For example: the levels for the hierarchy conformed by continents, countries, and states are 1, 2 and 3 respectively. Entries in the continent layer do not have a parent, but have children entries in the countries layer. Entries at the bottom level, the states layer, do not have children.

A HierarchyInfo implementation is provided out of the box with the Vector Analysis API. The DynaAdminHierarchyInfo implementation can be used to read and describe the known hierarchy layers in GeoJSON format. A DynaAdminHierarchyInfo can be instantiated and configured or can be subclassed. The hierarchy layers to be contained are specified by calling the addLevel() method, which takes the following parameters:

  • The hierarchy level number

  • The hierarchy level name, which must match the file name (without extension) of the GeoJSON file that contains the data. For example, the hierarchy level name for the file world_continents.json must be world_continents, for world_countries.json it is world_countries, and so on.

  • Children join field: This is a JSON property that is used to join entries of the current level with child entries in the lower level. If a null is passed, then the entry id is used.

  • Parent join field: This is a JSON property used to join entries of the current level with parent entries in the upper level. This value is not used for the top most level without an upper level to join. If the value is set null for any other level greater than 1, an IsInside spatial operation is performed to join parent and child entries. In this scenario, it is supposed that an upper level geometry entry can contain lower level entries.

For example, let us assume a hierarchy containing the following levels from the specified layers: 1- world_continents, 2 - world_countries and 3 - world_states_provinces. A sample entry from each layer would look like the following:

world_continents:
 {"type":"Feature","_id":"NA","geometry": {"type":"MultiPolygon", "coordinates":[ x,y,x,y,x,y] }"properties":{"NAME":"NORTH AMERICA", "CONTINENT_LONG_LABEL":"North America"},"label_box":[-118.07998,32.21006,-86.58515,44.71352]}

world_countries: {"type":"Feature","_id":"iso_CAN","geometry":{"type":"MultiPolygon","coordinates":[x,y,x,y,x,y]},"properties":{"NAME":"CANADA","CONTINENT":"NA","ALT_REGION":"NA","COUNTRY CODE":"CAN"},"label_box":[-124.28092,49.90408,-94.44878,66.89287]}

world_states_provinces:
{"type":"Feature","_id":"6093943","geometry": {"type":"Polygon", "coordinates":[ x,y,x,y,x,y]},"properties":{"COUNTRY":"Canada", "ISO":"CAN", "STATE_NAME":"Ontario"},"label_box":[-91.84903,49.39557,-82.32462,54.98426]}

A DynaAdminHierarchyInfo can be configured to create a hierarchy with the above layers in the following way:

DynaAdminHierarchyInfo dahi = new DynaAdminHierarchyInfo();
 
dahi.addLevel(1, "world_continents", null /*_id is used by default to join with child entries*/, null /*not needed as there are not upper hierarchy levels*/);
 
dahi.addLevel(2, "world_countries", "properties.COUNTRY CODE"/*field used to join with child entries*/, "properties.CONTINENT" /*the value "NA" will be used to find Canada's parent which is North America and which _id field value is also "NA" */);
 
dahi.addLevel(3, "world_states_provinces", null /*not needed as not child entries are expected*/, "properties.ISO"/*field used to join with parent entries. For Ontario, it is the same value than the field properties.COUNTRY CODE specified for Canada*/);
 
//save the previous configuration to the job configuration
 
dahi.initialize(conf);

A similar configuration can be used to create hierarchies from different layers, such as countries, states and counties, or any other layers with a similar JSON format.

Alternatively, to avoid configuring a hierarchy every time a job is executed, the hierarchy configuration can be enclosed in a DynaAdminHierarchyInfo subclass as in the following example:

public class WorldDynaAdminHierarchyInfo extends DynaAdminHierarchyInfo \
 
{
       public WorldDynaAdminHierarchyInfo() 
 
       {
              super();
              addLevel(1, "world_continents", null, null);
              addLevel(2, "world_countries", "properties.COUNTRY CODE", "properties.CONTINENT");
              addLevel(3, "world_states_provinces", null, "properties.ISO");
       }
 
}

2.8.12.1 Sample HierarchyInfo Implementation

The HierarchyInfo interface contains the following methods, which must be implemented to describe a hierarchy. The methods can be divided in to the following three categories:

  • Methods to describe the hierarchy

  • Methods to load data

  • Methods to supply data

Additionally there is an initialize() method, which can be used to perform any initialization and to save and read data both to and from the job configuration

void initialize(JobConf conf);       
 
//methods to describe the hierarchy
 
String getLevelName(int level); 
int getLevelNumber(String levelName);
int getNumberOfLevels();
 
//methods to load data
 
void load(Path[] hierDataPaths, int fromLevel, JobConf conf) throws Exception;
void loadFromIndex(HierarchyDataIndexReader[] readers, int fromLevel, JobConf conf) throws Exception;
 
//methods to supply data
 
Collection<String> getEntriesIds(int level);
JGeometry getEntryGeometry(int level, String entryId);  
String getParentId(int childLevel, String childId);

The following is a sample HierarchyInfo implementation, which takes the previously mentioned world layers as the hierarchy levels. The first section contains the initialize method and the methods used to describe the hierarchy. In this case, the initialize method does nothing. The methods mentioned in the following example use the hierarchyLevelNames array to provide the hierarchy description. The instance variables entriesGeoms and entriesParent are arrays of java.util.Map, which contains the entries geometries and entries parents respectively. The entries ids are used as keys in both cases. Since the arrays indices are zero-based and the hierarchy levels are one-based, the array indices correlate to the hierarchy levels as array index + 1 = hierarchy level.

public class WorldHierarchyInfo implements HierarchyInfo 
{
 
       private String[] hierarchyLevelNames = {"world_continents", "world_countries", "world_states_provinces"};
       private Map<String, JGeometry>[] entriesGeoms = new Map[3];
       private Map<String, String>[] entriesParents = new Map[3];
 
       @Override
       public void initialize(JobConf conf) 
      {

         //do nothing for this implementation
}

        @Override
        public int getNumberOfLevels() 
        {
          return hierarchyLevelNames.length;
}
 
        @Override
        public String getLevelName(int level) 
        {
           String levelName = null;
           if(level >=1 && level <= hierarchyLevelNames.length)
           {
             levelName = hierarchyLevelNames[ level - 1];    
           }
         return levelName;
         }

        @Override
        public int getLevelNumber(String levelName) 
        {
           for(int i=0; i< hierarchyLevelNames.length; i++ ) 
           {
             if(hierarchyLevelNames.equals( levelName) ) return i+1;
   }
   return -1;
}

The following example contains the methods that load the different hierarchy levels data. The load() method reads the data from the source files world_continents.json, world_countries.json, and world_states_provinces.json. For the sake of simplicity, the internally called loadLevel() method is not specified, but it is supposed to parse and read the JSON files.

The loadFromIndex() method only takes the information provided by the HierarchyIndexReader instances passed as parameters. The load() method is supposed to be executed only once and only if a hierarchy index has not been created, in a job. Once the data is loaded, it is automatically indexed and loadFromIndex() method is called every time the hierarchy data is loaded into the memory.

      @Override
      public void load(Path[] hierDataPaths, int fromLevel, JobConf conf) throws Exception {
      int toLevel = fromLevel + hierDataPaths.length - 1;
      int levels = getNumberOfLevels();
 
      for(int i=0, level=fromLevel; i<hierDataPaths.length && level<=levels; i++, level++)
      {
 
         //load current level from the current path
 
         loadLevel(level, hierDataPaths[i]);
       }
    }
 
    @Override
    public void loadFromIndex(HierarchyDataIndexReader[] readers, int fromLevel, JobConf conf)
                 throws Exception 
    {
     Text parentId = new Text();
     RecordInfoArrayWritable records = new RecordInfoArrayWritable();
     int levels = getNumberOfLevels();
 
     //iterate through each reader to load each level's entries
 
     for(int i=0, level=fromLevel; i<readers.length && level<=levels; i++, level++)
     {
       entriesGeoms[ level - 1 ] = new Hashtable<String, JGeometry>();
       entriesParents[ level - 1 ] = new Hashtable<String, String>();
 
       //each entry is a parent record id (key) and a list of entries as RecordInfo (value)
 
       while(readers[i].nextParentRecords(parentId, records))
       {
          String pId = null;
 
          //entries with no parent will have the parent id UNDEFINED_PARENT_ID. Such is the case of the first level entries
 
           if( ! UNDEFINED_PARENT_ID.equals( parentId.toString() ) )
           {
           pId = parentId.toString();
           }
 
         //add the current level's entries
 
           for(Object obj : records.get())
           {
              RecordInfo entry = (RecordInfo) obj;
              entriesGeoms[ level - 1 ].put(entry.getId(), entry.getGeometry());
              if(pId != null) 
              {
              entriesParents[ level -1 ].put(entry.getId(), pId);
              }
           }//finishin loading current parent entries
        }//finish reading single hierarchy level index
     }//finish iterating index readers
}

Finally, the following code listing contains the methods used to provide information of individual entries in each hierarchy level. The information provided is the ids of all the entries contained in a hierarchy level, the geometry of each entry, and the parent of each entry.

@Override
public Collection<String> getEntriesIds(int level) 
{
   Collection<String> ids = null;
 
   if(level >= 1 && level <= getNumberOfLevels() && entriesGeoms[ level - 1 ] != null)
   {
 
     //returns the ids of all the entries from the given level
 
     ids = entriesGeoms[ level - 1 ].keySet();
   }
   return ids;
}
 
@Override
public JGeometry getEntryGeometry(int level, String entryId) 
{
   JGeometry geom = null;
   if(level >= 1 && level <= getNumberOfLevels() && entriesGeoms[ level - 1 ] != null)
   {
 
      //returns the geometry of the entry with the given id and level
 
      geom = entriesGeoms[ level - 1 ].get(entryId);
    }
    return geom;
}
 
@Override
public String getParentId(int childLevel, String childId) 
{
   String parentId = null;
   if(childLevel >= 1 && childLevel <= getNumberOfLevels() && entriesGeoms[ childLevel - 1 ] != null)
   {
 
      //returns the parent id of the entry with the given id and level
 
      parentId = entriesParents[ childLevel - 1 ].get(childId);
   }
   return parentId;
   }
}//end of class

2.8.13 Using JGeometry in MapReduce Jobs

The Spatial Hadoop Vector Analysis only contains a small subset of the functionality provided by the Spatial Java API, which can also be used in the MapReduce jobs. This section provides some simple examples of how JGeometry can be used in Hadoop for spatial processing. The following example contains a simple mapper that performs the IsInside test between a dataset and a query geometry using the JGeometry class.

In this example, the query geometry ordinates, srid, geodetic value and tolerance used in the spatial operation are retrieved from the job configuration in the configure method. The query geometry, which is a polygon, is preprocessed to quickly perform the IsInside operation.

The map method is where the spatial operation is executed. Each input record value is tested against the query geometry and the id is returned, when the test succeeds.

public class IsInsideMapper extends MapReduceBase implements Mapper<LongWritable, Text, NullWritable, Text>
{
       private JGeometry queryGeom = null;
       private int srid = 0;
       private double tolerance = 0.0;
       private boolean geodetic = false;
       private Text outputValue = new Text();
       private double[] locationPoint = new double[2];
        
        
       @Override
       public void configure(JobConf conf) 
       {
           super.configure(conf);
           srid = conf.getInt("srid", 8307);
           tolerance = conf.getDouble("tolerance", 0.0);
           geodetic = conf.getBoolean("geodetic", true);
 
    //The ordinates are represented as a string of comma separated double values
           
            String[] ordsStr = conf.get("ordinates").split(",");
            double[] ordinates = new double[ordsStr.length];
            for(int i=0; i<ordsStr.length; i++)
            {
              ordinates[i] = Double.parseDouble(ordsStr[i]);
             }
 
    //create the query geometry as two-dimensional polygon and the given srid
 
             queryGeom = JGeometry.createLinearPolygon(ordinates, 2, srid);
 
    //preprocess the query geometry to make the IsInside operation run faster
 
              try 
              {
                queryGeom.preprocess(tolerance, geodetic, EnumSet.of(FastOp.ISINSIDE));
               } 
               catch (Exception e) 
               {
                 e.printStackTrace();
                }
                
          }
        
          @Override
          public void map(LongWritable key, Text value,
                      OutputCollector<NullWritable, Text> output, Reporter reporter)
                      throws IOException 
          {
 
     //the input value is a comma separated values text with the following columns: id, x-ordinate, y-ordinate
 
           String[] tokens = value.toString().split(",");
     
     //create a geometry representation of the record's location
 
            locationPoint[0] = Double.parseDouble(tokens[1]);//x ordinate
            locationPoint[1] = Double.parseDouble(tokens[2]);//y ordinate
            JGeometry location = JGeometry.createPoint(locationPoint, 2, srid);
 
     //perform spatial test
 
            try 
            {
              if( location.isInside(queryGeom, tolerance, geodetic)){
 
               //emit the record's id
 
               outputValue.set( tokens[0] );
               output.collect(NullWritable.get(), outputValue);
             }
     } 
             catch (Exception e) 
             {
                e.printStackTrace();
              }
}
}

A similar approach can be used to perform a spatial operation on the geometry itself. For example, by creating a buffer. The following example uses the same text value format and creates a buffer around each record location. The mapper output key and value are the record id and the generated buffer, which is represented as a JGeometryWritable. The JGeometryWritable is a Writable implementation contained in the Vector Analysis API that holds a JGeometry instance.

public class BufferMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, JGeometryWritable> 
{
       private int srid = 0;
       private double bufferWidth = 0.0;
       private Text outputKey = new Text();
       private JGeometryWritable outputValue = new JGeometryWritable();
       private double[] locationPoint = new double[2];
 
       @Override
       public void configure(JobConf conf)
       {
              super.configure(conf);
              srid = conf.getInt("srid", 8307);
 
              //get the buffer width 
 
              bufferWidth = conf.getDouble("bufferWidth", 0.0);
        }
 
        @Override
        public void map(LongWritable key, Text value,
               OutputCollector<Text, JGeometryWritable> output, Reporter reporter)
               throws IOException 
        {
 
               //the input value is a comma separated record with the following columns: id, longitude, latitude
 
               String[] tokens = value.toString().split(",");
 
               //create a geometry representation of the record's location
 
               locationPoint[0] = Double.parseDouble(tokens[1]);
               locationPoint[1] = Double.parseDouble(tokens[2]);
               JGeometry location = JGeometry.createPoint(locationPoint, 2, srid);
 
               try 
               {
 
                  //create the location's buffer
 
                  JGeometry buffer = location.buffer(bufferWidth);
 
                  //emit the record's id and the generated buffer
 
                  outputKey.set( tokens[0] );
                  outputValue.setGeometry( buffer );
                  output.collect(outputKey, outputValue);
                }
 
                catch (Exception e)
                {
                   e.printStackTrace();
                 }
        }
}

2.8.14 Support for Different Data Sources

In addition to file-based data sources (that is, a file or a set of files from a local or a distributed file system), other types of data sources can be used as the input data for a Vector API job.

Data sources are referenced as input data sets in the Vector API. All the input data sets implement the interface oracle.spatial.hadoop.vector.data.AbstractInputDataSet. Input data set properties can be set directly for a Vector job using the methods setInputFormatClass(), setRecordInfoProviderClass(), and setSpatialConfig(). More information can be set, depending the type of input data set. For example, setInput() can specify the input string for a file data source, or setIndexName() can be used for a spatial index. The job determines the input data type source based on the properties that are set.

Input data set information can also be set directly for a Vector API job using the job’s method setInputDataSet(). With this method, the input data source information is encapsulated, you have more control, and it is easier to identify the type of data source that is being used.

The Vector API provides the following implementations of AsbtractInputDataSet:

  • SimpleInputDataSet: Contains the minimum information required by the Vector API for an input data set. Typically, this type of input data set should be used for non-file based input data sets, such as Apache Hbase, an Oracle database, or any other non-file-based data source.

  • FileInputDataSet: Encapsulates file-based input data sets from local or distributed file systems. It provides properties for setting the input path as an array of Path instances or as a string that can be a regular expression for selecting paths.

  • SpatialIndexInputDataSet: A subclass of FileInputDataSet optimized for working with spatial indexes generated by the Vector API. It is sufficient to specify the index name for this type of input data set.

  • NoSQLInputDataSet: Specifies Oracle NoSQL data sources. It should be used in conjunction with Vector NoSQL API. If the NoSQL KVInputFormat or TableInputFormat classes need to be used, use SimpleInputFormat instead.

  • MultiInputDataSet: Input data set that encapsulates two or more input data sets.

Multiple Input Data Sets

Most of the Hadoop jobs provided by the Vector API (except Categorization) are able to manage more than one input data set by using the class oracle.spatial.hadoop.vector.data.MultiInputDataSet.

To add more than one input data set to a job, follow these steps.

  1. Create and configure two or more instances of AbstractInputDataSet subclasses.

  2. Create an instance of oracle.spatial.hadoop.vector.data.MultiInputDataSet.

  3. Add the input data sets created in step 1 to the MultiInputDataSet instance.

  4. Set MultiInputDataSet instance as the job’s input data set.

The following code snippet shows how to set multiple input data sets to a Vector API.

//file input data set
FileInputDataSet fileDataSet = new FileInputDataSet();
fileDataSet.setInputFormatClass(GeoJsonInputFormat.class);
fileDataSet.setRecordInfoProviderClass(GeoJsonRecordInfoProvider.class);
fileDataSet.setInputString("/user/myUser/geojson/*.json");
                
//spatial index input data set
SpatialIndexInputDataSet indexDataSet = new SpatialIndexInputDataSet();
indexDataSet.setIndexName("myIndex");
                
//create multi input data set
MultiInputDataSet multiDataSet = new MultiInputDataSet();
                
//add the previously defined input data sets
multiDataSet.addInputDataSet(fileDataSet);
multiDataSet.addInputDataSet(indexDataSet);
                
Binning binningJob = new Binning();
//set multiple input data sets to the job
binningJob.setInputDataSet(multiDataSet);

NoSQL Input Data Set

The Vector API provides classes to read data from Oracle NoSQL Database. The Vector NoSQL components let you group multiple key-value pairs into single records, which are passed to Hadoop mappers as RecordInfo instances. They also let you map NoSQL entries (key and value) to Hadoop records fields (RecordInfo’s id, geometry, and extra fields).

The NoSQL parameters are passed to a Vector job using the NoSQLInputDataSet class. You only need to fill and set a NoSQLConfiguration instance that contains the KV store, hosts, parent key, and additional information for the NoSQL data source. InputFormat and RecordInfoProvider classes do not need to be set because the default ones are used.

The following example shows how to configure a job to use NoSQL as data source, using the Vector NoSQL classes.

//create NoSQL configuration
NoSQLConfiguration nsqlConf = new NoSQLConfiguration();
// set connection data
nsqlConf.setKvStoreName("mystore");
nsqlConf.setKvStoreHosts(new String[] { "myserver:5000" });
nsqlConf.setParentKey(Key.createKey("tweets"));
// set NoSQL entries to be included in the Hadoop records
// the entries with the following minor keys will be set as the
// RecordInfo's extra fields
nsqlConf.addTargetEntries(new String[] { "friendsCount", "followersCount" });
// add an entry processor to map the spatial entry to a RecordInfo's
// geometry
nsqlConf.addTargetEntry("geometry", NoSQLJGeometryEntryProcessor.class);
//create and set the NoSQL input data set
NoSQLInputDataSet nsqlDataSet = new NoSQLInputDataSet();
//set noSQL configuration
nsqlDataSet.setNoSQLConfig(nsqlConf);
//set spatial configuration
SpatialConfig spatialConf = new SpatialConfig();
spatialConf.setSrid(8307);
nsqlDataSet.setSpatialConfig(spatialConf);

Target entries refer to the NoSQL entries that will be part of the Hadoop records and are specified by the NoSQL minor keys. In the preceding example, the entries with the minor keys friendsCount and followersCount will be part of a Hadoop record. These NoSQL entries will be parsed as text values and assigned to the Hadoop RecordInfo as the extra fields called friendsCount and followersCount. By default, the major key is used as record id. The entries that contain “geometry” as minor key are used to set the RecordInfo’s geometry field.

In the preceding example, the value type of the geometry NoSQL entries is JGeometry, so it is necessary to specify a class to parse the value and assign it to the RecordInfo’s geometry field. This requires setting an implementation of the NoSQLEntryProcessor interface. In this case, the NoSQLJGeometryEntryProcessor class is used, and it reads the value from the NoSQL entry and sets that value to the current RecordInfo’s geometry field. You can provide your own implementation of NoSQLEntryProcessor for parsing specific entry formats.

By default, NoSQL entries sharing the same major key are grouped into the same Hadoop record. This behavior can be changed by implementing the interface oracle.spatial.hadoop.nosql.NoSQLGrouper and setting the NoSQLConfiguration property entryGrouperClass with the new grouper class.

The Oracle NoSQL library kvstore.jar is required when running Vector API jobs that use NoSQL as the input data source.

Other Non-File-Based Data Sources

Other non-file-based data sources can be used with the Vector API, such as NoSQL (using the Oracle NoSQL classes) and Apache HBase. Although the Vector API does not provide specific classes to manage every type of data source, you can associate the specific data source with the job configuration and specify the following information to the Vector job:

  • InputFormat: The InputFormat implementation used to read data from the data source.

  • RecordInfoProvider: An implementation of RecordInfoProvider to extract required information such as id, spatial information, and extra fields from the key-value pairs returned by the current InputFormat.

  • Spatial configuration: Describes the spatial properties of the input data, such as the SRID and the dimension boundaries.

The following example shows how to use Apache HBase data in a Vector job.

//create job
Job job = Job.getInstance(getConf());
job.setJobName(getClass().getName());
job.setJarByClass(getClass());

//Setup hbase parameters
Scan scan = new Scan();
scan.setCaching(500);
scan.setCacheBlocks(false);
scan.addColumn(Bytes.toBytes("location_data"), Bytes.toBytes("geometry"));
scan.addColumn(Bytes.toBytes("other_data"), Bytes.toBytes("followers_count"));
scan.addColumn(Bytes.toBytes("other_data"), Bytes.toBytes("user_id"));

//initialize job configuration with hbase parameters
TableMapReduceUtil.initTableMapperJob(
                "tweets_table",
                scan,
                null,
                null,
                null,
                job);
//create binning job
Binning<ImmutableBytesWritable, Result> binningJob = new Binning<ImmutableBytesWritable, Result>();
//setup the input data set 
SimpleInputDataSet inputDataSet = new SimpleInputDataSet();
//use HBase's TableInputFormat
inputDataSet.setInputFormatClass(TableInputFormat.class);
//Set a RecordInfoProvider which can extract information from HBase TableInputFormat's returned key and values
inputDataSet.setRecordInfoProviderClass(HBaseRecordInfoProvider.class);
//set spatial configuration
SpatialConfig spatialConf = new SpatialConfig();
spatialConf.setSrid(8307);
inputDataSet.setSpatialConfig(spatialConf);
binningJob.setInputDataSet(inputDataSet);

//job output
binningJob.setOutput("hbase_example_output");

//binning configuration
BinningConfig binConf = new BinningConfig();
binConf.setGridMbr(new double[]{-180, -90, 180, 90});
binConf.setCellHeight(5);
binConf.setCellWidth(5);
binningJob.setBinConf(binConf);

//configure the job
binningJob.configure(job);

//run
boolean success = job.waitForCompletion(true);

The RecordInfoProvider class set in the preceding example is a custom implementation called HBaseRecordInfoProvider, the definition of which is as follows.

public class HBaseRecordInfoProvider implements RecordInfoProvider<ImmutableBytesWritable, Result>, Configurable{
        
        private Result value = null;
        private Configuration conf = null;
        private int srid = 0;

        @Override
        public void setCurrentRecord(ImmutableBytesWritable key, Result value) throws Exception {
                this.value = value;
        }

        @Override
        public String getId() {
                byte[] idb = value.getValue(Bytes.toBytes("other_data"), Bytes.toBytes("user_id"));
                String id = idb != null ? Bytes.toString(idb) : null;
                return id;
        }

        @Override
        public JGeometry getGeometry() {
                byte[] geomb = value.getValue(Bytes.toBytes("location_data"), Bytes.toBytes("geometry"));
                String geomStr = geomb!=null ? Bytes.toString(geomb) : null;
                JGeometry geom = null;
                if(geomStr != null){
                        String[] pointsStr = geomStr.split(",");
                        geom = JGeometry.createPoint(new double[]{Double.valueOf(pointsStr[0]), Double.valueOf(pointsStr[1])}, 2, srid);
                }
                return geom;
        }

        @Override
        public boolean getExtraFields(Map<String, String> extraFields) {
                byte[] fcb =  value.getValue(Bytes.toBytes("other_data"), Bytes.toBytes("followers_count"));
                if(fcb!=null){
                        extraFields.put("followers_count", Bytes.toString(fcb));
                }
                return fcb!=null;
        }

        @Override
        public Configuration getConf() {
                return conf;
        }

        @Override
        public void setConf(Configuration conf) {
                srid = conf.getInt(ConfigParams.SRID, 0);
        }
        
} 

2.8.15 Job Registry

Every time a Vector API job is launched using the command line interface or the web console, a registry file is created for that job. A job registry file contains the following information about the job:

  • Job name

  • Job ID

  • User that executed the job

  • Start and finish time

  • Parameters used to run the job

  • Jobs launched by the first job (called child jobs). Child jobs contain the same fields as the parent job.

A job registry file preserves the parameters used to run the job, which can be used as an aid for running an identical job even when it was not initially run using the command line interface.

By default, job registry files are created under the HDFS path relative to the user folder oracle_spatial/job_registry (for example, /user/hdfs/oracle_spatial/job_registry for the hdfs user).

Job registry files can be removed directly using HDFS commands or using the following utility methods from class oracle.spatial.hadoop.commons.logging.registry.RegistryManager:

  • public static int removeJobRegistry(long beforeDate, Configuration conf): Removes all the job registry files that were created before the specified time stamp from the default job registry folder.

  • public static int removeJobRegistry(Path jobRegDirPath, long beforeDate, Configuration conf): Removes all the job registry files that were created before the specified time stamp from a specified job registry folder.

2.8.16 Tuning Performance Data of Job Running Times Using the Vector Analysis API

The table lists some running times for jobs built using the Vector Analysis API. The jobs were executed using a 4-node cluster. The times may vary depending on the characteristics of the cluster. The test dataset contains over One billion records and the size is above 1 terabyte.

Table 2-2 Performance time for running jobs using Vector Analysis API

Job Type Time taken (approximate value)

Spatial Indexing

2 hours

Spatial Filter with Spatial Index

1 hour

Spatial Filter without Spatial Index

3 hours

Hierarchy count with Spatial Index

5 minutes

Hierarchy count without Spatial Index

3 hours

The time taken for the jobs can be decreased by increasing the maximum split size using any of the following configuration parameters.

mapred.max.split.size
mapreduce.input.fileinputformat.split.maxsize

This results in more splits are being processed by each single mapper and improves the execution time. This is done by using the SpatialFilterInputFormat (spatial indexing) or FileSplitInputFormat (spatial hierarchical join, buffer). Also, the same results can be achieved by using the implementation of CombineFileInputFormat as internal InputFormat.

2.9 Oracle Big Data Spatial Vector Analysis for Spark

Oracle Big Data Spatial Vector Analysis for Apache Spark is a Spatial Vector Analysis API that provides spatially-enabled RDDs (Resilient Distributed Datasets) that support spatial transformations and actions, spatial partitioning, and indexing. These components make use of the Spatial Java API to perform spatial analysis tasks. The supported features include:

    2.9.1 Spatial RDD (Resilient Distributed Dataset)

    A spatial RDD is a Spark RDD that allows you to perform spatial transformations and actions.

    The current spatial RDD implementation is the class oracle.spatial.spark.vector.rdd.SpatialJavaRDD, and it can be created from an existing instance of RDD or JavaRDD, as shown in the following example:

    //create a regular RDD
    JavaRDD<String> rdd = sc.textFile("someFile.txt");
    //create a SparkRecordInfoProvider to extract spatial information from the source RDD’s records
    SparkRecordInfoProvider recordInfoProvider = new MySparkRecordInfoProvider();
    //create a spatial RDD
    SpatialJavaRDD<String> spatialRDD = SpatialJavaRDD.fromJavaRDD(rdd, recordInfoProvider, String.class));
    

    A spatial RDD takes an implementation of the interface oracle.spatial.spark.vector.SparkRecordInfoProvider, which is used for extracting spatial information from each RDD element.

    A regular RDD can be transformed into a spatial RDD of the same generic type; that is, if the source RDD contains records of type String, the spatial RDD will also contain String records.

    You can also create a Spatial RDD with records of type oracle.spatial.spark.vector.SparkRecordInfo. A SparkRecordInfo is an abstraction of a record from the source RDD; it holds the source record’s spatial information and may contain a subset of the source record’s data. The following example shows how to create an RDD of SparkRecordInfo records by omitting the last parameter when calling the JavaSpatialRDD.fromJavaRDD() method.

    //create a regular RDD
    JavaRDD<String> rdd = sc.textFile("someFile.txt");
    //create a SparkRecordInfoProvider to extract spatial information from the source RDD’s records
    SparkRecordInfoProvider recordInfoProvider = new MySparkRecordInfoProvider();
    //create a spatial RDD
    SpatialJavaRDD<SparkRecordInfo> spatialRDD = SpatialJavaRDD.fromJavaRDD(rdd, recordInfoProvider));
    

    A spatial RDD of SparkRecordInfo records has the advantage that spatial information does not need to be extracted from each record every time it is needed for a spatial operation.

    You can accelerate spatial searches by spatially indexing a spatial RDD. Spatial indexing is described in section 1.4 Spatial Indexing.

    The spatial RDD provides the following spatial transformations and actions, which are described in the sections 1.2 Spatial Transformations and 1.3 Spatial Actions.

    Spatial transformations:

    • filter

    • flatMap

    • join (available when creating a spatial index)

    Spatial Actions:

    • MBR

    • nearestNeighbors

    Spatial Pair RDD

    A pair version of SpatialJavaRDD is provided and is implemented as the class oracle.spatial.spark.vector.rdd.SpatialJavaPairRDD. A spatial pair RDD is created from an existing pair RDD and contains the same spatial transformations and actions as the single spatial RDD. A SparkRecordInfoProvider used for a spatial pair RDD should receive records of type scala.Tuple2<K,V>, where K and V correspond to the pair RDD key and value types, respectively.

    Example 2-1 SparkRecordInfoProvider to Read Information from a CSV File

    The following example shows how to implement a simple SparkRecordInfoProvider to read information from a CSV file.

    public class CSVRecordInfoProvider implements SparkRecordInfoProvider<String>{
        private int srid = 8307;
    
        //receives an RDD record and fills the given recordInfo
        public boolean getRecordInfo(String record, SparkRecordInfo recordInfo) {
            try {
                String[] tokens = record.split(",");
                //expected records have the format: id,name,last_name,x,y where x and y are optional 
                //output recordInfo will contain the fields id, last name and geometry
                recordInfo.addField("id", tokens[0]);
                recordInfo.addField("last_name", tokens[2]);
                if (tokens.length == 5) {
                    recordInfo.setGeometry(JGeometry.createPoint(tokens[3], tokens[4], 2, srid));
                }
            } catch (Exception ex) {
                //return false when there is an error extracting data from the input value
                return false;
            }
            return true;
        }
    
        public void setSrid(int srid) {this.srid = srid;}   
        public int getSrid() {return srid;}
    }
    

    In this example, the record’s ID and last-name fields are extracted along with the spatial information to be set to the SparkRecordInfo instance used as out parameter. Extracting additional information is only needed when the goal is to create a spatial RDD containing SparkRecordInfo elements and is necessary to preserve a subset of the original records information. Otherwise, it is only necessary to extract the spatial information.

    The call to SparkRecordInfoProvider.getRecordInfo() should return true whenever the record should be included in a transformation or considered in a search. If SparkRecordInfoProvider.getRecordInfo() returns false, the record is ignored.

    2.9.2 Spatial Transformations

    The transformations described in the following subtopics are available for spatial RDD, spatial pair RDD, and a distributed spatial index unless stated otherwise (for example, a join transformation is only available for a distributed spatial index).

    2.9.2.1 Filter Transformation

    A filter transformation is a spatial version of the regular RDD’s filter() transformation. In addition to a user-provided filtering function, it takes an instance of oracle.spatial.hadoop.vector.util.SpatialOperationConfig, which is used to describe the spatial operation used to filter spatial records. A SpatialOperationConfig contains a query window which is the geometry used as reference and a spatial operation. The spatial operation is executed in the form: (RDD record’s geometry) (spatial operation) (query window). For example: (RDD record) IsInside (queryWindow)

    Spatial operations available are AnyInteract, IsInside, Contains, and WithinDistance.

    The following example returns an RDD containing only records that are inside the given query window and with not null ID.
    SpatialOperationConfig soc = new SpatialOperationConfig();
    soc.setOperation(SpatialOperation.IsInside);
    soc.setQueryWindow(JGeometry.createLinearPolygon(new double[] { 2.0, 1.0, 2.0, 3.0, 6.0, 3.0, 6.0, 1.0, 2.0, 1.0 }, 2, srid));
    SpatialJavaRDD<SparkRecordInfo> filteredSpatialRDD = spatialRDD.filter(
    (record) -> {
    return record.getField(“id”) != null;
    }, soc);
    

    2.9.2.2 FlatMap Transformation

    A FlatMap transformation is a spatial version of the regular RDD’s flatMap() transformation. In addition to the user-provided function, it takes a SpatialOperationConfig to perform a spatial filtering. It works like the Filter Transformation, except that spatially filtered results are passed to the map function and flattened.

    The following example creates an RDD that contains only elements that interact with the given query window and geometries that have been buffered.

    SpatialOperationConfig soc = new SpatialOperationConfig();
    soc.setOperation(SpatialOperation.AnyInteract);
    soc.setQueryWindow(JGeometry.createLinearPolygon(new double[] { 2.0, 1.0, 2.0, 3.0, 6.0, 3.0, 6.0, 1.0, 2.0, 1.0 }, 2, srid));
    JavaRDD<SparkRecordInfo> mappedRDD = spatialRDD.flatMap(
    (record) -> {
            JGeometry buffer = record.getGeometry().buffer(2.5);
            record.setGeometry(buffer);
    return Collections.singletonList(record);
    }, soc);
    

    2.9.2.3 Join Transformation

    A join transformation joins two spatial RDDs based on a spatial relationship between their records. In order to perform this transformation, one of the two RDDs must be spatially indexed. (See Spatial Indexing for more information about indexing a spatial RDD.)

    The following example joins all the records from both data sets that interact in any way.

    DistributedSpatialIndex index = DistributedSpatialIndex.createIndex(sparkContext, spatialRDD1, new QuadTreeConfiguration());
    SpatialJavaRDD<SparkRecordInfo> spatialRDD2 = SpatialJavaRDD.fromJavaRDD(rdd2, new RegionsRecordInfoProvider(srid));
    SatialOperationConfig soc = new SpatialOperationConfig();
    soc.setOperation(SpatialOperation.AnyInteract);
    JavaRDD<Tuple2<SparkRecordInfo, SparkRecordInfo> joinedRDD = index.join( spatialRDD2,
    (recordRDD1, recordRDD2) -> {
    return Collections.singletonList( new Tuple2<>(recordRDD1, recordRDD2));
    }, soc);
    

    2.9.2.4 Spatially Enabled Transformations

    Spatial operations can be performed in regular transformations by creating a SpatialTransformationContext before executing any transformation.

    After the SpatialTransformationContext instance is in the transformation function, that instance can be used to get the record’s geometry and apply spatial operations, as shown in the following example, which transforms an RDD of String records into a pair RDD where the key and value corresponds to the source record ID and a buffered geometry.

    SpatialJavaRDD<String> spatialRDD = SpatialJavaRDD.fromJavaRDD(rdd, new CSVRecordInfoProvider(srid), String.class);
    SpatialTransformationContext stCtx = spatialRDD.createSpatialTransformationContext();
    JavaPairRDD<String, JGeometry> bufferedRDD = spatialRDD.mapToPair(
    (record) -> {
            SparkRecordInfo recordInfo = stCtx.getRecordInfo(record);
            String id = (String) recordInfo.getField(“id”)
            JGeometry geom. = recordInfo.getGeometry(record);
            JGeometry buffer = geom.buffer(0.5);
    return new Tuple2(id, buffer);
    }, soc);
    

    2.9.3 Spatial Actions

    Spatial RDDs and spatial pair RDDs provide the following spatial actions.

    • MBR: Calculates the RDD’s minimum bounding rectangle (MBR). The MBR is only calculated once and cached so the second time it is called, it will not be recalculated. The following example shows how to get the MBR from a spatial RDD.

      doubl[] mbr = spatialRDD.getMBR();
      
    • NearestNeighbors: Returns a list with the K nearest elements from the RDD to a given geometry. The following example below shows how to get the five records closest to the given point.

      JGeometry point = JGeometry.createPoint(new double[] { 2.0, 1.0 }, 2, srid));
      List<SparkRecordInfo> nearestNeighbors = spatialRDD.nearestNeighbors(point, 5, 0.0);
      

    2.9.4 Spatially Indexing a Spatial RDD

    A spatial RDD can be spatially indexed to speed up spatial searches when performing spatial transformations.

    A spatial index repartitions the spatial RDD so that each partition only contains records on some specific area. This allows partitions that do not contain results in a spatial search to be quickly discarded, making the search faster.

    A spatial index is created through the abstract class oracle.spatial.spark.vector.index.DistributedSpatialIndex, which uses a specific implementation to create the actual spatial index. The following example shows how to create a spatial index using a QuadTree-based spatial index implementation.

    DistributedSpatialIndex index = DistributedSpatialIndex.createIndex(sparkContext, spatialRDD1, new QuadTreeConfiguration());
    

    The type of spatial index implementation is determined by the last parameter, which is a subtype of oracle.spatial.spark.vector.index.SpatialPartitioningConfiguration. Depending on the index implementation, the configuration parameter may accept different settings for performing partitioning and indexing. Currently, the only implementation of a spatial index is the class oracle.spatial.spark.vector.index.quadtree.QuadTreeDistIndex, and it receives a configuration of type oracle.spatial.spark.vector.index.quadtree.QuadTreeConfiguration.

    The DistributedSpatialIndex class currently supports the filter, flatMap, and join transformations, which are described in Spatial Transformations.

    A spatial index can be persisted using the method DistributedSpatialIndex.save(), which takes an existing SparkContext and a path where the index will be stored. The path may be in a local or a distributed (HDFS) file system. Similarly, a persisted spatial index can be loaded by calling the method DistributedSpatialIndex.load(), which also takes an existing SparkContext and the path where the index is stored.

    2.9.4.1 Spatial Partitioning of a Spatial RDD

    A spatial RDD can be partitioned through an implementation of oracle.spatial.spark.vector.index.SpatialPartitioning. The SpatialPartitioning class represents a spatial partitioning algorithm that transforms a spatial RDD into a spatially partitioned spatial pair RDD whose keys point to a spatial partition.

    A SpatialPartitioning algorithm is used internally by a spatial index, or it can be used directly by creating a concrete class. Currently, there is a QuadTree-based implementation called oracle.spatial.spark.vector.index.quadtree.QuadTreePartitioning. The following example shows how to spatially partition a spatial RDD.

    QuadTreePartitioning<T> partitioning = new QuadTreePartitioning<>(sparkContext, spatialRDD, new QuadTreeConfiguration());
    SpatialJavaPairRDD<PartitionKey, T> partRDD = partitioning.getPartitionedRDD();
    

    2.9.4.2 Local Spatial Indexing of a Spatial RDD

    A local spatial index can be created for each partition of a spatial RDD. Locally partitioning the content of each partition helps to improve spatial searches when working on a partition basis.

    A local index can be created for each partition by setting the parameter useLocalIndex to true when creating a distributed spatial index. A spatially partitioned RDD can also be transformed so each partition is locally indexed by calling the utility method oracle.spatial.spark.vector.index.local.LocalIndex.spatiallyIndexPartitions(SpatialJavaPairRDD<PartitionKey, T> rdd).

    2.9.5 Input Format Support

    The Spark Vector API provides implementations of SparkRecordInfoProvider for data sets in GeoJSON and shapefile formats. Data sets in these formats can be read using the existing InputFormat classes provided by the Vector API for Hadoop. SeeSupport for GeoJSON and Shapefile Formats for more information.

    Both JSON and shapefile SparkRecordInfoProvider implementations can be configured to include fields in addition to the geometry field when calling getRecordInfo(). The following example configures a ShapeFileRecordInfoProvider so that it returns SparkRecordInfo instances containing the country and province fields.

    //create an RDD from an existing shape file
    JavaPairRDD<LongWritable, MapWritable> shapeFileRDD = sparkContext.hadoopRDD(jobConf, ShapeFileInputFormat.class, LongWritable.class, MapWritable.class); 
    //list of fields to be included in the SparkRecordInfo filled by the SparkRecordInfoProvider implementation
    List<String> attributes = new ArrayList<String>();
    attributes.add("country");
    attributes.add("province");
    //create a ShapeFileRecordInfoProvider
    ShapeFileRecordInfoProvider recordInfoProvider = new ShapeFileRecordInfoProvider(attributes, 8307);
    //create a spatial RDD
    SpatialJavaRDD<SparkRecordInfo> spatialRDD = SpatialJavaRDD.fromJavaRDD(shapeFileRDD.values(), recordInfoProvider);
    

    2.9.6 Spatial Spark SQL API

    The Spatial Spark SQL API supports Spark SQL DataFrame objects containing spatial information in any format.

    Oracle Big Data Spatial Vector Hive Analysis (Oracle Big Data Spatial Vector Hive Analysis can be used with Spark SQL.

    Example 2-2 Creating a Spatial DataFrame for Querying Tweets

    In the following example, if the data is loaded using a spatial RDD, then a DataFrame can be created using the function SpatialJavaRDD.createSpatialDataFrame.

    //create HiveContext
    HiveContext sqlContext = new HiveContext(sparkContext.sc());
    //get the spatial DataFrame from the SpatialRDD
    //the geometries are in GeoJSON format
    DataFrame spatialDataFrame = spatialRDD.createSpatialDataFrame(sqlContext, properties);
    // Register the DataFrame as a table.
    spatialDataFrame.registerTempTable("tweets");
    //register UDFs
    sqlContext.sql("create temporary function ST_Polygon as 'oracle.spatial.hadoop.vector.hive.ST_Polygon'");
    sqlContext.sql("create temporary function ST_Point as 'oracle.spatial.hadoop.vector.hive.ST_Point'");
    sqlContext.sql("create temporary function ST_Contains as 'oracle.spatial.hadoop.vector.hive.function.ST_Contains'");
    // SQL can be run over RDDs that have been registered as tables.
    StringBuffer query = new StringBuffer();
    query.append("SELECT geometry, friends_count, location, followers_count FROM tweets ");
    query.append("WHERE ST_Contains( ");
    query.append("  ST_Polygon('{\"type\": \"Polygon\",\"coordinates\": [[[-106, 25], [-106, 30], [-104, 30], [-104, 25], [-106, 25]]]}', 8307) ");
    query.append("  , ST_Point(geometry, 8307) ");
    query.append("  , 0.05)");
    query.append("  and followers_count > 50");
    DataFrame results = sqlContext.sql(query.toString());           
    //Filter the tweets in a query window (somewhere in the north of Mexico) 
    //and with more than 50 followers.
    //Note that since the geometries are in GeoJSON format it is possible to create the ST_Point like
    //ST_Point(geometry, 8307)
    //instead of
    //ST_Point(geometry, 'oracle.spatial.hadoop.vector.hive.json.GeoJsonHiveRecordInfoProvider')
    List<String> filteredTweets = results.javaRDD().map(new Function<Row, String>() {
      public String call(Row row) {
              StringBuffer sb = new StringBuffer();
              sb.append("Geometry: ");
              sb.append(row.getString(0));
              
              sb.append("\nFriends count: ");
              sb.append(row.getString(1));
              sb.append("\nLocation: ");
              sb.append(row.getString(2));
              sb.append("\nFollowers count: ");
              sb.append(row.getString(3));
              return sb.toString();
      }
    }).collect();
    //print the filtered tweets
    filteredTweets.forEach(tweet -> System.out.println("Tweet: "+tweet)); 
    

    2.10 Oracle Big Data Spatial Vector Hive Analysis

    Oracle Big Data Spatial Vector Hive Analysis provides spatial functions to analyze the data using Hive. The spatial data can be in any Hive supported format. You can also use a spatial index created with the Java analysis API (see Spatial Indexing) for fast processing.

    The supported features include:

    See also HiveRecordInfoProvider for details about the implementation of these features.

    Hive Spatial Functions provides reference information about the available functions.

    Prerequisite Libraries

    The following libraries are required by the Spatial Vector Hive Analysis API.

    • sdohadoop-vector-hive.jar

    • sdohadoop-vector.jar

    • sdoutil.jar

    • sdoapi.jar

    • ojdbc.jar

    2.10.1 HiveRecordInfoProvider

    A record in a Hive table may contain a geometry field in any format like JSON, WKT, or a user-specifiedformat. Geometry constructors like ST_Geometry can create a geometry receiving the GeoJSON, WKT, or WKB representation of the geometry. If the geometry is stored in another format, a HiveRecordInfoProvider can be used.

    HiveRecordInfoProvider is a component that interprets the geometry field representation and returns the geometry in a GeoJSON format.

    The returned geometry must contain the geometry SRID, as in the following example format:

    {"type":<geometry-type", "crs": {"type": "name", "properties": {"name": "EPSG:4326"}}"coordinates":[c1,c2,....cn]} 
    

    The HiveRecordInfoProvider interface has the following methods:

    • void setCurrentRecord(Object record)

    • String getGeometry()

    The method setCurrentRecord() is called by passing the current geometry field provided when creating a geometry in Hive. The HiveRecordInfoProvider is used then to get the geometry or to return null if the record has no spatial information.

    The information returned by the HiveRecordInfoProvider is used by the Hive Spatial functions to create geometries (see Hive Spatial Functions).

    Sample HiveRecordInfoProvider Implementation

    This sample implementation, named SimpleHiveRecordInfoProvider, takes text records in JSON format. The following is a sample input record:
    {"longitude":-71.46, "latitude":42.35}
    

    When SimpeHiveRecordInfoProvider is instantiated, a JSON ObjectMapper is created. The ObjectMapper is used to parse records values later when setCurrentRecord() is called. The geometry is represented as latitude-longitude pair, and is used to create a point geometry using the JsonUtils.readGeometry() method. Then the GeoJSON format to be returned is created using GeoJsonGen.asGeometry(), and the SRID is added to the GeoJSON using JsonUtils.addSRIDToGeoJSON().

    public class SimpleHiveRecordInfoProvider implements HiveRecordInfoProvider{
      private static final Log LOG =
        LogFactory.getLog(SimpleHiveRecordInfoProvider.class.getName());
    
      private JsonNode recordNode = null;
      private ObjectMapper jsonMapper = null;
    
      public SimpleHiveRecordInfoProvider(){
        jsonMapper = new ObjectMapper();
      }
      
      @Override
      public void setCurrentRecord(Object record) throws Exception {
        try{
          if(record != null){
            //parse the current value
            recordNode = jsonMapper.readTree(record.toString());
          }
        }catch(Exception ex){
          recordNode = null;
          LOG.warn("Problem reading JSON record
            value:"+record.toString(), ex);
        }    
      }
    
      @Override
      public String getGeometry() {
        if(recordNode == null){
          return null;
        }
        
        JGeometry geom = null;
    
        try{
          geom = JsonUtils.readGeometry(recordNode, 
              2, //dimensions
              8307 //SRID
              );
        }catch(Exception ex){
          recordNode = null;
          LOG.warn("Problem reading JSON record
            geometry:"+recordNode.toString(), ex);
        }
        
        if(geom != null){
          StringBuilder res = new StringBuilder();
          //Get a GeoJSON representation of the JGeometry
          GeoJsonGen.asGeometry(geom, res);
          String result = res.toString();
          //add SRID to GeoJSON and return the result
          return JsonUtils.addSRIDToGeoJSON(result, 8307);
        }
        
         return null;
      }
    }
    

    2.10.2 Using the Hive Spatial API

    The Hive Spatial API consists of Oracle-supplied Hive User Defined Functions that can be used to create geometries and perform operations using one or two geometries.

    The functions can be grouped into logical categories: types, single-geometry, and two-geometries. (Hive Spatial Functions lists the functions in each category and provides reference information about each function.)

    Example 2-3 Hive Script

    The following example script returns information about Twitter users in a data set who are within a specified geographical polygon and who have more than 50 followers. It does the following:

    1. Adds the necessary jar files:

      add jar
        /opt/oracle/oracle-spatial-graph/spatial/vector/jlib/ojdbc8.jar
        /opt/oracle/oracle-spatial-graph/spatial/vector/jlib/sdoutl.jar
        /opt/oracle/oracle-spatial-graph/spatial/vector/jlib/sdoapi.jar
        /opt/oracle/oracle-spatial-graph/spatial/vector/jlib/sdohadoop-vector.jar
        /opt/oracle/oracle-spatial-graph/spatial/vector/jlib/sdohadoop-vector-hive.jar;
      
    2. Creates the Hive user-defined functions that will be used:

      create temporary function ST_Point as 'oracle.spatial.hadoop.vector.hive.ST_Point';
      create temporary function ST_Polygon as 'oracle.spatial.hadoop.vector.hive.ST_Polygon';
      create temporary function ST_Contains as 'oracle.spatial.hadoop.vector.hive.function.ST_Contains';
      
    3. Creates a Hive table based on the files under the HDFS directory /user/oracle/twitter. The InputFormat used in this case is org.apache.hadoop.mapred.TextInputFormat and the Hive SerDe is a user-provided SerDe package.TwitterSerDe. The geometry of the tweets will be saved in the geometry column with the format {"longitude":n, "latitude":n} :

      CREATE EXTERNAL TABLE IF NOT EXISTS sample_tweets (id STRING, geometry STRING, followers_count STRING, friends_count STRING, location STRING)                                         
      ROW FORMAT SERDE 'package.TwitterSerDe'              
      STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
      OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
      LOCATION '/user/oracle/twitter';
      
    4. Runs a a spatial query receiving an ST_Polygon query area and the ST_Point tweets geometry, and using 0.5 as the tolerance value for the spatial operation. The HiveRecordInfoProvider implementation that will translate the custom geometry to GeoJSON format is the one described in HiveRecordInfoProvider. The output will be information about Twitter users in the query area who have more than 50 followers.

      SELECT id, followers_count, friends_count, location FROM sample_tweets
      WHERE ST_Contains(
        ST_Polygon('{"type": "Polygon","coordinates": [[[-106, 25], [-106, 30], [-104, 30], [-104, 25], [-106, 25]]]}', 8307)
        , ST_Point(geometry, 'package.SimpleHiveRecordInfoProvider')
        , 0.5)
        and followers_count > 50;
      

    The complete script is as follows:

    add jar
      /opt/oracle/oracle-spatial-graph/spatial/vector/jlib/ojdbc8.jar
      /opt/oracle/oracle-spatial-graph/spatial/vector/jlib/sdoutl.jar
      /opt/oracle/oracle-spatial-graph/spatial/vector/jlib/sdoapi.jar
      /opt/oracle/oracle-spatial-graph/spatial/vector/jlib/sdohadoop-vector.jar
      /opt/oracle/oracle-spatial-graph/spatial/vector/jlib/sdohadoop-vector-hive.jar;
    
    create temporary function ST_Point as 'oracle.spatial.hadoop.vector.hive.ST_Point';
    create temporary function ST_Polygon as 'oracle.spatial.hadoop.vector.hive.ST_Polygon';
    create temporary function ST_Contains as 'oracle.spatial.hadoop.vector.hive.function.ST_Contains';
     
    CREATE EXTERNAL TABLE IF NOT EXISTS sample_tweets (id STRING, geometry STRING, followers_count STRING, friends_count STRING, location STRING)                                         
    ROW FORMAT SERDE 'package.TwitterSerDe'              
    STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION '/user/oracle/twitter';
    
    
    SELECT id, followers_count, friends_count, location FROM sample_tweets
    WHERE ST_Contains(
      ST_Polygon('{"type": "Polygon","coordinates": [[[-106, 25], [-106, 30], [-104, 30], [-104, 25], [-106, 25]]]}', 8307)
      , ST_Point(geometry, 'package.SimpleHiveRecordInfoProvider')
      , 0.5)
      and followers_count > 50;
    

    2.10.3 Using Spatial Indexes in Hive

    Hive spatial queries can use a previously created spatial index, which you can create using the Java API (see Spatial Indexing).

    If you do not need to use the index in API functions that will access the original data, you can specify isMapFileIndex=false when you call oracle.spatial.hadoop.vector.mapred.job.SpatialIndexing, or you can use the function setMapFileIndex(false). In these cases, the index will have the following structure:

    HDFSIndexDirectory/part-xxxxx
    

    And in these cases, when creating a Hive table, just provide the folder where you created the index.

    If you need to access the original data and you do not set the parameter isMapFileIndex=false, the index structure is as follows:

    part-xxxxx
      data
      index
    

    In such cases, to create a Hive table, the data files of the index are needed. Copy the data files into a new HDFS folder, with each data file having a different name, like data1, data2,, and so on. The new folder will be used to create the Hive table.

    The index contains the geometry records and extra fields. That data can be used when creating the Hive table.

    (Note that Spatial Indexing Class Structure describes the index structure, and RecordInfoProvider provides an example of a RecordInfoProvider adding extra fields.)

    InputFormat oracle.spatial.hadoop.vector.mapred.input.SpatialIndexTextInputFormat will be used to read the index. The output of this InputFormat is GeoJSON.

    Before running any query, you can specify a minimum bounding rectangle (MBR) that will perform a first data filtering using SpatialIndexTextInputFormat..

    Example 2-4 Hive Script Using a Spatial Index

    The following example script returns information about Twitter users in a data set who are within a specified geographical polygon and who have more than 50 followers. It does the following:

    1. Adds the necessary jar files:

      add jar
        /opt/oracle/oracle-spatial-graph/spatial/vector/jlib/ojdbc8.jar
        /opt/oracle/oracle-spatial-graph/spatial/vector/jlib/sdoutl.jar
        /opt/oracle/oracle-spatial-graph/spatial/vector/jlib/sdoapi.jar
        /opt/oracle/oracle-spatial-graph/spatial/vector/jlib/sdohadoop-vector.jar
        /opt/oracle/oracle-spatial-graph/spatial/vector/jlib/sdohadoop-vector-hive.jar;
      
    2. Creates the Hive user-defined functions that will be used:

      create temporary function ST_Point as 'oracle.spatial.hadoop.vector.hive.ST_Point';
      create temporary function ST_Polygon as 'oracle.spatial.hadoop.vector.hive.ST_Polygon';
      create temporary function ST_Contains as 'oracle.spatial.hadoop.vector.hive.function.ST_Contains';
      
    3. Sets the data maximum and minimum boundaries (dim1Min,dim2Min,dim1Max,dim2Max):

      set oracle.spatial.boundaries=-180,-90,180,90;
      
    4. Sets the extra fields contained in the spatial index that will be included in the table creation:
      set oracle.spatial.index.includedExtraFields=followers_count,friends_count,location;
      
    5. Creates a Hive table based on the files under the HDFS directory /user/oracle/twitter. The InputFormat used in this case is oracle.spatial.hadoop.vector.mapred.input.SpatialIndexTextInputFormat and the Hive SerDe is a user-provided SerDe oracle.spatial.hadoop.vector.hive.json.GeoJsonSerDe. (The code for oracle.spatial.hadoop.vector.hive.json.GeoJsonSerDe is included with the Hive examples.) The geometry of the tweets will be saved in the geometry column with the format {"longitude":n, "latitude":n} :

      CREATE EXTERNAL TABLE IF NOT EXISTS sample_tweets_index (id STRING, geometry STRING, followers_count STRING, friends_count STRING, location STRING)                                         
      ROW FORMAT SERDE 'oracle.spatial.hadoop.vector.hive.json.GeoJsonSerDe'              
      STORED AS INPUTFORMAT 'oracle.spatial.hadoop.vector.mapred.input.SpatialIndexTextInputFormat'
      OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
      LOCATION '/user/oracle/twitter/index';
      
    6. Defines the minimum bounding rectangle (MBR) to filter in the SpatialIndexTextInputFormat. Any spatial query will only have access to the data in this MBR. If no MBR is specified, then the data boundaries will be used. This setting is recommended to improve the performance.

      set oracle.spatial.spatialQueryWindow={"type": "Polygon","coordinates": [[[-107, 24], [-107, 31], [-103, 31], [-103, 24], [-107, 24]]]};
      
    7. Runs a a spatial query receiving an ST_Polygon query area and the ST_Point tweets geometry, and using 0.5 as the tolerance value for the spatial operation. The tweet geometries are in GeoJSON format, and the ST_Point function is usedspecifying the SRID as 8307.. The output will be information about Twitter users in the query area who have more than 50 followers.

      SELECT id, followers_count, friends_count, location FROM sample_tweets
      WHERE ST_Contains(
        ST_Polygon('{"type": "Polygon","coordinates": [[[-106, 25], [-106, 30], [-104, 30], [-104, 25], [-106, 25]]]}', 8307)
        , ST_Point(geometry, 8307)
        , 0.5)
        and followers_count > 50;
      

    The complete script is as follows. (Differences between this script and the one in Using the Hive Spatial API are marked in bold; however, all of the steps are described in the preceding list.)

    add jar
      /opt/oracle/oracle-spatial-graph/spatial/vector/jlib/ojdbc8.jar
      /opt/oracle/oracle-spatial-graph/spatial/vector/jlib/sdoutl.jar
      /opt/oracle/oracle-spatial-graph/spatial/vector/jlib/sdoapi.jar
      /opt/oracle/oracle-spatial-graph/spatial/vector/jlib/sdohadoop-vector.jar
      /opt/oracle/oracle-spatial-graph/spatial/vector/jlib/sdohadoop-vector-hive.jar;
    
    create temporary function ST_Polygon as 'oracle.spatial.hadoop.vector.hive.ST_Polygon';
    create temporary function ST_Point as 'oracle.spatial.hadoop.vector.hive.ST_Point';
    create temporary function ST_Contains as 'oracle.spatial.hadoop.vector.hive.function.ST_Contains';
    
    set oracle.spatial.boundaries=-180,-90,180,90;
    set oracle.spatial.index.includedExtraFields=followers_count,friends_count,location;
    
    CREATE EXTERNAL TABLE IF NOT EXISTS sample_tweets_index (id STRING, geometry STRING, followers_count STRING, friends_count STRING, location STRING)                                         
    ROW FORMAT SERDE 'oracle.spatial.hadoop.vector.hive.json.GeoJsonSerDe'              
    STORED AS INPUTFORMAT 'oracle.spatial.hadoop.vector.mapred.input.SpatialIndexTextInputFormat'
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION '/user/oracle/twitter/index';
    
    set oracle.spatial.spatialQueryWindow={"type": "Polygon","coordinates": [[[-107, 24], [-107, 31], [-103, 31], [-103, 24], [-107, 24]]]};
    
    SELECT id, followers_count, friends_count, location FROM sample_tweets
    WHERE ST_Contains(
      ST_Polygon('{"type": "Polygon","coordinates": [[[-106, 25], [-106, 30], [-104, 30], [-104, 25], [-106, 25]]]}', 8307)
      , ST_Point(geometry, 8307)
      , 0.5)
      and followers_count > 50;
    

    2.11 Using the Oracle Big Data Spatial and Graph Vector Console

    You can use the Oracle Big Data Spatial and Graph Vector Console to perform tasks related to spatial indexing and creating and showing thematic maps.

    2.11.1 Creating a Spatial Index Using the Console

    To create a spatial index using the Oracle Big Data Spatial and Graph Vector Console, follow these steps.

    1. Open the console: http://<oracle_big_data_spatial_vector_console>:8045/spatialviewer/

    2. Click Spatial Index.

    3. Specify all the required details:

      1. Index name.

      2. Path of the file or files to index in HDFS. For example, hdfs://<server_name_to_store_index>:8020/user/oracle/bdsg/tweets.json.

      3. New index path: This is the job output path. For example: hdfs://<oracle_big_data_spatial_vector_console>:8020/user/oracle/bdsg/index.

      4. SRID of the geometries to be indexed. Example: 8307

      5. Tolerance of the geometries to be indexed. Example: 0.5

      6. Input Format class: The input format class. For example: oracle.spatial.hadoop.vector.geojson.mapred.GeoJsonInputFormat

      7. Record Info Provider class: The class that provides the spatial information. For example: oracle.spatial.hadoop.vector.geojson.GeoJsonRecordInfoProvider.

        Note:

        If the InputFormat class or the RecordInfoProvider class is not in the API, or in the hadoop API classes, then a jar with the user-defined classes must be provided. To be able to use this jar the user must add it in the $JETTY_HOME/webapps/spatialviewer/WEB-INF/lib directory and restart the server.

      8. Whether the MVSuggest service must be used or not. If the geometry has to be found from a location string, then use the MVSuggest service. In this case the provided RecordInfoProvider must implement the interface oracle.spatial.hadoop.vector.LocalizableRecordInfoProvider.

      9. MVSuggest service URL(Optional): If the geometry has to be found from a location string then use the MVSuggest service. If the service URL is localhost then each data node must have the MVSuggest application started and running. In this case, the new index contains the point geometry and the layer provided by MVSuggest for each record. If the geometry is a polygon then the geometry is a centroid of the polygon. For example: http://localhost:8080

      10. MVSuggest Templates (Optional): When using the MVSuggest service, the user can define the templates used to create the index.

      11. Outcome notification email sent to (Optional): Provide email Ids to receive the notifications when the job finished. Separate multiple email Ids by a semicolon. For example, mymail@example.com

    4. Click Create.

      The submitted job is listed and you should wait to receive a mail notifying that the job completed successfully.

    2.11.2 Exploring the Indexed Spatial Data

    To explore indexed spatial data, follow these steps.

    1. Open the console:http://<oracle_big_data_spatial_vector_console>:8045/spatialviewer/

    2. Click Explore Data.

    For example, you can:

    • Select the desired indexed data and click Refresh Map to see the data on the map.

    • Change the background map style.

    • Change the real data zoom level.

    • Show data using a heat map.

    2.11.3 Running a Categorization Job Using the Console

    You can run a categorization job with or without the spatial index. Follow these steps.

    1. Open http://<oracle_big_data_spatial_vector_console>:8045/spatialviewer/.

    2. Click Categorization, then Run Job.

    3. Select either With Index or Without Index and provide the following details, as required:

      • With Index

        1. Index name

      • Without Index

        1. Path of the data: Provide the HDFS data path. For example, hdfs://<oracle_big_data_spatial_vector_console>:8020/user/oracle/bdsg/tweets.json.

        2. JAR with user classes (Optional): If the InputFormat class or the RecordInfoProvider class is not in the API, or in the hadoop API classes, then a jar with the user-defined classes must be provided. To be able to use this jar the user must add it in the $JETTY_HOME/webapps/spatialviewer/WEB-INF/lib directory and restart the server.

        3. Input Format class: The input format class. For example: oracle.spatial.hadoop.vector.geojson.mapred.GeoJsonInputFormat

        4. Record Info Provider class: The class that will provide the spatial information. For example: oracle.spatial.hadoop.vector.geojson.GeoJsonRecordInfoProvider.

        5. If the MVSuggest service has to be used or not. If the geometry must be found from a location string, then use the MVSuggest service. In this case the provided RecordInfoProvider has to implement the interface oracle.spatial.hadoop.vector.LocalizableRecordInfoProvider.

    4. Templates: The templates to create the thematic maps.

      Note:

      If a template refers to point geometries (for example, cities), the result returned is empty for that template, if MVSuggest is not used. This is because the spatial operations return results only for polygons.

      Tip:

      When using the MVSuggest service the results will be more accurate if all the templates that could match the results are provided. For example, if the data can refer to any city, state, country, or continent in the world, then the better choice of templates to build results are World Continents, World Countries, World State Provinces, and World Cities. On the other hand, if the data is from the USA states and counties, then the suitable templates are USA States and USA Counties. If an index that was created using the MVSuggest service is selected, then select the top hierarchy for an optimal result. For example, if it was created using World Countries, World State Provinces, and World Cities, then use World Countries as the template.

    5. Output path: The Hadoop job output path. For example: hdfs://<oracle_big_data_spatial_vector_console>:8020/user/oracle/bdsg/catoutput

    6. Result name: The result name. If a result exists for a template with the same name, it is overwritten. For example, Tweets test.

    7. Outcome notification email sent to (Optional): Provide email Ids to receive the notifications when the job finished. Separate multiple email Ids by a semicolon. For example, mymail@abccorp.com.

    2.11.4 Viewing the Categorization Results

    To view the categorization results, follow these steps.

    1. Open http://<oracle_big_data_spatial_vector_console>:8045/spatialviewer/.
    2. Click Categorization, then View Results.
    3. Click any one of the Templates. For example, World Continents.

      The World Continents template is displayed.

    4. Click any one of the Results displayed.

      Different continents appear with different patches of colors.

    5. Click any continent from the map. For example, North America.

      The template changes to World Countries and the focus changes to North America with the results by country.

    2.11.5 Saving Categorization Results to a File

    You can save categorization results to a file (for example, the result file created with a job executed from the command line) on the local system for possible future uploading and use. The templates are located in the folder $JETTY_HOME/webapps/spatialviewer/templates. The templates are GeoJSON files with features and all the features have ids. For example, the first feature in the template USA States starts with: {"type":"Feature","_id":"WYOMING",...

    The results must be JSON files with the following format: {"id":"JSONFeatureId","result":result}.

    For example, if the template USA States is selected, then a valid result is a file containing: {"id":"WYOMING","result":3232} {"id":"SOUTH DAKOTA","result":74968}

    1. Click Categorization, then View Results.
    2. Select a Template .
    3. Specify a Name.
    4. Click Choose File to select the File location.
    5. Click Save.

      The results can be located in the $JETTY_HOME/webapps/spatialviewer/results folder.

    2.11.6 Creating and Deleting Templates

    To create new templates do the following:

    1. Add the template JSON file in the folder $JETTY_HOME/webapps/spatialviewer/templates/.
    2. Add the template configuration file in the folder $JETTY_HOME/webapps/spatialviewer/templates/_config_.

    To delete the template, delete the JSON and configuration files added in steps 1 and 2.

    2.11.7 Configuring Templates

    Each template has a configuration file. The template configuration files are located in the folder $JETTY_HOME/webapps/spatialviewer/templates/_config_. The name of the configuration file is the same as the template files suffixed with config.json instead of .json.For example, the configuration file name of the template file usa_states.json is usa_states.config.json. The configuration parameters are:

    • name: Name of the template to be shown on the console. For example, name: USA States.

    • display_attribute: When displaying a categorization result, a cursor move on the top of a feature displays this property and result of the feature. For example, display_attribute: STATE NAME.

    • point_geometry: True, if the template contains point geometries and false, in case of polygons. For example, point_geometry: false.

    • child_templates (optional): The templates that can have several possible child templates separated by a coma. For example, child_templates: ["world_states_provinces, usa_states(properties.COUNTRY CODE:properties.PARENT_REGION)"].

      If the child templates do not specify a linked field, it means that all the features inside the parent features are considered as child features. In this case, the world_states_provinces doesn't specify any fields. If the link between parent and child is specified, then the spatial relationship doesn't apply and the feature properties link are checked. In the above example, the relationship with the usa_states is found with the property COUNTRY CODE in the current template, and the property PARENT_REGION in the template file usa_states.json.

    • srid: The SRID of the template's geometries. For example, srid: 8307.

    • back_polygon_template_file_name (optional): A template with polygon geometries to set as background when showing the defined template. For example, back_polygon_template_file_name: usa_states.

    • vectorLayers: Configuration specific to the MVSuggest service. For example:

      {
      "vectorLayers": [
      {
                      "gnidColumns":["_GNID"],
                      "boostValues":[2.0,1.0,1.0,2.0]
              }
              ]
       }
      

      Where:

      • gnidColumns is the name of the column(s) within the Json file that represents the Geoname ID. This value is used to support multiple languages with MVSuggest. (See references of that value in the file templates/_geonames_/alternateNames.json.) There is nodefault value for this property.

      • boostValues is an array of float numbers that represent how important a column is within the "properties" values for a given row. The higher the number, the more important that field is. A value of zero means the field will be ignored. When boostValues is not present, all fields receive a default value of 1.0, meaning they all are equally important properties. The MVSuggest service may return different results depending on those values. For a Json file with the following properties, the boost values might be as follows:

        "properties":{"Name":"New York City","State":"NY","Country":"United States","Country Code":"US","Population":8491079,"Time Zone":"UTC-5"}
        "boostValues":[3.0,2.0,1.0,1.0,0.0,0.0]
        

    2.11.8 Running a Clustering Job Using the Console

    To run a clustering job, follow these steps.

    1. Open http://<oracle_big_data_spatial_vector_console>:8045/spatialviewer/.

    2. Click Clustering, then Run Job.

    3. Provide the following details, as required:

      1. Path of the data: Provide the HDFS data path. For example, hdfs://<oracle_big_data_spatial_vector_console>:8020/user/oracle/bdsg/tweets.json.

      2. The SRID of the geometries. For example: 8307

      3. The tolerance of the geometries. For example: 0.5

      4. JAR with user classes (Optional): If the InputFormat class or the RecordInfoProvider class is not in the API, or in the hadoop API classes, then a jar with the user-defined classes must be provided. To be able to use this jar the user must add it in the $JETTY_HOME/webapps/spatialviewer/WEB-INF/lib directory and restart the server.

      5. Input Format class: The input format class. For example: oracle.spatial.hadoop.vector.geojson.mapred.GeoJsonInputFormat

      6. Record Info Provider class: The class that will provide the spatial information. For example: oracle.spatial.hadoop.vector.geojson.GeoJsonRecordInfoProvider.

      7. Output path: The Hadoop job output path. For example: hdfs://<oracle_big_data_spatial_vector_console>:8020/user/oracle/bdsg/catoutput

      8. Result name: The result name. If a result exists for a template with the same name, it is overwritten. For example, Tweets test.

      9. Outcome notification email sent to (Optional): Provide email Ids to receive the notifications when the job finished. Separate multiple email Ids by a semicolon. For example, mymail@abccorp.com.

    2.11.9 Viewing the Clustering Results

    To view the clustering results, follow these steps.

    1. Open http://<oracle_big_data_spatial_vector_console>:8045/spatialviewer/.
    2. Click Clustering, then View Results.
    3. Click any one of the Results displayed.

    2.11.10 Saving Clustering Results to a File

    You can save clustering results to a file on your local system, for later uploading and use. To save the clustering results to a file, follow these steps.

    1. Open http://<oracle_big_data_spatial_vector_console>:8045/spatialviewer/.
    2. Click Clustering, then View Results.
    3. Click the icon for saving the results.
    4. Specify the SRID of the geometries. For example: 8307
    5. Click Choose File and select the file location.
    6. Click Save.

    2.11.11 Running a Binning Job Using the Console

    You can run a binning job with or without the spatial index. Follow these steps.

    1. Open http://<oracle_big_data_spatial_vector_console>:8045/spatialviewer/.

    2. Click Binning, then Run Job.

    3. Select either With Index or Without Index and provide the following details, as required:

      • With Index

        1. Index name

      • Without Index

        1. Path of the data: Provide the HDFS data path. For example, hdfs://<oracle_big_data_spatial_vector_console>:8020/user/oracle/bdsg/tweets.json.

        2. The SRID of the geometries. For example: 8307

        3. The tolerance of the geometries. For example: 0.5

        4. JAR with user classes (Optional): If the InputFormat class or the RecordInfoProvider class is not in the API, or in the hadoop API classes, then a jar with the user-defined classes must be provided. To be able to use this jar the user must add it in the $JETTY_HOME/webapps/spatialviewer/WEB-INF/lib directory and restart the server.

        5. Input Format class: The input format class. For example: oracle.spatial.hadoop.vector.geojson.mapred.GeoJsonInputFormat

        6. Record Info Provider class: The class that will provide the spatial information. For example: oracle.spatial.hadoop.vector.geojson.GeoJsonRecordInfoProvider.

    4. Binning grid minimum bounding rectangle (MBR). You can click the icon for seeing the MBR on the map.

    5. Binning shape: hexagon (specify the hexagon width) or rectangle (specify the width and height).

    6. Thematic attribute: If the job uses an index, double-click to see the possible values, which are those returned by the function getExtraFields of the RecordInfoProvider used when creating the index. If the job does not use an index, then the field can be one of the fields returned by the function getExtraFields of the specified RecordInfoProvider class. In any case, the count attribute is always available and specifies the number of records in the bin.

    7. Output path: The Hadoop job output path. For example: hdfs://<oracle_big_data_spatial_vector_console>:8020/user/oracle/bdsg/binningOutput

    8. Result name: The result name. If a result exists for a template with the same name, it is overwritten. For example, Tweets test.

    9. Outcome notification email sent to (Optional): Provide email Ids to receive the notifications when the job finished. Separate multiple email Ids by a semicolon. For example, mymail@abccorp.com.

    2.11.12 Viewing the Binning Results

    To view the binning results, follow these steps.

    1. Open http://<oracle_big_data_spatial_vector_console>:8045/spatialviewer/.
    2. Click Binning, then View Results.
    3. Click any of the Results displayed.

    2.11.13 Saving Binning Results to a File

    You can save binning results to a file on your local system, for later uploading and use. To save the binning results to a file, follow these steps.

    1. Open http://<oracle_big_data_spatial_vector_console>:8045/spatialviewer/.
    2. Click Binning, then View Results.
    3. Click the icon for saving the results.
    4. Specify the SRID of the geometries. For example: 8307
    5. Specify the thematic attribute, which must be a property of the features in the result. For example, the count attribute can be used to create results depending on the number of results per bin.
    6. Click Choose File and select the file location.
    7. Click Save.

    2.11.14 Running a Job to Create an Index Using the Command Line

    To create a spatial index, use a command in the following format:

    hadoop jar <HADOOP_LIB_PATH>/sdohadoop-vector.jar oracle.spatial.hadoop.vector.mapred.job.SpatialIndexing [generic options] input=<path|comma_separated_paths|path_pattern> output=<path> inputFormat=<InputFormat_subclass> recordInfoProvider=<RecordInfoProvider_subclass> [srid=<integer_value>] [geodetic=<true|false>] [tolerance=<double_value>] [boundaries=<minX,minY,maxX,maxY>] [indexName=<index_name>] [indexMetadataDir=<path>] [overwriteIndexMetadata=<true|false>] [  mvsLocation=<path|URL> [mvsMatchLayers=<comma_separated_layers>][mvsMatchCountry=<country_name>][mvsSpatialResponse=<[NONE, FEATURE_GEOMETRY, FEATURE_CENTROID]>][mvsInterfaceType=<LOCAL, WEB>][mvsIsRepository=<true|false>][rebuildMVSIndex=<true|false>][mvsPersistentLocation=<hdfs_path>][mvsOverwritePersistentLocation=<true|false>] ]
    

    To use the new Hadoop API format, replace oracle.spatial.hadoop.vector.mapred.job.SpatialIndexing with oracle.spatial.hadoop.vector.mapreduce.job.SpatialIndexing.

    Input/output arguments:

    • input : the location of the input data. It can be expressed as a path, a comma separated list of paths, or a regular expression.

    • inputFormat: the inputFormat class implementation used to read the input data.

    • recordInfoProvider: the recordInfoProvider implementation used to extract information from the records read by the InputFormat class.

    • output: the path where the spatial index will be stored

    Spatial arguments:

    • srid (optional, default=0): the spatial reference system (coordinate system) ID of the spatial data.

    • geodetic (optional, default depends on the srid): boolean value that indicates whether the geometries are geodetic or not.

    • tolerance (optional, default=0.0): double value that represents the tolerance used when performing spatial operations.

    • boundaries (optional, default=unbounded): the minimum and maximum values for each dimension, expressed as comma separated values in the form: minX,minY,maxX,maxY

    Spatial index metadata arguments:

    • indexName (optional, default=output folder name):The name of the index to be generated.

    • indexMetadataDir (optional, default=hdfs://server:port/user/<current_user>/oracle_spatial/index_metadata/): the directory where the spatial index metadata will be stored.

    • overwriteIndexMetadata (optional, default=false) boolean argument that indicates whether the index metadata can be overwritten if an index with the same name already exists.

    MVSuggest arguments:

    • mvsLocation: The path to the MVSuggest directory or repository for local standalone instances of MVSuggest or the service URL when working with a remote instance. This argument is required when working with MVSuggest.

    • mvsMatchLayers (optional, default=all): comma separated list of layers. When provided, MVSuggest will only use these layers to perform the search.

    • mvsMatchCountry (optional, default=none): a country name which MVSuggest will give higher priority when performing matches.

    • mvsSpatialResponse (optional, default=CENTROID): the type of the spatial results contained in each returned match. It can be one of the following values: NONE, FEATURE_GEOMETRY, FEATURE_CENTROID.

    • mvsInterfaceType (optional: default=LOCAL): the type of MVSuggest service used, it can be LOCAL or WEB.

    • mvsIsRepository (optional: default=false) (LOCAL only): boolean value which specifies whether mvsLocation points to a whole MVS directory(false) or only to a repository(true). An MVS repository contains only JSON templates; it may or not contain a _config_ and _geonames_ folder.

    • mvsRebuildIndex (optional, default=false)(LOCAL only):boolean value specifying whether the repository index should be rebuilt or not.

    • mvsPersistentLocation (optional, default=none)(LOCAL only): an HDFS path where the MVSuggest directory will be saved.

    • mvsIsOverwritePersistentLocation (optional, default=false): boolean argument that indicates whether an existing mvsPersistentLocation must be overwritten in case it already exists.

    Example: Create a spatial index called indexExample. The index metadata will be stored in the HDFS directory spatialMetadata.

    hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop/lib/sdohadoop-vector.jar oracle.spatial.hadoop.vector.mapred.job.SpatialIndexing input="/user/hdfs/demo_vector/tweets/part*" output=/user/hdfs/demo_vector/tweets/spatial_index inputFormat=oracle.spatial.hadoop.vector.geojson.mapred.GeoJsonInputFormat recordInfoProvider=oracle.spatial.hadoop.vector.geojson.GeoJsonRecordInfoProvider srid=8307 geodetic=true tolerance=0.5 indexName=indexExample indexMetadataDir=indexMetadataDir overwriteIndexMetadata=true
    

    Example: Create a spatial index using MVSuggest to assign a spatial location to records that do not contain geometries.

    hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop/lib/sdohadoop-vector.jar oracle.spatial.hadoop.vector.mapred.job.SpatialIndexing input="/user/hdfs/demo_vector/tweets/part*" output=/user/hdfs/demo_vector/tweets/spatial_index inputFormat=oracle.spatial.hadoop.vector.geojson.mapred.GeoJsonInputFormat recordInfoProvider=mypackage.Simple LocationRecordInfoProvider srid=8307 geodetic=true tolerance=0.5 indexName=indexExample indexMetadataDir=indexMetadataDir overwriteIndexMetadata=true mvsLocation=file:///local_folder/mvs_dir/oraclemaps_pub/ mvsRepository=true
    

    2.11.15 Running a Job to Create a Categorization Result

    To create a categorization result, use a command in one of the following formats.

    With a Spatial Index

    hadoop jar <HADOOP_LIB_PATH >/sdohadoop-vector.jar oracle.spatial.hadoop.vector.mapred.job.Categorization [generic options] ( indexName=<indexName> [indexMetadataDir=<path>] )  | ( input=<path|comma_separated_paths|path_pattern> isInputIndex=true [srid=<integer_value>] [geodetic=<true|false>] [tolerance=<double_value>] [boundaries=<min_x,min_y,max_x,max_y>]  ) output=<path> hierarchyIndex=<hdfs_hierarchy_index_path> hierarchyInfo=<HierarchyInfo_subclass> [hierarchyDataPaths=<level1_path,level2_path,,levelN_path>] spatialOperation=<[None, IsInside, AnyInteract]>
    

    Without a Spatial Index

    hadoop jar <HADOOP_LIB_PATH >/sdohadoop-vector.jar oracle.spatial.hadoop.vector.mapred.job.Categorization [generic options] input=<path|comma_separated_paths|path_pattern> inputFormat=<InputFormat_subclass> recordInfoProvider=<RecordInfoProvider_subclass> [srid=<integer_value>] [geodetic=<true|false>] [tolerance=<double_value>] [boundaries=<min_x,min_y,max_x,max_y>]  output=<path> hierarchyIndex=<hdfs_hierarchy_index_path> hierarchyInfo=<HierarchyInfo_subclass>  hierarchyDataPaths=<level1_path,level2_path,,levelN_path>] spatialOperation=<[None, IsInside, AnyInteract]>
    

    Using MVSuggest

    hadoop jar <HADOOP_LIB_PATH >/sdohadoop-vector.jar oracle.spatial.hadoop.vector.mapred.job.Categorization [generic options] (indexName=<indexName> [indexMetadataDir=<path>])  | 
    ( 
    (input=<path|comma_separated_paths|path_pattern> isInputIndex=true)  | (input=<path|comma_separated_paths|path_pattern> inputFormat=<InputFormat_subclass> recordInfoProvider=<LocalizableRecordInfoProvider_subclass>)
    [srid=<integer_value>] [geodetic=<true|false>] [tolerance=<double_value>] [boundaries=<min_x,min_y,max_x,max_y>]  
    ) output=<path> 
    mvsLocation=<path|URL> [mvsMatchLayers=<comma_separated_layers>] [mvsMatchCountry=<country_name>] [mvsSpatialResponse=<[NONE, FEATURE_GEOMETRY, FEATURE_CENTROID]>] [mvsInterfaceType=<[UNDEFINED, LOCAL, WEB]>] [mvsIsRepository=<true|false>] [mvsRebuildIndex=<true|false>] [mvsPersistentLocation=<hdfs_path>] [mvsOverwritePersistentLocation=<true|false>] [mvsMaxRequestRecords=<integer_number>]  hierarchyIndex=<hdfs_hierarchy_index_path> hierarchyInfo=<HierarchyInfo_subclass>
    

    To use the new Hadoop API format, replace oracle.spatial.hadoop.vector.mapred.job.Categorization with oracle.spatial.hadoop.vector.mapreduce.job.Categorization.

    Input/output arguments:

    • indexName: the name of an existing spatial index. The index information will be looked at the path given by indexMetadataDir. When used, the argument input is ignored.

    • indexMetadataDir (optional, default=hdfs://server:port/user/<current_user>/oracle_spatial/index_metadata/): the directory where the spatial index metadata is located

    • input : the location of the input data. It can be expressed as a path, a comma separated list of paths, or a regular expression. (Ignored if indexName is specified.)

    • inputFormat: the inputFormat class implementation used to read the input data. (Ignored if indexName is specified.)

    • recordInfoProvider: the recordInfoProvider implementation used to extract information from the records read by the InputFormat class. (Ignored if indexName is specified.)

    • output: the path where the spatial index will be stored

    Spatial arguments:

    • srid (optional, default=0): the spatial reference system (coordinate system) ID of the spatial data.

    • geodetic (optional, default depends on the srid): boolean value that indicates whether the geometries are geodetic or not.

    • tolerance (optional, default=0.0): double value that represents the tolerance used when performing spatial operations.

    • boundaries (optional, default=unbounded): the minimum and maximum values for each dimension, expressed as comma separated values in the form: minX,minY,maxX,maxY

    • spatialOperation: the spatial operation to perform between the input data set and the hierarchical data set. Allowed values are IsInside and AnyInteract.

    Hierarchical data set arguments:

    • hierarchyIndex: the HDFS path of an existing hierarchical index or where it can be stored if it needs to be generated.

    • hierarchyInfo: the fully qualified name of a HierarchyInfo subclass which is used to describe the hierarchical data.

    • hierarchyDataPaths (optional, default=none): a comma separated list of paths of the hierarchy data. The paths should be sorted in ascending way by hierarchy level. If a hierarchy index path does not exist for the given hierarchy data, this argument is required.

    MVSuggest arguments:

    • mvsLocation: The path to the MVSuggest directory or repository for local standalone instances of MVSuggest or the service URL when working with a remote instance. This argument is required when working with MVSuggest.

    • mvsMatchLayers (optional, default=all): comma separated list of layers. When provided, MVSuggest will only use these layers to perform the search.

    • mvsMatchCountry (optional, default=none): a country name which MVSuggest will give higher priority when performing matches.

    • mvsSpatialResponse (optional, default=CENTROID): the type of the spatial results contained in each returned match. It can be one of the following values: NONE, FEATURE_GEOMETRY, FEATURE_CENTROID.

    • mvsInterfaceType (optional: default=LOCAL): the type of MVSuggest service used, it can be LOCAL or WEB.

    • mvsIsRepository (optional: default=false) (LOCAL only): Boolean value that specifies whether mvsLocation points to a whole MVS directory(false) or only to a repository(true). An MVS repository contains only JSON templates; it may or not contain a _config_ and _geonames_ folder.

    • mvsRebuildIndex (optional, default=false)(LOCAL only):boolean value specifying whether the repository index should be rebuilt or not.

    • mvsPersistentLocation (optional, default=none)(LOCAL only): an HDFS path where the MVSuggest directory will be saved.

    • mvsIsOverwritePersistentLocation (optional, default=false): boolean argument that indicates whether an existing mvsPersistentLocation must be overwritten in case it already exists.

    Example: Run a Categorization job to create a summary containing the records counts by continent, country, and state/provinces. The input is an existing spatial index called indexExample. The hierarchical data will be indexed and stored in HDFS at the path hierarchyIndex.

    hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop/lib/sdohadoop-vector.jar oracle.spatial.hadoop.vector.mapred.job.Categorization indexName= indexExample  output=/user/hdfs/demo_vector/tweets/hier_count_spatial hierarchyInfo=vectoranalysis.categorization.WorldAdminHierarchyInfo hierarchyIndex=hierarchyIndex  hierarchyDataPaths=file:///templates/world_continents.json,file:///templates/world_countries.json,file:///templates/world_states_provinces.json spatialOperation=IsInside
    

    Example: Run a Categorization job to create a summary of tweet counts per continent, country, states/provinces, and cities using MVSuggest.

    hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop/lib/sdohadoop-vector.jar oracle.spatial.hadoop.vector.mapred.job.Categorization input="/user/hdfs/demo_vector/tweets/part*" inputFormat=<InputFormat_subclass> recordInfoProvider=<LocalizableRecordInfoProvider_subclass>  output=/user/hdfs/demo_vector/tweets/hier_count_mvs hierarchyInfo=vectoranalysis.categorization.WorldAdminHierarchyInfo hierarchyIndex=hierarchyIndex mvsLocation=file:///mvs_dir mvsMatchLayers=world_continents,world_countries,world_states_provinces spatialOperation=IsInside
    

    2.11.16 Running a Job to Create a Clustering Result

    To create a clustering result, use a command in the following format:

    hadoop jar <HADOOP_LIB_PATH >/sdohadoop-vector.jar oracle.spatial.hadoop.vector.mapred.job.KMeansClustering [generic options] input=<path|comma_separated_paths|path_pattern> inputFormat=<InputFormat_subclass> recordInfoProvider=<RecordInfoProvider_subclass> output=<path> [srid=<integer_value>] [geodetic=<true|false>] [tolerance=<double_value>] [boundaries=<min_x,min_y,max_x,max_y>]  k=<number_of_clusters> [clustersPoints=<comma_separated_points_ordinates>] [deleteClusterFiles=<true|false>] [maxIterations=<integer_value>] [critFunClass=<CriterionFunction_subclass>] [shapeGenClass=<ClusterShapeGenerator_subclass>] [maxMemberDistance=<double_value>]
    

    To use the new Hadoop API format, replace oracle.spatial.hadoop.vector.mapred.job.KMeansClustering with oracle.spatial.hadoop.vector.mapreduce.job.KMeansClustering.

    Input/output arguments:

    • input : the location of the input data. It can be expressed as a path, a comma separated list of paths, or a regular expression.

    • inputFormat: the inputFormat class implementation used to read the input data.

    • recordInfoProvider: the recordInfoProvider implementation used to extract information from the records read by the InputFormat class.

    • output: the path where the spatial index will be stored

    Spatial arguments:

    • srid (optional, default=0): the spatial reference system (coordinate system) ID of the spatial data.

    • geodetic (optional, default depends on the srid): Boolean value that indicates whether the geometries are geodetic or not.

    • tolerance (optional, default=0.0): double value that represents the tolerance used when performing spatial operations.

    • boundaries (optional, default=unbounded): the minimum and maximum values for each dimension, expressed as comma separated values in the form: minX,minY,maxX,maxY

    • spatialOperation: the spatial operation to perform between the input data set and the hierarchical data set. Allowed values are IsInside and AnyInteract.

    Clustering arguments:

    • k: the number of clusters to be found.

    • clusterPoints (optional, default=none): the initial cluster centers as a comma-separated list of point ordinates in the form: p1_x,p1_y,p2_x,p2_y,…,pk_x,pk_y

    • deleteClusterFiles (optional, default=true): Boolean arguments that specifies whether the intermediate cluster files generated between iterations should be deleted or not

    • maxIterations (optional, default=calculated based on the number k): the maximum number of iterations allowed before the job completes.

    • critFunClass (optional, default=oracle.spatial.hadoop.vector.cluster.kmeans. SquaredErrorCriterionFunction) a fully qualified name of a CriterionFunction subclass.

    • shapeGenClass (optional, default= oracle.spatial.hadoop.vector.cluster.kmeans. ConvexHullClusterShapeGenerator) a fully qualified name of a ClusterShapeGenerator subclass used to generate the geometry of the clusters.

    • maxMemberDistance (optional, default=undefined): a double value that specifies the maximum distance between a cluster center and a cluster member.

    Example: Run a Clustering job to generate 5 clusters. The generated clusters geometries will be the convex hull of all .

    hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop/lib/sdohadoop-vector.jar oracle.spatial.hadoop.vector.mapred.job.KMeansClustering input="/user/hdfs/demo_vector/tweets/part*" output=/user/hdfs/demo_vector/tweets/result inputFormat=oracle.spatial.hadoop.vector.geojson.mapred.GeoJsonInputFormat recordInfoProvider=oracle.spatial.hadoop.vector.geojson.GeoJsonRecordInfoProvider srid=8307 geodetic=true tolerance=0.5 k=5 shapeGenClass=oracle.spatial.hadoop.vector.cluster.kmeans.ConvexHullClusterShapeGenerator
    

    2.11.17 Running a Job to Create a Binning Result

    To create a binning result, use a command in the following format:

    hadoop jar <HADOOP_LIB_PATH >/sdohadoop-vector.jar oracle.spatial.hadoop.vector.mapred.job.Binning [generic options] (indexName=<INDEX_NAME> [indexMetadataDir=<INDEX_METADATA_DIRECTORY>]) | (input=<DATA_PATH> inputFormat=<INPUT_FORMAT_CLASS> recordInfoProvider=<RECORD_INFO_PROVIDER_CLASS> [srid=<SRID>] [geodetic=<GEODETIC>] [tolerance=<TOLERANCE>]) output=<RESULT_PATH> cellSize=<CELL_SIZE> gridMbr=<GRID_MBR> [cellShape=<CELL_SHAPE>] [aggrFields=<EXTRA_FIELDS>]
    

    To use the new Hadoop API format, replace oracle.spatial.hadoop.vector.mapred.job.Binning with oracle.spatial.hadoop.vector.mapreduce.job.Binning.

    Input/output arguments:

    • indexName: the name of an existing spatial index. The index information will be looked at the path given by indexMetadataDir. When used, the argument input is ignored.

    • indexMetadataDir (optional, default=hdfs://server:port/user/<current_user>/oracle_spatial/index_metadata/): the directory where the spatial index metadata is located

    • input : the location of the input data. It can be expressed as a path, a comma separated list of paths, or a regular expression.

    • inputFormat: the inputFormat class implementation used to read the input data.

    • recordInfoProvider: the recordInfoProvider implementation used to extract information from the records read by the InputFormat class.

    • output: the path where the spatial index will be stored

    Spatial arguments:

    • srid (optional, default=0): the spatial reference system (coordinate system) ID of the spatial data.

    • geodetic (optional, default depends on the srid): Boolean value that indicates whether the geometries are geodetic or not.

    • tolerance (optional, default=0.0): double value that represents the tolerance used when performing spatial operations.

    Binning arguments:

    • cellSize: the size of the cells in the format: width,height

    • gridMbr : the minimum and maximum dimension values for the grid in the form: minX,minY,maxX,maxY

    • cellShape (optional, default=RECTANGLE): the shape of the cells. It can be RECTANGLE or HEXAGON

    • aggrFields (optional, default=none): a comma-separated list of field names that will be aggregated.

    Example: Run a spatial binning job to generate a grid of hexagonal cells and aggregate the value of the field SALES..

    hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop/lib/sdohadoop-vector.jar oracle.spatial.hadoop.vector.mapred.job.Binning indexName=indexExample indexMetadataDir=indexMetadataDir output=/user/hdfs/demo_vector/result cellShape=HEXAGON cellSize=5 gridMbr=-175,-85,175,85 aggrFields=SALES
    

    2.11.18 Running a Job to Perform Spatial Filtering

    To perform spatial filtering, use a command in the following format:

    hadoop jar <HADOOP_LIB_PATH >/sdohadoop-vector.jar oracle.spatial.hadoop.vector.mapred.job.SpatialFilter [generic options] ( indexName=<indexName> [indexMetadataDir=<path>] )  | 
    ( 
    (input=<path|comma_separated_paths|path_pattern> isInputIndex=true)  | (input=<path|comma_separated_paths|path_pattern> inputFormat=<InputFormat_subclass> recordInfoProvider=<RecordInfoProvider_subclass>)
    [srid=<integer_value>] [geodetic=<true|false>] [tolerance=<double_value>] [boundaries=<min_x,min_y,max_x,max_y>]
    ) output=<path>  spatialOperation=<[IsInside, AnyInteract]> queryWindow=<json-geometry>
    

    To use the new Hadoop API format, replace oracle.spatial.hadoop.vector.mapred.job.SpatialFilter with oracle.spatial.hadoop.vector.mapreduce.job.SpatialFilter.

    Input/output arguments:

    • indexName: the name of an existing spatial index. The index information will be looked at the path given by indexMetadataDir. When used, the argument input is ignored.

    • indexMetadataDir (optional, default=hdfs://server:port/user/<current_user>/oracle_spatial/index_metadata/): the directory where the spatial index metadata is located

    • input : the location of the input data. It can be expressed as a path, a comma separated list of paths, or a regular expression.

    • inputFormat: the inputFormat class implementation used to read the input data.

    • recordInfoProvider: the recordInfoProvider implementation used to extract information from the records read by the InputFormat class.

    • output: the path where the spatial index will be stored

    Spatial arguments:

    • srid (optional, default=0): the spatial reference system (coordinate system) ID of the spatial data.

    • geodetic (optional, default depends on the srid): Boolean value that indicates whether the geometries are geodetic or not.

    • tolerance (optional, default=0.0): double value that represents the tolerance used when performing spatial operations.

    Binning arguments:

    • cellSize: the size of the cells in the format: width,height

    • gridMbr : the minimum and maximum dimension values for the grid in the form: minX,minY,maxX,maxY

    • cellShape (optional, default=RECTANGLE): the shape of the cells. It can be RECTANGLE or HEXAGON

    • aggrFields (optional, default=none): a comma-separated list of field names that will be aggregated.

    • boundaries (optional, default=unbounded): the minimum and maximum values for each dimension, expressed as comma separated values in the form: minx,minY,maxX,maxY

    • spatialOperation: the operation to be applied between the queryWindow and the geometries from the input data set

    • queryWindow: the geometry used to filter the input dataset.

    Example: Perform a spatial filtering operation.

    hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop/lib/sdohadoop-vector.jar oracle.spatial.hadoop.vector.mapred.job.SpatialFilter indexName=indexExample indexMetadataDir=indexMetadataDir output=/user/hdfs/demo_vector/result spatialOperation=IsInside queryWindow='{"type":"Polygon", "coordinates":[[-106, 25, -106, 30, -104, 30, -104, 25, -106, 25]]}'
    

    2.11.19 Running a Job to Get Location Suggestions

    To create a job to get location suggestions, use a command in the following format.

    hadoop jar <HADOOP_LIB_PATH >/sdohadoop-vector.jar oracle.spatial.hadoop.vector.mapred.job.SuggestService [generic options] input=<path|comma_separated_paths|path_pattern> inputFormat=<InputFormat_subclass> recordInfoProvider=<RecordInfoProvider_subclass> output=<path> mvsLocation=<path|URL> [mvsMatchLayers=<comma_separated_layers>] [mvsMatchCountry=<country_name>] [mvsSpatialResponse=<[NONE, FEATURE_GEOMETRY, FEATURE_CENTROID]>] [mvsInterfaceType=<[UNDEFINED, LOCAL, WEB]>] [mvsIsRepository=<true|false>] [mvsRebuildIndex=<true|false>] [mvsPersistentLocation=<hdfs_path>] [mvsOverwritePersistentLocation=<true|false>] [mvsMaxRequestRecords=<integer_number>]
    

    To use the new Hadoop API format, replace oracle.spatial.hadoop.vector.mapred.job.SuggestService with oracle.spatial.hadoop.vector.mapreduce.job.SuggestService.

    Input/output arguments:

    • input : the location of the input data. It can be expressed as a path, a comma separated list of paths, or a regular expression. (Ignored if indexName is specified.)

    • inputFormat: the inputFormat class implementation used to read the input data. (Ignored if indexName is specified.)

    • recordInfoProvider: the recordInfoProvider implementation used to extract information from the records read by the InputFormat class. (Ignored if indexName is specified.)

    • output: the path where the spatial index will be stored

    MVSuggest arguments:

    • mvsLocation: The path to the MVSuggest directory or repository for local standalone instances of MVSuggest or the service URL when working with a remote instance. This argument is required when working with MVSuggest.

    • mvsMatchLayers (optional, default=all): comma separated list of layers. When provided, MVSuggest will only use these layers to perform the search.

    • mvsMatchCountry (optional, default=none): a country name which MVSuggest will give higher priority when performing matches.

    • mvsSpatialResponse (optional, default=CENTROID): the type of the spatial results contained in each returned match. It can be one of the following values: NONE, FEATURE_GEOMETRY, FEATURE_CENTROID.

    • mvsInterfaceType (optional: default=LOCAL): the type of MVSuggest service used, it can be LOCAL or WEB.

    • mvsIsRepository (optional: default=false) (LOCAL only): Boolean value that specifies whether mvsLocation points to a whole MVS directory(false) or only to a repository(true). An MVS repository contains only JSON templates; it may or not contain a _config_ and _geonames_ folder.

    • mvsRebuildIndex (optional, default=false)(LOCAL only):boolean value specifying whether the repository index should be rebuilt or not.

    • mvsPersistentLocation (optional, default=none)(LOCAL only): an HDFS path where the MVSuggest directory will be saved.

    • mvsIsOverwritePersistentLocation (optional, default=false): boolean argument that indicates whether an existing mvsPersistentLocation must be overwritten in case it already exists.

    Example: Get suggestions based on location texts from the input data set..

    hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop/lib/sdohadoop-vector.jar oracle.spatial.hadoop.vector.mapred.job.SuggestService input="/user/hdfs/demo_vector/tweets/part*" inputFormat=<InputFormat_subclass> recordInfoProvider=<LocalizableRecordInfoProvider_subclass> output=/user/hdfs/demo_vector/tweets/suggest_res mvsLocation=file:///mvs_dir mvsMatchLayers=world_continents,world_countries,world_states_provinces
    

    2.11.20 Running a Job to Perform a Spatial Join

    To perform a spatial join operation on two data sets, use a command in the following format.

    hadoop jar <HADOOP_LIB_PATH >/sdohadoop-vector.jar oracle.spatial.hadoop.vector.mapred.job. SpatialJoin [generic options] 
    inputList={ 
     { 
      ( indexName=<dataset1_spatial_index_name>  indexMetadataDir=<dataset1_spatial_index_metadata_dir_path> ) 
      |  
      ( input=<dataset1_path|comma_separated_paths|path_pattern> inputFormat=<dataset1_InputFormat_subclass> recordInfoProvider=<dataset1_RecordInfoProvider_subclass> )  
      [boundaries=<min_x,min_y,max_x,max_y>]  
     }  
     { 
      (indexName=<dataset2_spatial_index_name> indexMetadataDir=<dataset2_spatial_index_metadata_dir_path> 
      ) 
      |  
      ( input=<dataset2_path|comma_separated_paths|path_pattern> inputFormat=<dataset2_InputFormat_subclass> recordInfoProvider=<dataset2_RecordInfoProvider_subclass> 
      )  
      [boundaries=<min_x,min_y,max_x,max_y>]  
     } 
    } output=<path>[srid=<integer_value>] [geodetic=<true|false>] [tolerance=<double_value>] boundaries=<min_x,min_y,max_x,max_y> spatialOperation=<AnyInteract|IsInside|WithinDistance> [distance=<double_value>] [samplingRatio=<decimal_value_between_0_and_1> | partitioningResult=<path>]
    

    To use the new Hadoop API format, replace oracle.spatial.hadoop.vector.mapred.job.SpatialJoin with oracle.spatial.hadoop.vector.mapreduce.job.SpatialJoin.

    InputList: A list of two input data sets. The list is enclosed by curly braces ({}). Each list element is an input data set, which is enclosed by curly braces. An input data set can contain the following information, depending on whether the data set is specified as a spatial index.

    If specified as a spatial index:

    • indexName: the name of an existing spatial index.

    • indexMetadataDir : the directory where the spatial index metadata is located

    If not specified as a spatial index:

    • input : the location of the input data. It can be expressed as a path, a comma separated list of paths, or a regular expression. (Ignored if indexName is specified.)

    • inputFormat: the inputFormat class implementation used to read the input data. (Ignored if indexName is specified.)

    • recordInfoProvider: the recordInfoProvider implementation used to extract information from the records read by the InputFormat class. (Ignored if indexName is specified.)

    output: the path where the results will be stored

    Spatial arguments:

    • srid (optional, default=0): the spatial reference system (coordinate system) ID of the spatial data.

    • geodetic (optional, default depends on the srid): boolean value that indicates whether the geometries are geodetic or not.

    • tolerance (optional, default=0.0): double value that represents the tolerance used when performing spatial operations.

    • boundaries (optional, default=unbounded): the minimum and maximum values for each dimension, expressed as comma separated values in the form: minX,minY,maxX,maxY

    • spatialOperation: the spatial operation to perform between the input data set and the hierarchical data set. Allowed values are IsInside and AnyInteract.

    • distance: distance used for WithinDistance operations.

    Partitioning arguments:

    • samplingRatio (optional, default=0.1): ratio used to sample the data sets when partitioning needs to be performed

    • partitioningResult (optional, default=none): Path to a previously generated partitioning result file

    Example: Perform a spatial join on two data sets.

    hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop/lib/sdohadoop-vector.jar oracle.spatial.hadoop.vector.mapred.job.SpatialJoin inputList="{{input=/user/hdfs/demo_vector/world_countries.json inputFormat=oracle.spatial.hadoop.vector.geojson.mapred.GeoJsonInputFormat recordInfoProvider=oracle.spatial.hadoop.vector.geojson.GeoJsonRecordInfoProvider} {input=file="/user/hdfs/demo_vector/tweets/part*” inputFormat=oracle.spatial.hadoop.vector.geojson.mapred.GeoJsonInputFormat recordInfoProvider=oracle.spatial.hadoop.vector.geojson.GeoJsonRecordInfoProvider}}" output=/user/hdfs/demo_vector/spatial_join srid=8307 spatialOperation=AnyInteract boundaries=-180,-90,180,90
    

    2.11.21 Running a Job to Perform Partitioning

    To perform a spatial partitioning, use a command in the following format.

    hadoop jar <HADOOP_LIB_PATH >/sdohadoop-vector.jar oracle.spatial.hadoop.vector.mapred.job. SpatialJoin [generic options] 
    inputList={ 
     { 
      ( indexName=<dataset1_spatial_index_name>  indexMetadataDir=<dataset1_spatial_index_metadata_dir_path> ) 
      |  
      ( input=<dataset1_path|comma_separated_paths|path_pattern> inputFormat=<dataset1_InputFormat_subclass> recordInfoProvider=<dataset1_RecordInfoProvider_subclass> )  
      [boundaries=<min_x,min_y,max_x,max_y>]  
     }
    [  
     { 
      (indexName=<dataset2_spatial_index_name> indexMetadataDir=<dataset2_spatial_index_metadata_dir_path> 
      ) 
      |  
      ( input=<dataset2_path|comma_separated_paths|path_pattern> inputFormat=<dataset2_InputFormat_subclass> recordInfoProvider=<dataset2_RecordInfoProvider_subclass> 
      )  
      [boundaries=<min_x,min_y,max_x,max_y>]  
     }
      ……
     { 
      (indexName=<datasetN_spatial_index_name> indexMetadataDir=<datasetN_spatial_index_metadata_dir_path> 
      ) 
      |  
      ( input=<datasetN_path|comma_separated_paths|path_pattern> inputFormat=<datasetN_InputFormat_subclass> recordInfoProvider=<datasetN_RecordInfoProvider_subclass> 
      )  
      [boundaries=<min_x,min_y,max_x,max_y>]  
     }
     
    }
    ] output=<path>[srid=<integer_value>] [geodetic=<true|false>] [tolerance=<double_value>] boundaries=<min_x,min_y,max_x,max_y> [samplingRatio=<decimal_value_between_0_and_1>]
    

    To use the new Hadoop API format, replace oracle.spatial.hadoop.vector.mapred.job.Partitioning with oracle.spatial.hadoop.vector.mapreduce.job.Partitioning.

    InputList: A list of two input data sets. The list is enclosed by curly braces ({}). Each list element is an input data set, which is enclosed by curly braces. An input data set can contain the following information, depending on whether the data set is specified as a spatial index.

    If specified as a spatial index:

    • indexName: the name of an existing spatial index.

    • indexMetadataDir : the directory where the spatial index metadata is located

    If not specified as a spatial index:

    • input : the location of the input data. It can be expressed as a path, a comma separated list of paths, or a regular expression. (Ignored if indexName is specified.)

    • inputFormat: the inputFormat class implementation used to read the input data. (Ignored if indexName is specified.)

    • recordInfoProvider: the recordInfoProvider implementation used to extract information from the records read by the InputFormat class. (Ignored if indexName is specified.)

    output: the path where the results will be stored

    Spatial arguments:

    • srid (optional, default=0): the spatial reference system (coordinate system) ID of the spatial data.

    • geodetic (optional, default depends on the srid): boolean value that indicates whether the geometries are geodetic or not.

    • tolerance (optional, default=0.0): double value that represents the tolerance used when performing spatial operations.

    • boundaries (optional, default=unbounded): the minimum and maximum values for each dimension, expressed as comma separated values in the form: minX,minY,maxX,maxY

    Partitioning arguments:

    • samplingRatio (optional, default=0.1): ratio used to sample the data sets when partitioning needs to be performed

    Example: Partition two data sets.

    hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop/lib/sdohadoop-vector.jar oracle.spatial.hadoop.vector.mapred.job.Partitioning inputList="{{input=/user/hdfs/demo_vector/world_countries.json inputFormat=oracle.spatial.hadoop.vector.geojson.mapred.GeoJsonInputFormat recordInfoProvider=oracle.spatial.hadoop.vector.geojson.GeoJsonRecordInfoProvider} {input=file="/user/hdfs/demo_vector/tweets/part*” inputFormat=oracle.spatial.hadoop.vector.geojson.mapred.GeoJsonInputFormat recordInfoProvider=oracle.spatial.hadoop.vector.geojson.GeoJsonRecordInfoProvider}}" output=/user/hdfs/demo_vector/partitioning srid=8307 boundaries=-180,-90,180,90
    

    2.11.22 Using Multiple Inputs

    Multiple input data sets can be specified to a Vector job through the command line interface using the inputList parameter. The inputList parameter value is a group of input data sets. The inputList parameter format is as follows:

    inputList={ {input_data_set_1_params} {input_data_set_2_params} … {input_data_set_N_params} }
    

    Each individual input data set can be one of the following input data sets:

    • Non-file input data set: inputFormat=<InputFormat_subclass> recordInfoProvider=<RecordInfoProvider_subclass> [srid=<integer_value>] [geodetic=<true|false>] [tolerance=<double_value>] [boundaries=<min_x,min_y,max_x,max_y>]

    • File input data set: input=<path|comma_separated_paths|path_pattern> inputFormat=<FileInputFormat_subclass> recordInfoProvider=<RecordInfoProvider_subclass> [srid=<integer_value>] [geodetic=<true|false>] [tolerance=<double_value>] [boundaries=<min_x,min_y,max_x,max_y>]

    • Spatial index input data set: ( ( indexName=<<indexName>> [indexMetadataDir=<<path>>]) | ( isInputIndex=<true> input=<path|comma_separated_paths|path_pattern> ) ) [srid=<integer_value>] [geodetic=<true|false>] [tolerance=<double_value>] [boundaries=<min_x,min_y,max_x,max_y>]

    • NoSQL input data set: kvStore=<kv store name> kvStoreHosts=<comma separated list of hosts> kvParentKey=<parent key> [kvConsistency=<Absolute|NoneRequired|NoneRequiredNoMaster>] [kvBatchSize=<integer value>] [kvDepth=<CHILDREN_ONLY|DESCENDANTS_ONLY|PARENT_AND_CHILDREN|PARENT_AND_DESCENDANTS>] [kvFormatterClass=<fully qualified class name>] [kvSecurity=<properties file path>] [kvTimeOut=<long value>] [kvDefaultEntryProcessor=<fully qualified class name>] [kvEntryGrouper=<fully qualified class name>] [ kvResultEntries={ { minor key 1: a minor key name relative to the major key [fully qualified class name: a subclass of NoSQLEntryProcessor class used to process the entry with the given key] } * } ] [srid=<integer_value>] [geodetic=<true|false>] [tolerance=<double_value>] [boundaries=<min_x,min_y,max_x,max_y>]

    Notes:

    • A Categorization job does not support multiple input data sets.

    • A SpatialJoin job only supports two input data sets.

    • A SpatialIndexing job does not accept input data sets of type spatial index.

    • NoSQL input data sets can only be used when kvstore.jar is present in the classpath.

    2.12 Using Oracle Big Data Spatial and Graph Image Server Console

    You can use the Oracle Big Data Spatial and Graph Image Server Console to tasks, such as Loading Images to an HDFS Hadoop Cluster to Create a Mosaic.

    2.12.1 Loading Images to an HDFS Hadoop Cluster to Create a Mosaic

    Follow the instructions to create a mosaic:

    1. Open http://<oracle_big_data_image_server_console>:8080/spatialviewer/.
    2. Type the username and password.
    3. Click the Configuration tab and review the Hadoop configuration section.

      By default the application is configured to work with the Hadoop cluster and no additional configuration is required.

      Note:

      Only an admin user can make changes to this section.

    4. Click the Hadoop Loader tab and review the displayed instructions or alerts.
    5. Follow the instructions and update the runtime configuration, if necessary.
    6. Click the Folder icon.

      The File System dialog displays the list of image files and folders.

    7. Select the folders or files as required and click Ok.

      The complete path to the image file is displayed.

    8. Click Load Images.

      Wait for the images to be loaded successfully. A message is displayed.

    9. Proceed to create a mosaic, if there are no errors displayed.