2 Using Big Data Spatial and Graph with Spatial Data

This chapter provides conceptual and usage information about loading, storing, accessing, and working with spatial data in a Big Data environment.

2.1 About Big Data Spatial and Graph Support for Spatial Data

Spatial data represents the location characteristics of real or conceptual objects in relation to the real or conceptual space on a Geographic Information System (GIS) or other location-based application.

Oracle Big Data Spatial and Graph features enable spatial data to be stored, accessed, and analyzed quickly and efficiently for location-based decision making.

These features are used to geotag, enrich, visualize, transform, load, and process the location-specific two and three dimensional geographical images, and manipulate geometrical shapes for GIS functions.

2.1.1 What is Big Data Spatial and Graph on Apache Hadoop?

Oracle Big Data Spatial and Graph on Apache Hadoop is a framework that uses the MapReduce programs and analytic capabilities in a Hadoop cluster to store, access, and analyze the spatial data. The spatial features provide a schema and functions that facilitate the storage, retrieval, update, and query of collections of spatial data. Big Data Spatial and Graph on Hadoop supports storing and processing spatial images, which could be geometric shapes, raster, or vector images and stored in one of the several hundred supported formats.

See Also:

Oracle Spatial and Graph Developer's Guide for an introduction to spatial concepts, data, and operations

2.1.2 Advantages of Oracle Big Data Spatial and Graph

The advantages of using Oracle Big Data Spatial and Graph include the following:

  • Unlike some of the GIS-centric spatial processing systems and engines, Oracle Big Data Spatial and Graph is capable of processing both structured and unstructured spatial information.

  • Customers are not forced or restricted to store only one particular form of data in their environment. They can have their data stored both as a spatial or nonspatial business data and still can use Oracle Big Data to do their spatial processing.

  • This is a framework, and therefore customers can use the available APIs to custom-build their applications or operations.

  • Oracle Big Data Spatial can process both vector and raster types of information and images.

2.1.3 Oracle Big Data Spatial Features and Functions

The spatial data is loaded for query and analysis by the Spatial Server and the images are stored and processed by an Image Processing Framework. You can use the Oracle Big Data Spatial and Graph server on Hadoop for:

  • Cataloguing the geospatial information, such as geographical map-based footprints, availability of resources in a geography, and so on.

  • Topological processing to calculate distance operations, such as nearest neighbor in a map location.

  • Categorization to build hierarchical maps of geographies and enrich the map by creating demographic associations within the map elements.

The following functions are built into Oracle Big Data Spatial and Graph:

  • Indexing function for faster retrieval of the spatial data.

  • Map function to display map-based footprints.

  • Zoom function to zoom-in and zoom-out specific geographical regions.

  • Mosaic and Group function to group a set of image files for processing to create a mosaic or subset operations.

  • Cartesian and geodetic coordinate functions to represent the spatial data in one of these coordinate systems.

  • Hierarchical function that builds and relates geometric hierarchy, such as country, state, city, postal code, and so on. This function can process the input data in the form of documents or latitude/longitude coordinates.

2.1.4 Oracle Big Data Spatial Files, Formats, and Software Requirements

The stored spatial data or images can be in one of these supported formats:

  • GeoJSON files

  • Shapefiles

  • Both Geodetic and Cartesian data

  • Other GDAL supported formats

You must have the following software, to store and process the spatial data:

  • Java runtime

  • GCC Compiler - Only when the GDAL-supported formats are used

2.2 Oracle Big Data Vector and Raster Data Processing

Oracle Big Data Spatial and Graph supports the storage and processing of both vector and raster spatial data.

2.2.1 Oracle Big Data Spatial Raster Data Processing

For processing the raster data, the GDAL loader loads the raster spatial data or images onto a HDFS environment. The following basic operations can be performed on a raster spatial data:

  • Mosaic: Combine multiple raster images to create a single mosaic image.

  • Subset: Perform subset operations on individual images.

This feature supports a MapReduce framework for raster analysis operations. The users have the ability to custom-build their own raster operations, such as performing an algebraic function on a raster data and so on. For example, calculate the slope at each base of a digital elevation model or a 3D representation of a spatial surface, such as a terrain. For details, see "Oracle Big Data Spatial Hadoop Image Processing Framework for Raster Data Processing."

2.2.2 Oracle Big Data Spatial Vector Data Processing

This feature supports the processing of spatial vector data:

  • Loaded and stored on to a Hadoop HDFS environment

  • Stored either as Cartesian or geodetic data

The stored spatial vector data can be used for performing the following query operations and more:

  • Point-in-polygon

  • Distance calculation

  • Anyinteract

  • Buffer creation

Two different data service operations are supported for the spatial vector data:

  • Data enrichment

  • Data categorization

In addition, there is a limited Map Visualization API support for only the HTML5 format. You can access these APIs to create custom operations. For details, see "Oracle Big Data Spatial Vector Analysis."

2.3 Oracle Big Data Spatial Hadoop Image Processing Framework for Raster Data Processing

Oracle Spatial Hadoop Image Processing Framework allows the creation of new combined images resulting from a series of processing phases in parallel with the following features:

  • HDFS Images storage, where every block size split is stored as a separate image

  • Subset and user-defined operations processed in parallel using the MapReduce framework

  • Ability to add custom processing classes to be executed in parallel in a transparent way

  • Fast processing of georeferenced images

  • Support for GDAL formats, multiple bands images, DEMs (digital elevation models), multiple pixel depths, and SRIDs

The Oracle Spatial Hadoop Image Processing Framework consists of two modules, a Loader and Processor, each one represented by a Hadoop job running on different stages in a cluster, as represented in the following diagram. Also, you can load and process the images using the Image Server web application.

Description of image_process_framework.png follows
Description of the illustration image_process_framework.png

For installation and configuration information, see:

2.3.1 Image Loader

The Image Loader is a Hadoop job that loads a specific image or a group of images into HDFS.

  • While importing, the image is tiled and stored as an HDFS block.

  • GDAL is used to tile the image.

  • Each tile is loaded by a different mapper, so reading is parallel and faster.

  • Each tile includes a certain number of overlapping bytes (user input), so that the tile's cover area forms the adjacent tiles.

  • A MapReduce job uses a mapper to load the information for each tile. There are 'n' number of mappers, depending on the number of tiles, image resolution and block size.

  • A single reduce phase per image puts together all the information loaded by the mappers and stores the images into a special .ohif format, which contains the resolution, bands, offsets, and image data. This way the file offset containing each tile and the node location is known.

  • Each tile contains information for every band. This is helpful when there is a need to process only a few tiles; then, only the corresponding blocks are loaded.

The following diagram represents an Image Loader process:

Description of image_loader_job.png follows
Description of the illustration image_loader_job.png

2.3.2 Image Processor

The Image Processor is a Hadoop job that filters tiles to be processed based on the user input and performs processing in parallel to create a new image.

  • Processes specific tiles of the image identified by the user. You can identify one, zero, or multiple processing classes. After the execution of processing classes, a mosaic operation is performed to adapt the pixels to the final output format requested by the user.

  • A mapper loads the data corresponding to one tile, conserving data locality.

  • Once the data is loaded, the mapper filters the bands requested by the user.

  • Filtered information is processed and sent to each mapper in the reduce phase, where bytes are put together and a final processed image is stored into HDFS or regular File System depending on the user request.

The following diagram represents an Image Processor job:

Description of image_processor_job.png follows
Description of the illustration image_processor_job.png

2.3.3 Image Server

The Image Server is a web application the enables you to load and process images from different and variety of sources, especially from the Hadoop File System (HDFS). This Oracle Image Server has two main applications:

  • Raster Image processing to create catalogs from the source images and process into a single unit. You can also view the image thumbnails.

  • Hadoop console configuration, both server and console. It connects to the Hadoop cluster to load images to HDFS for further processing.

2.4 Loading an Image to Hadoop Using the Image Loader

The first step to process images using the Oracle Spatial and Graph Hadoop Image Processing Framework is to actually have the images in HDFS, followed by having the images separated into smart tiles. This allows the processing job to work separately on each tile independently. The Image Loader lets you import a single image or a collection of them into HDFS in parallel, which decreases the load time.

The Image Loader imports images from a file system into HDFS, where each block contains data for all the bands of the image, so that if further processing is required on specific positions, the information can be processed on a single node.

2.4.1 Image Loading Job

The image loading job has its custom input format that splits the image into related image splits. The splits are calculated based on an algorithm that reads square blocks of the image covering a defined area, which is determined by

area = ((blockSize - metadata bytes) / number of bands) / bytes per pixel.

For those pieces that do not use the complete block size, the remaining bytes are refilled with zeros.

Splits are assigned to different mappers where every assigned tile is read using GDAL based on the ImageSplit information. As a result an ImageDataWritable instance is created and saved in the context.

The metadata set in the ImageDataWritable instance is used by the processing classes to set up the tiled image in order to manipulate and process it. Since the source images are read from multiple mappers, the load is performed in parallel and faster.

After the mappers completes reading, the reducer picks up the tiles from the context and puts them together to save the file into HDFS. A special reading process is required to read the image back.

2.4.2 Input Parameters

The following input parameters are supplied to the Hadoop command:

hadoop jar HADOOP_LIB_PATH/hadoop-imageloader.jar 
  -files <SOURCE_IMGS_PATH>
  -out <HDFS_OUTPUT_FOLDER>
  [-overlap <OVERLAPPING_PIXELS>]
  [-thumbnail <THUMBNAIL_PATH>]
  [-gdal <GDAL_LIBRARIES_PATH>]

Where:


HADOOP_LIB_PATH for Oracle Big Data Appliance distribution is under /opt/cloudera/parcels/CDH/lib/hadoop/lib/, and for the standard distributions it is under /usr/lib/hadoop/lib.

SOURCE_IMGS_PATH is a path to the source image(s) or folder(s). For multiple inputs use a comma separator. This path must be accessible via NFS to all nodes in the cluster.

HDFS_OUTPUT_FOLDER is the HDFS output folder where the loaded images are stored.
OVERLAPPING_PIXELS is an optional number of overlapping pixels on the borders of each tile, if this parameter is not specified a default of two overlapping pixels are considered.

THUMBNAIL_PATH is an optional path to store a thumbnail of the loaded image(s). This path must be accessible via NFS to all nodes in the cluster and must have a write access permission for yarn users.

GDAL_LIBRARIES_PATH is an optional path for GDAL native libraries, in case, the cluster has them in a path different to the default Oracle Big Data Appliance path (/opt/cloudera/parcels/CDH/lib/hadoop/lib).

For example, this command loads all the georeferenced images under the images folder and adds an overlapping of 10 pixels on every border possible. The HDFS output folder is ohiftest and thumbnail of the loaded image are stored in the processtest folder. Since no -gdal flag was specified, the gdal libraries are loaded from the default gdal path /opt/cloudera/parcels/CDH/lib/hadoop/lib/native.

hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop/lib/hadoop-imageloader.jar   -files /opt/shareddir/spatial/demo/imageserver/images/hawaii.tif -out ohiftest -overlap 10 -thumbnail /opt/shareddir/spatial/processtest

By default, the Mappers and Reducers are configured to get 2 GB of JVM, but users can override this settings or any other job configuration properties by adding an imagejob.prop properties file in the same folder location from where the command is being executed. This properties file may list all the configuration properties that you want to override. For example,

mapreduce.map.memory.mb=2560
mapreduce.reduce.memory.mb=2560
mapreduce.reduce.java.opts=-Xmx2684354560
mapreduce.map.java.opts=-Xmx2684354560

2.4.3 Output Parameters

The reducer generates two output files per input image. The first one is the .ohif file that concentrates all the tiles for the source image, each tile may be processed as a separated instance by a processing mapper. Internally each tile is stored as a HDFS block, blocks are located in several nodes, one node may contain one or more blocks of a specific .ohif file. The .ohif file is stored in user specified folder with -out flag, under the /user/<USER_EXECUTING_JOB> folder, and the file can be identified as original_filename.ohif.

The second output is a related metadata file that lists all the pieces of the image and the coordinates each one of them cover, the file can be identified as: SRID_dataType_original_filename.loc where SRID and data type are extracted from source image. Following is an example of a couple of lines for a metadata file:

3,565758.0971152118,4302455.175478269,576788.4332668484,4291424.839326633
 
29,576757.962724993,4247455.847429363,582303.6013426667,4240485.710979952

The first element is the piece number, the second one is upper left X, followed by upper left Y, lower right X, and lower right Y. The .loc file is stored in the metadata folder under user the HDFS folder. If the -thumbnail flag was specified, a thumbnail of the source image is stored in the related folder. This is a way to visualize a translation of the .ohif file. Job execution logs can be accessed using the command yarn logs -applicationId <applicationId>.

2.5 Processing an Image Using the Oracle Spatial Hadoop Image Processor

Once the images are loaded into HDFS, they can be processed in parallel using Oracle Spatial Hadoop Image Processing Framework. You specify an output, and the framework filters the tiles to fit into that output, processes them, and puts them all together to store them into a single file. You can specify additional processing classes to be executed before the final output is created by the framework.

Image processor loads specific blocks of data, based on the input (mosaic description), and selects only the bands and pixels that fit into the final output. All the specified processing classes are executed and the final output is stored into HDFS or File System depending on the user request.

2.5.1 Image Processing Job

The image processing job has its own custom FilterInputFormat, which determines the tiles to be processed, based on the SRID and coordinates. Only images with same data type (pixel depth) as mosaic input data type (pixel depth) are considered. Only the tiles that intersect with coordinates specified by the user for the mosaic output are included. Once the tiles are selected, a custom ImageProcessSplit per each one of them is created.

When a mapper receives the ImageProcessSplit, it reads the information based on what the ImageSplit specifies, performs a filter to select only the bands indicated by the user, and executes the list of processing classes defined in the request.

Each mapper process runs in the node, where the data is located. Once the processing classes are executed, the final process executes the mosaic operation. The mosaic operation selects from every tile only the pixels that fit into output and makes the necessary resolution changes to add them in the mosaic output. The resulting bytes are set in the context included in the ImageBandWritable type.

A single reducer picks the tiles and puts them together. If user selected HDFS output, then the ImageLoader is called to store the result into HDFS. Otherwise, by default the image is prepared using GDAL and is stored in the File System.

2.5.2 Input Parameters

The following input parameters are supplied to the Hadoop command:

hadoop jar HADOOP_LIB_PATH/hadoop-imageprocessor.jar 
-catalog  <IMAGE_CATALOG_PATH>
-config  <MOSAIC_CONFIG_PATH>
[-usrlib  <USER_PROCESS_JAR_PATH>]
[-thumbnail  <THUMBNAIL_PATH>]
[-gdal  <GDAL_LIBRARIES_PATH>]

Where:


HADOOP_LIB_PATH for Oracle Big Data Appliance distribution is under /opt/cloudera/parcels/CDH/lib/hadoop/lib/, and for the standard distributions it is under /usr/lib/hadoop/lib.

IMAGE_CATALOG_PATH is the path to the catalog xml that lists the HDFS image(s) to be processed.

MOSAIC_CONFIG_PATH is the path to the mosaic configuration xml, that defines the features of the output mosaic.

USER_PROCESS_JAR_PATH is an optional user defined jar file, which contains additional processing classes to be applied to the source images.

THUMBNAIL_PATH is an optional path to store a thumbnail of the loaded image(s). This path must be accessible via NFS to all nodes in the cluster and is valid only for an HDFS output.

GDAL_LIBRARIES_PATH is an optional path for GDAL native libraries, in case, the cluster has them in a path different to the default Oracle Big Data Appliance path (/opt/cloudera/parcels/CDH/lib/hadoop/lib).

For example, This command will process all the files listed in the catalog file input.xml file using the mosaic output definition set in testFS.xml file.

hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop/lib/hadoop-imageprocessor.jar-catalog /opt/shareddir/spatial/demo/imageserver/images/input.xml-config /opt/shareddir/spatial/demo/imageserver/images/testFS.xml-thumbnail /opt/shareddir/spatial/processtest

By default, the Mappers and Reducers are configured to get 2 GB of JVM, but users can override this settings or any other job configuration properties by adding an imagejob.prop properties file in the same folder location from where the command is being executed.

2.5.2.1 Catalog XML Structure

This is an example of input catalog xml used to list every source image considered for mosaic operation generated by the image processing job.

-<catalog>
  -<image>
   <source>HDFS</source>
   <type>File</type>
   -<raster> /user/hdfs/ohif/opt/shareddir/spatial/imageserver/hawaii.tif.ohi         </raster>
  -<metadata>/user/hdfs/metadata/opt/shareddir/spatial/imageserver/26904_1_hawaii.tif.loc </metadata>
  <bands>3</bands>
  <defaultRed>1</defaultRed>
  <defaultGreen>2</defaultGreen>
  <defaultBlue>3</defaultBlue>
</image>
</catalog>

A <catalog> element contains the list of <image> elements to process.

Each <image> element defines a source image or a source folder within the <raster> element. All the images within the folder are processed.

The <metadata> element specifies a .loc location file related to the source .ohif file. This location file contains the list of tiles for each source image and the coordinates cover per tile.

The <bands> element specifies the number of bands of the image, and the <defaultRed>, <defaultGreen>, and <defaultBlue> specify the band used for the first three channels of the image. In this example, band 1 is used for red channel, band 2 is used for green channel, and band 3 is used for blue channel.

2.5.2.2 Mosaic definition XML Structure

This is an example of a mosaic configuration xml used to define the features of the mosaic output generated by the image processing job.

-<mosaic>
  -<output>
   <SRID>26904</SRID>
   <directory type="FS">/opt/shareddir/spatial/processOutput</directory>
   <!--directory type="HDFS">newData</directory-->
   <tempFSFolder>/opt/shareddir/spatial/tempOutput</tempFSFolder>
   <filename>littlemap</filename>
   <format>GTIFF</format>
  <width>1600</width>
  <height>986</height>
  <algorithm order="0">2</algorithm>
  <bands layers="3"/>
  <nodata>#000000</nodata>
  <pixelType>1</pixelType>
  </output>
  -<crop>
   -<transform>
    356958.985610072,280.38843650364862,0,2458324.0825054757,0,-280.38843650364862 </transform>
  </crop>
  -<process>
   -<class>oracle.spatial.imageporcessor.hadoop.process.ImageSlope </class>
   </process>
</mosaic>

The <mosaic> element defines the specifications of the processing output.

The <output> element defines the features such as <SRID> considered for the output. All the images in different SRID are converted to the mosaic SRID in order to decide if any of its tiles fit into the mosaic or not.

The <directory> element defines where the output is located. It can be in an HDFS or in regular FileSystem (FS), which is specified in the tag type.

The <tempFsFolder> element sets the path to store the mosaic output temporarily.

The <filename> and <format> elements specify the output filename.

The <width> and <height> elements set the mosaic output resolution.

The <algorithm> element sets the order algorithm for the images. A 1 order means, by source last modified date, and a 2 order means, by image size. The order tag represents ascendant or descendant modes.

The <bands> element specifies the number of bands in the output mosaic. Images with fewer bands than this number are discarded.

The <nodata> element specifies the color in the first three bands for all the pixels in the mosaic output that have no value.

The <pixelType> element sets the pixel type of the mosaic output. Source images that do not have the same pixel size are discarded for processing.

The <crop> element defines the coordinates included in the mosaic output in the following order: startcoordinateX, pixelXWidth, RotationX, startcoordinateY, RotationY, and pixelheightY.

The <process> element lists all the classes to execute before the mosaic operation. In this example a slope calculation is applied, and it is valid only for 32 bits images of Digital Elevation Model (DEM) files.

You can specify any other user-created processing classes. When no processing class is defined, only the mosaic operation is performed.

2.5.3 Job Execution

The first step of the job is to filter the tiles that would fit into the mosaic, as a start, the location files listed in the catalog xml are sent to the InputFormat, the location file name has the following structure: SRID_pixelType_fileName.loc.

By extracting the pixelType, the filter decides whether the related source image is valid for processing or not. Based on the user definition made in the catalog xml, one of the following happens:

  • If the image is valid for processing, then the SRID is evaluated next

  • If it is different from the user definition, then the MBR coordinates of every tile are converted into the user SRID and evaluated.

This way, every tile is evaluated for intersection with mosaic definition. Only the intersecting tiles are selected, and a split is created for each one of them.

A mapper process each split in the node where it is stored. The mapper executes the sequence of processing classes defined by the user, and then the mosaic process is executed. A single reducer puts together the result of the mappers and stores the image into FS or HDFS upon user request. If the user requested is to store the output into HDFS, then the ImageLoaderOverlap job is invoked to store the image as a .ohif file.

By default, the Mappers and Reducers are configured to get 2 GB of JVM, but you can override this settings or any other job configuration properties by adding an imagejob.prop properties file in the same folder location from where the command is being executed.

2.5.4 Processing Classes and ImageBandWritable

The processing classes specified in the catalog XML must follow a set of rules to be correctly processed by the job. All the processing classes must implement the ImageProcessorInterface and the ImageBandWritable type properties.

The ImageBandWritable instance defines the content of a tile, such as resolution, size, and pixels. These values must be reflected in the properties that create the definition of the tile. The integrity of the mosaic output depends on the correct manipulation of these properties.

Table 2-1 ImageBandWritable Properties

Type - Property Description

IntWritable dstWidthSize

Width size of the tile

IntWritable dstHeightSize

Height size of the tile

IntWritable bands

Number of bands in the tile

IntWritable dType

Data type of the tile

IntWritable offX

Starting X pixel, in relation to the source image

IntWritable offY

Starting Y pixel, in relation to the source image

IntWritable totalWidth

Width size of the source image

IntWritable totalHeight

Height size of the source image

IntWritable bytesNumber

Number of bytes containing the pixels of the tile and stored into baseArray

BytesWritable[] baseArray

Array containing the bytes representing the tile pixels, each cell represents a band

IntWritable[][] basePaletteArray

Array containing the int values representing the tile palette, each array represents a band. Each integer represents an entry for each color in the color table, there are four entries per color

IntWritable[] baseColorArray

Array containing the int values representing the color interpretation, each cell represents a band

DoubleWritable[] noDataArray

Array containing the NODATA values for the image, each cell contains the value for the related band

ByteWritable isProjection

Specifies if the tile has projection information with Byte.MAX_VALUE

ByteWritable isTransform

Specifies if the tile has the geo transform array information with Byte.MAX_VALUE

ByteWritable isMetadata

Specifies if the tile has metadata information with Byte.MAX_VALUE

IntWritable projectionLength

Specifies the projection information length

BytesWritable projectionRef

Specifies the projection information in bytes

DoubleWritable[] geoTransform

Contains the geo transform array

IntWritable metadataSize

Number of metadata values in the tile

IntWritable[] metadataLength

Array specifying the length of each metadataValue

BytesWritable[] metadata

Array of metadata of the tile

GeneralInfoWritable mosaicInfo

The user-defined information in the mosaic xml. Do not modify the mosaic output features. Modify the original xml file in a new name and run the process using the new xml


Processing Classes and Methods

When modifying the pixels of the tile, first get the bytes into an array using the following method:

byte [] bandData1 = img.getBand(0);

The bytes representing the tile pixels of band 1 are now in the bandData1 array. The base index is zero.

If the tile has a Float32 data type, then use the following method to get the pixels in a float array:

float[] floatBand1 = img.getFloatBand(0);

For any other data types, you must convert the byte array into a corresponding type using gdal.

After processing the pixels, if the same instance of ImageBandWritable must be used, then execute the following method:

img.removeBands;

This removes the content of previous bands, and you can start adding the new bands. To add a new band use the following method:

img.addBand(Object band);

Where band is a byte or a float array containing the pixel information. Do not forget to update the instance size, data type, bytesNumber and any other property that might be affected as a result of the processing operation. Setters are available for each property.

2.5.4.1 Location of the Classes and Jar Files

All the processing classes must be contained in a single jar file if you are using the Oracle Image Server Console.

  • The processing classes might be placed in different jar files if you are using the command line option.

  • The jar files, for Oracle Big Data Appliance, in both cases must be placed under /opt/cloudera/parcels/CDH/lib/hadoop/lib/

Once new classes are visible in the classpath, they must be added to the mosaic XML, in the <process><class> section. Every <class> element added is executed in order of appearance before the final mosaic operation is performed.

2.5.5 Output

When you specify an HDFS directory in the catalog XML, the output generated is an .ohif file as in the case of an ImageLoader job,

When the user specifies a FS directory in the catalog xml, the output generated is an image with the filename and type specified and is stored into regular FileSystem.

In both the scenarios, the output must comply with the specifications set in the catalog xml. The job execution logs can be accessed using the command yarn logs -applicationId <applicationId>.

2.6 Oracle Big Data Spatial Vector Analysis

Oracle Big Data Spatial Vector Analysis is a Spatial Vector Analysis API, which runs as a Hadoop job and provides MapReduce components for spatial processing of data stored in HDFS. These components make use of the Spatial Java API to perform spatial analysis tasks. There is a web console provided along with the API. The supported features include:

In addition, read the following information for understanding the complete implementation details:

2.6.1 Spatial Indexing

A spatial index is in the form of a key/value pair and generated as a Hadoop MapFile. Each MapFile entry contains a spatial index for one split of the original data. The key and value pair contain the following information:

  • Key: a split identifier in the form: path + start offset + length.

  • Value: a spatial index structure containing the actual indexed records.

The following figure depicts a spatial index in relation to the user data. The records are represented as r1, r2, and so on.

Description of spatial_index_rep.png follows
Description of the illustration spatial_index_rep.png

Related subtopics:

2.6.1.1 Spatial Indexing Class Structure

Records in a spatial index are represented using the class oracle.spatial.hadoop.vector.RecordInfo. A RecordInfo typically contains a subset of the original record data and a way to locate the record in the file where it is stored. The specific RecordInfo data depends on two things:

  • InputFormat used to read the data

  • RecordInfoProvider implementation, which provides the record's data

The fields contained within a RecordInfo:

  • Id: Text field with the record Id.

  • Geometry: JGeometry field with the record geometry.

  • Extra fields: Additional optional fields of the record can be added as name-value pairs. The values are always represented as text.

  • Start offset: The position of the record in a file as a byte offset. This value depends on the InputFormat used to read the original data.

  • Length: The original record length in bytes.

  • Path: The file path can be added optionally. This is optional because the file path can be known using the spatial index entry key. However, to add the path to the RecordInfo instances when a spatial index is created, the value of the configuration property oracle.spatial.recordInfo.includePathField key is set to true.

2.6.1.2 Configuration for Creating a Spatial Index

A spatial index is created using a combination of FileSplitInputFormat, SpatialIndexingMapper, InputFormat, and RecordInfoProvider, where the last two are provided by the user. The following code example shows part of the configuration needed to run a job that creates a spatial index for the data located in the HDFS folder /user/data.

//input
 
conf.setInputFormat(FileSplitInputFormat.class);
FileSplitInputFormat.setInputPaths(conf, new Path("/user/data"));
FileSplitInputFormat.setInternalInputFormatClass(conf, TextInputFormat.class);
FileSplitInputFormat.setRecordInfoProviderClass(conf, TwitterLogRecordInfoProvider.class);
 
//output
 
conf.setOutputFormat(MapFileOutputFormat.class);
FileOutputFormat.setOutputPath(conf, new Path("/user/data_spatial_index"));
 
//mapper
 
conf.setMapperClass(SpatialIndexingMapper.class); 
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(RTreeWritable.class);

In this example,

  • The FileSplitInputFormat is set as the job InputFormat. FileSplitInputFormat is a subclass of CompositeInputFormat, an abstract class which uses another InputFormat implementation (internalInputFormat) to read the data. The internal InputFormat and the RecordInfoProvider implementations are specified by the user and they are set to TextInputFormat and TwitterLogRecordInfoProvider respectively.

  • The OutputFormat used is MapFileOutputFormat is the resulting MapFile.

  • The mapper is set to SpatialIndexingMappper. The mapper output key and value types are Text (splits identifiers) and RTreeWritable (the actual spatial indices).

  • No reducer class is specified so it runs with the default reducer. The reduce phase is needed to sort the output MapFile keys.

Alternatively, this configuration can be set easier by using the SpatialIndexing class. SpatialIndexing is a job driver that creates a spatial index. In the following example, a SpatialIndexing instance is created, set up, and used to add the settings to the job configuration by calling the configure() method. Once the configuration has been set, the job is launched.

SpatialIndexing<LongWritable, Text> spatialIndexing = new SpatialIndexing<LongWritable, Text>();
 
//path to input data
 
spatialIndexing.setInput("/user/data");
 
//path of the spatial index to be generated
 
spatialIndexing.setOutput("/user/data_spatial_index");
 
//input format used to read the data
 
spatialIndexing.setInputFormatClass(TextInputFormat.class);
 
//record info provider used to extract records information
 
spatialIndexing.setRecordInfoProviderClass(TwitterLogRecordInfoProvider.class);
 
//add the spatial indexing configuration to the job configuration
 
spatialIndexing.configure(jobConf);
 
//run the job
 
JobClient.runJob(jobConf);

2.6.1.3 Input Formats for Spatial Index

An InputFormat must meet the following requisites to be supported:

  • It must be a subclass of FileInputFormat.

  • The getSplits()method must return either FileSplit or CombineFileSplit split types.

  • The getPos() method, the RecordReader provided by the InputFormat must return the current position to track back a record in the spatial index to its original record in the user file. If the current position is not returned, then the original record cannot be found using the spatial index.

    However, the spatial index still can be created and used in operations that do not require the original record to be read. For example, additional fields can be added as extra fields to avoid having to read the whole original record.

    Note:

    The spatial indices are created for each split as returned by the getSplits() method. When the spatial index is used for filtering (see Spatial Filtering), it is recommended to use the same InputFormat implementation than the one used to create the spatial index to ensure the splits indices can be found.

The getPos() method has been removed from the Hadoop new API, however, org.apache.hadoop.mapreduce.lib.input.TextInputFormat and CombineTextInputFormat are supported and it is still possible to get the record start offsets.

Other input formats from the new API are supported, but the record start offsets will not be contained in the spatial index. Therefore, it is not possible to find the original records. The requisites for a new API input format is same as for the old API. However, it must be translated to the new APIs FileInputFormat, FileSplit, and CombineFileSplit. The following example shows an input format from the new Hadoop API that is used as internal input format:

AdapterInputFormat.setInternalInputFormatClass(conf, org.apache.hadoop.mapreduce.lib.input.TextInputFormat);
spatialIndexing.setInputFormatClass( AdapterInputFormat.class);

2.6.1.4 MVSuggest for Locating Records

MVSuggest can be used at the time of spatial indexing to get an approximate location for records, which do not have geometry but have some text field. This text field can be used to determine the record location. The geometry returned by MVSuggest is used to include the record in the spatial index.

Since it is important to know the field containing the search text for every record, the RecordInfoProvider implementation must also implement LocalizableRecordInfoProvider. For more information see the "LocalizableRecordInfoProvider."

MVSuggest can be used in the "Input Formats for Spatial Index." example by calling the following method in the SpatialIndexing class:

spatialIndexing.setMVSUrl("http://host:port")

If MVSuggest service is deployed at every cluster node, then the host can be set to localhost. The layers used by MVSuggest to perform the search can be specified by passing the name of the layers as an array of strings to the following method:

spatialIndexing. setMatchingLayers( new String[]{ "world_continents", "world_countries", "world_cities"} )

This settings can also be added directly to the job configuration using the following parameters:

conf.set(ConfigParams.MVS_URL, "http://host:port")
conf.set(ConfigParams.MVS_MATCHING_LAYERS, "world_continents, world_countries, world_cities")

2.6.2 Spatial Filtering

Once the spatial index has been generated, it can be used to spatially filter the data. The filtering is performed before the data reaches the mapper and while it is being read. The following sample code example demonstrates how the SpatialFilterInputFormat is used to spatially filter the data.

//set input path and format
 
FileInputFormat.setInputPaths(conf, new Path("/user/data/"));
conf.setInputFormat(SpatialFilterInputFormat.class);
 
//set internal input format
 
SpatialFilterInputFormat.setInternalInputFormatClass(conf, TextInputFormat.class);
if( spatialIndexPath != null )  
{
 
     //set the path to the spatial index and put it in the distributed cache
 
     boolean useDistributedCache = true;
     SpatialFilterInputFormat.setSpatialIndexPath(conf, spatialIndexPath, useDistributedCache);
} 
else 
{
     //as no spatial index is used a RecordInfoProvider is needed
 
     SpatialFilterInputFormat.setRecordInfoProviderClass(conf, TwitterLogRecordInfoProvider.class);
}
 
//set spatial operation used to filter the records
 
SpatialFilterInputFormat.setSpatialOperation(conf, SpatialOperation.IsInside);
SpatialFilterInputFormat.setSpatialQueryWindow
(conf,"{\"type\":\"Polygon\", \"coordinates\":[[-106.64595, 25.83997, -106.64595, 36.50061, -93.51001, 36.50061, -93.51001, 25.83997 , -106.64595, 25.83997]]}");
SpatialFilterInputFormat.setSRID(conf, 8307);
SpatialFilterInputFormat.setTolerance(conf, 0.5);
SpatialFilterInputFormat.setGeodetic(conf, true);

SpatialFilterInputFormat has to be set as the job's InputFormat. The InputFormat that actually reads the data must be set as the internal InputFormat. In this example, the internal InputFormat is TextInputFormat.

If a spatial index is specified, it is used for filtering. Otherwise, a RecordInfoProvider must be specified in order to get the records geometries, in which case the filtering is performed record by record.

As a final step, the spatial operation and query window to perform the spatial filter are set. It is recommended to use the same internal InputFormat implementation used when the spatial index was created or, at least, an implementation that uses the same criteria to generate the splits. For details see "Input Formats for Spatial Index."

2.6.2.1 Filtering Records

The following steps are executed when records are filtered using the SpatialFilterInputFormat and a spatial index.

  1. SpatialFilterInputFormat getRecordReader() method is called when the mapper requests a RecordReader for the current split.

  2. The spatial index for the current split is retrieved.

  3. A spatial query is performed over the records contained in it using the spatial index.

    As a result, the ranges in the split that contains records meeting the spatial filter are known. For example, if a split goes from the file position 1000 to 2000, upon executing the spatial filter it can be determined that records that fulfill the spatial condition are in the ranges 1100-1200, 1500-1600 and 1800-1950. So the result of performing the spatial filtering at this stage is a subset of the original filter containing smaller splits.

  4. An InternalInputFormat RecordReader is requested for every small split from the resulting split subset.

  5. A RecordReader is returned to the caller mapper. The returned RecordReader is actually a wrapper RecordReader with one or more RecordReaders returned by the internal InputFormat.

  6. Every time the mapper calls the RecordReader, the call to next method to read a record is delegated to the internal RecordReader.

These steps are shown in the following spatial filter interaction diagram.

Description of spatial_interact_diag.png follows
Description of the illustration spatial_interact_diag.png

2.6.3 Classifying Data Hierarchically

The Vector Analysis API provides a way to classify the data into hierarchical entities. For example, in a given set of catalogs with a defined level of administrative boundaries such as continents, countries and states, it is possible to join a record of the user data to a record of each level of the hierarchy data set.The following example generates a summary count for each hierarchy level, containing the number of user records per continent, country and state or province:

HierarchicalCount<LongWritable,Text> hierCount = new HierarchicalCount<LongWritable,Text>()
 
//when spatial option is set the job will use a spatial index
 
hierCount.setOption(Option.parseOption(Option.Spatial));
 
//set the path to the spatial index
 
hierCount.setInput(""/user/data_spatial_index/"");
 
//set the job's output
 
hierCount.setOutput("/user/hierarchy_count");
 
//set HierarchyInfo implementation which describes the world administrative boundaries hierarchy
 
hierCount.setHierarchyInfoClass( WorldAdminHierarchyInfo.class );
 
//specify the paths of the hierarchy data
 
Path[] hierarchyDataPaths = {
new Path("file:///home/user/catalogs/world_continents.json"), 
new Path("file:///home/user/catalogs/world_countries.json"), 
new Path("file:///home/user/catalogs/world_states_provinces.json")};
hierCount.setHierarchyDataPaths(hierarchyDataPaths);
 
//set the path where the index for the previous hierarchy data will be generated
 
hierCount.setHierarchyDataIndexPath("/user/hierarchy_data_index/");
 
//setup the spatial operation which will be used to join records from the two datasets (spatial index and hierarchy data).
 
hierCount.setSpatialOperation(SpatialOperation.IsInside);
hierCount.setSrid(8307);
hierCount.setTolerance(0.5);
hierCount.setGeodetic(true);
 
//add the previous setup to the job configuration
 
hierCount.configure(conf);
 
//run the job
 
RunningJob rj = JobClient.runJob(conf);
 
//call jobFinished to properly rename the output files
 
hierCount.jobFinished(rj.isSuccessful());

This example uses the HierarchicalCount demo job driver to facilitate the configuration. The configuration can be divided in to following categories:

  • Input data: This is a previously generated spatial index.

  • Output data: This is a folder which contains the summary counts for each hierarchy level.

  • Hierarchy data configuration: This contains the following:

    • HierarchyInfo class: This is an implementation of HierarchyInfo class in charge of describing the current hierarchy data. It provides the number of hierarchy levels, level names, and the data contained at each level.

    • Hierarchy data paths: This is the path to each one of the hierarchy catalogs. These catalogs are read by the HierarchyInfo class.

    • Hierarchy index path: This is the path where the hierarchy data index is stored. Hierarchy data needs to be preprocessed to know the parent-child relationships between hierarchy levels. This information is processed once and saved at the hierarchy index, so it can be used later by the current job or even by any other jobs.

  • Spatial operation configuration: This is the spatial operation to be performed between records of the user data and the hierarchy data in order to join both datasets. The parameters to set here are the Spatial Operation type (IsInside), SRID (8307), Tolerance (0.5 meters), and whether the geometries are Geodetic (true).

The HierarchyCount.jobFinished() method is called at the end to properly name the reducer output. Therefore, only one file is left for each hierarchy level instead of multiple files with the –r-NNNNN postfix. Internally, the HierarchyCount.configure() method sets the mapper and reducer to be SpatialHierarchicalCountMapper and SpatialHierarchicalCountReducer respectively. SpatialHierarchicalCountMapper's output key is a hierarchy entry identifier in the form hierarchy_level + hierarchy_entry_id. The mapper output value is a single count for each output key. The reducer sums up all the counts for each key.

Note:

The entire hierarchy data may be loaded into memory and hence the total size of all the catalogs is expected to be significantly less than the user data. The hierarchy data size should not be larger than a couple of gigabytes.

If you want another type of output instead of counts, for example, a list of user records according to the hierarchy entry. In this case, the SpatialHierarchicalJoinMapper can be used. The SpatialHierarchicalJoinMapper output value is a RecordInfo instance, which can be gathered in a user-defined reducer to produce a different output. The following user-defined reducer generates a MapFile for each hierarchy level using the MultipleOutputs class. Each MapFile has the hierarchy entry ids as keys and ArrayWritable instances containing the matching records for each hierarchy entry as values. The following is an user-defined reducer that returns a list of records by hierarchy entry:

public class HierarchyJoinReducer extends MapReduceBase implements Reducer<Text, RecordInfo, Text, ArrayWritable> {
 
       private MultipleOutputs mos = null;
       private Text outKey = new Text();
       private ArrayWritable outValue = new ArrayWritable( RecordInfo.class );
 
       @Override
       public void configure(JobConf conf) 
       {
         super.configure(conf);
 
         //use MultipleOutputs to generate different outputs for each hierarchy level
    
         mos = new MultipleOutputs(conf);
        }
        @Override
        public void reduce(Text key, Iterator<RecordInfo> values,
                          OutputCollector<Text, RecordInfoArrayWritable> output, Reporter reporter)
                          throws IOException 
        {
 
          //Get the hierarchy level name and the hierarchy entry id from the key
 
          String[] keyComponents = HierarchyHelper.getMapRedOutputKeyComponents(key.toString());
          String hierarchyLevelName = keyComponents[0];
          String entryId = keyComponents[1];
          List<Writable> records = new LinkedList<Writable>();
 
          //load the values to memory to fill output ArrayWritable
      
          while(values.hasNext())
          {
            RecordInfo recordInfo = new RecordInfo( values.next() );
            records.add( recordInfo );          
          }
          if(!records.isEmpty())
          {
 
            //set the hierarchy entry id as key
 
            outKey.set(entryId);
 
            //list of records matching the hierarchy entry id
 
            outValue.set( records.toArray(new Writable[]{} ) );
 
            //get the named output for the given hierarchy level
 
            hierarchyLevelName = FileUtils.toValidMONamedOutput(hierarchyLevelName);
            OutputCollector<Text, ArrayWritable> mout = mos.getCollector(hierarchyLevelName, reporter);
 
           //Emit key and value
 
           mout.collect(outKey, outValue);
          }
}
 
        @Override
        public void close() throws IOException 
        {
          mos.close();
        }
}

The same reducer can be used in a job with the following configuration to generate list of records according to the hierarchy levels:

JobConf conf = new JobConf(getConf());
 
//input path
 
FileInputFormat.setInputPaths(conf, new Path("/user/data_spatial_index/") );
 
//output path
 
FileOutputFormat.setOutputPath(conf, new Path("/user/records_per_hier_level/") );
 
//input format used to read the spatial index
 
conf.setInputFormat( SequenceFileInputFormat.class);
 
//output format: the real output format will be configured for each multiple output later
 
conf.setOutputFormat(NullOutputFormat.class);
 
//mapper
 
conf.setMapperClass( SpatialHierarchicalJoinMapper.class );
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(RecordInfo.class);
 
//reducer
 
conf.setReducerClass( HierarchyJoinReducer.class );
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(ArrayWritable.class);
 
////////////////////////////////////////////////////////////////////
 
//hierarchy data setup
 
//set HierarchyInfo class implementation
 
conf.setClass(ConfigParams.HIERARCHY_INFO_CLASS, WorldAdminHierarchyInfo.class, HierarchyInfo.class);
 
//paths to hierarchical catalogs
 
Path[] hierarchyDataPaths = {
new Path("file:///home/user/catalogs/world_continents.json"), 
new Path("file:///home/user/catalogs/world_countries.json"), 
new Path("file:///home/user/catalogs/world_states_provinces.json")};
 
//path to hierarchy index
 
Path hierarchyDataIndexPath = new Path("/user/hierarchy_data_index/");
 
//instantiate the HierarchyInfo class to index the data if needed.
 
HierarchyInfo hierarchyInfo = new WorldAdminHierarchyInfo();
hierarchyInfo.initialize(conf);
 
//Create the hierarchy index if needed. If it already exists, it will only load the hierarchy index to the distributed cache
 
HierarchyHelper.setupHierarchyDataIndex(hierarchyDataPaths, hierarchyDataIndexPath, hierarchyInfo, conf);
 
///////////////////////////////////////////////////////////////////
 
//setup the multiple named outputs:
 
int levels = hierarchyInfo.getNumberOfLevels();
for(int i=1; i<=levels; i++)
{
    String levelName = hierarchyInfo.getLevelName(i);
 
    //the hierarchy level name is used as the named output
 
    String namedOutput = FileUtils.toValidMONamedOutput(levelName);
    MultipleOutputs.addNamedOutput(conf, namedOutput, MapFileOutputFormat.class, Text.class, ArrayWritable.class);
}
 
//finally, setup the spatial operation
 
conf.setInt(ConfigParams.SPATIAL_OPERATION, SpatialOperation.IsInside.getOrdinal());
conf.setInt(ConfigParams.SRID, 8307);
conf.setFloat(ConfigParams.SPATIAL_TOLERANCE, 0.5f);
conf.setBoolean(ConfigParams.GEODETIC, true);
 
//run job
 
JobClient.runJob(conf);

Supposing the output value should be an array of record ids instead of an array of RecordInfo instances, it would be enough to perform a couple of changes in the previously defined reducer.

The line where outValue is declared, in the previous example, changes to:

private ArrayWritable outValue = new ArrayWritable(Text.class);

The loop where the input values are retrieved, in the previous example, is changed. Therefore, the record ids are got instead of the whole records:

while(values.hasNext())
{
  records.add( new Text(values.next().getId()) );
}

While only the record id is needed the mapper emits the whole RecordInfo instance. Therefore, a better approach is to change the mappers output value. The mappers output value can be changed by extending AbstractSpatialJoinMapper. In the following example, the mapper emits only the record ids instead of the whole RecorInfo instance every time a record matches some of the hierarchy entries:

public class IdSpatialHierarchicalMapper extends AbstractSpatialHierarchicalMapper< Text > 
{
       Text outValue = new Text();
 
       @Override
       protected Text getOutValue(RecordInfo matchingRecordInfo) 
       {
 
         //the out value is the record's id
 
         outValue.set(matchingRecordInfo.getId());
         return outValue;
       }
}

2.6.3.1 Changing the Hierarchy Level Range

By default, all the hierarchy levels defined in the HierarchyInfo implementation are loaded when performing the hierarchy search. The range of hierarchy levels loaded is from level 1 (parent level) to the level returned by HierarchyInfo.getNumberOfLevels() method. The following example shows how to setup a job to only load the levels 2 and 3.

conf.setInt( ConfigParams.HIERARCHY_LOAD_MIN_LEVEL, 2);
conf.setInt( ConfigParams.HIERARCHY_LOAD_MAX_LEVEL, 3);

Note:

These parameters are useful when only a subset of the hierarchy levels is required and it is not recommended to modify the HierarchyInfo implementation.

2.6.3.2 Controlling the Search Hierarchy

The search is always performed only at the bottom hierarchy level (the higher level number). If a user record matches some hierarchy entry at this level, then the match is propagated to the parent entry in upper levels. For example, if a user record matches Los Angeles, then it also matches California, USA, and North America. If there are no matches for a user record at the bottom level, then the search does not continue into the upper levels.

This behavior can be modified by setting the configuration parameter ConfigParams. HIERARCHY_SEARCH_MULTIPLE_LEVELS to true. Therefore, if a search at the bottom hierarchy level resulted in some unmatched user records, then search continues into the upper levels until the top hierarchy level is reached or there are no more user records to join. This behavior can be used when the geometries of parent levels do not perfectly enclose the geometries of their child entries

2.6.3.3 Using MVSuggest to Classify the Data

MVSuggest can be used instead of the spatial index to classify data. For this case, an implementation of LocalizableRecordInfoProvider must be known and sent to MVSuggest to perform the search. See the information about LocalizableRecordInfoProvider.

In the following example, the program option is changed from spatial to MVS. The input is the path to the user data instead of the spatial index. The InputFormat used to read the user record and an implementation of LocalizableRecordInfoProvider are specified. The MVSuggest service URL is set. Notice that there is no spatial operation configuration needed in this case. The changes are marked bold in the following code listing:

HierarchicalCount<LongWritable,Text> hierCount = new HierarchicalCount<LongWritable,Text>()

//set option to MVS

hierCount.setOption(Option.parseOption(Option.MVS));

//the input path is the user's data

hierCount.setInput(""/user/data/"");

//set the job's output

hierCount.setOutput("/user/mvs_hierarchy_count");

//set HierarchyInfo implementation which describes the world administrative boundaries hierarchy

hierCount.setHierarchyInfoClass( WorldAdminHierarchyInfo.class );

//specify the paths of the hierarchy data

Path[] hierarchyDataPaths = {
new Path("file:///home/user/catalogs/world_continents.json"), 
new Path("file:///home/user/catalogs/world_countries.json"), 
new Path("file:///home/user/catalogs/world_states_provinces.json")};
hierCount.setHierarchyDataPaths(hierarchyDataPaths);

//set the path where the index for the previous hierarchy data will be generated

hierCount.setHierarchyDataIndexPath("/user/hierarchy_data_index/");

//No spatial operation configuration is needed, Instead, specify the InputFormat used to read the user's data and the LocalizableRecordInfoProvider class.

hierCount. setInputFormatClass( TextInputFormat.class );
hierCount.setRecordInfoProviderClass( MyLocalizableRecordInfoProvider.class );

//finally, set the MVSuggest service URL

hierCount.setMVSUrl("http://localhost:8080");

//add the previous setup to the job configuration
hierCount.configure(conf);

//run the job

RunningJob rj = JobClient.runJob(conf);

//call jobFinished to properly rename the output files

hierCount.jobFinished(rj.isSuccessful());

Note:

When using MVSuggest, the hierarchy data files must be the same as the layer files used by MVSuggest. The hierarchy level names returned by the HierarchyInfo.getLevelNames() method are used as the matching layers by MVSuggest.

2.6.4 Generating Buffers

The API provides a mapper and a demo job to generate a buffer around each record's geometry. The following code sample shows how to run a job to generate a buffer around each record geometry

Buffer<LongWritable, Text> buffer = new Buffer<LongWritable, Text>();
 
//set path to input data
 
buffer.setInput("/user/waterlines/");
 
//set path to the resulting data
 
buffer.setOutput("/user/data_buffer/");          
 
//set input format used to read the data
 
buffer.setInputFormatClass( TextInputFormat.class );
 
//set the record infor provider implementation
 
buffer.setRecordInfoProviderClass(WorldSampleLineRecordInfoProvider.class);
 
//set the width of the buffers to be generated
 
buffer.setBufferWidth(Double.parseDouble(args[4]));
 
//set the previous configuration to the job configuration
 
buffer.configure(conf);
 
//run the job
 
JobClient.runJob(conf);

This example job uses BufferMapper as the job mapper. BufferMapper generates a buffer for each input record containing a geometry. The output key and values are the record id and a RecordInfo instance containing the generated buffer. The resulting file is a Hadoop MapFile containing the mapper output key and values.

BufferMapper accepts the following parameters:

Parameter ConfigParam constant Type Description
oracle.spatial.buffer.width BUFFER_WIDTH double The buffer width
oracle.spatial.buffer.sma BUFFER_SMA double The semi major axis for the datum used in the coordinate system of the input
oracle.spatial.buffer.iFlat BUFFER_IFLAT double The flattening value
oracle.spatial.buffer.arcT BUFFER_ARCT double The arc tolerance used for geodetic densification

2.6.5 RecordInfoProvider

A record read by a MapReduce job from HDFS is represented in memory as a key-value pair using a Java type (typically) Writable subclass, such as LongWritable, Text, ArrayWritable or some user-defined type. For example, records read using TextInputFormat are represented in memory as LongWritable, Text key-value pairs.

RecordInfoProvider is the component that interprets these memory record representations and returns the data needed by the Vector Analysis API. Thus, the API is not tied to any specific format and memory representations.

The RecordInfoProvider interface has the following methods:

  • void setCurrentRecord(K key, V value)

  • String getId()

  • JGeometry getGeometry()

  • boolean getExtraFields(Map<String, String> extraFields)

When a record is read from HDFS, the setCurrentRecord() method sets the RecordInfoProvider instance. The key-value pair passed as arguments are same as the values returned by the RecordReader provided by the InputFormat. The RecordInfoProvider is used to get the current record id, geometry, and extra fields. None of these fields are required fields. Only those records with a geometry participates in the spatial operations. The Id is useful for differentiating records in operations such as categorization. The extra fields can be used to store any record information that can be represented as text and which is desired to be quickly accessed without reading the original record, or for operations where MVSuggest is used.

Typically, the information returned by RecordInfoProvider is used to populate RecordInfo instances. A RecordInfo can be thought as a light version of a record and contains the information returned by the RecordInfoProvider plus information to locate the original record in a file.

2.6.5.1 Sample RecordInfo Implementation

This is a JsonRecordInfoProvider implementation, which takes text records in JSON format and read using TextInputFormat. A sample record is shown here:

{ "_id":"ABCD1234", "location":" 119.31669, -31.21615", "locationText":"Boston, Ma", "date":"03-18-2015", "time":"18:05", "device-type":"cellphone", "device-name":"iPhone"}

When a JsonRecordInfoProvider is instantiated, a JSON ObjectMapper is created. This is used to parse only a record value when setCurrentRecord() method is called. The record key is ignored. The record id, geometry, and one extra field are retrieved from the _id, location and locationText JSON properties. The geometry is represented as latitude-longitude pair and is used to create a point geometry using JGeometry.createPoint() method. The extra field (locationText) is added to the extraFields map, which serves as an out parameter and true is returned indicating that an extra field was added.

public class JsonRecordInfoProvider implements RecordInfoProvider<LongWritable, Text> {
private Text value = null;
private ObjectMapper jsonMapper = null;
private JsonNode recordNode = null;
 
public JsonRecordInfoProvider(){
 
//json mapper used to parse all the records
 
jsonMapper = new ObjectMapper();
 
}
 
@Override
public void setCurrentRecord(LongWritable key, Text value) throws Exception {
        try{
 
           //parse the current value
 
           recordNode = jsonMapper.readTree(value.toString());
           }catch(Exception ex){
              recordNode = null;
              throw ex;
        }
}
 
@Override
public String getId() {
        String id = null;
        if(recordNode != null ){
                id = recordNode.get("_id").getTextValue();
        }
        return id;
}
@Override
public JGeometry getGeometry() {
        JGeometry geom = null;
        if(recordNode!= null){
                //location is represented as a lat,lon pair
                String location = recordNode.get("location").getTextValue();
                String[] locTokens = location.split(",");
                double lat = Double.parseDouble(locTokens[0]);
                double lon = Double.parseDouble(locTokens[1]);
                geom =  JGeometry.createPoint( new double[]{lon, lat},  2, 8307);
        }
        return geom;
}
 
@Override
public boolean getExtraFields(Map<String, String> extraFields) {
        boolean extraFieldsExist = false;
        if(recordNode != null) {
                extraFields.put("locationText", recordNode.get("locationText").getTextValue() );
                extraFieldsExist = true;
        }
        return extraFieldsExist;
}
}

2.6.5.2 LocalizableRecordInfoProvider

This interface extends RecordInfoProvider and is used to know the extra fields that can be used as the search text, when MVSuggest is used.

The only method added by this interface is getLocationServiceField(), which returns the name of the extra field, and is sent to MVSuggest.

In addition, the following is an implementation based on "Sample RecordInfo Implementation." The name returned in this example is locationText, which is the name of the extra field included in the parent class.

public class LocalizableJsonRecordInfoProvider extends JsonRecordInfoProvider implements LocalizableRecordInfoProvider<LongWritable, Text> {
 
@Override
public String getLocationServiceField() {
        return  "locationText";
}
}

2.6.6 HierarchyInfo

The HierarchyInfo interface is used to describe a hierarchical dataset. This implementation of HierarchyInfo is expected to provide the number, names, and the entries of the hierarchy levels of the hierarchy it describes.

The root hierarchy level is always the hierarchy level 1. The entries in this level do not have parent entries and this level is referred as the top hierarchy level. Children hierarchy levels will have higher level values. For example: the levels for the hierarchy conformed by continents, countries, and states are 1, 2 and 3 respectively. Entries in the continent layer do not have a parent, but have children entries in the countries layer. Entries at the bottom level, the states layer, do not have children.

A HierarchyInfo implementation is provided out of the box with the Vector Analysis API. The DynaAdminHierarchyInfo implementation can be used to read and describe the known hierarchy layers in GeoJSON format. A DynaAdminHierarchyInfo can be instantiated and configured or can be subclassed. The hierarchy layers to be contained are specified by calling the addLevel() method, which takes the following parameters:

  • The hierarchy level number

  • The hierarchy level name, which must match the file name (without extension) of the GeoJSON file that contains the data. For example, the hierarchy level name for the file world_continents.json must be world_continents, for world_countries.json it is world_countries, and so on.

  • Children join field: This is a JSON property that is used to join entries of the current level with child entries in the lower level. If a null is passed, then the entry id is used.

  • Parent join field: This is a JSON property used to join entries of the current level with parent entries in the upper level. This value is not used for the top most level without an upper level to join. If the value is set null for any other level greater than 1, an IsInside spatial operation is performed to join parent and child entries. In this scenario, it is supposed that an upper level geometry entry can contain lower level entries.

For example, let us assume a hierarchy containing the following levels from the specified layers: 1- world_continents, 2 - world_countries and 3 - world_states_provinces. A sample entry from each layer would look like the following:

world_continents:
 {"type":"Feature","_id":"NA","geometry": {"type":"MultiPolygon", "coordinates":[ x,y,x,y,x,y] }"properties":{"NAME":"NORTH AMERICA", "CONTINENT_LONG_LABEL":"North America"},"label_box":[-118.07998,32.21006,-86.58515,44.71352]}

world_countries: {"type":"Feature","_id":"iso_CAN","geometry":{"type":"MultiPolygon","coordinates":[x,y,x,y,x,y]},"properties":{"NAME":"CANADA","CONTINENT":"NA","ALT_REGION":"NA","COUNTRY CODE":"CAN"},"label_box":[-124.28092,49.90408,-94.44878,66.89287]}

world_states_provinces:
{"type":"Feature","_id":"6093943","geometry": {"type":"Polygon", "coordinates":[ x,y,x,y,x,y]},"properties":{"COUNTRY":"Canada", "ISO":"CAN", "STATE_NAME":"Ontario"},"label_box":[-91.84903,49.39557,-82.32462,54.98426]}

A DynaAdminHierarchyInfo can be configured to create a hierarchy with the above layers in the following way:

DynaAdminHierarchyInfo dahi = new DynaAdminHierarchyInfo();
 
dahi.addLevel(1, "world_continents", null /*_id is used by default to join with child entries*/, null /*not needed as there are not upper hierarchy levels*/);
 
dahi.addLevel(2, "world_countries", "properties.COUNTRY CODE"/*field used to join with child entries*/, "properties.CONTINENT" /*the value "NA" will be used to find Canada's parent which is North America and which _id field value is also "NA" */);
 
dahi.addLevel(3, "world_states_provinces", null /*not needed as not child entries are expected*/, "properties.ISO"/*field used to join with parent entries. For Ontario, it is the same value than the field properties.COUNTRY CODE specified for Canada*/);
 
//save the previous configuration to the job configuration
 
dahi.initialize(conf);

A similar configuration can be used to create hierarchies from different layers, such as countries, states, and counties, or any other layers with a similar JSON format.

Alternatively, to avoid configuring a hierarchy every time a job is executed, the hierarchy configuration can be enclosed in a DynaAdminHierarchyInfo subclass as in the following example:

public class WorldDynaAdminHierarchyInfo extends DynaAdminHierarchyInfo \
 
{
       public WorldDynaAdminHierarchyInfo() 
 
       {
              super();
              addLevel(1, "world_continents", null, null);
              addLevel(2, "world_countries", "properties.COUNTRY CODE", "properties.CONTINENT");
              addLevel(3, "world_states_provinces", null, "properties.ISO");
       }
 
}

2.6.6.1 Sample HierarchyInfo Implementation

The HierarchyInfo interface contains the following methods, which must be implemented to describe a hierarchy. The methods can be divided in to the following three categories:

  • Methods to describe the hierarchy

  • Methods to load data

  • Methods to supply data

Additionally there is an initialize() method, which can be used to perform any initialization and to save and read data both to and from the job configuration

void initialize(JobConf conf);  
 
//methods to describe the hierarchy
 
String getLevelName(int level); 
int getLevelNumber(String levelName);
int getNumberOfLevels();
 
//methods to load data
 
void load(Path[] hierDataPaths, int fromLevel, JobConf conf) throws Exception;
void loadFromIndex(HierarchyDataIndexReader[] readers, int fromLevel, JobConf conf) throws Exception;
 
//methods to supply data
 
Collection<String> getEntriesIds(int level);
JGeometry getEntryGeometry(int level, String entryId);  
String getParentId(int childLevel, String childId);

The following is a sample HierarchyInfo implementation, which takes the previously mentioned world layers as the hierarchy levels. The first section contains the initialize method and the methods used to describe the hierarchy. In this case, the initialize method does nothing. The methods mentioned in the following example use the hierarchyLevelNames array to provide the hierarchy description. The instance variables entriesGeoms and entriesParent are arrays of java.util.Map, which contains the entries geometries and entries parents respectively. The entries ids are used as keys in both cases. Since the arrays indices are zero-based and the hierarchy levels are one-based, the array indices correlate to the hierarchy levels as array index + 1 = hierarchy level.

public class WorldHierarchyInfo implements HierarchyInfo 
{
 
       private String[] hierarchyLevelNames = {"world_continents", "world_countries", "world_states_provinces"};
       private Map<String, JGeometry>[] entriesGeoms = new Map[3];
       private Map<String, String>[] entriesParents = new Map[3];
 
       @Override
       public void initialize(JobConf conf) 
      {

         //do nothing for this implementation
}

        @Override
        public int getNumberOfLevels() 
        {
          return hierarchyLevelNames.length;
}
 
        @Override
        public String getLevelName(int level) 
        {
           String levelName = null;
           if(level >=1 && level <= hierarchyLevelNames.length)
           {
             levelName = hierarchyLevelNames[ level - 1];    
           }
         return levelName;
         }

        @Override
        public int getLevelNumber(String levelName) 
        {
           for(int i=0; i< hierarchyLevelNames.length; i++ ) 
           {
             if(hierarchyLevelNames.equals( levelName) ) return i+1;
   }
   return -1;
}

The following example contains the methods that load the different hierarchy levels data. The load() method reads the data from the source files world_continents.json, world_countries.json, and world_states_provinces.json. For the sake of simplicity, the internally called loadLevel() method is not specified, but it is supposed to parse and read the JSON files.

The loadFromIndex() method only takes the information provided by the HierarchyIndexReader instances passed as parameters. The load() method is supposed to be executed only once and only if a hierarchy index has not been created, in a job. Once the data is loaded, it is automatically indexed and loadFromIndex() method is called every time the hierarchy data is loaded into the memory.

      @Override
      public void load(Path[] hierDataPaths, int fromLevel, JobConf conf) throws Exception {
      int toLevel = fromLevel + hierDataPaths.length - 1;
      int levels = getNumberOfLevels();
 
      for(int i=0, level=fromLevel; i<hierDataPaths.length && level<=levels; i++, level++)
      {
 
         //load current level from the current path
 
         loadLevel(level, hierDataPaths[i]);
       }
    }
 
    @Override
    public void loadFromIndex(HierarchyDataIndexReader[] readers, int fromLevel, JobConf conf)
                 throws Exception 
    {
     Text parentId = new Text();
     RecordInfoArrayWritable records = new RecordInfoArrayWritable();
     int levels = getNumberOfLevels();
 
     //iterate through each reader to load each level's entries
 
     for(int i=0, level=fromLevel; i<readers.length && level<=levels; i++, level++)
     {
       entriesGeoms[ level - 1 ] = new Hashtable<String, JGeometry>();
       entriesParents[ level - 1 ] = new Hashtable<String, String>();
 
       //each entry is a parent record id (key) and a list of entries as RecordInfo (value)
 
       while(readers[i].nextParentRecords(parentId, records))
       {
          String pId = null;
 
          //entries with no parent will have the parent id UNDEFINED_PARENT_ID. Such is the case of the first level entries
 
           if( ! UNDEFINED_PARENT_ID.equals( parentId.toString() ) )
           {
           pId = parentId.toString();
           }
 
         //add the current level's entries
 
           for(Object obj : records.get())
           {
              RecordInfo entry = (RecordInfo) obj;
              entriesGeoms[ level - 1 ].put(entry.getId(), entry.getGeometry());
              if(pId != null) 
              {
              entriesParents[ level -1 ].put(entry.getId(), pId);
              }
           }//finishin loading current parent entries
        }//finish reading single hierarchy level index
     }//finish iterating index readers
}

Finally, the following code listing contains the methods used to provide information of individual entries in each hierarchy level. The information provided is the ids of all the entries contained in a hierarchy level, the geometry of each entry, and the parent of each entry.

@Override
public Collection<String> getEntriesIds(int level) 
{
   Collection<String> ids = null;
 
   if(level >= 1 && level <= getNumberOfLevels() && entriesGeoms[ level - 1 ] != null)
   {
 
     //returns the ids of all the entries from the given level
 
     ids = entriesGeoms[ level - 1 ].keySet();
   }
   return ids;
}
 
@Override
public JGeometry getEntryGeometry(int level, String entryId) 
{
   JGeometry geom = null;
   if(level >= 1 && level <= getNumberOfLevels() && entriesGeoms[ level - 1 ] != null)
   {
 
      //returns the geometry of the entry with the given id and level
 
      geom = entriesGeoms[ level - 1 ].get(entryId);
    }
    return geom;
}
 
@Override
public String getParentId(int childLevel, String childId) 
{
   String parentId = null;
   if(childLevel >= 1 && childLevel <= getNumberOfLevels() && entriesGeoms[ childLevel - 1 ] != null)
   {
 
      //returns the parent id of the entry with the given id and level
 
      parentId = entriesParents[ childLevel - 1 ].get(childId);
   }
   return parentId;
   }
}//end of class

2.6.7 Using JGeometry in MapReduce Jobs

The Spatial Hadoop Vector Analysis only contains a small subset of the functionality provided by the Spatial Java API, which can also be used in the MapReduce jobs. This section provides some simple examples of how JGeometry can be used in Hadoop for spatial processing. The following example contains a simple mapper that performs the IsInside test between a dataset and a query geometry using the JGeometry class.

In this example, the query geometry ordinates, srid, geodetic value and tolerance used in the spatial operation are retrieved from the job configuration in the configure method. The query geometry, which is a polygon, is preprocessed to quickly perform the IsInside operation.

The map method is where the spatial operation is executed. Each input record value is tested against the query geometry and the id is returned, when the test succeeds.

public class IsInsideMapper extends MapReduceBase implements Mapper<LongWritable, Text, NullWritable, Text>
{
       private JGeometry queryGeom = null;
       private int srid = 0;
       private double tolerance = 0.0;
       private boolean geodetic = false;
       private Text outputValue = new Text();
       private double[] locationPoint = new double[2];
        
        
       @Override
       public void configure(JobConf conf) 
       {
           super.configure(conf);
           srid = conf.getInt("srid", 8307);
           tolerance = conf.getDouble("tolerance", 0.0);
           geodetic = conf.getBoolean("geodetic", true);
 
    //The ordinates are represented as a string of comma separated double values
           
            String[] ordsStr = conf.get("ordinates").split(",");
            double[] ordinates = new double[ordsStr.length];
            for(int i=0; i<ordsStr.length; i++)
            {
              ordinates[i] = Double.parseDouble(ordsStr[i]);
             }
 
    //create the query geometry as two-dimensional polygon and the given srid
 
             queryGeom = JGeometry.createLinearPolygon(ordinates, 2, srid);
 
    //preprocess the query geometry to make the IsInside operation run faster
 
              try 
              {
                queryGeom.preprocess(tolerance, geodetic, EnumSet.of(FastOp.ISINSIDE));
               } 
               catch (Exception e) 
               {
                 e.printStackTrace();
                }
                
          }
        
          @Override
          public void map(LongWritable key, Text value,
                      OutputCollector<NullWritable, Text> output, Reporter reporter)
                      throws IOException 
          {
 
     //the input value is a comma separated values text with the following columns: id, x-ordinate, y-ordinate
 
           String[] tokens = value.toString().split(",");
     
     //create a geometry representation of the record's location
 
            locationPoint[0] = Double.parseDouble(tokens[1]);//x ordinate
            locationPoint[1] = Double.parseDouble(tokens[2]);//y ordinate
            JGeometry location = JGeometry.createPoint(locationPoint, 2, srid);
 
     //perform spatial test
 
            try 
            {
              if( location.isInside(queryGeom, tolerance, geodetic)){
 
               //emit the record's id
 
               outputValue.set( tokens[0] );
               output.collect(NullWritable.get(), outputValue);
             }
     } 
             catch (Exception e) 
             {
                e.printStackTrace();
              }
}
}

A similar approach can be used to perform a spatial operation on the geometry itself. For example, by creating a buffer. The following example uses the same text value format and creates a buffer around each record location. The mapper output key and value are the record id and the generated buffer, which is represented as a JGeometryWritable. The JGeometryWritable is a Writable implementation contained in the Vector Analysis API that holds a JGeometry instance.

public class BufferMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, JGeometryWritable> 
{
       private int srid = 0;
       private double bufferWidth = 0.0;
       private Text outputKey = new Text();
       private JGeometryWritable outputValue = new JGeometryWritable();
       private double[] locationPoint = new double[2];
 
       @Override
       public void configure(JobConf conf)
       {
              super.configure(conf);
              srid = conf.getInt("srid", 8307);
 
              //get the buffer width 
 
              bufferWidth = conf.getDouble("bufferWidth", 0.0);
        }
 
        @Override
        public void map(LongWritable key, Text value,
               OutputCollector<Text, JGeometryWritable> output, Reporter reporter)
               throws IOException 
        {
 
               //the input value is a comma separated record with the following columns: id, longitude, latitude
 
               String[] tokens = value.toString().split(",");
 
               //create a geometry representation of the record's location
 
               locationPoint[0] = Double.parseDouble(tokens[1]);
               locationPoint[1] = Double.parseDouble(tokens[2]);
               JGeometry location = JGeometry.createPoint(locationPoint, 2, srid);
 
               try 
               {
 
                  //create the location's buffer
 
                  JGeometry buffer = location.buffer(bufferWidth);
 
                  //emit the record's id and the generated buffer
 
                  outputKey.set( tokens[0] );
                  outputValue.setGeometry( buffer );
                  output.collect(outputKey, outputValue);
                }
 
                catch (Exception e)
                {
                   e.printStackTrace();
                 }
        }
}

2.6.8 Tuning Performance Data of Job Running Times using Vector Analysis API

The table lists some running times for jobs built using the Vector Analysis API. The jobs were executed using a 4-node cluster. The times may vary depending on the characteristics of the cluster. The test dataset contains over One billion records and the size is above 1 terabyte.

Table 2-2 Performance time for running jobs using Vector Analysis API

Job Type Time taken (approximate value)

Spatial Indexing

2 hours

Spatial Filter with Spatial Index

1 hour

Spatial Filter without Spatial Index

3 hours

Hierarchy count with Spatial Index

5 minutes

Hierarchy count without Spatial Index

3 hours


The time taken for the jobs can be decreased by increasing the maximum split size using any of the following configuration parameters.

mapred.max.split.size
mapreduce.input.fileinputformat.split.maxsize

This results in more splits are being processed by each single mapper and improves the execution time. This is done by using the SpatialFilterInputFormat (spatial indexing) or FileSplitInputFormat (spatial hierarchical join, buffer). Also, the same results can be achieved by using the implementation of CombineFileInputFormat as internal InputFormat.

2.7 Using Oracle Big Data Spatial and Graph Vector Console

You can use the Oracle Big Data Spatial and Graph Vector Console to perform tasks related to spatial indexing and creating and showing thematic maps.

2.7.1 Creating a Spatial Index Using the Console

To create a spatial index using the Oracle Big Data Spatial and Graph Vector Console, follow these steps.

  1. Open http://<oracle_big_data_spatial_vector_console>:8080/spatialviewer/

  2. Click Create Index.

  3. Specify all the required details:

    1. Path of data to the index: Path of the index file in HDFS. For example, hdfs://<server_name_to_store_index>:8020/user/hsaucedo/twitter_logs/part-m-00000.

    2. New index path: This is the job output path. For example: hdfs://<oracle_big_data_spatial_vector_console>:8020/user/rinfante/Twitter/index.

    3. Input Format class: The input format class. For example: org.apache.hadoop.mapred.TextInputFormat

    4. Record Info Provider class: The class that provides the spatial information. For example: oracle.spatial.hadoop.vector.demo.usr.TwitterLogRecordInfoProvider.

      Note:

      If the InputFormat class or the RecordInfoProvider class is not in the API, or in the hadoop API classes, then a jar with the user-defined classes must be provided. To be able to use this jar the user must add it in the $JETTY_HOME/webapps/spatialviewer/WEB-INF/lib directory and restart the server.
    5. MVSuggest service URL(Optional): If the geometry has to be found from a location string then use the MVSuggest service. If the service URL is localhost then each data node must have the MVSuggest application started and running. In this case, the new index contains the point geometry and the layer provided by MVSuggest for each record. If the geometry is a polygon then the geometry is a centroid of the polygon. For example: http://localhost:8080

    6. MVSuggest Templates (Optional): The user can define the templates used to create the index with MVSuggest. For optimal results, it is recommended to select the same templates for the index creation and when running the hierarchy job.

    7. Outcome notification email sent to (Optional): Provide email Ids to receive the notifications when the job finished. Separate multiple email Ids by a semicolon. For example, mymail@abccorp.com

  4. Click Create.

    The submitted job is listed and you should wait to receive a mail notifying that the job completed successfully.

2.7.2 Running a Hierarchy Job Using the Console

You can run a hierarchy job to with or without the spatial index, follow these steps.

  1. Open http://<oracle_big_data_spatial_vector_console>:8080/spatialviewer/.

  2. Click Run Job.

  3. Select either With Index or Without Index and provide the following details, as required:

    • With Index

      1. Index path: The index file path in HDFS. For example, hdfs://<oracle_big_data_spatial_vector_console>:8020/user/Twitter/index/part*/data.

      2. SRID: The geometries used to build the index. For example, 8307.

      3. Tolerance: The tolerance of the geometry to build the index. For example, 0.5.

      4. Geodetic: Whether the geometries used are geodetic or not. Select Yes or No.

    • Without Index

      1. Path of the data: Provide the HDFS data path. For example, hdfs://<oracle_big_data_spatial_vector_console>:8020/user/*/data.

      2. JAR with user classes (Optional): If the InputFormat class or the RecordInfoProvider class is not in the API, or in the hadoop API classes, then a jar with the user-defined classes must be provided. To be able to use this jar the user must add it in the $JETTY_HOME/webapps/spatialviewer/WEB-INF/lib directory and restart the server.

      3. Input Format class: The input format class. For example: org.apache.hadoop.mapred.FileInputFormat

      4. Record Info Provider class: The class that will provide the spatial information. For example: oracle.spatial.hadoop.vector.demo.usr.TwitterLogRecordInfoProvider.

      5. MVSuggest service URL: If the geometry has to be found from a location string then use the MVSuggest service. If the service URL is localhost then each data node must have the MVSuggest application started and running. In this case, the new index contains the point geometry and the layer provided by MVSuggest for each record. If the geometry is a polygon then the geometry is a centroid of the polygon. For example: http://localhost:8080

  4. Templates: The templates to create the thematic maps. For optimal results, it is recommended to select the same templates for the index creation and when running the hierarchy job. For example, World Continents, World Countries, and so on.

    Note:

    If a template refers to point geometries (for example cities), the result returned is empty for that template, if MVSuggest is not used. This is because the spatial operations return results only for polygons.

    Tip:

    When using the MVSuggest service the results will be accurate, provided all the templates that could match the results are set. For example if the data can refer to any city, state, country, or continent in the world, then the better choice of templates to build results are World Continents, World Countries, World State Provinces and World Cities. On the other hand, if the data is from the USA states, counties, and cities, then the suitable templates are USA States, USA Counties, and USA Cities.
  5. Output path: The Hadoop job output path. For example, hdfs://<oracle_big_data_spatial_vector_console>:8020/user/rinfante/Twitter/myoutput.

  6. Result name: The result name. If a result exists for a template with the same name, it is overwritten. For example, Tweets test.

  7. Outcome notification email sent to (Optional): Provide email Ids to receive the notifications when the job finished. Separate multiple email Ids by a semicolon. For example, mymail@abccorp.com.

2.7.3 Viewing the Index Results

To view the results, follow these steps.

  1. Open http://<oracle_big_data_spatial_vector_console>:8080/spatialviewer/.

  2. Click Show Hadoop Results.

  3. Click any one of the Templates. For example, World Continents.

    The World Continents template is displayed.

  4. Click any one of the Results displayed.

    Different continents appear with different patches of colors.

  5. Click any continent from the map. For example, North America.

    The template changes to World Countries and the focus changes to North America with the results by country.

2.7.4 Creating Results Manually

It is possible to create and upload result files stored locally. For example, the result file created with a job executed from the command line. The templates are located in the folder $JETTY_HOME/webapps/spatialviewer/templates. The templates are GeoJSON files with features and all the features have ids. For example, the first feature in the template USA States starts with: {"type":"Feature","_id":"WYOMING",...

The results must be JSON files with the following format: {"id":"JSONFeatureId","result":result}.

For example, if the template USA States is selected, then a valid result is a file containing: {"id":"WYOMING","result":3232} {"id":"SOUTH DAKOTA","result":74968}

  1. From Show Hadoop Results, click the New Results icon icon.

  2. Select a Template from the drop down.

  3. Specify a Name.

  4. Click Choose File to select the File location.

  5. Click Save.

    The results can be located in the $JETTY_HOME/webapps/spatialviewer/results folder.

2.7.5 Creating and Deleting templates

To create new templates do the following:

  1. Add the template JSON file in the folder $JETTY_HOME/webapps/spatialviewer/templates/.

  2. Add the template configuration file in the folder $JETTY_HOME/webapps/spatialviewer/templates/_config_.

  3. If MVSuggest is used, then add the new templates to the MVSuggest templates.

To delete the template, delete the JSON and configuration files added in steps 1 and 2.

2.7.6 Configuring Templates

Each template has a configuration file. The template configuration files are located in the folder $JETTY_HOME/webapps/spatialviewer/templates/_config_. The name of the configuration file is same as the template files suffixed with config.json instead of .json.For example, the configuration file name of the template file usa_states.json is usa_states.config.json. The configuration parameters are:

  • name: Name of the template to be shown on the console. For example, name: USA States.

  • display_attribute: When displaying a result, a cursor move on the top of a feature displays this property and result of the feature. For example, display_attribute: STATE NAME.

  • point_geometry: True, if the template contains point geometries and false, in case of polygons.For example, point_geometry: false.

  • child_templates (optional): The templates that can have several possible child templates separated by a coma. For example, child_templates: ["world_states_provinces, usa_states(properties.COUNTRY CODE:properties.PARENT_REGION)"].

    If the child templates don't specify a linked field, it means that all the features inside the parent features are considered as child features. In this case, the world_states_provinces doesn't specify any fields. If the link between parent and child is specified, then the spatial relationship doesn't apply and the feature properties link are checked. In the above example, the relationship with the usa_states is found with the property COUNTRY CODE in the current template, and the property PARENT_REGION in the template file usa_states.json.

  • srid: The SRID of the template's geometries. For example, srid: 8307.

  • bounds: The bounds of the map when showing the template. For example, bounds: [-180, -90, 180, 90].

  • numberOfZoomLevels: The map's number of zoom level when showing the template. For example, numberOfZoomLevels: 19.

  • back_polygon_template_file_name (optional): A template with polygon geometries to set as background when showing the defined template. For example, back_polygon_template_file_name: usa_states.

2.7.7 Running a Job to Create an Index Using the Command Line

Run the following command to generate an index:

hadoop jar <HADOOP_LIB_PATH>/<jarfile> oracle.spatial.hadoop.vector.demo.job.SpatialIndexing <DATA_PATH> <SPATIAL_INDEX_PATH> <INPUT_FORMAT_CLASS> <RECORD_INFO_PROVIDER_CLASS>

Where,

  • jarfile is the jar file to be specified by the user for spatial indexing.

  • DATA_PATH is the location of the data to be indexed.

  • SPATIAL_INDEX_PATH is the location of the resulting spatial index.

  • INPUT_FORMAT_CLASS is the InputFormat implementation used to read the data.

  • RECORD_INFO_PROVIDER_CLASS is the implementation used to extract information from the records.

The following example uses the command line to create an index.

hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop/lib/sdohadoop-vector-demo.jar oracle.spatial.hadoop.vector.demo.job.SpatialIndexing "/user/hdfs/demo_vector/tweets/part*" /user/hdfs/demo_vector/tweets/spatial_index org.apache.hadoop.mapred.TextInputFormat oracle.spatial.hadoop.vector.demo.usr.TwitterLogRecordInfoProvider 

Note:

The preceding example uses the demo job that comes preloaded.

2.7.8 Running a Job to Perform a Spatial Filtering

Run the following command to perform a spatial filter:

hadoop jar HADOOP_LIB_PATH/sdohadoop-vector-demo.jar oracle.spatial.hadoop.vector.demo.job.TwitterLogSearch <DATA_PATH> <RESULT_PATH> <SEARCH_TEXT> <SPATIAL_INTERACTION> <GEOMETRY> <SRID> <TOLERANCE> <GEODETIC> <SPATIAL_INDEX_PATH>

Where,

  • DATA_PATH is the data to be filtered.

  • RESULT_PATH is the path where the results are generated.

  • SEARCH_TEXT is the text to be searched.

  • SPATIAL_INTERACTION is the type of spatial interaction. It can be either 1 (is inside) or 2 (any interact).

  • GEOMETRY is the geometry used to perform the spatial filter. It should be expressed in JSON format.

  • SRID is the SRS id of the query geometry.

  • TOLERANCE is the tolerance used for the spatial search.

  • GEODETIC mentions whether the geometries are geodetic or not.

  • SPATIAL_INDEX_PATH is the path to a previously generated spatial index.

The following example demonstrates the counts all tweets containing some text and interacting with a given geometry.

hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop/lib/sdohadoop-vector-demo.jar oracle.spatial.hadoop.vector.demo.job.TwitterLogSearch "/user/hdfs/demo_vector/tweets/part*" /user/hdfs/demo_vector/tweets/text_search feel 2 '{"type":"Polygon", "coordinates":[[-106.64595, 25.83997, -106.64595, 36.50061, -93.51001, 36.50061, -93.51001, 25.83997, -106.64595, 25.83997]]}' 8307 0.0 true /user/hdfs/demo_vector/tweets/spatial_index

Note:

The preceding example uses the demo job that comes preloaded.

2.7.9 Running a Job to Create a Hierarchy Result

Run the following command to create a hierarchy result:

hadoop jar HADOOP_LIB_PATH/sdohadoop-vector-demo.jar oracle.spatial.hadoop.vector.demo.job.HierarchicalCount spatial <SPATIAL_INDEX_PATH> <RESULT_PATH> <HIERARCHY_INFO_CLASS> <SPATIAL_INTERACTION> <SRID> <TOLERANCE> <GEODETIC> <HIERARCHY_DATA_INDEX_PATH> <HIERARCHY_DATA_PATHS>

Where,

  • SPATIAL_INDEX_PATH is the path to a previously generated spatial index. If there are files other than the spatial index path files in this path (for example, the hadoop generated _SUCCESS file), then add the following pattern: SPATIAL_INDEX_PATH/part*/data.

  • RESULT_PATHis the path where the results are generated. There should be a file named XXXX_count.json for each hierarchy level.

  • HIERARCHY_INFO_CLASS is the location of the resulting spatial index.

  • INPUT_FORMAT_CLASS is the HierarchyInfo implementation. It defines the structure of the current hierarchy data.

  • SPATIAL_INTERACTION,SRID,TOLERANCE,GEODETIC, all these parameters have the same meaning as defined previously for Text Search Job. They are used to define the spatial operation between the data and the hierarchical information geometries.

  • HIERARCHY_DATA_INDEX_PATH is the path where the hierarchy data index will be placed. This index is used by the job to avoid finding parent-children relationships each time are required.

  • HIERARCHY_DATA_PATHS are comma separated list of paths to the hierarchy data. If a hierarchy index was created previously for the same hierarchy data it can be omitted.

The following example uses the command line to create an index.

hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop/lib/sdohadoop-vector-demo.jar oracle.spatial.hadoop.vector.demo.job.HierarchicalCount spatial "/user/hdfs/demo_vector/tweets/spatial_index/part*/data" /user/hdfs/demo_vector/tweets/hier_count_spatial oracle.spatial.hadoop.vector.demo.usr.WorldAdminHierarchyInfo 1 8307 0.5 true /user/hdfs/demo_vector/world_hier_index file:///net/den00btb/scratch/hsaucedo/spatial_bda/demo/vector/catalogs/world_continents.json,file:///net/den00btb/scratch/hsaucedo/spatial_bda/demo/vector/catalogs/world_countries.json,file:///net/den00btb/scratch/hsaucedo/spatial_bda/demo/vector/catalogs/world_states_provinces.json

Note:

The preceding example uses the demo job that comes preloaded.

2.7.10 Running a Job to Generate Buffer

Run the following command to generate a buffer:

hadoop jar HADOOP_LIB_PATH/sdohadoop-vector-demo.jar oracle.spatial.hadoop.vector.demo.job.Buffer <DATA_PATH> <RESULT_PATH> <INPUT_FORMAT_CLASS> <RECORD_INFO_PROVIDER_CLASS> <BUFFER_WIDTH> <BUFFER_SMA> <BUFFER_FLAT> <BUFFER_ARCTOL>

Where,

  • DATA_PATH is the data to be filtered.

  • RESULT_PATH is the path where the results are generated.

  • INPUT_FORMAT_CLASS is the InputFormat implementation used to read the data.

  • RECORD_INFO_PROVIDER_CLASS is the RecordInfoProvider implementation used to extract data from each record.

  • BUFFER_WIDTH specifies the buffer width.

  • BUFFER_SMA is the semi major axis.

  • BUFFER_FLAT is the buffer flattening.

  • BUFFER_ARCTOL is the arc tolerance for geodetic arc densification.

The following example demonstrates generating a buffer around each record geometry. The resulting file is a MapFile where each entry corresponds to a record from the input data. The entry key is the record id and the value is a RecordInfo instance holding the generated buffer and the record location (path, start offset and length).

hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop/lib/sdohadoop-vector-demo.jar oracle.spatial.hadoop.vector.demo.job.Buffer "/user/hdfs/demo_vector/waterlines/part*"  /user/hdfs/demo_vector/waterlines/buffers org.apache.hadoop.mapred.TextInputFormat oracle.spatial.hadoop.vector.demo.usr.WorldSampleLineRecordInfoProvider 5.0

Note:

The preceding example uses the demo job that comes preloaded.

2.8 Using Oracle Big Data Spatial and Graph Image Server Console

You can use the Oracle Big Data Spatial and Graph Image Server Console to tasks, such as Loading Images to HDFS Hadoop Cluster to Create a Mosaic.

2.8.1 Loading Images to HDFS Hadoop Cluster to Create a Mosaic

Follow the instructions to create a mosaic:

  1. Open http://<oracle_big_data_image_server_console>:8080/spatialviewer/.

  2. Type the username and password.

  3. Click the Configuration tab and review the Hadoop configuration section.

    By default the application is configured to work with the Hadoop cluster and no additional configuration is required.

    Note:

    Only an admin user can make changes to this section.
  4. Click the Hadoop Loader tab and review the displayed instructions or alerts.

  5. Follow the instructions and update the runtime configuration, if necessary.

  6. Click the Folder icon.

    The File System dialog displays the list of image files and folders.

  7. Select the folders or files as required and click Ok.

    The complete path to the image file is displayed.

  8. Click Load Images.

    Wait for the images to be loaded successfully. A message is displayed.

  9. Proceed to create a mosaic, if there are no errors displayed.