This chapter provides conceptual and usage information about loading, storing, accessing, and working with spatial data in a Big Data environment.
Oracle Big Data Spatial Hadoop Image Processing Framework for Raster Data Processing
Processing an Image Using the Oracle Spatial Hadoop Image Processor
Using Oracle Big Data Spatial and Graph Image Server Console
Spatial data represents the location characteristics of real or conceptual objects in relation to the real or conceptual space on a Geographic Information System (GIS) or other location-based application.
Oracle Big Data Spatial and Graph features enable spatial data to be stored, accessed, and analyzed quickly and efficiently for location-based decision making.
These features are used to geotag, enrich, visualize, transform, load, and process the location-specific two and three dimensional geographical images, and manipulate geometrical shapes for GIS functions.
Oracle Big Data Spatial and Graph on Apache Hadoop is a framework that uses the MapReduce programs and analytic capabilities in a Hadoop cluster to store, access, and analyze the spatial data. The spatial features provide a schema and functions that facilitate the storage, retrieval, update, and query of collections of spatial data. Big Data Spatial and Graph on Hadoop supports storing and processing spatial images, which could be geometric shapes, raster, or vector images and stored in one of the several hundred supported formats.
See Also:
Oracle Spatial and Graph Developer's Guide for an introduction to spatial concepts, data, and operationsThe advantages of using Oracle Big Data Spatial and Graph include the following:
Unlike some of the GIS-centric spatial processing systems and engines, Oracle Big Data Spatial and Graph is capable of processing both structured and unstructured spatial information.
Customers are not forced or restricted to store only one particular form of data in their environment. They can have their data stored both as a spatial or nonspatial business data and still can use Oracle Big Data to do their spatial processing.
This is a framework, and therefore customers can use the available APIs to custom-build their applications or operations.
Oracle Big Data Spatial can process both vector and raster types of information and images.
The spatial data is loaded for query and analysis by the Spatial Server and the images are stored and processed by an Image Processing Framework. You can use the Oracle Big Data Spatial and Graph server on Hadoop for:
Cataloguing the geospatial information, such as geographical map-based footprints, availability of resources in a geography, and so on.
Topological processing to calculate distance operations, such as nearest neighbor in a map location.
Categorization to build hierarchical maps of geographies and enrich the map by creating demographic associations within the map elements.
The following functions are built into Oracle Big Data Spatial and Graph:
Indexing function for faster retrieval of the spatial data.
Map function to display map-based footprints.
Zoom function to zoom-in and zoom-out specific geographical regions.
Mosaic and Group function to group a set of image files for processing to create a mosaic or subset operations.
Cartesian and geodetic coordinate functions to represent the spatial data in one of these coordinate systems.
Hierarchical function that builds and relates geometric hierarchy, such as country, state, city, postal code, and so on. This function can process the input data in the form of documents or latitude/longitude coordinates.
The stored spatial data or images can be in one of these supported formats:
GeoJSON files
Shapefiles
Both Geodetic and Cartesian data
Other GDAL supported formats
You must have the following software, to store and process the spatial data:
Java runtime
GCC Compiler - Only when the GDAL-supported formats are used
Oracle Big Data Spatial and Graph supports the storage and processing of both vector and raster spatial data.
For processing the raster data, the GDAL loader loads the raster spatial data or images onto a HDFS environment. The following basic operations can be performed on a raster spatial data:
Mosaic: Combine multiple raster images to create a single mosaic image.
Subset: Perform subset operations on individual images.
This feature supports a MapReduce framework for raster analysis operations. The users have the ability to custom-build their own raster operations, such as performing an algebraic function on a raster data and so on. For example, calculate the slope at each base of a digital elevation model or a 3D representation of a spatial surface, such as a terrain. For details, see "Oracle Big Data Spatial Hadoop Image Processing Framework for Raster Data Processing."
This feature supports the processing of spatial vector data:
Loaded and stored on to a Hadoop HDFS environment
Stored either as Cartesian or geodetic data
The stored spatial vector data can be used for performing the following query operations and more:
Point-in-polygon
Distance calculation
Anyinteract
Buffer creation
Two different data service operations are supported for the spatial vector data:
Data enrichment
Data categorization
In addition, there is a limited Map Visualization API support for only the HTML5 format. You can access these APIs to create custom operations. For details, see "Oracle Big Data Spatial Vector Analysis."
Oracle Spatial Hadoop Image Processing Framework allows the creation of new combined images resulting from a series of processing phases in parallel with the following features:
HDFS Images storage, where every block size split is stored as a separate image
Subset and user-defined operations processed in parallel using the MapReduce framework
Ability to add custom processing classes to be executed in parallel in a transparent way
Fast processing of georeferenced images
Support for GDAL formats, multiple bands images, DEMs (digital elevation models), multiple pixel depths, and SRIDs
The Oracle Spatial Hadoop Image Processing Framework consists of two modules, a Loader and Processor, each one represented by a Hadoop job running on different stages in a cluster, as represented in the following diagram. Also, you can load and process the images using the Image Server web application.
For installation and configuration information, see:
Installing Oracle Big Data Spatial and Graph on an Oracle Big Data Appliance
Installing and Configuring the Big Data Spatial Image Processing Framework
Installing and Configuring the Big Data Spatial Image Server
The Image Loader is a Hadoop job that loads a specific image or a group of images into HDFS.
While importing, the image is tiled and stored as an HDFS block.
GDAL is used to tile the image.
Each tile is loaded by a different mapper, so reading is parallel and faster.
Each tile includes a certain number of overlapping bytes (user input), so that the tile's cover area forms the adjacent tiles.
A MapReduce job uses a mapper to load the information for each tile. There are 'n' number of mappers, depending on the number of tiles, image resolution and block size.
A single reduce phase per image puts together all the information loaded by the mappers and stores the images into a special .ohif
format, which contains the resolution, bands, offsets, and image data. This way the file offset containing each tile and the node location is known.
Each tile contains information for every band. This is helpful when there is a need to process only a few tiles; then, only the corresponding blocks are loaded.
The following diagram represents an Image Loader process:
The Image Processor is a Hadoop job that filters tiles to be processed based on the user input and performs processing in parallel to create a new image.
Processes specific tiles of the image identified by the user. You can identify one, zero, or multiple processing classes. After the execution of processing classes, a mosaic operation is performed to adapt the pixels to the final output format requested by the user.
A mapper loads the data corresponding to one tile, conserving data locality.
Once the data is loaded, the mapper filters the bands requested by the user.
Filtered information is processed and sent to each mapper in the reduce phase, where bytes are put together and a final processed image is stored into HDFS or regular File System depending on the user request.
The following diagram represents an Image Processor job:
The Image Server is a web application the enables you to load and process images from different and variety of sources, especially from the Hadoop File System (HDFS). This Oracle Image Server has two main applications:
Raster Image processing to create catalogs from the source images and process into a single unit. You can also view the image thumbnails.
Hadoop console configuration, both server and console. It connects to the Hadoop cluster to load images to HDFS for further processing.
The first step to process images using the Oracle Spatial and Graph Hadoop Image Processing Framework is to actually have the images in HDFS, followed by having the images separated into smart tiles. This allows the processing job to work separately on each tile independently. The Image Loader lets you import a single image or a collection of them into HDFS in parallel, which decreases the load time.
The Image Loader imports images from a file system into HDFS, where each block contains data for all the bands of the image, so that if further processing is required on specific positions, the information can be processed on a single node.
The image loading job has its custom input format that splits the image into related image splits. The splits are calculated based on an algorithm that reads square blocks of the image covering a defined area, which is determined by
area = ((blockSize - metadata bytes) / number of bands) / bytes per pixel.
For those pieces that do not use the complete block size, the remaining bytes are refilled with zeros.
Splits are assigned to different mappers where every assigned tile is read using GDAL based on the ImageSplit
information. As a result an ImageDataWritable
instance is created and saved in the context.
The metadata set in the ImageDataWritable
instance is used by the processing classes to set up the tiled image in order to manipulate and process it. Since the source images are read from multiple mappers, the load is performed in parallel and faster.
After the mappers completes reading, the reducer picks up the tiles from the context and puts them together to save the file into HDFS. A special reading process is required to read the image back.
The following input parameters are supplied to the Hadoop command:
hadoop jar HADOOP_LIB_PATH/hadoop-imageloader.jar -files <SOURCE_IMGS_PATH> -out <HDFS_OUTPUT_FOLDER> [-overlap <OVERLAPPING_PIXELS>] [-thumbnail <THUMBNAIL_PATH>] [-gdal <GDAL_LIBRARIES_PATH>]
Where:
HADOOP_LIB_PATH
for Oracle Big Data Appliance distribution is under /opt/cloudera/parcels/CDH/lib/hadoop/lib/
, and for the standard distributions it is under /usr/lib/hadoop/lib
.SOURCE_IMGS_PATH
is a path to the source image(s) or folder(s). For multiple inputs use a comma separator. This path must be accessible via NFS to all nodes in the cluster.HDFS_OUTPUT_FOLDER
is the HDFS output folder where the loaded images are stored.OVERLAPPING_PIXELS
is an optional number of overlapping pixels on the borders of each tile, if this parameter is not specified a default of two overlapping pixels are considered.THUMBNAIL_PATH
is an optional path to store a thumbnail of the loaded image(s). This path must be accessible via NFS to all nodes in the cluster and must have a write access permission for yarn users.GDAL_LIBRARIES_PATH
is an optional path for GDAL native libraries, in case, the cluster has them in a path different to the default Oracle Big Data Appliance path (/opt/cloudera/parcels/CDH/lib/hadoop/lib
).For example, this command loads all the georeferenced images under the images
folder and adds an overlapping of 10 pixels on every border possible. The HDFS output folder is ohiftest
and thumbnail of the loaded image are stored in the processtest
folder. Since no -gdal
flag was specified, the gdal libraries are loaded from the default gdal path /opt/cloudera/parcels/CDH/lib/hadoop/lib/
native.
hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop/lib/hadoop-imageloader.jar -files /opt/shareddir/spatial/demo/imageserver/images/hawaii.tif -out ohiftest -overlap 10 -thumbnail /opt/shareddir/spatial/processtest
By default, the Mappers and Reducers are configured to get 2 GB of JVM, but users can override this settings or any other job configuration properties by adding an imagejob.prop
properties file in the same folder location from where the command is being executed. This properties file may list all the configuration properties that you want to override. For example,
mapreduce.map.memory.mb=2560 mapreduce.reduce.memory.mb=2560 mapreduce.reduce.java.opts=-Xmx2684354560 mapreduce.map.java.opts=-Xmx2684354560
The reducer generates two output files per input image. The first one is the .ohif
file that concentrates all the tiles for the source image, each tile may be processed as a separated instance by a processing mapper. Internally each tile is stored as a HDFS block, blocks are located in several nodes, one node may contain one or more blocks of a specific .ohif
file. The .ohif
file is stored in user specified folder with -out flag, under the /user/<USER_EXECUTING_JOB>
folder, and the file can be identified as original_filename.ohif
.
The second output is a related metadata file that lists all the pieces of the image and the coordinates each one of them cover, the file can be identified as: SRID_dataType_original_filename.loc
where SRID and data type are extracted from source image. Following is an example of a couple of lines for a metadata file:
3,565758.0971152118,4302455.175478269,576788.4332668484,4291424.839326633 29,576757.962724993,4247455.847429363,582303.6013426667,4240485.710979952
The first element is the piece number, the second one is upper left X, followed by upper left Y, lower right X, and lower right Y. The .loc
file is stored in the metadata
folder under user the HDFS
folder. If the -thumbnail
flag was specified, a thumbnail of the source image is stored in the related folder. This is a way to visualize a translation of the .ohif
file. Job execution logs can be accessed using the command yarn logs -applicationId <applicationId>
.
Once the images are loaded into HDFS, they can be processed in parallel using Oracle Spatial Hadoop Image Processing Framework. You specify an output, and the framework filters the tiles to fit into that output, processes them, and puts them all together to store them into a single file. You can specify additional processing classes to be executed before the final output is created by the framework.
Image processor loads specific blocks of data, based on the input (mosaic description), and selects only the bands and pixels that fit into the final output. All the specified processing classes are executed and the final output is stored into HDFS or File System depending on the user request.
The image processing job has its own custom FilterInputFormat
, which determines the tiles to be processed, based on the SRID and coordinates. Only images with same data type (pixel depth) as mosaic input data type (pixel depth) are considered. Only the tiles that intersect with coordinates specified by the user for the mosaic output are included. Once the tiles are selected, a custom ImageProcessSplit
per each one of them is created.
When a mapper receives the ImageProcessSplit
, it reads the information based on what the ImageSplit
specifies, performs a filter to select only the bands indicated by the user, and executes the list of processing classes defined in the request.
Each mapper process runs in the node, where the data is located. Once the processing classes are executed, the final process executes the mosaic operation. The mosaic operation selects from every tile only the pixels that fit into output and makes the necessary resolution changes to add them in the mosaic output. The resulting bytes are set in the context included in the ImageBandWritable
type.
A single reducer picks the tiles and puts them together. If user selected HDFS output, then the ImageLoader is called to store the result into HDFS. Otherwise, by default the image is prepared using GDAL and is stored in the File System.
The following input parameters are supplied to the Hadoop command:
hadoop jar HADOOP_LIB_PATH/hadoop-imageprocessor.jar -catalog <IMAGE_CATALOG_PATH> -config <MOSAIC_CONFIG_PATH> [-usrlib <USER_PROCESS_JAR_PATH>] [-thumbnail <THUMBNAIL_PATH>] [-gdal <GDAL_LIBRARIES_PATH>]
Where:
HADOOP_LIB_PATH
for Oracle Big Data Appliance distribution is under /opt/cloudera/parcels/CDH/lib/hadoop/lib/
, and for the standard distributions it is under /usr/lib/hadoop/lib
.IMAGE_CATALOG_PATH
is the path to the catalog xml that lists the HDFS image(s) to be processed.MOSAIC_CONFIG_PATH
is the path to the mosaic configuration xml, that defines the features of the output mosaic.USER_PROCESS_JAR_PATH
is an optional user defined jar file, which contains additional processing classes to be applied to the source images.THUMBNAIL_PATH
is an optional path to store a thumbnail of the loaded image(s). This path must be accessible via NFS to all nodes in the cluster and is valid only for an HDFS output.GDAL_LIBRARIES_PATH
is an optional path for GDAL native libraries, in case, the cluster has them in a path different to the default Oracle Big Data Appliance path (/opt/cloudera/parcels/CDH/lib/hadoop/lib
).For example, This command will process all the files listed in the catalog file input.xml file using the mosaic output definition set in testFS.xml file.
hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop/lib/hadoop-imageprocessor.jar-catalog /opt/shareddir/spatial/demo/imageserver/images/input.xml-config /opt/shareddir/spatial/demo/imageserver/images/testFS.xml-thumbnail /opt/shareddir/spatial/processtest
By default, the Mappers and Reducers are configured to get 2 GB of JVM, but users can override this settings or any other job configuration properties by adding an imagejob.prop
properties file in the same folder location from where the command is being executed.
This is an example of input catalog xml used to list every source image considered for mosaic operation generated by the image processing job.
-<catalog> -<image> <source>HDFS</source> <type>File</type> -<raster> /user/hdfs/ohif/opt/shareddir/spatial/imageserver/hawaii.tif.ohi </raster> -<metadata>/user/hdfs/metadata/opt/shareddir/spatial/imageserver/26904_1_hawaii.tif.loc </metadata> <bands>3</bands> <defaultRed>1</defaultRed> <defaultGreen>2</defaultGreen> <defaultBlue>3</defaultBlue> </image> </catalog>
A <catalog>
element contains the list of <image> elements to process.
Each <image>
element defines a source image or a source folder within the <raster> element. All the images within the folder are processed.
The <metadata>
element specifies a .loc
location file related to the source .ohif
file. This location file contains the list of tiles for each source image and the coordinates cover per tile.
The <bands>
element specifies the number of bands of the image, and the <defaultRed>
, <defaultGreen>
, and <defaultBlue>
specify the band used for the first three channels of the image. In this example, band 1 is used for red channel, band 2 is used for green channel, and band 3 is used for blue channel.
This is an example of a mosaic configuration xml used to define the features of the mosaic output generated by the image processing job.
-<mosaic>
-<output>
<SRID>26904</SRID>
<directory type="FS">/opt/shareddir/spatial/processOutput</directory>
<!--directory type="HDFS">newData</directory-->
<tempFSFolder>/opt/shareddir/spatial/tempOutput</tempFSFolder>
<filename>littlemap</filename>
<format>GTIFF</format>
<width>1600</width>
<height>986</height>
<algorithm order="0">2</algorithm>
<bands layers="3"/>
<nodata>#000000</nodata>
<pixelType>1</pixelType>
</output>
-<crop>
-<transform>
356958.985610072,280.38843650364862,0,2458324.0825054757,0,-280.38843650364862 </transform>
</crop>
-<process>
-<class>oracle.spatial.imageporcessor.hadoop.process.ImageSlope </class>
</process>
</mosaic>
The <mosaic>
element defines the specifications of the processing output.
The <output>
element defines the features such as <SRID> considered for the output. All the images in different SRID are converted to the mosaic SRID in order to decide if any of its tiles fit into the mosaic or not.
The <directory>
element defines where the output is located. It can be in an HDFS or in regular FileSystem (FS), which is specified in the tag type.
The <tempFsFolder>
element sets the path to store the mosaic output temporarily.
The <filename>
and <format>
elements specify the output filename.
The <width>
and <height>
elements set the mosaic output resolution.
The <algorithm>
element sets the order algorithm for the images. A 1 order means, by source last modified date, and a 2 order means, by image size. The order tag represents ascendant or descendant modes.
The <bands>
element specifies the number of bands in the output mosaic. Images with fewer bands than this number are discarded.
The <nodata>
element specifies the color in the first three bands for all the pixels in the mosaic output that have no value.
The <pixelType>
element sets the pixel type of the mosaic output. Source images that do not have the same pixel size are discarded for processing.
The <crop>
element defines the coordinates included in the mosaic output in the following order: startcoordinateX
, pixelXWidth
, RotationX
, startcoordinateY
, RotationY
, and pixelheightY
.
The <process>
element lists all the classes to execute before the mosaic operation. In this example a slope calculation is applied, and it is valid only for 32 bits images of Digital Elevation Model (DEM) files.
You can specify any other user-created processing classes. When no processing class is defined, only the mosaic operation is performed.
The first step of the job is to filter the tiles that would fit into the mosaic, as a start, the location files listed in the catalog xml are sent to the InputFormat
, the location file name has the following structure: SRID_pixelType_fileName.loc
.
By extracting the pixelType
, the filter decides whether the related source image is valid for processing or not. Based on the user definition made in the catalog xml, one of the following happens:
If the image is valid for processing, then the SRID is evaluated next
If it is different from the user definition, then the MBR coordinates of every tile are converted into the user SRID and evaluated.
This way, every tile is evaluated for intersection with mosaic definition. Only the intersecting tiles are selected, and a split is created for each one of them.
A mapper process each split in the node where it is stored. The mapper executes the sequence of processing classes defined by the user, and then the mosaic process is executed. A single reducer puts together the result of the mappers and stores the image into FS or HDFS upon user request. If the user requested is to store the output into HDFS, then the ImageLoaderOverlap
job is invoked to store the image as a .ohif
file.
By default, the Mappers and Reducers are configured to get 2 GB of JVM, but you can override this settings or any other job configuration properties by adding an imagejob.prop
properties file in the same folder location from where the command is being executed.
The processing classes specified in the catalog XML must follow a set of rules to be correctly processed by the job. All the processing classes must implement the ImageProcessorInterface
and the ImageBandWritable
type properties.
The ImageBandWritable
instance defines the content of a tile, such as resolution, size, and pixels. These values must be reflected in the properties that create the definition of the tile. The integrity of the mosaic output depends on the correct manipulation of these properties.
Table 2-1 ImageBandWritable Properties
Type - Property | Description |
---|---|
IntWritable dstWidthSize |
Width size of the tile |
IntWritable dstHeightSize |
Height size of the tile |
IntWritable bands |
Number of bands in the tile |
IntWritable dType |
Data type of the tile |
IntWritable offX |
Starting X pixel, in relation to the source image |
IntWritable offY |
Starting Y pixel, in relation to the source image |
IntWritable totalWidth |
Width size of the source image |
IntWritable totalHeight |
Height size of the source image |
IntWritable bytesNumber |
Number of bytes containing the pixels of the tile and stored into baseArray |
BytesWritable[] baseArray |
Array containing the bytes representing the tile pixels, each cell represents a band |
IntWritable[][] basePaletteArray |
Array containing the int values representing the tile palette, each array represents a band. Each integer represents an entry for each color in the color table, there are four entries per color |
IntWritable[] baseColorArray |
Array containing the int values representing the color interpretation, each cell represents a band |
DoubleWritable[] noDataArray |
Array containing the NODATA values for the image, each cell contains the value for the related band |
ByteWritable isProjection |
Specifies if the tile has projection information with Byte.MAX_VALUE |
ByteWritable isTransform |
Specifies if the tile has the geo transform array information with Byte.MAX_VALUE |
ByteWritable isMetadata |
Specifies if the tile has metadata information with Byte.MAX_VALUE |
IntWritable projectionLength |
Specifies the projection information length |
BytesWritable projectionRef |
Specifies the projection information in bytes |
DoubleWritable[] geoTransform |
Contains the geo transform array |
IntWritable metadataSize |
Number of metadata values in the tile |
IntWritable[] metadataLength |
Array specifying the length of each metadataValue |
BytesWritable[] metadata |
Array of metadata of the tile |
GeneralInfoWritable mosaicInfo |
The user-defined information in the mosaic xml. Do not modify the mosaic output features. Modify the original xml file in a new name and run the process using the new xml |
Processing Classes and Methods
When modifying the pixels of the tile, first get the bytes into an array using the following method:
byte [] bandData1 = img.getBand(0);
The bytes representing the tile pixels of band 1 are now in the bandData1 array. The base index is zero.
If the tile has a Float32 data type, then use the following method to get the pixels in a float array:
float[] floatBand1 = img.getFloatBand(0);
For any other data types, you must convert the byte array into a corresponding type using gdal.
After processing the pixels, if the same instance of ImageBandWritable must be used, then execute the following method:
img.removeBands;
This removes the content of previous bands, and you can start adding the new bands. To add a new band use the following method:
img.addBand(Object band);
Where band is a byte or a float array containing the pixel information. Do not forget to update the instance size, data type, bytesNumber and any other property that might be affected as a result of the processing operation. Setters are available for each property.
All the processing classes must be contained in a single jar file if you are using the Oracle Image Server Console.
The processing classes might be placed in different jar files if you are using the command line option.
The jar files, for Oracle Big Data Appliance, in both cases must be placed under /opt/cloudera/parcels/CDH/lib/hadoop/lib/
Once new classes are visible in the classpath, they must be added to the mosaic XML, in the <process><class> section. Every <class>
element added is executed in order of appearance before the final mosaic operation is performed.
When you specify an HDFS directory in the catalog XML, the output generated is an .ohif
file as in the case of an ImageLoader
job,
When the user specifies a FS directory in the catalog xml, the output generated is an image with the filename and type specified and is stored into regular FileSystem.
In both the scenarios, the output must comply with the specifications set in the catalog xml. The job execution logs can be accessed using the command yarn logs -applicationId <applicationId>
.
Oracle Big Data Spatial Vector Analysis is a Spatial Vector Analysis API, which runs as a Hadoop job and provides MapReduce components for spatial processing of data stored in HDFS. These components make use of the Spatial Java API to perform spatial analysis tasks. There is a web console provided along with the API. The supported features include:
In addition, read the following information for understanding the complete implementation details:
A spatial index is in the form of a key/value pair and generated as a Hadoop MapFile. Each MapFile entry contains a spatial index for one split of the original data. The key and value pair contain the following information:
Key: a split identifier in the form: path + start offset + length.
Value: a spatial index structure containing the actual indexed records.
The following figure depicts a spatial index in relation to the user data. The records are represented as r1, r2, and so on.
Related subtopics:
Records in a spatial index are represented using the class oracle.spatial.hadoop.vector.RecordInfo
. A RecordInfo
typically contains a subset of the original record data and a way to locate the record in the file where it is stored. The specific RecordInfo
data depends on two things:
InputFormat
used to read the data
RecordInfoProvider
implementation, which provides the record's data
The fields contained within a RecordInfo
:
Id: Text field with the record Id.
Geometry: JGeometry
field with the record geometry.
Extra fields: Additional optional fields of the record can be added as name-value pairs. The values are always represented as text.
Start offset: The position of the record in a file as a byte offset. This value depends on the InputFormat
used to read the original data.
Length: The original record length in bytes.
Path: The file path can be added optionally. This is optional because the file path can be known using the spatial index entry key. However, to add the path to the RecordInfo
instances when a spatial index is created, the value of the configuration property oracle.spatial.recordInfo.includePathField
key is set to true
.
A spatial index is created using a combination of FileSplitInputFormat
, SpatialIndexingMapper
, InputFormat
, and RecordInfoProvider
, where the last two are provided by the user. The following code example shows part of the configuration needed to run a job that creates a spatial index for the data located in the HDFS folder /user/data
.
//input conf.setInputFormat(FileSplitInputFormat.class); FileSplitInputFormat.setInputPaths(conf, new Path("/user/data")); FileSplitInputFormat.setInternalInputFormatClass(conf, TextInputFormat.class); FileSplitInputFormat.setRecordInfoProviderClass(conf, TwitterLogRecordInfoProvider.class); //output conf.setOutputFormat(MapFileOutputFormat.class); FileOutputFormat.setOutputPath(conf, new Path("/user/data_spatial_index")); //mapper conf.setMapperClass(SpatialIndexingMapper.class); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(RTreeWritable.class);
In this example,
The FileSplitInputFormat
is set as the job InputFormat
. FileSplitInputFormat
is a subclass of CompositeInputFormat
, an abstract class which uses another InputFormat
implementation (internalInputFormat
) to read the data. The internal InputFormat
and the RecordInfoProvider
implementations are specified by the user and they are set to TextInputFormat
and TwitterLogRecordInfoProvider
respectively.
The OutputFormat
used is MapFileOutputFormat
is the resulting MapFile
.
The mapper is set to SpatialIndexingMappper
. The mapper output key and value types are Text
(splits identifiers) and RTreeWritable
(the actual spatial indices).
No reducer class is specified so it runs with the default reducer. The reduce phase is needed to sort the output MapFile keys.
Alternatively, this configuration can be set easier by using the SpatialIndexing
class. SpatialIndexing
is a job driver that creates a spatial index. In the following example, a SpatialIndexing
instance is created, set up, and used to add the settings to the job configuration by calling the configure()
method. Once the configuration has been set, the job is launched.
SpatialIndexing<LongWritable, Text> spatialIndexing = new SpatialIndexing<LongWritable, Text>(); //path to input data spatialIndexing.setInput("/user/data"); //path of the spatial index to be generated spatialIndexing.setOutput("/user/data_spatial_index"); //input format used to read the data spatialIndexing.setInputFormatClass(TextInputFormat.class); //record info provider used to extract records information spatialIndexing.setRecordInfoProviderClass(TwitterLogRecordInfoProvider.class); //add the spatial indexing configuration to the job configuration spatialIndexing.configure(jobConf); //run the job JobClient.runJob(jobConf);
An InputFormat
must meet the following requisites to be supported:
It must be a subclass of FileInputFormat
.
The getSplits()
method must return either FileSplit
or CombineFileSplit
split types.
The getPos()
method, the RecordReader
provided by the InputFormat
must return the current position to track back a record in the spatial index to its original record in the user file. If the current position is not returned, then the original record cannot be found using the spatial index.
However, the spatial index still can be created and used in operations that do not require the original record to be read. For example, additional fields can be added as extra fields to avoid having to read the whole original record.
Note:
The spatial indices are created for each split as returned by thegetSplits()
method. When the spatial index is used for filtering (see Spatial Filtering), it is recommended to use the same InputFormat
implementation than the one used to create the spatial index to ensure the splits indices can be found.The getPos()
method has been removed from the Hadoop new API, however, org.apache.hadoop.mapreduce.lib.input.TextInputFormat
and CombineTextInputFormat
are supported and it is still possible to get the record start offsets.
Other input formats from the new API are supported, but the record start offsets will not be contained in the spatial index. Therefore, it is not possible to find the original records. The requisites for a new API input format is same as for the old API. However, it must be translated to the new APIs FileInputFormat
, FileSplit
, and CombineFileSplit
. The following example shows an input format from the new Hadoop API that is used as internal input format:
AdapterInputFormat.setInternalInputFormatClass(conf, org.apache.hadoop.mapreduce.lib.input.TextInputFormat); spatialIndexing.setInputFormatClass( AdapterInputFormat.class);
MVSuggest
can be used at the time of spatial indexing to get an approximate location for records, which do not have geometry but have some text field. This text field can be used to determine the record location. The geometry returned by MVSuggest
is used to include the record in the spatial index.
Since it is important to know the field containing the search text for every record, the RecordInfoProvider
implementation must also implement LocalizableRecordInfoProvider
. For more information see the "LocalizableRecordInfoProvider."
MVSuggest can be used in the "Input Formats for Spatial Index." example by calling the following method in the SpatialIndexing
class:
spatialIndexing.setMVSUrl("http://host:port")
If MVSuggest service is deployed at every cluster node, then the host can be set to localhost. The layers used by MVSuggest to perform the search can be specified by passing the name of the layers as an array of strings to the following method:
spatialIndexing. setMatchingLayers( new String[]{ "world_continents", "world_countries", "world_cities"} )
This settings can also be added directly to the job configuration using the following parameters:
conf.set(ConfigParams.MVS_URL, "http://host:port") conf.set(ConfigParams.MVS_MATCHING_LAYERS, "world_continents, world_countries, world_cities")
Once the spatial index has been generated, it can be used to spatially filter the data. The filtering is performed before the data reaches the mapper and while it is being read. The following sample code example demonstrates how the SpatialFilterInputFormat
is used to spatially filter the data.
//set input path and format FileInputFormat.setInputPaths(conf, new Path("/user/data/")); conf.setInputFormat(SpatialFilterInputFormat.class); //set internal input format SpatialFilterInputFormat.setInternalInputFormatClass(conf, TextInputFormat.class); if( spatialIndexPath != null ) { //set the path to the spatial index and put it in the distributed cache boolean useDistributedCache = true; SpatialFilterInputFormat.setSpatialIndexPath(conf, spatialIndexPath, useDistributedCache); } else { //as no spatial index is used a RecordInfoProvider is needed SpatialFilterInputFormat.setRecordInfoProviderClass(conf, TwitterLogRecordInfoProvider.class); } //set spatial operation used to filter the records SpatialFilterInputFormat.setSpatialOperation(conf, SpatialOperation.IsInside); SpatialFilterInputFormat.setSpatialQueryWindow (conf,"{\"type\":\"Polygon\", \"coordinates\":[[-106.64595, 25.83997, -106.64595, 36.50061, -93.51001, 36.50061, -93.51001, 25.83997 , -106.64595, 25.83997]]}"); SpatialFilterInputFormat.setSRID(conf, 8307); SpatialFilterInputFormat.setTolerance(conf, 0.5); SpatialFilterInputFormat.setGeodetic(conf, true);
SpatialFilterInputFormat
has to be set as the job's InputFormat
. The InputFormat
that actually reads the data must be set as the internal InputFormat
. In this example, the internal InputFormat
is TextInputFormat
.
If a spatial index is specified, it is used for filtering. Otherwise, a RecordInfoProvider
must be specified in order to get the records geometries, in which case the filtering is performed record by record.
As a final step, the spatial operation and query window to perform the spatial filter are set. It is recommended to use the same internal InputFormat
implementation used when the spatial index was created or, at least, an implementation that uses the same criteria to generate the splits. For details see "Input Formats for Spatial Index."
The following steps are executed when records are filtered using the SpatialFilterInputFormat
and a spatial index.
SpatialFilterInputFormat getRecordReader()
method is called when the mapper requests a RecordReader
for the current split.
The spatial index for the current split is retrieved.
A spatial query is performed over the records contained in it using the spatial index.
As a result, the ranges in the split that contains records meeting the spatial filter are known. For example, if a split goes from the file position 1000 to 2000, upon executing the spatial filter it can be determined that records that fulfill the spatial condition are in the ranges 1100-1200, 1500-1600 and 1800-1950. So the result of performing the spatial filtering at this stage is a subset of the original filter containing smaller splits.
An InternalInputFormat
RecordReader
is requested for every small split from the resulting split subset.
A RecordReader is returned to the caller mapper. The returned RecordReader is actually a wrapper RecordReader with one or more RecordReaders returned by the internal InputFormat
.
Every time the mapper calls the RecordReader, the call to next method to read a record is delegated to the internal RecordReader.
These steps are shown in the following spatial filter interaction diagram.
The Vector Analysis API provides a way to classify the data into hierarchical entities. For example, in a given set of catalogs with a defined level of administrative boundaries such as continents, countries and states, it is possible to join a record of the user data to a record of each level of the hierarchy data set.The following example generates a summary count for each hierarchy level, containing the number of user records per continent, country and state or province:
HierarchicalCount<LongWritable,Text> hierCount = new HierarchicalCount<LongWritable,Text>() //when spatial option is set the job will use a spatial index hierCount.setOption(Option.parseOption(Option.Spatial)); //set the path to the spatial index hierCount.setInput(""/user/data_spatial_index/""); //set the job's output hierCount.setOutput("/user/hierarchy_count"); //set HierarchyInfo implementation which describes the world administrative boundaries hierarchy hierCount.setHierarchyInfoClass( WorldAdminHierarchyInfo.class ); //specify the paths of the hierarchy data Path[] hierarchyDataPaths = { new Path("file:///home/user/catalogs/world_continents.json"), new Path("file:///home/user/catalogs/world_countries.json"), new Path("file:///home/user/catalogs/world_states_provinces.json")}; hierCount.setHierarchyDataPaths(hierarchyDataPaths); //set the path where the index for the previous hierarchy data will be generated hierCount.setHierarchyDataIndexPath("/user/hierarchy_data_index/"); //setup the spatial operation which will be used to join records from the two datasets (spatial index and hierarchy data). hierCount.setSpatialOperation(SpatialOperation.IsInside); hierCount.setSrid(8307); hierCount.setTolerance(0.5); hierCount.setGeodetic(true); //add the previous setup to the job configuration hierCount.configure(conf); //run the job RunningJob rj = JobClient.runJob(conf); //call jobFinished to properly rename the output files hierCount.jobFinished(rj.isSuccessful());
This example uses the HierarchicalCount
demo job driver to facilitate the configuration. The configuration can be divided in to following categories:
Input data: This is a previously generated spatial index.
Output data: This is a folder which contains the summary counts for each hierarchy level.
Hierarchy data configuration: This contains the following:
HierarchyInfo
class: This is an implementation of HierarchyInfo
class in charge of describing the current hierarchy data. It provides the number of hierarchy levels, level names, and the data contained at each level.
Hierarchy data paths: This is the path to each one of the hierarchy catalogs. These catalogs are read by the HierarchyInfo
class.
Hierarchy index path: This is the path where the hierarchy data index is stored. Hierarchy data needs to be preprocessed to know the parent-child relationships between hierarchy levels. This information is processed once and saved at the hierarchy index, so it can be used later by the current job or even by any other jobs.
Spatial operation configuration: This is the spatial operation to be performed between records of the user data and the hierarchy data in order to join both datasets. The parameters to set here are the Spatial Operation type (IsInside), SRID (8307), Tolerance (0.5 meters), and whether the geometries are Geodetic (true).
The HierarchyCount.jobFinished()
method is called at the end to properly name the reducer output. Therefore, only one file is left for each hierarchy level instead of multiple files with the –r-NNNNN postfix. Internally, the HierarchyCount.configure()
method sets the mapper and reducer to be SpatialHierarchicalCountMapper
and SpatialHierarchicalCountReducer
respectively. SpatialHierarchicalCountMapper's
output key is a hierarchy entry identifier in the form hierarchy_level
+ hierarchy_entry_id
. The mapper output value is a single count for each output key. The reducer sums up all the counts for each key.
Note:
The entire hierarchy data may be loaded into memory and hence the total size of all the catalogs is expected to be significantly less than the user data. The hierarchy data size should not be larger than a couple of gigabytes.If you want another type of output instead of counts, for example, a list of user records according to the hierarchy entry. In this case, the SpatialHierarchicalJoinMapper
can be used. The SpatialHierarchicalJoinMapper
output value is a RecordInfo
instance, which can be gathered in a user-defined reducer to produce a different output. The following user-defined reducer generates a MapFile for each hierarchy level using the MultipleOutputs
class. Each MapFile has the hierarchy entry ids as keys and ArrayWritable
instances containing the matching records for each hierarchy entry as values. The following is an user-defined reducer that returns a list of records by hierarchy entry:
public class HierarchyJoinReducer extends MapReduceBase implements Reducer<Text, RecordInfo, Text, ArrayWritable> { private MultipleOutputs mos = null; private Text outKey = new Text(); private ArrayWritable outValue = new ArrayWritable( RecordInfo.class ); @Override public void configure(JobConf conf) { super.configure(conf); //use MultipleOutputs to generate different outputs for each hierarchy level mos = new MultipleOutputs(conf); } @Override public void reduce(Text key, Iterator<RecordInfo> values, OutputCollector<Text, RecordInfoArrayWritable> output, Reporter reporter) throws IOException { //Get the hierarchy level name and the hierarchy entry id from the key String[] keyComponents = HierarchyHelper.getMapRedOutputKeyComponents(key.toString()); String hierarchyLevelName = keyComponents[0]; String entryId = keyComponents[1]; List<Writable> records = new LinkedList<Writable>(); //load the values to memory to fill output ArrayWritable while(values.hasNext()) { RecordInfo recordInfo = new RecordInfo( values.next() ); records.add( recordInfo ); } if(!records.isEmpty()) { //set the hierarchy entry id as key outKey.set(entryId); //list of records matching the hierarchy entry id outValue.set( records.toArray(new Writable[]{} ) ); //get the named output for the given hierarchy level hierarchyLevelName = FileUtils.toValidMONamedOutput(hierarchyLevelName); OutputCollector<Text, ArrayWritable> mout = mos.getCollector(hierarchyLevelName, reporter); //Emit key and value mout.collect(outKey, outValue); } } @Override public void close() throws IOException { mos.close(); } }
The same reducer can be used in a job with the following configuration to generate list of records according to the hierarchy levels:
JobConf conf = new JobConf(getConf()); //input path FileInputFormat.setInputPaths(conf, new Path("/user/data_spatial_index/") ); //output path FileOutputFormat.setOutputPath(conf, new Path("/user/records_per_hier_level/") ); //input format used to read the spatial index conf.setInputFormat( SequenceFileInputFormat.class); //output format: the real output format will be configured for each multiple output later conf.setOutputFormat(NullOutputFormat.class); //mapper conf.setMapperClass( SpatialHierarchicalJoinMapper.class ); conf.setMapOutputKeyClass(Text.class); conf.setMapOutputValueClass(RecordInfo.class); //reducer conf.setReducerClass( HierarchyJoinReducer.class ); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(ArrayWritable.class); //////////////////////////////////////////////////////////////////// //hierarchy data setup //set HierarchyInfo class implementation conf.setClass(ConfigParams.HIERARCHY_INFO_CLASS, WorldAdminHierarchyInfo.class, HierarchyInfo.class); //paths to hierarchical catalogs Path[] hierarchyDataPaths = { new Path("file:///home/user/catalogs/world_continents.json"), new Path("file:///home/user/catalogs/world_countries.json"), new Path("file:///home/user/catalogs/world_states_provinces.json")}; //path to hierarchy index Path hierarchyDataIndexPath = new Path("/user/hierarchy_data_index/"); //instantiate the HierarchyInfo class to index the data if needed. HierarchyInfo hierarchyInfo = new WorldAdminHierarchyInfo(); hierarchyInfo.initialize(conf); //Create the hierarchy index if needed. If it already exists, it will only load the hierarchy index to the distributed cache HierarchyHelper.setupHierarchyDataIndex(hierarchyDataPaths, hierarchyDataIndexPath, hierarchyInfo, conf); /////////////////////////////////////////////////////////////////// //setup the multiple named outputs: int levels = hierarchyInfo.getNumberOfLevels(); for(int i=1; i<=levels; i++) { String levelName = hierarchyInfo.getLevelName(i); //the hierarchy level name is used as the named output String namedOutput = FileUtils.toValidMONamedOutput(levelName); MultipleOutputs.addNamedOutput(conf, namedOutput, MapFileOutputFormat.class, Text.class, ArrayWritable.class); } //finally, setup the spatial operation conf.setInt(ConfigParams.SPATIAL_OPERATION, SpatialOperation.IsInside.getOrdinal()); conf.setInt(ConfigParams.SRID, 8307); conf.setFloat(ConfigParams.SPATIAL_TOLERANCE, 0.5f); conf.setBoolean(ConfigParams.GEODETIC, true); //run job JobClient.runJob(conf);
Supposing the output value should be an array of record ids instead of an array of RecordInfo
instances, it would be enough to perform a couple of changes in the previously defined reducer.
The line where outValue
is declared, in the previous example, changes to:
private ArrayWritable outValue = new ArrayWritable(Text.class);
The loop where the input values are retrieved, in the previous example, is changed. Therefore, the record ids are got instead of the whole records:
while(values.hasNext()) { records.add( new Text(values.next().getId()) ); }
While only the record id is needed the mapper emits the whole RecordInfo
instance. Therefore, a better approach is to change the mappers output value. The mappers output value can be changed by extending AbstractSpatialJoinMapper
. In the following example, the mapper emits only the record ids instead of the whole RecorInfo
instance every time a record matches some of the hierarchy entries:
public class IdSpatialHierarchicalMapper extends AbstractSpatialHierarchicalMapper< Text > { Text outValue = new Text(); @Override protected Text getOutValue(RecordInfo matchingRecordInfo) { //the out value is the record's id outValue.set(matchingRecordInfo.getId()); return outValue; } }
By default, all the hierarchy levels defined in the HierarchyInfo
implementation are loaded when performing the hierarchy search. The range of hierarchy levels loaded is from level 1 (parent level) to the level returned by HierarchyInfo.getNumberOfLevels()
method. The following example shows how to setup a job to only load the levels 2 and 3.
conf.setInt( ConfigParams.HIERARCHY_LOAD_MIN_LEVEL, 2); conf.setInt( ConfigParams.HIERARCHY_LOAD_MAX_LEVEL, 3);
Note:
These parameters are useful when only a subset of the hierarchy levels is required and it is not recommended to modify theHierarchyInfo
implementation.The search is always performed only at the bottom hierarchy level (the higher level number). If a user record matches some hierarchy entry at this level, then the match is propagated to the parent entry in upper levels. For example, if a user record matches Los Angeles, then it also matches California, USA, and North America. If there are no matches for a user record at the bottom level, then the search does not continue into the upper levels.
This behavior can be modified by setting the configuration parameter ConfigParams
. HIERARCHY_SEARCH_MULTIPLE_LEVELS
to true. Therefore, if a search at the bottom hierarchy level resulted in some unmatched user records, then search continues into the upper levels until the top hierarchy level is reached or there are no more user records to join. This behavior can be used when the geometries of parent levels do not perfectly enclose the geometries of their child entries
MVSuggest
can be used instead of the spatial index to classify data. For this case, an implementation of LocalizableRecordInfoProvider
must be known and sent to MVSuggest to perform the search. See the information about LocalizableRecordInfoProvider
.
In the following example, the program option is changed from spatial to MVS. The input is the path to the user data instead of the spatial index. The InputFormat
used to read the user record and an implementation of LocalizableRecordInfoProvider
are specified. The MVSuggest service URL is set. Notice that there is no spatial operation configuration needed in this case. The changes are marked bold in the following code listing:
HierarchicalCount<LongWritable,Text> hierCount = new HierarchicalCount<LongWritable,Text>() //set option to MVS hierCount.setOption(Option.parseOption(Option.MVS)); //the input path is the user's data hierCount.setInput(""/user/data/""); //set the job's output hierCount.setOutput("/user/mvs_hierarchy_count"); //set HierarchyInfo implementation which describes the world administrative boundaries hierarchy hierCount.setHierarchyInfoClass( WorldAdminHierarchyInfo.class ); //specify the paths of the hierarchy data Path[] hierarchyDataPaths = { new Path("file:///home/user/catalogs/world_continents.json"), new Path("file:///home/user/catalogs/world_countries.json"), new Path("file:///home/user/catalogs/world_states_provinces.json")}; hierCount.setHierarchyDataPaths(hierarchyDataPaths); //set the path where the index for the previous hierarchy data will be generated hierCount.setHierarchyDataIndexPath("/user/hierarchy_data_index/"); //No spatial operation configuration is needed, Instead, specify the InputFormat used to read the user's data and the LocalizableRecordInfoProvider class. hierCount. setInputFormatClass( TextInputFormat.class ); hierCount.setRecordInfoProviderClass( MyLocalizableRecordInfoProvider.class ); //finally, set the MVSuggest service URL hierCount.setMVSUrl("http://localhost:8080"); //add the previous setup to the job configuration hierCount.configure(conf); //run the job RunningJob rj = JobClient.runJob(conf); //call jobFinished to properly rename the output files hierCount.jobFinished(rj.isSuccessful());
Note:
When usingMVSuggest
, the hierarchy data files must be the same as the layer files used by MVSuggest. The hierarchy level names returned by the HierarchyInfo.getLevelNames()
method are used as the matching layers by MVSuggest.The API provides a mapper and a demo job to generate a buffer around each record's geometry. The following code sample shows how to run a job to generate a buffer around each record geometry
Buffer<LongWritable, Text> buffer = new Buffer<LongWritable, Text>(); //set path to input data buffer.setInput("/user/waterlines/"); //set path to the resulting data buffer.setOutput("/user/data_buffer/"); //set input format used to read the data buffer.setInputFormatClass( TextInputFormat.class ); //set the record infor provider implementation buffer.setRecordInfoProviderClass(WorldSampleLineRecordInfoProvider.class); //set the width of the buffers to be generated buffer.setBufferWidth(Double.parseDouble(args[4])); //set the previous configuration to the job configuration buffer.configure(conf); //run the job JobClient.runJob(conf);
This example job uses BufferMapper
as the job mapper. BufferMapper
generates a buffer for each input record containing a geometry. The output key and values are the record id and a RecordInfo
instance containing the generated buffer. The resulting file is a Hadoop MapFile
containing the mapper output key and values.
BufferMapper
accepts the following parameters:
Parameter | ConfigParam constant | Type | Description |
---|---|---|---|
oracle.spatial.buffer.width | BUFFER_WIDTH | double | The buffer width |
oracle.spatial.buffer.sma | BUFFER_SMA | double | The semi major axis for the datum used in the coordinate system of the input |
oracle.spatial.buffer.iFlat | BUFFER_IFLAT | double | The flattening value |
oracle.spatial.buffer.arcT | BUFFER_ARCT | double | The arc tolerance used for geodetic densification |
A record read by a MapReduce job from HDFS is represented in memory as a key-value pair using a Java type (typically) Writable subclass, such as LongWritable, Text, ArrayWritable or some user-defined type. For example, records read using TextInputFormat are represented in memory as LongWritable, Text key-value pairs.
RecordInfoProvider
is the component that interprets these memory record representations and returns the data needed by the Vector Analysis API. Thus, the API is not tied to any specific format and memory representations.
The RecordInfoProvider
interface has the following methods:
void setCurrentRecord(K key, V value)
String getId()
JGeometry getGeometry()
boolean getExtraFields(Map<String, String> extraFields)
When a record is read from HDFS, the setCurrentRecord()
method sets the RecordInfoProvider
instance. The key-value pair passed as arguments are same as the values returned by the RecordReader
provided by the InputFormat
. The RecordInfoProvider is used to get the current record id, geometry, and extra fields. None of these fields are required fields. Only those records with a geometry participates in the spatial operations. The Id is useful for differentiating records in operations such as categorization. The extra fields can be used to store any record information that can be represented as text and which is desired to be quickly accessed without reading the original record, or for operations where MVSuggest
is used.
Typically, the information returned by RecordInfoProvider is used to populate RecordInfo
instances. A RecordInfo can be thought as a light version of a record and contains the information returned by the RecordInfoProvider plus information to locate the original record in a file.
This is a JsonRecordInfoProvider
implementation, which takes text records in JSON format and read using TextInputFormat
. A sample record is shown here:
{ "_id":"ABCD1234", "location":" 119.31669, -31.21615", "locationText":"Boston, Ma", "date":"03-18-2015", "time":"18:05", "device-type":"cellphone", "device-name":"iPhone"}
When a JsonRecordInfoProvider is instantiated, a JSON ObjectMapper is created. This is used to parse only a record value when setCurrentRecord()
method is called. The record key is ignored. The record id, geometry, and one extra field are retrieved from the _id, location and locationText JSON properties. The geometry is represented as latitude-longitude pair and is used to create a point geometry using JGeometry.createPoint()
method. The extra field (locationText) is added to the extraFields map, which serves as an out parameter and true is returned indicating that an extra field was added.
public class JsonRecordInfoProvider implements RecordInfoProvider<LongWritable, Text> { private Text value = null; private ObjectMapper jsonMapper = null; private JsonNode recordNode = null; public JsonRecordInfoProvider(){ //json mapper used to parse all the records jsonMapper = new ObjectMapper(); } @Override public void setCurrentRecord(LongWritable key, Text value) throws Exception { try{ //parse the current value recordNode = jsonMapper.readTree(value.toString()); }catch(Exception ex){ recordNode = null; throw ex; } } @Override public String getId() { String id = null; if(recordNode != null ){ id = recordNode.get("_id").getTextValue(); } return id; } @Override public JGeometry getGeometry() { JGeometry geom = null; if(recordNode!= null){ //location is represented as a lat,lon pair String location = recordNode.get("location").getTextValue(); String[] locTokens = location.split(","); double lat = Double.parseDouble(locTokens[0]); double lon = Double.parseDouble(locTokens[1]); geom = JGeometry.createPoint( new double[]{lon, lat}, 2, 8307); } return geom; } @Override public boolean getExtraFields(Map<String, String> extraFields) { boolean extraFieldsExist = false; if(recordNode != null) { extraFields.put("locationText", recordNode.get("locationText").getTextValue() ); extraFieldsExist = true; } return extraFieldsExist; } }
This interface extends RecordInfoProvider
and is used to know the extra fields that can be used as the search text, when MVSuggest
is used.
The only method added by this interface is getLocationServiceField()
, which returns the name of the extra field, and is sent to MVSuggest.
In addition, the following is an implementation based on "Sample RecordInfo Implementation." The name returned in this example is locationText
, which is the name of the extra field included in the parent class.
public class LocalizableJsonRecordInfoProvider extends JsonRecordInfoProvider implements LocalizableRecordInfoProvider<LongWritable, Text> { @Override public String getLocationServiceField() { return "locationText"; } }
The HierarchyInfo
interface is used to describe a hierarchical dataset. This implementation of HierarchyInfo is expected to provide the number, names, and the entries of the hierarchy levels of the hierarchy it describes.
The root hierarchy level is always the hierarchy level 1. The entries in this level do not have parent entries and this level is referred as the top hierarchy level. Children hierarchy levels will have higher level values. For example: the levels for the hierarchy conformed by continents, countries, and states are 1, 2 and 3 respectively. Entries in the continent layer do not have a parent, but have children entries in the countries layer. Entries at the bottom level, the states layer, do not have children.
A HierarchyInfo implementation is provided out of the box with the Vector Analysis API. The DynaAdminHierarchyInfo
implementation can be used to read and describe the known hierarchy layers in GeoJSON format. A DynaAdminHierarchyInfo can be instantiated and configured or can be subclassed. The hierarchy layers to be contained are specified by calling the addLevel()
method, which takes the following parameters:
The hierarchy level number
The hierarchy level name, which must match the file name (without extension) of the GeoJSON file that contains the data. For example, the hierarchy level name for the file world_continents.json
must be world_continents
, for world_countries.json
it is world_countries
, and so on.
Children join field: This is a JSON property that is used to join entries of the current level with child entries in the lower level. If a null is passed, then the entry id is used.
Parent join field: This is a JSON property used to join entries of the current level with parent entries in the upper level. This value is not used for the top most level without an upper level to join. If the value is set null for any other level greater than 1, an IsInside
spatial operation is performed to join parent and child entries. In this scenario, it is supposed that an upper level geometry entry can contain lower level entries.
For example, let us assume a hierarchy containing the following levels from the specified layers: 1- world_continents, 2 - world_countries and 3 - world_states_provinces. A sample entry from each layer would look like the following:
world_continents: {"type":"Feature","_id":"NA","geometry": {"type":"MultiPolygon", "coordinates":[ x,y,x,y,x,y] }"properties":{"NAME":"NORTH AMERICA", "CONTINENT_LONG_LABEL":"North America"},"label_box":[-118.07998,32.21006,-86.58515,44.71352]} world_countries: {"type":"Feature","_id":"iso_CAN","geometry":{"type":"MultiPolygon","coordinates":[x,y,x,y,x,y]},"properties":{"NAME":"CANADA","CONTINENT":"NA","ALT_REGION":"NA","COUNTRY CODE":"CAN"},"label_box":[-124.28092,49.90408,-94.44878,66.89287]} world_states_provinces: {"type":"Feature","_id":"6093943","geometry": {"type":"Polygon", "coordinates":[ x,y,x,y,x,y]},"properties":{"COUNTRY":"Canada", "ISO":"CAN", "STATE_NAME":"Ontario"},"label_box":[-91.84903,49.39557,-82.32462,54.98426]}
A DynaAdminHierarchyInfo can be configured to create a hierarchy with the above layers in the following way:
DynaAdminHierarchyInfo dahi = new DynaAdminHierarchyInfo(); dahi.addLevel(1, "world_continents", null /*_id is used by default to join with child entries*/, null /*not needed as there are not upper hierarchy levels*/); dahi.addLevel(2, "world_countries", "properties.COUNTRY CODE"/*field used to join with child entries*/, "properties.CONTINENT" /*the value "NA" will be used to find Canada's parent which is North America and which _id field value is also "NA" */); dahi.addLevel(3, "world_states_provinces", null /*not needed as not child entries are expected*/, "properties.ISO"/*field used to join with parent entries. For Ontario, it is the same value than the field properties.COUNTRY CODE specified for Canada*/); //save the previous configuration to the job configuration dahi.initialize(conf);
A similar configuration can be used to create hierarchies from different layers, such as countries, states, and counties, or any other layers with a similar JSON format.
Alternatively, to avoid configuring a hierarchy every time a job is executed, the hierarchy configuration can be enclosed in a DynaAdminHierarchyInfo subclass as in the following example:
public class WorldDynaAdminHierarchyInfo extends DynaAdminHierarchyInfo \ { public WorldDynaAdminHierarchyInfo() { super(); addLevel(1, "world_continents", null, null); addLevel(2, "world_countries", "properties.COUNTRY CODE", "properties.CONTINENT"); addLevel(3, "world_states_provinces", null, "properties.ISO"); } }
The HierarchyInfo
interface contains the following methods, which must be implemented to describe a hierarchy. The methods can be divided in to the following three categories:
Methods to describe the hierarchy
Methods to load data
Methods to supply data
Additionally there is an initialize()
method, which can be used to perform any initialization and to save and read data both to and from the job configuration
void initialize(JobConf conf); //methods to describe the hierarchy String getLevelName(int level); int getLevelNumber(String levelName); int getNumberOfLevels(); //methods to load data void load(Path[] hierDataPaths, int fromLevel, JobConf conf) throws Exception; void loadFromIndex(HierarchyDataIndexReader[] readers, int fromLevel, JobConf conf) throws Exception; //methods to supply data Collection<String> getEntriesIds(int level); JGeometry getEntryGeometry(int level, String entryId); String getParentId(int childLevel, String childId);
The following is a sample HierarchyInfo implementation, which takes the previously mentioned world layers as the hierarchy levels. The first section contains the initialize method and the methods used to describe the hierarchy. In this case, the initialize method does nothing. The methods mentioned in the following example use the hierarchyLevelNames
array to provide the hierarchy description. The instance variables entriesGeoms
and entriesParent
are arrays of java.util.Map
, which contains the entries geometries and entries parents respectively. The entries ids are used as keys in both cases. Since the arrays indices are zero-based and the hierarchy levels are one-based, the array indices correlate to the hierarchy levels as array index + 1 = hierarchy level.
public class WorldHierarchyInfo implements HierarchyInfo { private String[] hierarchyLevelNames = {"world_continents", "world_countries", "world_states_provinces"}; private Map<String, JGeometry>[] entriesGeoms = new Map[3]; private Map<String, String>[] entriesParents = new Map[3]; @Override public void initialize(JobConf conf) { //do nothing for this implementation } @Override public int getNumberOfLevels() { return hierarchyLevelNames.length; } @Override public String getLevelName(int level) { String levelName = null; if(level >=1 && level <= hierarchyLevelNames.length) { levelName = hierarchyLevelNames[ level - 1]; } return levelName; } @Override public int getLevelNumber(String levelName) { for(int i=0; i< hierarchyLevelNames.length; i++ ) { if(hierarchyLevelNames.equals( levelName) ) return i+1; } return -1; }
The following example contains the methods that load the different hierarchy levels data. The load()
method reads the data from the source files world_continents.json
, world_countries.json
, and world_states_provinces.json
. For the sake of simplicity, the internally called loadLevel()
method is not specified, but it is supposed to parse and read the JSON files.
The loadFromIndex()
method only takes the information provided by the HierarchyIndexReader
instances passed as parameters. The load()
method is supposed to be executed only once and only if a hierarchy index has not been created, in a job. Once the data is loaded, it is automatically indexed and loadFromIndex()
method is called every time the hierarchy data is loaded into the memory.
@Override public void load(Path[] hierDataPaths, int fromLevel, JobConf conf) throws Exception { int toLevel = fromLevel + hierDataPaths.length - 1; int levels = getNumberOfLevels(); for(int i=0, level=fromLevel; i<hierDataPaths.length && level<=levels; i++, level++) { //load current level from the current path loadLevel(level, hierDataPaths[i]); } } @Override public void loadFromIndex(HierarchyDataIndexReader[] readers, int fromLevel, JobConf conf) throws Exception { Text parentId = new Text(); RecordInfoArrayWritable records = new RecordInfoArrayWritable(); int levels = getNumberOfLevels(); //iterate through each reader to load each level's entries for(int i=0, level=fromLevel; i<readers.length && level<=levels; i++, level++) { entriesGeoms[ level - 1 ] = new Hashtable<String, JGeometry>(); entriesParents[ level - 1 ] = new Hashtable<String, String>(); //each entry is a parent record id (key) and a list of entries as RecordInfo (value) while(readers[i].nextParentRecords(parentId, records)) { String pId = null; //entries with no parent will have the parent id UNDEFINED_PARENT_ID. Such is the case of the first level entries if( ! UNDEFINED_PARENT_ID.equals( parentId.toString() ) ) { pId = parentId.toString(); } //add the current level's entries for(Object obj : records.get()) { RecordInfo entry = (RecordInfo) obj; entriesGeoms[ level - 1 ].put(entry.getId(), entry.getGeometry()); if(pId != null) { entriesParents[ level -1 ].put(entry.getId(), pId); } }//finishin loading current parent entries }//finish reading single hierarchy level index }//finish iterating index readers }
Finally, the following code listing contains the methods used to provide information of individual entries in each hierarchy level. The information provided is the ids of all the entries contained in a hierarchy level, the geometry of each entry, and the parent of each entry.
@Override public Collection<String> getEntriesIds(int level) { Collection<String> ids = null; if(level >= 1 && level <= getNumberOfLevels() && entriesGeoms[ level - 1 ] != null) { //returns the ids of all the entries from the given level ids = entriesGeoms[ level - 1 ].keySet(); } return ids; } @Override public JGeometry getEntryGeometry(int level, String entryId) { JGeometry geom = null; if(level >= 1 && level <= getNumberOfLevels() && entriesGeoms[ level - 1 ] != null) { //returns the geometry of the entry with the given id and level geom = entriesGeoms[ level - 1 ].get(entryId); } return geom; } @Override public String getParentId(int childLevel, String childId) { String parentId = null; if(childLevel >= 1 && childLevel <= getNumberOfLevels() && entriesGeoms[ childLevel - 1 ] != null) { //returns the parent id of the entry with the given id and level parentId = entriesParents[ childLevel - 1 ].get(childId); } return parentId; } }//end of class
The Spatial Hadoop Vector Analysis only contains a small subset of the functionality provided by the Spatial Java API, which can also be used in the MapReduce jobs. This section provides some simple examples of how JGeometry can be used in Hadoop for spatial processing. The following example contains a simple mapper that performs the IsInside
test between a dataset and a query geometry using the JGeometry class.
In this example, the query geometry ordinates, srid, geodetic value and tolerance used in the spatial operation are retrieved from the job configuration in the configure method. The query geometry, which is a polygon, is preprocessed to quickly perform the IsInside
operation.
The map method is where the spatial operation is executed. Each input record value is tested against the query geometry and the id is returned, when the test succeeds.
public class IsInsideMapper extends MapReduceBase implements Mapper<LongWritable, Text, NullWritable, Text> { private JGeometry queryGeom = null; private int srid = 0; private double tolerance = 0.0; private boolean geodetic = false; private Text outputValue = new Text(); private double[] locationPoint = new double[2]; @Override public void configure(JobConf conf) { super.configure(conf); srid = conf.getInt("srid", 8307); tolerance = conf.getDouble("tolerance", 0.0); geodetic = conf.getBoolean("geodetic", true); //The ordinates are represented as a string of comma separated double values String[] ordsStr = conf.get("ordinates").split(","); double[] ordinates = new double[ordsStr.length]; for(int i=0; i<ordsStr.length; i++) { ordinates[i] = Double.parseDouble(ordsStr[i]); } //create the query geometry as two-dimensional polygon and the given srid queryGeom = JGeometry.createLinearPolygon(ordinates, 2, srid); //preprocess the query geometry to make the IsInside operation run faster try { queryGeom.preprocess(tolerance, geodetic, EnumSet.of(FastOp.ISINSIDE)); } catch (Exception e) { e.printStackTrace(); } } @Override public void map(LongWritable key, Text value, OutputCollector<NullWritable, Text> output, Reporter reporter) throws IOException { //the input value is a comma separated values text with the following columns: id, x-ordinate, y-ordinate String[] tokens = value.toString().split(","); //create a geometry representation of the record's location locationPoint[0] = Double.parseDouble(tokens[1]);//x ordinate locationPoint[1] = Double.parseDouble(tokens[2]);//y ordinate JGeometry location = JGeometry.createPoint(locationPoint, 2, srid); //perform spatial test try { if( location.isInside(queryGeom, tolerance, geodetic)){ //emit the record's id outputValue.set( tokens[0] ); output.collect(NullWritable.get(), outputValue); } } catch (Exception e) { e.printStackTrace(); } } }
A similar approach can be used to perform a spatial operation on the geometry itself. For example, by creating a buffer. The following example uses the same text value format and creates a buffer around each record location. The mapper output key and value are the record id and the generated buffer, which is represented as a JGeometryWritable
. The JGeometryWritable
is a Writable implementation contained in the Vector Analysis API that holds a JGeometry instance.
public class BufferMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, JGeometryWritable> { private int srid = 0; private double bufferWidth = 0.0; private Text outputKey = new Text(); private JGeometryWritable outputValue = new JGeometryWritable(); private double[] locationPoint = new double[2]; @Override public void configure(JobConf conf) { super.configure(conf); srid = conf.getInt("srid", 8307); //get the buffer width bufferWidth = conf.getDouble("bufferWidth", 0.0); } @Override public void map(LongWritable key, Text value, OutputCollector<Text, JGeometryWritable> output, Reporter reporter) throws IOException { //the input value is a comma separated record with the following columns: id, longitude, latitude String[] tokens = value.toString().split(","); //create a geometry representation of the record's location locationPoint[0] = Double.parseDouble(tokens[1]); locationPoint[1] = Double.parseDouble(tokens[2]); JGeometry location = JGeometry.createPoint(locationPoint, 2, srid); try { //create the location's buffer JGeometry buffer = location.buffer(bufferWidth); //emit the record's id and the generated buffer outputKey.set( tokens[0] ); outputValue.setGeometry( buffer ); output.collect(outputKey, outputValue); } catch (Exception e) { e.printStackTrace(); } } }
The table lists some running times for jobs built using the Vector Analysis API. The jobs were executed using a 4-node cluster. The times may vary depending on the characteristics of the cluster. The test dataset contains over One billion records and the size is above 1 terabyte.
Table 2-2 Performance time for running jobs using Vector Analysis API
Job Type | Time taken (approximate value) |
---|---|
Spatial Indexing |
2 hours |
Spatial Filter with Spatial Index |
1 hour |
Spatial Filter without Spatial Index |
3 hours |
Hierarchy count with Spatial Index |
5 minutes |
Hierarchy count without Spatial Index |
3 hours |
The time taken for the jobs can be decreased by increasing the maximum split size using any of the following configuration parameters.
mapred.max.split.size mapreduce.input.fileinputformat.split.maxsize
This results in more splits are being processed by each single mapper and improves the execution time. This is done by using the SpatialFilterInputFormat
(spatial indexing) or FileSplitInputFormat
(spatial hierarchical join, buffer). Also, the same results can be achieved by using the implementation of CombineFileInputFormat
as internal InputFormat
.
You can use the Oracle Big Data Spatial and Graph Vector Console to perform tasks related to spatial indexing and creating and showing thematic maps.
To create a spatial index using the Oracle Big Data Spatial and Graph Vector Console, follow these steps.
Open http://<oracle_big_data_spatial_vector_console>:8080/spatialviewer/
Click Create Index.
Specify all the required details:
Path of data to the index: Path of the index file in HDFS. For example, hdfs://<server_name_to_store_index>:8020/user/hsaucedo/twitter_logs/part-m-00000
.
New index path: This is the job output path. For example: hdfs://<oracle_big_data_spatial_vector_console>:8020/user/rinfante/Twitter/index
.
Input Format class: The input format class. For example: org.apache.hadoop.mapred.TextInputFormat
Record Info Provider class: The class that provides the spatial information. For example: oracle.spatial.hadoop.vector.demo.usr.TwitterLogRecordInfoProvider
.
Note:
If theInputFormat
class or the RecordInfoProvider
class is not in the API, or in the hadoop API classes, then a jar with the user-defined classes must be provided. To be able to use this jar the user must add it in the $JETTY_HOME/webapps/spatialviewer/WEB-INF/lib
directory and restart the server.MVSuggest service URL(Optional): If the geometry has to be found from a location string then use the MVSuggest service. If the service URL is localhost then each data node must have the MVSuggest application started and running. In this case, the new index contains the point geometry and the layer provided by MVSuggest for each record. If the geometry is a polygon then the geometry is a centroid of the polygon. For example: http://localhost:8080
MVSuggest Templates (Optional): The user can define the templates used to create the index with MVSuggest. For optimal results, it is recommended to select the same templates for the index creation and when running the hierarchy job.
Outcome notification email sent to (Optional): Provide email Ids to receive the notifications when the job finished. Separate multiple email Ids by a semicolon. For example, mymail@abccorp.com
Click Create.
The submitted job is listed and you should wait to receive a mail notifying that the job completed successfully.
You can run a hierarchy job to with or without the spatial index, follow these steps.
Open http://<oracle_big_data_spatial_vector_console>:8080/spatialviewer/
.
Click Run Job.
Select either With Index or Without Index and provide the following details, as required:
With Index
Index path: The index file path in HDFS. For example, hdfs://<oracle_big_data_spatial_vector_console>:8020/user/Twitter/index/part*/data
.
SRID: The geometries used to build the index. For example, 8307.
Tolerance: The tolerance of the geometry to build the index. For example, 0.5.
Geodetic: Whether the geometries used are geodetic or not. Select Yes or No.
Without Index
Path of the data: Provide the HDFS data path. For example, hdfs://<oracle_big_data_spatial_vector_console>:8020/user/*/data
.
JAR with user classes (Optional): If the InputFormat
class or the RecordInfoProvider
class is not in the API, or in the hadoop API classes, then a jar with the user-defined classes must be provided. To be able to use this jar the user must add it in the $JETTY_HOME/webapps/spatialviewer/WEB-INF/lib
directory and restart the server.
Input Format class: The input format class. For example: org.apache.hadoop.mapred.FileInputFormat
Record Info Provider class: The class that will provide the spatial information. For example: oracle.spatial.hadoop.vector.demo.usr.TwitterLogRecordInfoProvider
.
MVSuggest service URL: If the geometry has to be found from a location string then use the MVSuggest service. If the service URL is localhost then each data node must have the MVSuggest application started and running. In this case, the new index contains the point geometry and the layer provided by MVSuggest for each record. If the geometry is a polygon then the geometry is a centroid of the polygon. For example: http://localhost:8080
Templates: The templates to create the thematic maps. For optimal results, it is recommended to select the same templates for the index creation and when running the hierarchy job. For example, World Continents, World Countries, and so on.
Note:
If a template refers to point geometries (for example cities), the result returned is empty for that template, if MVSuggest is not used. This is because the spatial operations return results only for polygons.Tip:
When using the MVSuggest service the results will be accurate, provided all the templates that could match the results are set. For example if the data can refer to any city, state, country, or continent in the world, then the better choice of templates to build results are World Continents, World Countries, World State Provinces and World Cities. On the other hand, if the data is from the USA states, counties, and cities, then the suitable templates are USA States, USA Counties, and USA Cities.Output path: The Hadoop job output path. For example, hdfs://<oracle_big_data_spatial_vector_console>:8020/user/rinfante/Twitter/myoutput.
Result name: The result name. If a result exists for a template with the same name, it is overwritten. For example, Tweets test.
Outcome notification email sent to (Optional): Provide email Ids to receive the notifications when the job finished. Separate multiple email Ids by a semicolon. For example, mymail@abccorp.com
.
To view the results, follow these steps.
Open http://<oracle_big_data_spatial_vector_console>:8080/spatialviewer/
.
Click Show Hadoop Results.
Click any one of the Templates. For example, World Continents.
The World Continents template is displayed.
Click any one of the Results displayed.
Different continents appear with different patches of colors.
Click any continent from the map. For example, North America.
The template changes to World Countries and the focus changes to North America with the results by country.
It is possible to create and upload result files stored locally. For example, the result file created with a job executed from the command line. The templates are located in the folder $JETTY_HOME/webapps/spatialviewer/templates
. The templates are GeoJSON files with features and all the features have ids. For example, the first feature in the template USA States starts with: {"type":"Feature","_id":"WYOMING",
...
The results must be JSON files with the following format: {"id":"JSONFeatureId","result":result}
.
For example, if the template USA States is selected, then a valid result is a file containing: {"id":"WYOMING","result":3232} {"id":"SOUTH DAKOTA","result":74968}
From Show Hadoop Results, click the icon.
Select a Template from the drop down.
Specify a Name.
Click Choose File to select the File location.
Click Save.
The results can be located in the $JETTY_HOME/webapps/spatialviewer/results
folder.
To create new templates do the following:
Add the template JSON file in the folder $JETTY_HOME/webapps/spatialviewer/templates/
.
Add the template configuration file in the folder $JETTY_HOME/webapps/spatialviewer/templates/_config_
.
If MVSuggest
is used, then add the new templates to the MVSuggest
templates.
To delete the template, delete the JSON and configuration files added in steps 1 and 2.
Each template has a configuration file. The template configuration files are located in the folder $JETTY_HOME/webapps/spatialviewer/templates/_config_
. The name of the configuration file is same as the template files suffixed with config.json
instead of .json
.For example, the configuration file name of the template file usa_states.json
is usa_states.config.json
. The configuration parameters are:
name: Name of the template to be shown on the console. For example, name: USA States
.
display_attribute: When displaying a result, a cursor move on the top of a feature displays this property and result of the feature. For example, display_attribute: STATE NAME
.
point_geometry: True, if the template contains point geometries and false, in case of polygons.For example, point_geometry: false
.
child_templates (optional): The templates that can have several possible child templates separated by a coma. For example, child_templates: ["world_states_provinces, usa_states(properties.COUNTRY CODE:properties.PARENT_REGION)"].
If the child templates don't specify a linked field, it means that all the features inside the parent features are considered as child features. In this case, the world_states_provinces
doesn't specify any fields. If the link between parent and child is specified, then the spatial relationship doesn't apply and the feature properties link are checked. In the above example, the relationship with the usa_states
is found with the property COUNTRY CODE
in the current template, and the property PARENT_REGION
in the template file usa_states.json
.
srid: The SRID of the template's geometries. For example, srid: 8307
.
bounds: The bounds of the map when showing the template. For example, bounds: [-180, -90, 180, 90]
.
numberOfZoomLevels: The map's number of zoom level when showing the template. For example, numberOfZoomLevels: 19
.
back_polygon_template_file_name (optional): A template with polygon geometries to set as background when showing the defined template. For example, back_polygon_template_file_name: usa_states
.
Run the following command to generate an index:
hadoop jar <HADOOP_LIB_PATH>/<jarfile> oracle.spatial.hadoop.vector.demo.job.SpatialIndexing <DATA_PATH> <SPATIAL_INDEX_PATH> <INPUT_FORMAT_CLASS> <RECORD_INFO_PROVIDER_CLASS>
Where,
jarfile
is the jar file to be specified by the user for spatial indexing.
DATA_PATH
is the location of the data to be indexed.
SPATIAL_INDEX_PATH
is the location of the resulting spatial index.
INPUT_FORMAT_CLASS
is the InputFormat
implementation used to read the data.
RECORD_INFO_PROVIDER_CLASS
is the implementation used to extract information from the records.
The following example uses the command line to create an index.
hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop/lib/sdohadoop-vector-demo.jar oracle.spatial.hadoop.vector.demo.job.SpatialIndexing "/user/hdfs/demo_vector/tweets/part*" /user/hdfs/demo_vector/tweets/spatial_index org.apache.hadoop.mapred.TextInputFormat oracle.spatial.hadoop.vector.demo.usr.TwitterLogRecordInfoProvider
Note:
The preceding example uses the demo job that comes preloaded.Run the following command to perform a spatial filter:
hadoop jar HADOOP_LIB_PATH/sdohadoop-vector-demo.jar oracle.spatial.hadoop.vector.demo.job.TwitterLogSearch <DATA_PATH> <RESULT_PATH> <SEARCH_TEXT> <SPATIAL_INTERACTION> <GEOMETRY> <SRID> <TOLERANCE> <GEODETIC> <SPATIAL_INDEX_PATH>
Where,
DATA_PATH
is the data to be filtered.
RESULT_PATH
is the path where the results are generated.
SEARCH_TEXT
is the text to be searched.
SPATIAL_INTERACTION
is the type of spatial interaction. It can be either 1 (is inside) or 2 (any interact).
GEOMETRY
is the geometry used to perform the spatial filter. It should be expressed in JSON format.
SRID
is the SRS id of the query geometry.
TOLERANCE
is the tolerance used for the spatial search.
GEODETIC
mentions whether the geometries are geodetic or not.
SPATIAL_INDEX_PATH
is the path to a previously generated spatial index.
The following example demonstrates the counts all tweets containing some text and interacting with a given geometry.
hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop/lib/sdohadoop-vector-demo.jar oracle.spatial.hadoop.vector.demo.job.TwitterLogSearch "/user/hdfs/demo_vector/tweets/part*" /user/hdfs/demo_vector/tweets/text_search feel 2 '{"type":"Polygon", "coordinates":[[-106.64595, 25.83997, -106.64595, 36.50061, -93.51001, 36.50061, -93.51001, 25.83997, -106.64595, 25.83997]]}' 8307 0.0 true /user/hdfs/demo_vector/tweets/spatial_index
Note:
The preceding example uses the demo job that comes preloaded.Run the following command to create a hierarchy result:
hadoop jar HADOOP_LIB_PATH/sdohadoop-vector-demo.jar oracle.spatial.hadoop.vector.demo.job.HierarchicalCount spatial <SPATIAL_INDEX_PATH> <RESULT_PATH> <HIERARCHY_INFO_CLASS> <SPATIAL_INTERACTION> <SRID> <TOLERANCE> <GEODETIC> <HIERARCHY_DATA_INDEX_PATH> <HIERARCHY_DATA_PATHS>
Where,
SPATIAL_INDEX_PATH
is the path to a previously generated spatial index. If there are files other than the spatial index path files in this path (for example, the hadoop generated _SUCCESS file), then add the following pattern: SPATIAL_INDEX_PATH/part*/data
.
RESULT_PATH
is the path where the results are generated. There should be a file named XXXX_count.json
for each hierarchy level.
HIERARCHY_INFO_CLASS
is the location of the resulting spatial index.
INPUT_FORMAT_CLASS
is the HierarchyInfo
implementation. It defines the structure of the current hierarchy data.
SPATIAL_INTERACTION,SRID,TOLERANCE,GEODETIC
, all these parameters have the same meaning as defined previously for Text Search Job. They are used to define the spatial operation between the data and the hierarchical information geometries.
HIERARCHY_DATA_INDEX_PATH
is the path where the hierarchy data index will be placed. This index is used by the job to avoid finding parent-children relationships each time are required.
HIERARCHY_DATA_PATHS
are comma separated list of paths to the hierarchy data. If a hierarchy index was created previously for the same hierarchy data it can be omitted.
The following example uses the command line to create an index.
hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop/lib/sdohadoop-vector-demo.jar oracle.spatial.hadoop.vector.demo.job.HierarchicalCount spatial "/user/hdfs/demo_vector/tweets/spatial_index/part*/data" /user/hdfs/demo_vector/tweets/hier_count_spatial oracle.spatial.hadoop.vector.demo.usr.WorldAdminHierarchyInfo 1 8307 0.5 true /user/hdfs/demo_vector/world_hier_index file:///net/den00btb/scratch/hsaucedo/spatial_bda/demo/vector/catalogs/world_continents.json,file:///net/den00btb/scratch/hsaucedo/spatial_bda/demo/vector/catalogs/world_countries.json,file:///net/den00btb/scratch/hsaucedo/spatial_bda/demo/vector/catalogs/world_states_provinces.json
Note:
The preceding example uses the demo job that comes preloaded.Run the following command to generate a buffer:
hadoop jar HADOOP_LIB_PATH/sdohadoop-vector-demo.jar oracle.spatial.hadoop.vector.demo.job.Buffer <DATA_PATH> <RESULT_PATH> <INPUT_FORMAT_CLASS> <RECORD_INFO_PROVIDER_CLASS> <BUFFER_WIDTH> <BUFFER_SMA> <BUFFER_FLAT> <BUFFER_ARCTOL>
Where,
DATA_PATH
is the data to be filtered.
RESULT_PATH
is the path where the results are generated.
INPUT_FORMAT_CLASS
is the InputFormat
implementation used to read the data.
RECORD_INFO_PROVIDER_CLASS
is the RecordInfoProvider
implementation used to extract data from each record.
BUFFER_WIDTH
specifies the buffer width.
BUFFER_SMA
is the semi major axis.
BUFFER_FLAT
is the buffer flattening.
BUFFER_ARCTOL
is the arc tolerance for geodetic arc densification.
The following example demonstrates generating a buffer around each record geometry. The resulting file is a MapFile where each entry corresponds to a record from the input data. The entry key is the record id and the value is a RecordInfo
instance holding the generated buffer and the record location (path, start offset and length).
hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop/lib/sdohadoop-vector-demo.jar oracle.spatial.hadoop.vector.demo.job.Buffer "/user/hdfs/demo_vector/waterlines/part*" /user/hdfs/demo_vector/waterlines/buffers org.apache.hadoop.mapred.TextInputFormat oracle.spatial.hadoop.vector.demo.usr.WorldSampleLineRecordInfoProvider 5.0
Note:
The preceding example uses the demo job that comes preloaded.You can use the Oracle Big Data Spatial and Graph Image Server Console to tasks, such as Loading Images to HDFS Hadoop Cluster to Create a Mosaic.
Follow the instructions to create a mosaic:
Open http://<oracle_big_data_image_server_console>:8080/spatialviewer/.
Type the username and password.
Click the Configuration tab and review the Hadoop configuration section.
By default the application is configured to work with the Hadoop cluster and no additional configuration is required.
Note:
Only an admin user can make changes to this section.Click the Hadoop Loader tab and review the displayed instructions or alerts.
Follow the instructions and update the runtime configuration, if necessary.
Click the Folder icon.
The File System dialog displays the list of image files and folders.
Select the folders or files as required and click Ok.
The complete path to the image file is displayed.
Click Load Images.
Wait for the images to be loaded successfully. A message is displayed.
Proceed to create a mosaic, if there are no errors displayed.