Indexing GeoJson Data

Indexing GeoJson data is similar to indexing other json data. In the GeoJson case, the GEOMETRY or POINT keyword must be used after an index path that leads to geometry objects. POINT should be used only if all rows in the table are expected to have single point geometries at the indexed field (GEOMETRY can also be used in this case, but POINT is recommended for better performance). As in the case of other json data, an error will be raised if for some row the value of the index path is not a valid GeoJson point or geometry, unless that value is NULL, json null, or EMPTY.

An index that includes a path to geometry objects is called a geometry index. A geometry index can index other fields as well, but some restrictions apply: (a) a geometry index cannot index more than one GeoJson field, (b) the GeoJson field cannot be inside an array, unless it is a POINT field, and (c) a geometry index cannot be a multi-key index, unless the GeoJson field is a POINT field and the array or map being indexed is the one that contains the POINT field.

Indexing of geometries is based on geohashing. Geohashing is an encoding of a longitude/latitude pair to a string. It works by recursively partitioning the 2-D longitude/latitude coordinate system into a hierarchy of rectangulars called cells. The initial (level-0) cell is the whole world, i.e., all points with a longitude between -180 and +180 and latitude between -90 and +90. The first (level-0) split creates the 32 level-1 cells shown in the following figure. Each cell is assigned a "name", which is a character out of this 32-char-long string G = "0123456789bcdefghjkmnpqrstuvwxyz". This name is called the geohash of the cell.

Figure 11-4 32 Level-1 Geohash Cells

Description of Figure 11-4 follows
Description of "Figure 11-4 32 Level-1 Geohash Cells"

The next (level-1) split splits each level-1 cell into 32 level-2 cells. The following figure shows all the level-2 cells inside the "g" cell.

Figure 11-5 Level-1 and Level-2 Geohash Cells

Description of Figure 11-5 follows
Description of "Figure 11-5 Level-1 and Level-2 Geohash Cells"

As shown, the geohash of each level-2 is 2 chars long, where the 1st char is the geohash of the parent cell, and the 2nd char is again drawn from the same char set. This process continues down to some given level L, called the geohash length. During an even-numbered split, each cell is split into 8 vertical slices and 4 horizontal slices. During an odd-numbered split, each cell is split into 4 vertical slices and 8 horizontal slices. In both cases, for each of the 32 sub-cells, its geohash is formed by using the parent-cell geohash as a prefix and appending a char out of G. The extra char for each sub-cell is chosen the same way as shown in both the earlier figures for even and odd splits respectively.

Oracle NoSQL Database uses a geohash length of 10. Cells at level 10 have an area of about 1 square meter. When indexing a point, the level-10 cell that contains the point is computed and the geohash of that cell is placed in the index entry. So, for points, a single index entry is generated for each point, and a geometry index on a POINT field behaves like a simple (non-multikey) index, unless the POINT field itself is inside an array or map that is being indexed. Notice that all points inside the same level-10 cell will have the same geohash.

With the geohashing algorithm described above, points that are close to each other will usually (but not always) be close together in the geometry index as well, i.e., have long common prefixes. So, searching for points using one of the functions described in the previous section translates to one or more range scans in the geometry index. These range scans may return false positives, so the search function itself must still be applied on the rows returned by index scans to eliminate the false positives.

When indexing a LineString or Polygon, the geometry’s minimum bounding box (MBR) is computed first, and then a set of cells is found that completely cover the MBR. The level of the covering cells depends on the size and shape and position of the MBR (usually it will be less than 10). Then, an index entry is created for each of the covering cells containing the geohash of that cell. So, a geometry index on a LineString or Polygon is always a multi-key index since multiple index entries will be created for a single row. For MultiPoints, MultiLineStrings, MultiPolygons, and GeometryCollections each of the constituent geometries is indexed separately and the index for such geometries it also a multi-key index.