About Spatial Pipeline

The spatial pipeline extends the existing scikit-learn pipeline to include spatial information such as geometry data and spatial weights.

The SpatialPipeline class can easily chain together both spatial and non-spatial steps, and is composed of estimators. An estimator can be one of the following:

  • Transformer: An estimator with the fit and transform methods that are described in the following table.
    Method Description
    fit The fit method computes statistics and other properties from the training data.
    transform The transform method applies the values calculated in the fit method to change the data.
    fit_transform Calls the fit and transform methods sequentially with the training data.

    One typical example of a transformer is the StandardScaler, which standardizes the data so that each feature has zero mean and unit variance. Usually, transformers are part of the pre-processing step in a pipeline.

  • Classifier/Regressor: This estimator must be the last step in a pipeline. It can be either a regression or a classification task. The methods available in a pipeline correspond to those in the final step. In this case, it has the fit, predict, and score methods along with the other methods associated with the estimator. Usually, the pipeline goes through multiple transformers before reaching this estimator.

  • Composite Estimator: These estimators can combine multiple estimators and can be chained with other estimators. For example, having a pre-processing pipeline to execute multiple transformations to the data and then making this pipeline part of another pipeline for a regression task. There are three composite estimators:
    Estimator Description
    SpatialPipeline A pipeline that includes spatial information.
    SpatialFeatureUnion Concatenate resulting columns (features) from different estimators to create a single input while sharing spatial information.
    SpatialColumnTransformer Selects a subset of columns (features) from the input and passes these columns to an estimator while sharing spatial information.

A spatial pipeline can take the same input as a regular scikit-learn pipeline plus the spatial information which is required by spatial processes (spatial transformers and spatial models or predictors). This additional spatial information can be divided into two categories:

  • Data location/geometries: The geometry associated with each sample in the input data, X, is a vector of geometries. This vector can be embedded in X if X is either a geopandas GeoDataFrame or a SpatialDataFrame. It can also be defined in the parameter geometries.
  • Spatial parameters: These are additional parameters used to provide context about geometries (CRS), describe/quantify spatial relationships (spatial weights definition, spatial weights objects), or help perform faster spatial searches (spatial index).

The following figure shows the data flow in a spatial pipeline.



As seen in the preceding figure, the input data comprising X, y, and (optionally) spatial parameters are received by the spatial pipeline. Note that the input X can be split into X' (non-spatial data) and geometries. Then, the spatial parameters and the geometries are extracted and passed to all the spatial steps in the pipeline.