oraclesai.preprocessing

class CategoricalLagTransformer(spatial_weights_definition=None)

The categorical lag is used for categorical variables and represents the most common value in the neighborhood. For example, given a feature representing the property type; the categorical lag is the most common property in the surroundings

Parameters:

spatial_weights_definition – SpatialWeightsDefinition, default=None. Spatial relationship specification.

fit(X, y=None, geometries=None, spatial_weights=None)

Calculates the spatial weights of the training data using the algorithm associated with the parameter spatial_weights_definition and the geometry column. It stores the training data and geometries.

Parameters:
  • X – {numpy array, geopandas dataframe, vector dataframe} of shape (n_samples, n_features). Independent variables

  • y – Ignored. Not used, present for API consistency by convention.

  • geometries – shapely array or string, default=None. Geometry data for each sample in X. If specified as string, X is expected to be a DataFrame

  • spatial_weights – SpatialWeights, default=None. A spatial weights matrix

Returns:

self. Fitted estimator.

transform(X, y=None, geometries=None, spatial_weights=None, use_fit_lag=False)

Returns the most common value from each location’s neighbors; by defining the parameter use_fit_lag, it can use the neighbors from the training set, or the data passed into the transform method; the output is a NumPy array.

Parameters:
  • X – {numpy array, geopandas dataframe, vector dataframe} of shape (n_samples, n_features). Independent variables

  • y – Ignored. Not used, present for API consistency by convention.

  • geometries – shapely array or string, default=None. Geometry data for each sample in X. If specified as string, X is expected to be a DataFrame

  • spatial_weights – SpatialWeights, default=None. A spatial weights matrix

  • use_fit_lag – Boolean, default=False. If True, it calculates the spatial lag from the training data; otherwise, it uses the given data to obtain the spatial lag.

Returns:

The categorical lag of the given data.

class SCoordTransformer(crs=None)

Transformer that returns the centroid of the geometries for each observation

fit(X, y=None, geometries=None)

Not implemented since no calculations are required for training.

transform(X, y=None, geometries=None)

Returns the XY coordinates of the geometries; in the case of non-points geometries, it returns the geometries’ centroids.

Parameters:
  • X – {numpy array, geopandas dataframe, vector dataframe} of shape (n_samples, n_features). Independent variables

  • y – Ignored. Not used, present for API consistency by convention.

  • geometries – shapely array or string, default=None. Geometry data for each sample in X. If specified as string, X is expected to be a DataFrame

Returns:

The transformed data.

class SpatialImputer(missing_values=nan, spatial_weights_definition=None, strategy='mean')

Fill all the missing values using the values from the neighbors for each observation.

Parameters:
  • missing_values – int, float, str, np.nan, None or pandas.NA, default=np.nan. The placeholder for the missing values. All occurrences of missing_values will be imputed

  • spatial_weights_definition – SpatialWeightsDefinition, default=None. Spatial relationship specification

  • strategy – {“mean”, “median”, “maximum”, “minimum”}, default=”mean”. It calculates the specified statistic from the neighbors to fill the missing value.

fit(X, y=None, geometries=None, spatial_weights=None)

Calculate the spatial weights according to spatial_weights_definition. In case that spatial weights cannot be computed, it uses a SimpleImputer from scikit-learn. It stores the training data and geometries.

Parameters:
  • X – {numpy array, geopandas dataframe, vector dataframe} of shape (n_samples, n_features). Independent variables

  • y – Ignored. Not used, present for API consistency by convention.

  • geometries – shapely array or string, default=None. Geometry data for each sample in X. If specified as string, X is expected to be a DataFrame

  • spatial_weights – SpatialWeights, default=None. A spatial weights matrix

Returns:

self. Fitted estimator.

property mask_

A boolean array with True in those cells with missing values, and False anywhere else

transform(X, y=None, geometries=None, spatial_weights=None, use_fit_lag=False)

Returns a NumPy array with the data passed as parameter filled according to the specified strategy. It determines whether to use the neighbors from the training set by defining the parameter use_fit_lag.

Parameters:
  • X – {numpy array, geopandas dataframe, vector dataframe} of shape (n_samples, n_features). Independent variables

  • y – Ignored. Not used, present for API consistency by convention.

  • geometries – shapely array or string, default=None. Geometry data for each sample in X. If specified as string, X is expected to be a DataFrame

  • spatial_weights – SpatialWeights, default=None. A spatial weights matrix

  • use_fit_lag – Boolean, default=False. If True, it executes imputation from the training data; otherwise, it uses the given data.

Returns:

The transformed data.

class SpatialLagTransformer(spatial_weights_definition=None, strategy='mean')

The spatial lag of a particular feature reflects the average value of that feature in the neighborhood around each observation. For example, given a neighborhood, the spatial lag of the house price for a specific house is the average house price in its surroundings.

Parameters:
  • spatial_weights_definition – SpatialWeightsDefinition, default=None. Spatial relationship specification

  • strategy – {“mean”, “median”}, default=”mean”. For “median”, it calculates the median from the neighbors. For “mean”, it calculates the average from the neighbors

fit(X, y=None, geometries=None, spatial_weights=None)

Computes the spatial weights according to the parameter spatial_weights_definition. It stores the training data and the geometries.

Parameters:
  • X – {numpy array, geopandas dataframe, vector dataframe} of shape (n_samples, n_features). Independent variables

  • y – Ignored. Not used, present for API consistency by convention.

  • geometries – shapely array or string, default=None. Geometry data for each sample in X. If specified as string, X is expected to be a DataFrame

  • spatial_weights – SpatialWeights, default=None. A spatial weights matrix

Returns:

self. Fitted estimator.

transform(X, y=None, geometries=None, spatial_weights=None, use_fit_lag=False)

Changes the values of a given data for their spatial lag. If use_fit_lag=True calculates the spatial lag from the training set; otherwise, it computes the spatial lag from the data passed into the transform method. The function returns a NumPy array.

Parameters:
  • X – {numpy array, geopandas dataframe, vector dataframe} of shape (n_samples, n_features). Independent variables

  • y – Ignored. Not used, present for API consistency by convention.

  • geometries – shapely array or string, default=None. Geometry data for each sample in X. If specified as string, X is expected to be a DataFrame

  • spatial_weights – SpatialWeights, default=None. A spatial weights matrix

  • use_fit_lag – Boolean, default=False. If True, it calculates the spatial lag from the training data; otherwise, it uses the given data to obtain the spatial lag.

Returns:

The transformed data.

spatial_train_test_split(X, y=None, geometries=None, test_size=0.3, numpy_result=False, random_state=None) Tuple

Splits data into train and test sub sets. Each sub set is divided into: explanatory variables X and geometries, and target variable y, where: X is a multi dimensional array of n-samples * n-features, while geometry and y are one dimensional arrays of n-samples.

Parameters:
  • X – A oraclesai.SpatialDataFrame, geopandas.GeoDataFrame, pandas.DataFrame or numpy array. When X is a SpatialDataFrame or a DataFrame, it can contain the columns for geometries and y too.

  • y – The name of the target variable column in X or a 1-d numpy array

  • geometries – The name of the spatial column in X or a 1-d numpy array of shapely geometries

  • test_size – (default=0.3) proportion of the test set. A value from 0 to 1

  • numpy_result – If True, the returned vector will always be numpy arrays. If False, the returned types will match the types of the input data.

  • random_state – (None) the seed used to generate a random number.

Returns:

A tuple containing X_train, X_test, y_train, y_test, geometries_train, geometries_test.