oraclesai.preprocessing
- class CategoricalLagTransformer(spatial_weights_definition=None)
The categorical lag is used for categorical variables and represents the most common value in the neighborhood. For example, given a feature representing the property type; the categorical lag is the most common property in the surroundings
- Parameters:
spatial_weights_definition – SpatialWeightsDefinition, default=None. Spatial relationship specification.
- fit(X, y=None, geometries=None, spatial_weights=None)
Calculates the spatial weights of the training data using the algorithm associated with the parameter
spatial_weights_definition
and the geometry column. It stores the training data and geometries.- Parameters:
X – {numpy array, geopandas dataframe, vector dataframe} of shape (n_samples, n_features). Independent variables
y – Ignored. Not used, present for API consistency by convention.
geometries – shapely array or string, default=None. Geometry data for each sample in
X
. If specified as string,X
is expected to be a DataFramespatial_weights – SpatialWeights, default=None. A spatial weights matrix
- Returns:
self. Fitted estimator.
- transform(X, y=None, geometries=None, spatial_weights=None, use_fit_lag=False)
Returns the most common value from each location’s neighbors; by defining the parameter
use_fit_lag
, it can use the neighbors from the training set, or the data passed into the transform method; the output is a NumPy array.- Parameters:
X – {numpy array, geopandas dataframe, vector dataframe} of shape (n_samples, n_features). Independent variables
y – Ignored. Not used, present for API consistency by convention.
geometries – shapely array or string, default=None. Geometry data for each sample in X. If specified as string, X is expected to be a DataFrame
spatial_weights – SpatialWeights, default=None. A spatial weights matrix
use_fit_lag – Boolean, default=False. If True, it calculates the spatial lag from the training data; otherwise, it uses the given data to obtain the spatial lag.
- Returns:
The categorical lag of the given data.
- class SCoordTransformer(crs=None)
Transformer that returns the centroid of the geometries for each observation
- fit(X, y=None, geometries=None)
Not implemented since no calculations are required for training.
- transform(X, y=None, geometries=None)
Returns the XY coordinates of the geometries; in the case of non-points geometries, it returns the geometries’ centroids.
- Parameters:
X – {numpy array, geopandas dataframe, vector dataframe} of shape (n_samples, n_features). Independent variables
y – Ignored. Not used, present for API consistency by convention.
geometries – shapely array or string, default=None. Geometry data for each sample in
X
. If specified as string,X
is expected to be a DataFrame
- Returns:
The transformed data.
- class SpatialImputer(missing_values=nan, spatial_weights_definition=None, strategy='mean')
Fill all the missing values using the values from the neighbors for each observation.
- Parameters:
missing_values – int, float, str, np.nan, None or pandas.NA, default=np.nan. The placeholder for the missing values. All occurrences of
missing_values
will be imputedspatial_weights_definition – SpatialWeightsDefinition, default=None. Spatial relationship specification
strategy – {“mean”, “median”, “maximum”, “minimum”}, default=”mean”. It calculates the specified statistic from the neighbors to fill the missing value.
- fit(X, y=None, geometries=None, spatial_weights=None)
Calculate the spatial weights according to
spatial_weights_definition
. In case that spatial weights cannot be computed, it uses aSimpleImputer
from scikit-learn. It stores the training data and geometries.- Parameters:
X – {numpy array, geopandas dataframe, vector dataframe} of shape (n_samples, n_features). Independent variables
y – Ignored. Not used, present for API consistency by convention.
geometries – shapely array or string, default=None. Geometry data for each sample in
X
. If specified as string,X
is expected to be a DataFramespatial_weights – SpatialWeights, default=None. A spatial weights matrix
- Returns:
self. Fitted estimator.
- property mask_
A boolean array with True in those cells with missing values, and False anywhere else
- transform(X, y=None, geometries=None, spatial_weights=None, use_fit_lag=False)
Returns a NumPy array with the data passed as parameter filled according to the specified strategy. It determines whether to use the neighbors from the training set by defining the parameter
use_fit_lag
.- Parameters:
X – {numpy array, geopandas dataframe, vector dataframe} of shape (n_samples, n_features). Independent variables
y – Ignored. Not used, present for API consistency by convention.
geometries – shapely array or string, default=None. Geometry data for each sample in
X
. If specified as string,X
is expected to be a DataFramespatial_weights – SpatialWeights, default=None. A spatial weights matrix
use_fit_lag – Boolean, default=False. If True, it executes imputation from the training data; otherwise, it uses the given data.
- Returns:
The transformed data.
- class SpatialLagTransformer(spatial_weights_definition=None, strategy='mean')
The spatial lag of a particular feature reflects the average value of that feature in the neighborhood around each observation. For example, given a neighborhood, the spatial lag of the house price for a specific house is the average house price in its surroundings.
- Parameters:
spatial_weights_definition – SpatialWeightsDefinition, default=None. Spatial relationship specification
strategy – {“mean”, “median”}, default=”mean”. For “median”, it calculates the median from the neighbors. For “mean”, it calculates the average from the neighbors
- fit(X, y=None, geometries=None, spatial_weights=None)
Computes the spatial weights according to the parameter
spatial_weights_definition
. It stores the training data and the geometries.- Parameters:
X – {numpy array, geopandas dataframe, vector dataframe} of shape (n_samples, n_features). Independent variables
y – Ignored. Not used, present for API consistency by convention.
geometries – shapely array or string, default=None. Geometry data for each sample in
X
. If specified as string,X
is expected to be a DataFramespatial_weights – SpatialWeights, default=None. A spatial weights matrix
- Returns:
self. Fitted estimator.
- transform(X, y=None, geometries=None, spatial_weights=None, use_fit_lag=False)
Changes the values of a given data for their spatial lag. If
use_fit_lag=True
calculates the spatial lag from the training set; otherwise, it computes the spatial lag from the data passed into the transform method. The function returns a NumPy array.- Parameters:
X – {numpy array, geopandas dataframe, vector dataframe} of shape (n_samples, n_features). Independent variables
y – Ignored. Not used, present for API consistency by convention.
geometries – shapely array or string, default=None. Geometry data for each sample in
X
. If specified as string,X
is expected to be a DataFramespatial_weights – SpatialWeights, default=None. A spatial weights matrix
use_fit_lag – Boolean, default=False. If True, it calculates the spatial lag from the training data; otherwise, it uses the given data to obtain the spatial lag.
- Returns:
The transformed data.
- spatial_train_test_split(X, y=None, geometries=None, test_size=0.3, numpy_result=False, random_state=None) Tuple
Splits data into train and test sub sets. Each sub set is divided into: explanatory variables X and geometries, and target variable y, where: X is a multi dimensional array of n-samples * n-features, while geometry and y are one dimensional arrays of n-samples.
- Parameters:
X – A
oraclesai.SpatialDataFrame
,geopandas.GeoDataFrame
,pandas.DataFrame
or numpy array. WhenX
is a SpatialDataFrame or a DataFrame, it can contain the columns for geometries and y too.y – The name of the target variable column in
X
or a 1-d numpy arraygeometries – The name of the spatial column in
X
or a 1-d numpy array of shapely geometriestest_size – (default=0.3) proportion of the test set. A value from 0 to 1
numpy_result – If True, the returned vector will always be numpy arrays. If False, the returned types will match the types of the input data.
random_state – (None) the seed used to generate a random number.
- Returns:
A tuple containing X_train, X_test, y_train, y_test, geometries_train, geometries_test.