oraclesai.classification

class GWRClassifier(spatial_weights_definition=None, bandwidth=None, fixed=True)

Geographical Weighted Regression for binary classification. A logistic model is trained for each observation by including the target and explanatory variables from the observations falling within a specified bandwidth.

Parameters:
  • spatial_weights_definition – SpatialWeightsDefinition, default=None. Spatial relationship specification.

  • bandwidth – scalar, default=None. Bandwidth value consisting of either a distance or K nearest neighbors. If the bandwidth is provided, it overrides the parameter spatial_weights_definition according to fixed.

  • fixed – boolean, default=True. Works only when bandwidth is defined. If True it uses DistanceBandWeightsDefinition as spatial weights; otherwise, it uses KNNWeightsDefinition.

property betas
Returns:

A 2D-array with the estimated parameters (n x k) for the binomial GWR model

fit(X, y, geometries=None, crs=None)

Search for the bandwidth and fits a binomial GWR model with the given data. A logistic classifier is executed at every observation using neighbors’ data. If the parameter spatial_weights_definition is defined, the bandwidth is retrieved from it; otherwise, estimate the bandwidth based on the geometries.

Parameters:
  • X – {numpy array, geopandas dataframe, vector dataframe} of shape (n_samples, n_features). Independent variables

  • y – {pandas.DataFrame, numpy 1D array or string}. If specified as string, X is expected to be a DataFrame.

  • geometries – shapely array, default=None. Geometry data for each sample in X.

  • crs – pyproj.crs.CRS, default=None. Coordinate reference system. Only used when X is a numpy array. It is ignored when CRS information is available in X (i.e. a SpatialDataFrame or GeoDataFrame).

Returns:

self. Fitted estimator.

property k
Returns:

The number of variables for which coefficients are estimated (including the constant)

property model_type
Returns:

The type of the classification model

predict(X, geometries=None)

Evaluates the binomial GWR model with the given data. If no model is defined returns None. A logistic model is built for each observation of the prediction set using neighboring observations from the training data; then, it uses those models to estimate the target variable.

Parameters:
  • X – {numpy array, geopandas dataframe, vector dataframe} of shape (n_samples, n_features). Independent variables

  • geometries – shapely array, default=None. Geometry data for each sample in X.

Returns:

A 1D numpy array with the prediction for each element of the prediction set.

property predy
Returns:

An array with the predictions for the training data

score(X, y, sample_weight=None, geometries=None)

Returns the accuracy of the model.

Parameters:
  • X – {numpy array, geopandas dataframe, vector dataframe} of shape (n_samples, n_features). Independent variables

  • y – {pandas.DataFrame, numpy 1D array or string}. If specified as string, X is expected to be a DataFrame

  • sample_weight – Weighted contribution to the score for each sample

  • geometries – shapely array, default=None. Geometry data for each sample in X.

Returns:

The accuracy of the model.

property summary
Returns:

A summary of the trained model

property u
Returns:

An array with the residuals of the trained binomial GWR model

class GeographicalClassifier(global_model=None, model_cls=None, spatial_weights_definition=None, bandwidth=None, fixed=True, local_weight=0.25, **kwargs)

Geographical classification algorithm. It uses a global model and multiple local models to perform classification.

Parameters:
  • global_model – A scikit-learn estimator instance, default=None. A trained model used as global model. Local models will be of the same type as this model. Required when model_cls is None.

  • model_cls – Class of scikit-learn estimator, default=None. Type of the global model and local models. When model_cls is provided (instead of global_model), a global model will be trained.Required when global_model=None. model_cls creation parameters are specified as kwargs.

  • spatial_weights_definition – SpatialWeightsDefinition, default=None. Spatial relationship specification. These criteria are used to group data into neighborhoods and train local models.

  • bandwidth – int or float, default=None. Distance (fixed=True) or number of nearest neighbors (fixed=False). bandwidth + fixed is another way to set the spatial relationship specification. It is ignored if spatial_weights_definition was set.

  • fixed – bool, default=True. True if bandwidth represents a distance. False for number of nearest neighbors.

  • local_weight – float (0.0 to 1.0), default=0.25. Weight associated to the local models predictions.

  • kwargs – Additional parameters for the inner models created with parameter model_cls.

fit(X, y, geometries=None, crs=None, spatial_weights=None, spatial_weights_definition=None, column_map=None, fit_global_model=True, n_jobs=1, backend=None, batch_size=None)

Trains a geographical classification model. Internally, a global model (if fit_global_model=True) and several local models are trained. A local model is created for each neighborhood. A neighborhood is a spatial region containing multiple samples from X that are spatially related. Neighborhoods are built using the spatial relationship specified at model’s creation time (spatial_weights_definition, bandwidth) or using the spatial weights matrix object passed as parameter for training.

Parameters:
  • X – A SpatialDataFrame, DataFrame, GeoDataFrame or a 2d numpy array. Expected shape is (n_samples, n_features). Predicting data. For SpatialDataFrame or GeoDataFrame, the geometries can be found in X, as a column. If X contains the column y, the parameter y must specify the name of that column.

  • y – A 1d array or string. Target values. If X contains a column with the target values this parameter will specify the name of that column instead.

  • geometries – A list of Shapely geometries, a string (column name) or None, default=None. The geometries associated to X. If X is a SpatialDataFrame or a GeoDataFrame and X contains the geometries as one of its columns, this parameter may contain the name of that column, or it can be None (in case X has a column called ‘geometry’).

  • crs – pyproj.crs.CRS or string, default=None. Spatial reference system of geometries. Only used when X is a numpy array. It is ignored when CRS information is available in X (i.e. X is a SpatialDataFrame or GeoDataFrame).

  • spatial_weights – SpatialWeightsDefinition or pysal weights object, default=None. A pre-computed spatial weights matrix for the training data. If not None any spatial relationship specification was provided at the model’s creation time will be ignored.

  • fit_global_model – bool, default=True. If False, the global model will not be trained.

  • n_jobs – int, default=1. Number of processor cores used to parallelize local models training. Set -1 to use all the available cores.

  • backend – string, default=None. The Joblib backend to use when n_jobs != 1. If None, Joblib’s default backend will be used (typically loki)

  • batch_size – ‘auto’ or int, default=’auto’. Number of batch tasks per parallel job.

Returns:

self. Fitted estimator.

predict(X, y=None, geometries=None, crs=None)

Predict the target class for X using the global model and the local models that are closer to geometries. The returned predicted class is the class with highest probability resulting when calling predict_proba().

Parameters:
  • X – A SpatialDataFrame, DataFrame, GeoDataFrame or a 2d numpy array. Expected shape is (n_samples, n_features). Predicting data. For SpatialDataFrame or GeoDataFrame, the geometries can be found in X, as a column. If X contains the column y (e.g., a SpatialDataFrame or DataFrame used for training or testing), the parameter y must specify the name of that column, so it can be excluded.

  • y – A string or None, default=None. If X contains a column with the target values, this parameter will specify the name of that column so it can be excluded for the prediction, otherwise, this parameter is not used.

  • geometries – A list of Shapely geometries, a string (column name) or None, default=None. The geometries associated to X. If X is a SpatialDataFrame or a GeoDataFrame and X contains the geometries as one of its columns, this parameter may contain the name of that column or it can be omitted (in case X has a column called ‘geometry’).

Returns:

An array of shape n_samples containing the value of the predicted class.

predict_proba(X, y=None, geometries=None, crs=None)

Predict the probability of each class for X using the global model and the local models that are closer to geometries. The returned probabilities are calculated as follows: local_model_probabilities * local_weight + global_model_probabilities * (1.0 - local_weight)

Parameters:
  • X – A SpatialDataFrame, DataFrame, GeoDataFrame or a 2d numpy array. Expected shape is (n_samples, n_features). Predicting data. For SpatialDataFrame or GeoDataFrame, the geometries can be found in X, as a column. If X contains the column y (e.g., a SpatialDataFrame or DataFrame used for training or testing), the parameter y must specify the name of that column so it can be excluded.

  • y – A string or None, default=None. If X contains a column with the target values, this parameter will specify the name of that column so it can be excluded for the prediction, otherwise, this parameter is not used.

  • geometries – A list of Shapely geometries, a string (column name) or None, default=None. The geometries associated to X. If X is a SpatialDataFrame or a GeoDataFrame and X contains the geometries as one of its columns, this parameter may contain the name of that column, or it can be omitted (in case X has a column called ‘geometry’).

Returns:

An array of shape n_samples containing a tuple with the probabilities for each class.

score(X, y, geometries=None, crs=None)

Compute the F1 score on the given test data and labels.

Parameters:
  • X – A SpatialDataFrame, DataFrame, GeoDataFrame or a 2d numpy array. Expected shape is (n_samples, n_features). Predicting data. For SpatialDataFrame or GeoDataFrame, the geometries can be found in X, as a column. If X contains the column y (e.g., a proxy or DataFrame used for training or testing), the parameter y must specify the name of that column so it can be excluded.

  • y – A string or None, default=None. If X contains a column with the target values, this parameter will specify the name of that column so it can be excluded for the prediction, otherwise, this parameter is not used.

  • geometries – A list of Shapely geometries, a string (column name) or None, default=None. The geometries associated to X. If X is a SpatialDataFrame or a GeoDataFrame and X contains the geometries as one of its columns, this parameter may contain the name of that column or it can be omitted (in case X has a column called ‘geometry’).

  • crs – pyproj.crs.CRS, default=None. Coordinate reference system.

Returns:

The mean accuracy of self.predict(X).

class SLXClassifier(spatial_weights_definition=None, random_state=None, balance_method=None, balance_ratio=1.0)

Implementation of the SLX Logistic Regression model. Executes Logistic Regression involving a feature engineering step to add features that provide a spatial context to the data. The algorithm adds one or more columns with the spatial lag of certain features, representing the average from neighboring observations.

Parameters:
  • spatial_weights_definition – SpatialWeightsDefinition, default=None. Establishes the interaction between neighboring observations.

  • random_state – RandomState instance or None, default=None. Determines random number generation.

  • balance_method – {None, ‘random’, ‘smote’}. The method chosen to balance the dataset. ‘random’ creates duplicates from random samples (with replacement) from the minority class. ‘smote’ selects a random sample from the minority class, A, and from its k nearest neighbors, it selects a random neighbor, B. The vector AB is multiplied by a random number in the range [0, 1], and the result is added to A, generating a new synthetic instance.

  • balance_ratio – float, default=1.0. A number between 0 and 1 representing the desired ratio of observations from minority classes during the balancing process. A value of 1 result in the same number of observations for both classes.

property betas
Returns:

An array with the estimated parameters for the trained model. For multi_class, a (m x k) matrix is returned, representing the parameters for each fitted model, where m is the number of classes in the target variable.

fit(X, y, geometries=None, crs=None, spatial_weights=None, column_ids=None)

Trains the SLX logistic regression model. Get the column indexes from column_ids and adds the spatial lag of those columns into the training data. Finally, it fits a logistic regression classifier with the extended data. For the multi-class scenario, a model is fitted for each class, following a one-vs-rest strategy.

Parameters:
  • X – {numpy array, geopandas dataframe, vector dataframe} of shape (n_samples, n_features). Independent variables

  • y – {pandas.DataFrame, numpy 1D array or string}. If specified as string, X is expected to be a DataFrame.

  • geometries – shapely array, default=None. Geometry data for each sample in X.

  • crs – pyproj.crs.CRS, default=None. Coordinate reference system. Only used when X is a numpy array. It is ignored when CRS information is available in X (i.e. a SpatialDataFrame or GeoDataFrame).

  • spatial_weights – SpatialWeights, default=None. A spatial weights’ matrix.

  • column_ids – List of strings or list of integers, default=None. A list of column names or column indexes, indicating the columns that will be used to compute the spatial lag.

Returns:

self. Fitted estimator

property k
Returns:

The number of variables for which coefficients are estimated (including the constant)

property model_type
Returns:

The type of the classification model

predict(X, geometries=None, spatial_weights=None, use_fit_lag=False)

Calculates the spatial lag of the dataset using the same columns defined in the fit process and returns the category with the highest probability according to Logistic Regression.

Parameters:
  • X – {numpy array, geopandas dataframe, vector dataframe} of shape (n_samples, n_features). Independent variables

  • geometries – shapely array, default=None. Geometry data for each sample in X.

  • spatial_weights – SpatialWeights, default=None. A spatial weights matrix.

  • use_fit_lag – boolean, default=False. If False, use the spatial lag from the prediction data, otherwise, use the training data to calculate the spatial lag.

Returns:

The predicted class for each element of the prediction set.

property predy
Returns:

An array with the predictions for the training data. For multi_class, the prediction represents the class with the highest probability.

score(X, y, sample_weight=None, geometries=None, use_fit_lag=False)

Returns the accuracy of the model.

Parameters:
  • X – {numpy array, geopandas dataframe, vector dataframe} of shape (n_samples, n_features). Independent variables

  • y – {pandas.DataFrame, numpy 1D array or string}. If specified as string, X is expected to be a DataFrame

  • sample_weight – Weighted contribution to the score for each sample

  • geometries – shapely array, default=None. Geometry data for each sample in X.

  • use_fit_lag – boolean, default=False. If false, it will use the spatial lag from the prediction data, otherwise, it will use the training data to calculate the spatial lag.

Returns:

The accuracy of the model.

property summary
Returns:

A string containing statistics and estimated parameters of the fitted models.

property u
Returns:

An array with the residuals of the trained model