oraclesai.classification
- class GWRClassifier(spatial_weights_definition=None, bandwidth=None, fixed=True)
Geographical Weighted Regression for binary classification. A logistic model is trained for each observation by including the target and explanatory variables from the observations falling within a specified bandwidth.
- Parameters:
spatial_weights_definition – SpatialWeightsDefinition, default=None. Spatial relationship specification.
bandwidth – scalar, default=None. Bandwidth value consisting of either a distance or K nearest neighbors. If the bandwidth is provided, it overrides the parameter
spatial_weights_definitionaccording tofixed.fixed – boolean, default=True. Works only when
bandwidthis defined. If True it usesDistanceBandWeightsDefinitionas spatial weights; otherwise, it usesKNNWeightsDefinition.
- property betas
- Returns:
A 2D-array with the estimated parameters (n x k) for the binomial GWR model
- fit(X, y, geometries=None, crs=None)
Search for the bandwidth and fits a binomial GWR model with the given data. A logistic classifier is executed at every observation using neighbors’ data. If the parameter
spatial_weights_definitionis defined, the bandwidth is retrieved from it; otherwise, estimate the bandwidth based on the geometries.- Parameters:
X – {numpy array, geopandas dataframe, vector dataframe} of shape (n_samples, n_features). Independent variables
y – {pandas.DataFrame, numpy 1D array or string}. If specified as string, X is expected to be a DataFrame.
geometries – shapely array, default=None. Geometry data for each sample in
X.crs – pyproj.crs.CRS, default=None. Coordinate reference system. Only used when
Xis a numpy array. It is ignored when CRS information is available inX(i.e. a SpatialDataFrame or GeoDataFrame).
- Returns:
self. Fitted estimator.
- property k
- Returns:
The number of variables for which coefficients are estimated (including the constant)
- property model_type
- Returns:
The type of the classification model
- predict(X, geometries=None)
Evaluates the binomial GWR model with the given data. If no model is defined returns None. A logistic model is built for each observation of the prediction set using neighboring observations from the training data; then, it uses those models to estimate the target variable.
- Parameters:
X – {numpy array, geopandas dataframe, vector dataframe} of shape (n_samples, n_features). Independent variables
geometries – shapely array, default=None. Geometry data for each sample in
X.
- Returns:
A 1D numpy array with the prediction for each element of the prediction set.
- property predy
- Returns:
An array with the predictions for the training data
- score(X, y, sample_weight=None, geometries=None)
Returns the accuracy of the model.
- Parameters:
X – {numpy array, geopandas dataframe, vector dataframe} of shape (n_samples, n_features). Independent variables
y – {pandas.DataFrame, numpy 1D array or string}. If specified as string,
Xis expected to be a DataFramesample_weight – Weighted contribution to the score for each sample
geometries – shapely array, default=None. Geometry data for each sample in
X.
- Returns:
The accuracy of the model.
- property summary
- Returns:
A summary of the trained model
- property u
- Returns:
An array with the residuals of the trained binomial GWR model
- class GeographicalClassifier(global_model=None, model_cls=None, spatial_weights_definition=None, bandwidth=None, fixed=True, local_weight=0.25, **kwargs)
Geographical classification algorithm. It uses a global model and multiple local models to perform classification.
- Parameters:
global_model – A scikit-learn estimator instance, default=None. A trained model used as global model. Local models will be of the same type as this model. Required when
model_clsis None.model_cls – Class of scikit-learn estimator, default=None. Type of the global model and local models. When
model_clsis provided (instead ofglobal_model), a global model will be trained.Required whenglobal_model=None.model_clscreation parameters are specified askwargs.spatial_weights_definition – SpatialWeightsDefinition, default=None. Spatial relationship specification. These criteria are used to group data into neighborhoods and train local models.
bandwidth – int or float, default=None. Distance (fixed=True) or number of nearest neighbors (fixed=False). bandwidth + fixed is another way to set the spatial relationship specification. It is ignored if
spatial_weights_definitionwas set.fixed – bool, default=True. True if bandwidth represents a distance. False for number of nearest neighbors.
local_weight – float (0.0 to 1.0), default=0.25. Weight associated to the local models predictions.
kwargs – Additional parameters for the inner models created with parameter
model_cls.
- fit(X, y, geometries=None, crs=None, spatial_weights=None, spatial_weights_definition=None, column_map=None, fit_global_model=True, n_jobs=1, backend=None, batch_size=None)
Trains a geographical classification model. Internally, a global model (if
fit_global_model=True) and several local models are trained. A local model is created for each neighborhood. A neighborhood is a spatial region containing multiple samples fromXthat are spatially related. Neighborhoods are built using the spatial relationship specified at model’s creation time (spatial_weights_definition,bandwidth) or using the spatial weights matrix object passed as parameter for training.- Parameters:
X – A SpatialDataFrame, DataFrame, GeoDataFrame or a 2d numpy array. Expected shape is (n_samples, n_features). Predicting data. For SpatialDataFrame or GeoDataFrame, the geometries can be found in
X, as a column. IfXcontains the column y, the parameterymust specify the name of that column.y – A 1d array or string. Target values. If
Xcontains a column with the target values this parameter will specify the name of that column instead.geometries – A list of Shapely geometries, a string (column name) or None, default=None. The geometries associated to
X. IfXis a SpatialDataFrame or a GeoDataFrame andXcontains the geometries as one of its columns, this parameter may contain the name of that column, or it can be None (in case X has a column called ‘geometry’).crs – pyproj.crs.CRS or string, default=None. Spatial reference system of geometries. Only used when
Xis a numpy array. It is ignored when CRS information is available inX(i.e.Xis a SpatialDataFrame or GeoDataFrame).spatial_weights – SpatialWeightsDefinition or pysal weights object, default=None. A pre-computed spatial weights matrix for the training data. If not None any spatial relationship specification was provided at the model’s creation time will be ignored.
fit_global_model – bool, default=True. If False, the global model will not be trained.
n_jobs – int, default=1. Number of processor cores used to parallelize local models training. Set -1 to use all the available cores.
backend – string, default=None. The Joblib backend to use when
n_jobs != 1. If None, Joblib’s default backend will be used (typically loki)batch_size – ‘auto’ or int, default=’auto’. Number of batch tasks per parallel job.
- Returns:
self. Fitted estimator.
- predict(X, y=None, geometries=None, crs=None)
Predict the target class for
Xusing the global model and the local models that are closer to geometries. The returned predicted class is the class with highest probability resulting when callingpredict_proba().- Parameters:
X – A SpatialDataFrame, DataFrame, GeoDataFrame or a 2d numpy array. Expected shape is (n_samples, n_features). Predicting data. For SpatialDataFrame or GeoDataFrame, the geometries can be found in
X, as a column. IfXcontains the column y (e.g., a SpatialDataFrame or DataFrame used for training or testing), the parameterymust specify the name of that column, so it can be excluded.y – A string or None, default=None. If
Xcontains a column with the target values, this parameter will specify the name of that column so it can be excluded for the prediction, otherwise, this parameter is not used.geometries – A list of Shapely geometries, a string (column name) or None, default=None. The geometries associated to
X. IfXis a SpatialDataFrame or a GeoDataFrame andXcontains the geometries as one of its columns, this parameter may contain the name of that column or it can be omitted (in caseXhas a column called ‘geometry’).
- Returns:
An array of shape n_samples containing the value of the predicted class.
- predict_proba(X, y=None, geometries=None, crs=None)
Predict the probability of each class for
Xusing the global model and the local models that are closer to geometries. The returned probabilities are calculated as follows: local_model_probabilities * local_weight + global_model_probabilities * (1.0 - local_weight)- Parameters:
X – A SpatialDataFrame, DataFrame, GeoDataFrame or a 2d numpy array. Expected shape is (n_samples, n_features). Predicting data. For SpatialDataFrame or GeoDataFrame, the geometries can be found in
X, as a column. IfXcontains the columny(e.g., a SpatialDataFrame or DataFrame used for training or testing), the parameterymust specify the name of that column so it can be excluded.y – A string or None, default=None. If
Xcontains a column with the target values, this parameter will specify the name of that column so it can be excluded for the prediction, otherwise, this parameter is not used.geometries – A list of Shapely geometries, a string (column name) or None, default=None. The geometries associated to
X. IfXis a SpatialDataFrame or a GeoDataFrame andXcontains the geometries as one of its columns, this parameter may contain the name of that column, or it can be omitted (in caseXhas a column called ‘geometry’).
- Returns:
An array of shape n_samples containing a tuple with the probabilities for each class.
- score(X, y, geometries=None, crs=None)
Compute the F1 score on the given test data and labels.
- Parameters:
X – A SpatialDataFrame, DataFrame, GeoDataFrame or a 2d numpy array. Expected shape is (n_samples, n_features). Predicting data. For SpatialDataFrame or GeoDataFrame, the geometries can be found in
X, as a column. IfXcontains the column y (e.g., a proxy or DataFrame used for training or testing), the parameterymust specify the name of that column so it can be excluded.y – A string or None, default=None. If
Xcontains a column with the target values, this parameter will specify the name of that column so it can be excluded for the prediction, otherwise, this parameter is not used.geometries – A list of Shapely geometries, a string (column name) or None, default=None. The geometries associated to
X. IfXis a SpatialDataFrame or a GeoDataFrame andXcontains the geometries as one of its columns, this parameter may contain the name of that column or it can be omitted (in caseXhas a column called ‘geometry’).crs – pyproj.crs.CRS, default=None. Coordinate reference system.
- Returns:
The mean accuracy of
self.predict(X).
- class SLXClassifier(spatial_weights_definition=None, random_state=None, balance_method=None, balance_ratio=1.0)
Implementation of the SLX Logistic Regression model. Executes Logistic Regression involving a feature engineering step to add features that provide a spatial context to the data. The algorithm adds one or more columns with the spatial lag of certain features, representing the average from neighboring observations.
- Parameters:
spatial_weights_definition – SpatialWeightsDefinition, default=None. Establishes the interaction between neighboring observations.
random_state – RandomState instance or None, default=None. Determines random number generation.
balance_method – {None, ‘random’, ‘smote’}. The method chosen to balance the dataset. ‘random’ creates duplicates from random samples (with replacement) from the minority class. ‘smote’ selects a random sample from the minority class, A, and from its k nearest neighbors, it selects a random neighbor, B. The vector AB is multiplied by a random number in the range [0, 1], and the result is added to A, generating a new synthetic instance.
balance_ratio – float, default=1.0. A number between 0 and 1 representing the desired ratio of observations from minority classes during the balancing process. A value of 1 result in the same number of observations for both classes.
- property betas
- Returns:
An array with the estimated parameters for the trained model. For multi_class, a (m x k) matrix is returned, representing the parameters for each fitted model, where m is the number of classes in the target variable.
- fit(X, y, geometries=None, crs=None, spatial_weights=None, column_ids=None)
Trains the SLX logistic regression model. Get the column indexes from
column_idsand adds the spatial lag of those columns into the training data. Finally, it fits a logistic regression classifier with the extended data. For the multi-class scenario, a model is fitted for each class, following a one-vs-rest strategy.- Parameters:
X – {numpy array, geopandas dataframe, vector dataframe} of shape (n_samples, n_features). Independent variables
y – {pandas.DataFrame, numpy 1D array or string}. If specified as string, X is expected to be a DataFrame.
geometries – shapely array, default=None. Geometry data for each sample in X.
crs – pyproj.crs.CRS, default=None. Coordinate reference system. Only used when X is a numpy array. It is ignored when CRS information is available in
X(i.e. a SpatialDataFrame or GeoDataFrame).spatial_weights – SpatialWeights, default=None. A spatial weights’ matrix.
column_ids – List of strings or list of integers, default=None. A list of column names or column indexes, indicating the columns that will be used to compute the spatial lag.
- Returns:
self. Fitted estimator
- property k
- Returns:
The number of variables for which coefficients are estimated (including the constant)
- property model_type
- Returns:
The type of the classification model
- predict(X, geometries=None, spatial_weights=None, use_fit_lag=False)
Calculates the spatial lag of the dataset using the same columns defined in the fit process and returns the category with the highest probability according to Logistic Regression.
- Parameters:
X – {numpy array, geopandas dataframe, vector dataframe} of shape (n_samples, n_features). Independent variables
geometries – shapely array, default=None. Geometry data for each sample in
X.spatial_weights – SpatialWeights, default=None. A spatial weights matrix.
use_fit_lag – boolean, default=False. If False, use the spatial lag from the prediction data, otherwise, use the training data to calculate the spatial lag.
- Returns:
The predicted class for each element of the prediction set.
- property predy
- Returns:
An array with the predictions for the training data. For multi_class, the prediction represents the class with the highest probability.
- score(X, y, sample_weight=None, geometries=None, use_fit_lag=False)
Returns the accuracy of the model.
- Parameters:
X – {numpy array, geopandas dataframe, vector dataframe} of shape (n_samples, n_features). Independent variables
y – {pandas.DataFrame, numpy 1D array or string}. If specified as string, X is expected to be a DataFrame
sample_weight – Weighted contribution to the score for each sample
geometries – shapely array, default=None. Geometry data for each sample in X.
use_fit_lag – boolean, default=False. If false, it will use the spatial lag from the prediction data, otherwise, it will use the training data to calculate the spatial lag.
- Returns:
The accuracy of the model.
- property summary
- Returns:
A string containing statistics and estimated parameters of the fitted models.
- property u
- Returns:
An array with the residuals of the trained model