oraclesai.outliers

class LocalOutlierFactor(n_neighbors=20, algorithm='auto', leaf_size=30, metric='minkowski', p=2, metric_params=None, contamination='auto', novelty=False, n_jobs=None, spatial_weights_definition=None, threshold=1.5)

The Local Outlier Factor (LOF) helps us to identify outliers from a dataset. The LOF score is calculated for each training observation. Larger the LOF score, more isolated the observation is. It uses the neighbors to compute the LOF score of an observation.

Parameters:
  • n_neighbors – int, default=20. The number of neighbors to use for KNN. If n_neighbors is larger than the number of samples provided, all samples will be used. Ignored if spatial_weights_definition is not None

  • algorithm – {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, default=’auto. Algorithm used to compute the nearest neighbors: - ‘ball_tree’. Uses BallTree - ‘kd_tree’. Uses KDTree - ‘brute’. Uses a brute-force search. - ‘auto’. will try to decide the most appropriate algorithm based on the values passed to the fit method. Ignored if spatial_weights_definition is not None

  • leaf_size – int, default=30. Leaf is size passed to BallTree or KDTree. Ignored if spatial_weights_definition is not None

  • metric – str or callable, default=’minkowski’. Metric to use for distance computation. Default is ‘minkowski’, which results in the standard Euclidean distance when p = 2. If metric is ‘precomputed’, X is assumed to be a distance square matrix. X may be a sparse graph, in which case only non-zero elements may be considered neighbors. If metric is a callable function, it takes two arrays representing 1D vectors as inputs and must return one value indicating the distance between those vectors. Ignored if spatial_weights_definition is not None

  • p – int, default=2. Parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used. Ignored if spatial_weights_definition is not None

  • metric_params – dict, default=None. Additional keyword arguments for the metric function.

  • contamination – ‘auto’ or float, default=’auto’. The amount of contamination of the data set, i.e. the proportion of outliers in the data set. When fitting this is used to define the threshold on the scores of the samples. If a float, the contamination should be in the range (0, 0.5]. Ignored if spatial_weights_definition is not None

  • novelty – bool, default=False. Set novelty to True if you want to use LocalOutlierFactor for novelty detection. In this case you should only use predict on new unseen data and not on the training set;

  • n_jobs – int, default=None. The number of parallel jobs to run

  • spatial_weights_definition – SpatialWeightsDefinition, default=None. Spatial relationship specification.

  • threshold – float, default=1.5. LOF scores above this value are considered outliers.

fit(X, y=None, geometries=None, spatial_weights=None, crs=None)

Computes the LOF score for each observation in the training set

Parameters:
  • X – {numpy array, geopandas dataframe, vector dataframe} of shape (n_samples, n_features). Training instances to cluster.

  • y – Ignored. Not used, present here for API consistency by convention

  • geometries – shapely array, default=None. Geometry data for each sample in X.

  • spatial_weights – SpatialWeights, default=None. A spatial weights matrix

  • crs – pyproj.crs.CRS, default=None. Coordinate reference system

fit_predict(X, y=None, geometries=None, crs=None)

Computes the LOF score for each observation in the training set, and returns an array with values in {-1, 1}, indicating if an element of the training set is an inlier or an outlier

Parameters:
  • X – {numpy array, geopandas dataframe, vector dataframe} of shape (n_samples, n_features). Training instances to cluster.

  • y – Ignored. Not used, present here for API consistency by convention

  • geometries – shapely array, default=None. Geometry data for each sample in X.

  • crs – pyproj.crs.CRS, default=None. Coordinate reference system

property negative_outlier_factor_
Returns:

The negative LOF score of each observation in the training set

property outlier_factor_
Returns:

The LOF score of each observation in the training set

predict(X, geometries=None, crs=None)

Classifies each element of the prediction set as inliers/outliers depending on their LOF score. The LOF score is calculated using the training set

Parameters:
  • X – {numpy array, geopandas dataframe, vector dataframe} of shape (n_samples, n_features). The prediction set

  • geometries – shapely array, default=None. Geometry data for each sample in X

  • crs – pyproj.crs.CRS, default=None. Coordinate reference system