oraclesai.outliers
- class LocalOutlierFactor(n_neighbors=20, algorithm='auto', leaf_size=30, metric='minkowski', p=2, metric_params=None, contamination='auto', novelty=False, n_jobs=None, spatial_weights_definition=None, threshold=1.5)
The Local Outlier Factor (LOF) helps us to identify outliers from a dataset. The LOF score is calculated for each training observation. Larger the LOF score, more isolated the observation is. It uses the neighbors to compute the LOF score of an observation.
- Parameters:
n_neighbors – int, default=20. The number of neighbors to use for KNN. If n_neighbors is larger than the number of samples provided, all samples will be used. Ignored if spatial_weights_definition is not None
algorithm – {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, default=’auto. Algorithm used to compute the nearest neighbors: - ‘ball_tree’. Uses BallTree - ‘kd_tree’. Uses KDTree - ‘brute’. Uses a brute-force search. - ‘auto’. will try to decide the most appropriate algorithm based on the values passed to the fit method. Ignored if spatial_weights_definition is not None
leaf_size – int, default=30. Leaf is size passed to BallTree or KDTree. Ignored if spatial_weights_definition is not None
metric – str or callable, default=’minkowski’. Metric to use for distance computation. Default is ‘minkowski’, which results in the standard Euclidean distance when p = 2. If metric is ‘precomputed’, X is assumed to be a distance square matrix. X may be a sparse graph, in which case only non-zero elements may be considered neighbors. If metric is a callable function, it takes two arrays representing 1D vectors as inputs and must return one value indicating the distance between those vectors. Ignored if spatial_weights_definition is not None
p – int, default=2. Parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used. Ignored if spatial_weights_definition is not None
metric_params – dict, default=None. Additional keyword arguments for the metric function.
contamination – ‘auto’ or float, default=’auto’. The amount of contamination of the data set, i.e. the proportion of outliers in the data set. When fitting this is used to define the threshold on the scores of the samples. If a float, the contamination should be in the range (0, 0.5]. Ignored if spatial_weights_definition is not None
novelty – bool, default=False. Set novelty to True if you want to use LocalOutlierFactor for novelty detection. In this case you should only use predict on new unseen data and not on the training set;
n_jobs – int, default=None. The number of parallel jobs to run
spatial_weights_definition – SpatialWeightsDefinition, default=None. Spatial relationship specification.
threshold – float, default=1.5. LOF scores above this value are considered outliers.
- fit(X, y=None, geometries=None, spatial_weights=None, crs=None)
Computes the LOF score for each observation in the training set
- Parameters:
X – {numpy array, geopandas dataframe, vector dataframe} of shape (n_samples, n_features). Training instances to cluster.
y – Ignored. Not used, present here for API consistency by convention
geometries – shapely array, default=None. Geometry data for each sample in X.
spatial_weights – SpatialWeights, default=None. A spatial weights matrix
crs – pyproj.crs.CRS, default=None. Coordinate reference system
- fit_predict(X, y=None, geometries=None, crs=None)
Computes the LOF score for each observation in the training set, and returns an array with values in {-1, 1}, indicating if an element of the training set is an inlier or an outlier
- Parameters:
X – {numpy array, geopandas dataframe, vector dataframe} of shape (n_samples, n_features). Training instances to cluster.
y – Ignored. Not used, present here for API consistency by convention
geometries – shapely array, default=None. Geometry data for each sample in X.
crs – pyproj.crs.CRS, default=None. Coordinate reference system
- property negative_outlier_factor_
- Returns:
The negative LOF score of each observation in the training set
- property outlier_factor_
- Returns:
The LOF score of each observation in the training set
- predict(X, geometries=None, crs=None)
Classifies each element of the prediction set as inliers/outliers depending on their LOF score. The LOF score is calculated using the training set
- Parameters:
X – {numpy array, geopandas dataframe, vector dataframe} of shape (n_samples, n_features). The prediction set
geometries – shapely array, default=None. Geometry data for each sample in X
crs – pyproj.crs.CRS, default=None. Coordinate reference system