mlm_insights.core.sfcs package
Subpackages
Submodules
mlm_insights.core.sfcs.cpc_sfc module
- class mlm_insights.core.sfcs.cpc_sfc.CardinalityItem(estimate: int, lower_bound: float, upper_bound: float)
Bases:
object
- estimate: int
- lower_bound: float
- upper_bound: float
- class mlm_insights.core.sfcs.cpc_sfc.CompressedProbabilityCountingSFC(sketch: cpc_sketch, log_k: int = 11)
Bases:
ShareableFeatureComponent
,Serializable
- Provides estimates in a single pass for:
identifying cardinality estimate (with associated lower and upper bound)
Reference: https://datasketches.apache.org/docs/CPC/CPC.html CompressedProbabilityCountingSFC contains only one state i.e sketch: datasketches.cpc_sketch.
- Note:
Use create method instead of constructor
- compute(column: Series, **kwargs: Any) None
Update the state of the CompressedProbabilityCountingSFC using input series.
Parameters
- columnpd.Series
Input column.
- classmethod create(config: Dict[str, ConfigParameter] | None = None) CompressedProbabilityCountingSFC
- Factory Method to create a CompressedProbabilityCountingSFC. Supported configurable parameters
DEFAULT_LOG_K: K-value to initialize cpc_sketch
Returns
- CompressedProbabilityCountingSFC
An Instance of CompressedProbabilityCountingSFC.
- classmethod deserialize(serialized_bytes: bytes, **kwargs: Any) CompressedProbabilityCountingSFC
Create a new instance of CompressedProbabilityCountingSFC from serialized bytes.
Parameters
- serialized_bytesbytes
Serialized bytes as input.
Returns
- DistinctCountSFC
New instance of CompressedProbabilityCountingSFC
- get_cardinality() CardinalityItem
Returns the cardinality of input data, with associated lower and upper bound.
Returns
CardinalityItem
- log_k: int = 11
- merge(other: CompressedProbabilityCountingSFC, **kwargs: Any) CompressedProbabilityCountingSFC
Merge two CompressedProbabilityCountingSFC into one with the help of an CPC union. The CPC union is updated with both the CompressedProbabilityCountingSFC instances to return a merged CompressedProbabilityCountingSFC
Parameters
- otherCompressedProbabilityCountingSFC
Other CompressedProbabilityCountingSFC that need be merged.
Returns
- CompressedProbabilityCountingSFC
A new instance of CompressedProbabilityCountingSFC after merging.
- serialize(**kwargs: Any) bytes
Serialize the CompressedProbabilityCountingSFC to bytes. Since it have only one state i.e cpc_sketch, using default serialization of datasketches
Returns
- DistinctCountSFC
A new instance of CompressedProbabilityCountingSFC after merging.
- sketch: cpc_sketch
mlm_insights.core.sfcs.descriptive_statistics_sfc module
- class mlm_insights.core.sfcs.descriptive_statistics_sfc.DescriptiveStatisticsSFC(total_count: int, mean: float, minimum: float, maximum: float, central_moments: List[float])
Bases:
ShareableFeatureComponent
DescriptiveStatisticsSFC calculate few descriptive statistics of the data.
It contains following states:
total_count: Size of the data.
mean: the statistical mean of the data
minimum: the minimum element of the data
maximum: the maximum element of the data
central_moments: It stores central_moments up to MAXIMUM_MOMENT_ORDER order.
Mathematically: central_moments[i] = sum{( x - mean )^i} /N
- central_moments: List[float]
- compute(column: Series, **kwargs: Any) None
Update the state of the DescriptiveStatisticsSFC using input series.
Parameters
- columnpd.Series
Input column.
- classmethod create(config: Dict[str, ConfigParameter] | None = None) DescriptiveStatisticsSFC
Factory Method to create an DescriptiveStatisticsSFC. No config parameter is supported.
Returns
- DescriptiveStatisticsSFC
An Instance of QuantilesSFC.
- get_central_moments(order: int) float | None
Get the Central order of order K
Parameters
- orderint
Order of the moment, must be less than equal to MAXIMUM_MOMENT_ORDER
Returns
float : Central moment of order K.
- get_kurtosis() float | None
Get the Excess Kurtosis of data
Returns
float : Excess Kurtosis of the data
- get_maximum() float | None
Get the Maximum of the data
Returns
float : Maximum value of the data.
- get_minimum() float | None
Get the Minimum of the data
Returns
float : Minimum value of the data.
- get_standard_deviation() float | None
Get the Standard Deviation of data
Returns
float : Standard Deviation of the data
- maximum: float
- mean: float
- merge(other: DescriptiveStatisticsSFC, **kwargs: Any) DescriptiveStatisticsSFC
Merge two DescriptiveStatisticsSFC into one, without mutating the others.
Parameters
- otherDescriptiveStatisticsSFC
Other DescriptiveStatisticsSFC that need be merged.
Returns
- DescriptiveStatisticsSFC
A new instance of DescriptiveStatisticsSFC after merging.
- minimum: float
- total_count: int
mlm_insights.core.sfcs.distinct_count_sfc module
- class mlm_insights.core.sfcs.distinct_count_sfc.CardinalityItem(estimate: int, lower_bound: float, upper_bound: float)
Bases:
object
- estimate: int
- lower_bound: float
- upper_bound: float
- class mlm_insights.core.sfcs.distinct_count_sfc.DistinctCountSFC(sketch: hll_sketch)
Bases:
ShareableFeatureComponent
,Serializable
Provides estimates in a single pass for identifying cardinality estimate (with associated lower and upper bound)
Reference: https://datasketches.apache.org/docs/HLL/HLL.html
DistinctCountSFC contains only one state i.e., sketch: datasketches.hll_sketch.
Note: Use create method instead of constructor
- compute(column: Series, **kwargs: Any) None
Update the state of the DistinctCountSFC using input series.
Parameters
- columnpd.Series
Input column.
- classmethod create(config: Dict[str, ConfigParameter] | None = None) DistinctCountSFC
Factory Method to create a DistinctCountSFC. Supported configurable parameters
DEFAULT_LOG_K: K-value to initialize hll_sketch, default = 12
Additionally, the HLL Target for the resulting sketch is set to HLL_4 and is non-configurable
Returns
- DistinctCountSFCDistinctCountSFC
An Instance of DistinctCountSFC.
- classmethod deserialize(serialized_bytes: bytes, **kwargs: Any) DistinctCountSFC
Create a new instance of DistinctCountSFC from serialized bytes.
Parameters
- serialized_bytesbytes
Serialized bytes as input.
Returns
- DistinctCountSFC
New instance of DistinctCountSFC
- get_cardinality() CardinalityItem
Returns the cardinality of input data, with associated lower and upper bound.
Returns
CardinalityItem
- merge(other: DistinctCountSFC, **kwargs: Any) DistinctCountSFC
Merge two DistinctCountSFC into one with the help of an HLL union. The HLL union is updated with both the DistinctCountSFC instances to return a merged DistinctCountSFC
Parameters
- otherDistinctCountSFC
Other DistinctCountSFC that need be merged.
Returns
- DistinctCountSFCDistinctCountSFC
A new instance of DistinctCountSFC after merging.
- serialize(**kwargs: Any) bytes
Serialize the DistinctCountSFC to bytes. Since it have only one state i.e hll_sketch, using default serialization of datasketches
Returns
- DistinctCountSFC
A new instance of DistinctCountSFC after merging.
- sketch: hll_sketch
mlm_insights.core.sfcs.framework_sfc module
Bases:
Enum
Enum to store all framework specific SharableFeatureComponent
mlm_insights.core.sfcs.frequent_items_sfc module
- class mlm_insights.core.sfcs.frequent_items_sfc.FrequentItemEstimate(value: str, estimate: int, lower_bound: int, upper_bound: int)
Bases:
object
- estimate: int
- lower_bound: int
- upper_bound: int
- value: str
- class mlm_insights.core.sfcs.frequent_items_sfc.FrequentItemsSFC(sketch: frequent_strings_sketch)
Bases:
ShareableFeatureComponent
,Serializable
Provides estimates in a single pass for:
identifying frequent items (aka heavy hitters) and
answering point queries (approximately how many times did item appears in a stream/dataset)
Reference: https://datasketches.apache.org/docs/Frequency/FrequentItemsOverview.html
- compute(column: Series, **kwargs: Any) None
Update the state of the FrequentItemsSFC using input series.
Parameters
- columnpd.Series
Input column.
- classmethod create(config: Dict[str, ConfigParameter] | None = None) FrequentItemsSFC
Factory Method to create the SFC. Use create method instead of constructor. Supported configurable parameters CONFIG_MAX_SIZE_KEY = Maximum size of counters. Default is 7. One can tweak this parameter to control both the space usage and the error (larger size corresponds to more space and less error)
Returns
- FrequentItemsSFC
An Instance of FrequentItemsSFC.
- classmethod deserialize(serialized_sketch: bytes, **kwargs: Any) FrequentItemsSFC
Create a new instance from serialized bytes.
Parameters
- serialized_sketchbytes
Serialized bytes as input.
Returns
- FrequentItemsSFC
New instance of FrequentItemsSFC
- get_frequency_estimate(item: Any) FrequentItemEstimate
Get a frequency estimate of a specific item i.e approximately how many times did item appear in the stream/dataset
Parameters
- item: Any
Item value to get the frequency estimate for
Returns
FrequentItemEstimate
- get_frequent_items_estimates() List[FrequentItemEstimate]
Get a list of all the frequent item estimates from the processed data stream/data set
Returns
- List[FrequentItemEstimate]
List of FrequentItemEstimate which includes the value, estimate and lower/upper bounds.
- get_frequent_items_estimates_no_false_negatives() List[FrequentItemEstimate]
Get a list of all the frequent item estimates using the No false negatives for the Frequent Items sketch
Returns
- List[FrequentItemEstimate]
List of FrequentItemEstimate which includes the value, estimate and lower/upper bounds.
- get_top_k_elements(k: int) List[FrequentItemEstimate]
Get a list of top ‘k’ frequent items (aka heavy hitters). When ‘k’ exceeds the number of frequent items, returns the number of frequent items captured by the SFC, else returns `k number of frequent items.
Parameters
- k: int
Count of how many top frequently occurring items to return.
Returns
- List[FrequentItemEstimate]
List of FrequentItemEstimate which includes the value, estimate and lower/upper bounds.
- get_top_k_elements_using_no_false_negatives(k: int) List[FrequentItemEstimate]
Get a list of top ‘k’ frequent items (aka heavy hitters). When ‘k’ exceeds the number of frequent items, returns the number of frequent items captured by the SFC, else returns `k number of frequent items. Here, we use the No false negatives return from the Frequent Items sketch to calculate the Top k elements. This is done in case the sketch returns an empty list for No false positive scenario
Parameters
- k: int
Count of how many top frequently occurring items to return.
Returns
- List[FrequentItemEstimate]
List of FrequentItemEstimate which includes the value, estimate and lower/upper bounds.
- get_total_count() int
Returns the total count of input data.
Returns
- int :
total count of items in the data.
- merge(other: FrequentItemsSFC, **kwargs: Any) FrequentItemsSFC
Merge two SFCs to produce a correct union, without mutating the others.
Parameters
- otherFrequentItemsSFC
Other FrequentItemsSFC to be merged.
Returns
- FrequentItemsSFC
A new instance of FrequentItemsSFC after merging.
- serialize(**kwargs: Any) bytes
Serialize the FrequentItemsSFC to bytes. This allows the SFC to be persisted in a Profile
Returns
- KLLDoublesSFC
A new instance of KLLDoublesSFC after merging.
- sketch: frequent_strings_sketch
mlm_insights.core.sfcs.quantiles_sfc module
- class mlm_insights.core.sfcs.quantiles_sfc.QuantilesSFC(kll_sketch: kll_doubles_sketch)
Bases:
ShareableFeatureComponent
,Serializable
QuantilesSFC uses streaming quantiles’ algorithm. This can be used to find quantiles, ranks, pmf and cmf.
QuantilesSFC contains only one state i.e kll_sketch: datasketches.skll_doubles_sketch.
Reference: https://datasketches.apache.org/docs/KLL/KLLSketch.html
- Note:
Use create method instead of constructor
- compute(column: Series, **kwargs: Any) None
Update the state of the QuantilesSFC using input series.
Parameters
- columnpd.Series
Input column.
- classmethod create(config: Dict[str, ConfigParameter] | None = None) QuantilesSFC
- Factory Method to create an QuantilesSFC. Supported configurable parameters
KLL_K: K-value to initialize kll_double_sketch, default = 200
Returns
- QuantilesSFC
An Instance of QuantilesSFC.
- classmethod deserialize(serialized_bytes: bytes, **kwargs: Any) QuantilesSFC
Create a new instance of QuantilesSFC from serialized bytes.
Parameters
- serialized_bytesbytes
Serialized bytes as input.
Returns
- QuantilesSFC
New instance of QuantilesSFC
- get_quantile(rank: float) float
Returns an approximation to the data value associated with the given normalized rank in a hypothetical sorted version of the input data.
Returns
- float
Quantile of the data of given rank
- kll_sketch: kll_doubles_sketch
- merge(other: QuantilesSFC, **kwargs: Any) QuantilesSFC
Merge two KLL_SFC into one, without mutating the others.
Parameters
- otherQuantilesSFC
Other QuantilesSFC that need be merged.
Returns
- QuantilesSFC
A new instance of QuantilesSFC after merging.
mlm_insights.core.sfcs.sfc_merge_exception module
- exception mlm_insights.core.sfcs.sfc_merge_exception.SFCMergeException(message: str)
Bases:
Exception
Exception raised when merging of 2 ShareableFeatureComponent fails
- Attributes:
message – explanation of the error
mlm_insights.core.sfcs.sfc_registry module
- class mlm_insights.core.sfcs.sfc_registry.SFCMetaData(klass: ~typing.Type[~mlm_insights.core.sfcs.interfaces.shareable_feature_component.ShareableFeatureComponent], config: ~typing.Dict[str, ~typing.Any] = <factory>)
Bases:
object
SFCMetaData to store class type and config of ShareableFeatureComponent
- config: Dict[str, Any]
- get_hash() str
Get the hash of the SFCMetaData, Hash value is derived from md5-hash of SFCMetaData.config
Returns
str: The calculated hash of the SFCMetaData.
- klass: Type[ShareableFeatureComponent]
- class mlm_insights.core.sfcs.sfc_registry.SFCRegistry
Bases:
object
- add_sfc(sfc_metadata: SFCMetaData) SFCRegistry
Add ShareableFeatureComponent to the SFCRegistry
Parameters
sfc_metadata : SFCMetaData
Returns
SFCRegistry
- static create_from_sfc_map(sfc_map: Dict[str, ShareableFeatureComponent]) SFCRegistry
Factory method to create SFC Registry using SFC Map. Use this method to create SFC registry directly form the SFC map.
Parameters
- sfc_mapDict[str, ShareableFeatureComponent]
Dictionary of sfc_map, hash as the Key and ShareableFeatureComponent as value.
- static create_from_sfc_meta(sfc_metas: List[SFCMetaData]) SFCRegistry
Factory method to create SFC Registry using List of SFC Metadata. For each SFC metadata , a hash will be created and new instance of SFC will be created. If two metadata are same, one key will be stored is the set.
Parameters
- sfc_metasList[SFCMetaData]
List of SFCMetaData
- classmethod deserialize(sfc_registry_message: SFCRegistryMessage) SFCRegistry
Deserialize the Protobuffer message to SFCRegistry
Returns
SFCRegistry
- get_sfc(sfc_meta: SFCMetaData) ShareableFeatureComponent
Get the ShareableFeatureComponent from the SFCMetaData.
Parameters
sfc_meta : SFCMetaData
Returns
ShareableFeatureComponent
Raises
- KeyError
If the SFCMetaData is not found in the Registry , it will raise KeyError.
- get_sfc_map() Dict[str, ShareableFeatureComponent]
Get the ShareableFeatureComponent mapping of SFCMetaData.
Returns
Dict[str, ShareableFeatureComponent]: