mlm_insights.core.sfcs package

Subpackages

Submodules

mlm_insights.core.sfcs.cpc_sfc module

class mlm_insights.core.sfcs.cpc_sfc.CardinalityItem(estimate: int, lower_bound: float, upper_bound: float)

Bases: object

estimate: int
lower_bound: float
upper_bound: float
class mlm_insights.core.sfcs.cpc_sfc.CompressedProbabilityCountingSFC(sketch: cpc_sketch, log_k: int = 11)

Bases: ShareableFeatureComponent, Serializable

Provides estimates in a single pass for:
  • identifying cardinality estimate (with associated lower and upper bound)

Reference: https://datasketches.apache.org/docs/CPC/CPC.html CompressedProbabilityCountingSFC contains only one state i.e sketch: datasketches.cpc_sketch.

Note:

Use create method instead of constructor

compute(column: Series, **kwargs: Any) None

Update the state of the CompressedProbabilityCountingSFC using input series.

Parameters

columnpd.Series

Input column.

classmethod create(config: Dict[str, ConfigParameter] | None = None) CompressedProbabilityCountingSFC
Factory Method to create a CompressedProbabilityCountingSFC. Supported configurable parameters

DEFAULT_LOG_K: K-value to initialize cpc_sketch

Returns

CompressedProbabilityCountingSFC

An Instance of CompressedProbabilityCountingSFC.

classmethod deserialize(serialized_bytes: bytes, **kwargs: Any) CompressedProbabilityCountingSFC

Create a new instance of CompressedProbabilityCountingSFC from serialized bytes.

Parameters

serialized_bytesbytes

Serialized bytes as input.

Returns

DistinctCountSFC

New instance of CompressedProbabilityCountingSFC

get_cardinality() CardinalityItem

Returns the cardinality of input data, with associated lower and upper bound.

Returns

CardinalityItem

log_k: int = 11
merge(other: CompressedProbabilityCountingSFC, **kwargs: Any) CompressedProbabilityCountingSFC

Merge two CompressedProbabilityCountingSFC into one with the help of an CPC union. The CPC union is updated with both the CompressedProbabilityCountingSFC instances to return a merged CompressedProbabilityCountingSFC

Parameters

otherCompressedProbabilityCountingSFC

Other CompressedProbabilityCountingSFC that need be merged.

Returns

CompressedProbabilityCountingSFC

A new instance of CompressedProbabilityCountingSFC after merging.

serialize(**kwargs: Any) bytes

Serialize the CompressedProbabilityCountingSFC to bytes. Since it have only one state i.e cpc_sketch, using default serialization of datasketches

Returns

DistinctCountSFC

A new instance of CompressedProbabilityCountingSFC after merging.

sketch: cpc_sketch

mlm_insights.core.sfcs.descriptive_statistics_sfc module

class mlm_insights.core.sfcs.descriptive_statistics_sfc.DescriptiveStatisticsSFC(total_count: int, mean: float, minimum: float, maximum: float, central_moments: List[float])

Bases: ShareableFeatureComponent

DescriptiveStatisticsSFC calculate few descriptive statistics of the data.

It contains following states:

total_count: Size of the data.

mean: the statistical mean of the data

minimum: the minimum element of the data

maximum: the maximum element of the data

central_moments: It stores central_moments up to MAXIMUM_MOMENT_ORDER order.

Mathematically: central_moments[i] = sum{( x - mean )^i} /N

central_moments: List[float]
compute(column: Series, **kwargs: Any) None

Update the state of the DescriptiveStatisticsSFC using input series.

Parameters

columnpd.Series

Input column.

classmethod create(config: Dict[str, ConfigParameter] | None = None) DescriptiveStatisticsSFC

Factory Method to create an DescriptiveStatisticsSFC. No config parameter is supported.

Returns

DescriptiveStatisticsSFC

An Instance of QuantilesSFC.

get_central_moments(order: int) float | None

Get the Central order of order K

Parameters

orderint

Order of the moment, must be less than equal to MAXIMUM_MOMENT_ORDER

Returns

float : Central moment of order K.

get_kurtosis() float | None

Get the Excess Kurtosis of data

Returns

float : Excess Kurtosis of the data

get_maximum() float | None

Get the Maximum of the data

Returns

float : Maximum value of the data.

get_mean() float | None

Get the mean of the data

Returns

float : Mean of the data.

get_minimum() float | None

Get the Minimum of the data

Returns

float : Minimum value of the data.

get_skewness() float | None

Get the Skewness of data

Returns

float : Skewness of the data

get_standard_deviation() float | None

Get the Standard Deviation of data

Returns

float : Standard Deviation of the data

get_total_count() int

Get the total count of the data

Returns

float : Total count of the data.

get_variance() float | None

Get the variance of data

Returns

float : Variance of the data

maximum: float
mean: float
merge(other: DescriptiveStatisticsSFC, **kwargs: Any) DescriptiveStatisticsSFC

Merge two DescriptiveStatisticsSFC into one, without mutating the others.

Parameters

otherDescriptiveStatisticsSFC

Other DescriptiveStatisticsSFC that need be merged.

Returns

DescriptiveStatisticsSFC

A new instance of DescriptiveStatisticsSFC after merging.

minimum: float
total_count: int

mlm_insights.core.sfcs.distinct_count_sfc module

class mlm_insights.core.sfcs.distinct_count_sfc.CardinalityItem(estimate: int, lower_bound: float, upper_bound: float)

Bases: object

estimate: int
lower_bound: float
upper_bound: float
class mlm_insights.core.sfcs.distinct_count_sfc.DistinctCountSFC(sketch: hll_sketch)

Bases: ShareableFeatureComponent, Serializable

Provides estimates in a single pass for identifying cardinality estimate (with associated lower and upper bound)

Reference: https://datasketches.apache.org/docs/HLL/HLL.html

DistinctCountSFC contains only one state i.e., sketch: datasketches.hll_sketch.

Note: Use create method instead of constructor

compute(column: Series, **kwargs: Any) None

Update the state of the DistinctCountSFC using input series.

Parameters

columnpd.Series

Input column.

classmethod create(config: Dict[str, ConfigParameter] | None = None) DistinctCountSFC

Factory Method to create a DistinctCountSFC. Supported configurable parameters

DEFAULT_LOG_K: K-value to initialize hll_sketch, default = 12

Additionally, the HLL Target for the resulting sketch is set to HLL_4 and is non-configurable

Returns

DistinctCountSFCDistinctCountSFC

An Instance of DistinctCountSFC.

classmethod deserialize(serialized_bytes: bytes, **kwargs: Any) DistinctCountSFC

Create a new instance of DistinctCountSFC from serialized bytes.

Parameters

serialized_bytesbytes

Serialized bytes as input.

Returns

DistinctCountSFC

New instance of DistinctCountSFC

get_cardinality() CardinalityItem

Returns the cardinality of input data, with associated lower and upper bound.

Returns

CardinalityItem

merge(other: DistinctCountSFC, **kwargs: Any) DistinctCountSFC

Merge two DistinctCountSFC into one with the help of an HLL union. The HLL union is updated with both the DistinctCountSFC instances to return a merged DistinctCountSFC

Parameters

otherDistinctCountSFC

Other DistinctCountSFC that need be merged.

Returns

DistinctCountSFCDistinctCountSFC

A new instance of DistinctCountSFC after merging.

serialize(**kwargs: Any) bytes

Serialize the DistinctCountSFC to bytes. Since it have only one state i.e hll_sketch, using default serialization of datasketches

Returns

DistinctCountSFC

A new instance of DistinctCountSFC after merging.

sketch: hll_sketch

mlm_insights.core.sfcs.framework_sfc module

class mlm_insights.core.sfcs.framework_sfc.FrameworkShareableFeatureComponent(value)

Bases: Enum

Enum to store all framework specific SharableFeatureComponent

CompressedProbabilityCountingSFC = <class 'mlm_insights.core.sfcs.cpc_sfc.CompressedProbabilityCountingSFC'>
DescriptiveStatisticsSFC = <class 'mlm_insights.core.sfcs.descriptive_statistics_sfc.DescriptiveStatisticsSFC'>
DistinctCountSFC = <class 'mlm_insights.core.sfcs.distinct_count_sfc.DistinctCountSFC'>
FrequentItemsSFC = <class 'mlm_insights.core.sfcs.frequent_items_sfc.FrequentItemsSFC'>
QuantilesSFC = <class 'mlm_insights.core.sfcs.quantiles_sfc.QuantilesSFC'>

mlm_insights.core.sfcs.frequent_items_sfc module

class mlm_insights.core.sfcs.frequent_items_sfc.FrequentItemEstimate(value: str, estimate: int, lower_bound: int, upper_bound: int)

Bases: object

estimate: int
lower_bound: int
upper_bound: int
value: str
class mlm_insights.core.sfcs.frequent_items_sfc.FrequentItemsSFC(sketch: frequent_strings_sketch)

Bases: ShareableFeatureComponent, Serializable

Provides estimates in a single pass for:

  • identifying frequent items (aka heavy hitters) and

  • answering point queries (approximately how many times did item appears in a stream/dataset)

Reference: https://datasketches.apache.org/docs/Frequency/FrequentItemsOverview.html

compute(column: Series, **kwargs: Any) None

Update the state of the FrequentItemsSFC using input series.

Parameters

columnpd.Series

Input column.

classmethod create(config: Dict[str, ConfigParameter] | None = None) FrequentItemsSFC

Factory Method to create the SFC. Use create method instead of constructor. Supported configurable parameters CONFIG_MAX_SIZE_KEY = Maximum size of counters. Default is 7. One can tweak this parameter to control both the space usage and the error (larger size corresponds to more space and less error)

Returns

FrequentItemsSFC

An Instance of FrequentItemsSFC.

classmethod deserialize(serialized_sketch: bytes, **kwargs: Any) FrequentItemsSFC

Create a new instance from serialized bytes.

Parameters

serialized_sketchbytes

Serialized bytes as input.

Returns

FrequentItemsSFC

New instance of FrequentItemsSFC

get_frequency_estimate(item: Any) FrequentItemEstimate

Get a frequency estimate of a specific item i.e approximately how many times did item appear in the stream/dataset

Parameters

item: Any

Item value to get the frequency estimate for

Returns

FrequentItemEstimate

get_frequent_items_estimates() List[FrequentItemEstimate]

Get a list of all the frequent item estimates from the processed data stream/data set

Returns

List[FrequentItemEstimate]

List of FrequentItemEstimate which includes the value, estimate and lower/upper bounds.

get_frequent_items_estimates_no_false_negatives() List[FrequentItemEstimate]

Get a list of all the frequent item estimates using the No false negatives for the Frequent Items sketch

Returns

List[FrequentItemEstimate]

List of FrequentItemEstimate which includes the value, estimate and lower/upper bounds.

get_top_k_elements(k: int) List[FrequentItemEstimate]

Get a list of top ‘k’ frequent items (aka heavy hitters). When ‘k’ exceeds the number of frequent items, returns the number of frequent items captured by the SFC, else returns `k number of frequent items.

Parameters

k: int

Count of how many top frequently occurring items to return.

Returns

List[FrequentItemEstimate]

List of FrequentItemEstimate which includes the value, estimate and lower/upper bounds.

get_top_k_elements_using_no_false_negatives(k: int) List[FrequentItemEstimate]

Get a list of top ‘k’ frequent items (aka heavy hitters). When ‘k’ exceeds the number of frequent items, returns the number of frequent items captured by the SFC, else returns `k number of frequent items. Here, we use the No false negatives return from the Frequent Items sketch to calculate the Top k elements. This is done in case the sketch returns an empty list for No false positive scenario

Parameters

k: int

Count of how many top frequently occurring items to return.

Returns

List[FrequentItemEstimate]

List of FrequentItemEstimate which includes the value, estimate and lower/upper bounds.

get_total_count() int

Returns the total count of input data.

Returns

int :

total count of items in the data.

merge(other: FrequentItemsSFC, **kwargs: Any) FrequentItemsSFC

Merge two SFCs to produce a correct union, without mutating the others.

Parameters

otherFrequentItemsSFC

Other FrequentItemsSFC to be merged.

Returns

FrequentItemsSFC

A new instance of FrequentItemsSFC after merging.

serialize(**kwargs: Any) bytes

Serialize the FrequentItemsSFC to bytes. This allows the SFC to be persisted in a Profile

Returns

KLLDoublesSFC

A new instance of KLLDoublesSFC after merging.

sketch: frequent_strings_sketch

mlm_insights.core.sfcs.quantiles_sfc module

class mlm_insights.core.sfcs.quantiles_sfc.QuantilesSFC(kll_sketch: kll_doubles_sketch)

Bases: ShareableFeatureComponent, Serializable

QuantilesSFC uses streaming quantiles’ algorithm. This can be used to find quantiles, ranks, pmf and cmf.

QuantilesSFC contains only one state i.e kll_sketch: datasketches.skll_doubles_sketch.

Reference: https://datasketches.apache.org/docs/KLL/KLLSketch.html

Note:

Use create method instead of constructor

compute(column: Series, **kwargs: Any) None

Update the state of the QuantilesSFC using input series.

Parameters

columnpd.Series

Input column.

classmethod create(config: Dict[str, ConfigParameter] | None = None) QuantilesSFC
Factory Method to create an QuantilesSFC. Supported configurable parameters

KLL_K: K-value to initialize kll_double_sketch, default = 200

Returns

QuantilesSFC

An Instance of QuantilesSFC.

classmethod deserialize(serialized_bytes: bytes, **kwargs: Any) QuantilesSFC

Create a new instance of QuantilesSFC from serialized bytes.

Parameters

serialized_bytesbytes

Serialized bytes as input.

Returns

QuantilesSFC

New instance of QuantilesSFC

get_maximum_value() float

Returns

float

Maximum value in the sketch

get_median() float

Returns an approximate median of input data.

Returns

float

Median of the data.

get_minimum_value() float

Returns

float

Minimum value in the sketch

get_quantile(rank: float) float

Returns an approximation to the data value associated with the given normalized rank in a hypothetical sorted version of the input data.

Returns

float

Quantile of the data of given rank

get_size() int

Returns

int

Number of elements in the SFC

kll_sketch: kll_doubles_sketch
merge(other: QuantilesSFC, **kwargs: Any) QuantilesSFC

Merge two KLL_SFC into one, without mutating the others.

Parameters

otherQuantilesSFC

Other QuantilesSFC that need be merged.

Returns

QuantilesSFC

A new instance of QuantilesSFC after merging.

serialize(**kwargs: Any) bytes

Serialize the QuantilesSFC to bytes. Since it have only one state i.e kll_sketch, using default serialization of datasketches

Returns

QuantilesSFC

A new instance of QuantilesSFC after merging.

mlm_insights.core.sfcs.sfc_merge_exception module

exception mlm_insights.core.sfcs.sfc_merge_exception.SFCMergeException(message: str)

Bases: Exception

Exception raised when merging of 2 ShareableFeatureComponent fails

Attributes:

message – explanation of the error

mlm_insights.core.sfcs.sfc_registry module

class mlm_insights.core.sfcs.sfc_registry.SFCMetaData(klass: ~typing.Type[~mlm_insights.core.sfcs.interfaces.shareable_feature_component.ShareableFeatureComponent], config: ~typing.Dict[str, ~typing.Any] = <factory>)

Bases: object

SFCMetaData to store class type and config of ShareableFeatureComponent

config: Dict[str, Any]
get_hash() str

Get the hash of the SFCMetaData, Hash value is derived from md5-hash of SFCMetaData.config

Returns

str: The calculated hash of the SFCMetaData.

klass: Type[ShareableFeatureComponent]
class mlm_insights.core.sfcs.sfc_registry.SFCRegistry

Bases: object

add_sfc(sfc_metadata: SFCMetaData) SFCRegistry

Add ShareableFeatureComponent to the SFCRegistry

Parameters

sfc_metadata : SFCMetaData

Returns

SFCRegistry

static create_from_sfc_map(sfc_map: Dict[str, ShareableFeatureComponent]) SFCRegistry

Factory method to create SFC Registry using SFC Map. Use this method to create SFC registry directly form the SFC map.

Parameters

sfc_mapDict[str, ShareableFeatureComponent]

Dictionary of sfc_map, hash as the Key and ShareableFeatureComponent as value.

static create_from_sfc_meta(sfc_metas: List[SFCMetaData]) SFCRegistry

Factory method to create SFC Registry using List of SFC Metadata. For each SFC metadata , a hash will be created and new instance of SFC will be created. If two metadata are same, one key will be stored is the set.

Parameters

sfc_metasList[SFCMetaData]

List of SFCMetaData

classmethod deserialize(sfc_registry_message: SFCRegistryMessage) SFCRegistry

Deserialize the Protobuffer message to SFCRegistry

Returns

SFCRegistry

get_sfc(sfc_meta: SFCMetaData) ShareableFeatureComponent

Get the ShareableFeatureComponent from the SFCMetaData.

Parameters

sfc_meta : SFCMetaData

Returns

ShareableFeatureComponent

Raises

KeyError

If the SFCMetaData is not found in the Registry , it will raise KeyError.

get_sfc_map() Dict[str, ShareableFeatureComponent]

Get the ShareableFeatureComponent mapping of SFCMetaData.

Returns

Dict[str, ShareableFeatureComponent]:

get_sfcs() Any

Get the ShareableFeatureComponent from the SFCMetaData.

Returns

All values of ShareableFeatureComponent

serialize() SFCRegistryMessage

Serialize theSFCRegistry to Protobuffer Message.

Returns

SFCRegistryMessage

Module contents