mlm_insights.core.metrics.drift_metrics package

Submodules

mlm_insights.core.metrics.drift_metrics.chi_square module

class mlm_insights.core.metrics.drift_metrics.chi_square.ChiSquare(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, epsilon_value: float = 0.0001, _max_size_k: int = 7)

Bases: MetricBase

Data Drift Metric to compute Chi-square goodness of fit test
The chi-square tests the null hypothesis that the categorical data has the given frequencies.
It can process only categorical data types (nominal, ordinal, binary).
It is an approximate metric
This is used for Model Drift computation, taking into consideration reference and current profiles

Configuration

epsilon_value: float, default = 0.0001
  • This function replaces the 0 values in an array with a smaller value. If the array contains any elements <= smaller value, it replaces with epsilon. This is required for certain drift algorithms to ensure the value generated is not an invalid one. For eg: if a denominator is a zero, this leads to division by zero error

_max_size_k: int, default = 7
  • Maximum size, in log2, of k. The value must be between 7 and 21, inclusive

Returns

  • algorithm: string: Drift Algorithm Name
    • “Chi Squared Goodness of Fit Test”

  • test_statistic: float: Test Statistic
    • The chi-squared test statistic

  • p_value: float: p value
    • The P-value is the area under the density curve of this chi-square distribution to the right of the value of the test statistic.

Examples

import pandas as pd

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType
from mlm_insights.core.metrics.drift_metrics.chi_square import ChiSquare
from mlm_insights.core.metrics.metric_metadata import MetricMetadata

input_schema = {
    'mode_of_transport': FeatureType(
        data_type=DataType.TEXT,
        variable_type=VariableType.NOMINAL
        column_type=ColumnType.INPUT)
}


def get_metrics():
    uni_variate_metrics = {
        "mode_of_transport": [MetricMetadata(klass=ChiSquare)]

    }
    metric_details = MetricDetail(univariate_metric=uni_variate_metrics,
                                  dataset_metrics=[])
    return metric_details


def do_run(data_frame):
    runner = InsightsBuilder().                 with_input_schema(input_schema).                 with_data_frame(data_frame=data_frame).                 with_metrics(metrics=get_metrics()).                 with_engine(engine=EngineDetail(engine_name="native")).                 build()
    return runner.run().profile


def main():
    reference_data_frame = pd.DataFrame({'mode_of_transport': ['bus', 'bus', 'train', 'walk', 'bus', 'car']})
    target_data_frame = pd.DataFrame({'mode_of_transport': ['bus', 'bus', 'bus', 'cycle', 'bus', 'car']})

    # do a reference run
    reference_profile = do_run(data_frame=reference_data_frame)
    target_profile = do_run(data_frame=target_data_frame)

    profile_json = target_profile.to_json(reference_profile=reference_profile)
    feature_metrics = profile_json['feature_metrics']
    print(feature_metrics['mode_of_transport']["ChiSquare"])


if __name__ == "__main__":
    main()

Returns the standard metric result as:
{
    'metric_name': 'ChiSquare',
    'metric_description': 'Data Drift Metric to compute Chi-square goodness of fit test',
    'variable_count': 3,
    'variable_names': ['algorithm', 'test_statistic', 'p_value'],
    'variable_types': [TEXT, CONTINUOUS, CONTINUOUS],
    'variable_dtypes': [STRING, FLOAT, FLOAT],
    'variable_dimensions': [0, 0, 0],
    'metric_data': ['ChiSquare', 0.5, 0.5],
    'metadata': {},
    'error': None
}
classmethod create(config: Dict[str, ConfigParameter] | None = None) ChiSquare

Factory Method to create an object.

Returns

Object: number of items that are duplicate of another item in the data and percentage of duplicate count out of the total count.

epsilon_value: float = 0.0001
get_required_shareable_feature_components() List[SFCMetaData]

Returns a list of Shareable Feature Components containing 1 SFC that is Frequent Items SFC.

Returns

List of SFCMetadata, containing only 1 SFC i.e. Frequent Items SFC

get_result(**kwargs: Any) Dict[str, Any]

Returns the computed value of the metric Shareable Feature Component(s) can be accessed using kwargs

Returns

Dict[str, Any]: Dictionary with key as string and value as any metric property.

get_standard_metric_result(**kwargs: Any) StandardMetricResult

This method returns metric output in standard format.

Returns

StandardMetricResult

merge(other_metric: ChiSquare, **kwargs: Any) ChiSquare

Merge two ChiSquare into one, without mutating the others.

Parameters

other_metricChiSquare

Other ChiSquare that need be merged.

Returns

TypeMetric

A new instance of ChiSquare

mlm_insights.core.metrics.drift_metrics.drift_metrics_helper module

mlm_insights.core.metrics.drift_metrics.drift_metrics_helper.get_quantiles_sfcs(metric_metadata: MetricMetadata, metric: MetricBase, kwargs: Any) Tuple[Any, Any, float, float]
mlm_insights.core.metrics.drift_metrics.drift_metrics_helper.validate_metric_can_be_computed(current_profile: Profile, reference_profile: Profile, feature_name: str, metric_metadata: MetricMetadata) None

mlm_insights.core.metrics.drift_metrics.jensen_shannon module

class mlm_insights.core.metrics.drift_metrics.jensen_shannon.JensenShannon(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, bins: str | int | ~typing.List[float] = 'sturges')

Bases: MetricBase

Data Drift Metric to compute Jensen Shannon distance between 2 probability distributions
This is the square root of the Jensen-Shannon divergence.
It can process only numerical data types (int, float).
It is an approximate metric
This is used for Model Drift computation, taking into consideration reference and current profiles

Configuration

bin: Union[str, int, List[float]], default=’sturges’
One of the following values
- Number of bins
- Binning algorithm. Default is Sturges
- Bins: List of floats

Returns

  • algorithm: string: Drift Algorithm Name
    • “Jensen Shannon Distance”

  • drift_score: float: Drift Score
    • The Jensen-Shannon distances between 2 probability distributions

Examples

import pandas as pd
from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType
from mlm_insights.core.metrics.drift_metrics.jensen_shannon import JensenShannon
from mlm_insights.core.metrics.metric_metadata import MetricMetadata

input_schema = {
    'square_feet': FeatureType(
        data_type=DataType.FLOAT,
        variable_type=VariableType.CONTINUOUS,
        column_type=ColumnType.INPUT)
}


def get_metrics():
    uni_variate_metrics = {
        "square_feet": [MetricMetadata(klass=JensenShannon)]

    }
    metric_details = MetricDetail(univariate_metric=uni_variate_metrics,
                                  dataset_metrics=[])
    return metric_details


def do_run(data_frame):
    runner = InsightsBuilder().                 with_input_schema(input_schema).                 with_data_frame(data_frame=data_frame).                 with_metrics(metrics=get_metrics()).                 with_engine(engine=EngineDetail(engine_name="native")).                 build()
    return runner.run().profile


def main():
    reference_data_frame = pd.DataFrame({'square_feet': [10, 10, 10, 10]})
    target_data_frame = pd.DataFrame({'square_feet': [20, 21.2, 10, 11.3]})

    # do a reference run
    reference_profile = do_run(data_frame=reference_data_frame)
    target_profile = do_run(data_frame=target_data_frame)

    profile_json = target_profile.to_json(reference_profile=reference_profile)
    feature_metrics = profile_json['feature_metrics']
    print(feature_metrics['square_feet']["JensenShannon"])


if __name__ == "__main__":
    main()

Returns the standard metric result as:
{
    'metric_name': 'JensenShannon',
    'metric_description': 'Data Drift Metric to compute Jensen Shannon distance between 2 probability distributions',
    'variable_count': 2,
    'variable_names': ['algorithm', 'drift_score'],
    'variable_types': [TEXT, CONTINUOUS],
    'variable_dtypes': [STRING, FLOAT],
    'variable_dimensions': [0, 0],
    'metric_data': ['JensenShannon', 0.5],
    'metadata': {},
    'error': None
}
bins: str | int | List[float] = 'sturges'
classmethod create(config: Dict[str, ConfigParameter] | None = None) JensenShannon

Factory Method to create an object. The configuration will be available in config.

Returns

MetricBase

An Instance of MetricBase.

get_required_shareable_feature_components() List[SFCMetaData]

Returns list of SFCs required to compute KL metric.

Returns

List: list of SFCs

get_result(**kwargs: Any) Dict[str, Any]

Returns the computed value of the metric Shareable Feature Component(s) can be accessed using kwargs

Returns

Dict[str, Any]: Dictionary with key as string and value as any metric property.

get_standard_metric_result(**kwargs: Any) StandardMetricResult

This method returns metric output in standard format.

Returns

StandardMetricResult

merge(other_metric: JensenShannon, **kwargs: Any) JensenShannon

Merge two JensenShannon into one, without mutating the others.

Parameters

other_metricJensenShannon

Other JensenShannon that need be merged.

Returns

TypeMetric

A new instance of JensenShannon

mlm_insights.core.metrics.drift_metrics.kolmogorov_smirnov module

class mlm_insights.core.metrics.drift_metrics.kolmogorov_smirnov.KolmogorovSmirnov(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, bins: str | int | ~typing.List[float] = 100, _kll_k: int = 500)

Bases: MetricBase

Performs the two-sample Kolmogorov-Smirnov test for goodness of fit.
The asymptotic Kolmogorov-Smirnov distribution is used to compute an approximate p-value.
Kolmogorov Smirnov Test is Nonparametric statistical test to identify whether 2 probability distributions differ
or whether the two data samples come from the same distribution
Test Statistic for the 2-sample test is the greatest distance between the CDFs (Cumulative Distribution Function) of each sample
Null Hypothesis: samples are drawn from the same distribution
It can process only numerical data types (int, float).
It is an approximate metric
Internally, it uses a sketch data structure with a default K value of 500.
This is used for Model Drift computation, taking into consideration reference and current profiles

Configuration

bin: Union[str, int, List[float]], default=’sturges’
One of the following values
- Number of bins
- Binning algorithm. Default is Sturges
- Bins: List of floats
_KLL_K: int, default= 500
  • buffer size for kll sketch

Returns

  • algorithm: string: Drift Algorithm Name
    • “Kolmogorov Smirnov”

  • test_statistic: float: Test Statistic
    • The KS test statistic

  • p_value: float: p value
    • show us the chance of getting the two samples, assuming the null hypothesis is true

Examples

import pandas as pd

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType
from mlm_insights.core.metrics.drift_metrics.kolmogorov_smirnov import KolmogorovSmirnov
from mlm_insights.core.metrics.metric_metadata import MetricMetadata

input_schema = {
    'square_feet': FeatureType(
        data_type=DataType.FLOAT,
        variable_type=VariableType.CONTINUOUS,
        column_type=ColumnType.INPUT)
}


def get_metrics():
    uni_variate_metrics = {
        "square_feet": [MetricMetadata(klass=KolmogorovSmirnov)]

    }
    metric_details = MetricDetail(univariate_metric=uni_variate_metrics,
                                  dataset_metrics=[])
    return metric_details


def do_run(data_frame):
    runner = InsightsBuilder().                 with_input_schema(input_schema).                 with_data_frame(data_frame=data_frame).                 with_metrics(metrics=get_metrics()).                 with_engine(engine=EngineDetail(engine_name="native")).                 build()
    return runner.run().profile


def main():
    reference_data_frame = pd.DataFrame({'square_feet': [10, 10, 10, 10]})
    target_data_frame = pd.DataFrame({'square_feet': [20, 21.2, 10, 11.3]})

    # do a reference run
    reference_profile = do_run(data_frame=reference_data_frame)
    target_profile = do_run(data_frame=target_data_frame)

    profile_json = target_profile.to_json(reference_profile=reference_profile)
    feature_metrics = profile_json['feature_metrics']
    print(feature_metrics['square_feet']["KolmogorovSmirnov"])


if __name__ == "__main__":
    main()

Returns the standard metric result as:
{
    'metric_name': 'KolmogorovSmirnov',
    'metric_description': 'Data Drift Metric to compute two-sample Kolmogorov-Smirnov test for goodness of fit',
    'variable_count': 3,
    'variable_names': ['algorithm', 'test_statistic', 'p_value'],
    'variable_types': [TEXT, CONTINUOUS, CONTINUOUS],
    'variable_dtypes': [STRING, FLOAT, FLOAT],
    'variable_dimensions': [0, 0, 0],
    'metric_data': ['KolmogorovSmirnov', 0.5, 0.5],
    'metadata': {},
    'error': None
}
bins: str | int | List[float] = 100
classmethod create(config: Dict[str, ConfigParameter] | None = None) KolmogorovSmirnov

Factory Method to create an object. The configuration will be available in config.

Returns

MetricBase

An Instance of MetricBase.

get_required_shareable_feature_components() List[SFCMetaData]

Returns list of SFCs required to compute KS metric.

Returns

List: list of SFCs

get_result(**kwargs: Any) Dict[str, Any]

Returns the computed value of the metric Shareable Feature Component(s) can be accessed using kwargs

Returns

Dict[str, Any]: Dictionary with key as string and value as any metric property.

get_standard_metric_result(**kwargs: Any) StandardMetricResult

This method returns metric output in standard format.

Returns

StandardMetricResult

merge(other_metric: KolmogorovSmirnov, **kwargs: Any) KolmogorovSmirnov

Merge two KolmogorovSmirnov metric into one, without mutating the others.

Parameters

other_metricKolmogorovSmirnov

Other KolmogorovSmirnov metric that need be merged.

Returns

TypeMetric

A new instance of KolmogorovSmirnov

mlm_insights.core.metrics.drift_metrics.kullback_leibler module

class mlm_insights.core.metrics.drift_metrics.kullback_leibler.KullbackLeibler(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, bins: str | int | ~typing.List[float] = 'sturges')

Bases: MetricBase

Metric to compute Kullback-Leibler divergence between 2 probability distributions
It is an approximate metric. It can process only numerical data types (int, float).
This is used for Model Drift computation, taking into consideration reference and current profiles

Configuration

bin: Union[str, int, List[float]], default=’sturges’
One of the following values
- Number of bins
- Binning algorithm. Default is Sturges
- Bins: List of floats

Returns

  • algorithm: string: Drift Algorithm Name
    • “Kullback Leibler Divergence”

  • drift_score: float: Drift Score
    • The KL distances between 2 probability distributions

Examples

import pandas as pd

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType
from mlm_insights.core.metrics.drift_metrics.kullback_leibler import KullbackLeibler
from mlm_insights.core.metrics.metric_metadata import MetricMetadata

input_schema = {
    'square_feet': FeatureType(
        data_type=DataType.FLOAT,
        variable_type=VariableType.CONTINUOUS,
        column_type=ColumnType.INPUT)
}


def get_metrics():
    uni_variate_metrics = {
        "square_feet": [MetricMetadata(klass=KullbackLeibler)]

    }
    metric_details = MetricDetail(univariate_metric=uni_variate_metrics,
                                  dataset_metrics=[])
    return metric_details


def do_run(data_frame):
    runner = InsightsBuilder().                 with_input_schema(input_schema).                 with_data_frame(data_frame=data_frame).                 with_metrics(metrics=get_metrics()).                 with_engine(engine=EngineDetail(engine_name="native")).                 build()
    return runner.run().profile


def main():
    reference_data_frame = pd.DataFrame({'square_feet': [10, 10, 10, 10]})
    target_data_frame = pd.DataFrame({'square_feet': [20, 21.2, 10, 11.3]})

    # do a reference run
    reference_profile = do_run(data_frame=reference_data_frame)
    target_profile = do_run(data_frame=target_data_frame)

    profile_json = target_profile.to_json(reference_profile=reference_profile)
    feature_metrics = profile_json['feature_metrics']
    print(feature_metrics['square_feet']["KullbackLeibler"])


if __name__ == "__main__":
    main()

Returns the standard metric result as:
{
    'metric_name': 'KullbackLeibler',
    'metric_description': 'Data Drift Metric to compute Kullback-Leibler divergence between 2 probability distributions',
    'variable_count': 2,
    'variable_names': ['algorithm', 'drift_score'],
    'variable_types': [TEXT, CONTINUOUS],
    'variable_dtypes': [STRING, FLOAT],
    'variable_dimensions': [0, 0],
    'metric_data': ['KullbackLeibler', 0.5],
    'metadata': {},
    'error': None
}
bins: str | int | List[float] = 'sturges'
classmethod create(config: Dict[str, ConfigParameter] | None = None) KullbackLeibler

Factory Method to create an object. The configuration will be available in config.

Returns

MetricBase

An Instance of MetricBase.

get_required_shareable_feature_components() List[SFCMetaData]

Returns list of SFCs required to compute KL metric.

Returns

List: list of SFCs

get_result(**kwargs: Any) Dict[str, Any]

Returns the computed value of the metric Shareable Feature Component(s) can be accessed using kwargs

Returns

Dict[str, Any]: Dictionary with key as string and value as any metric property.

get_standard_metric_result(**kwargs: Any) StandardMetricResult

This method returns metric output in standard format.

Returns

StandardMetricResult

merge(other_metric: KullbackLeibler, **kwargs: Any) KullbackLeibler

Merge two KullbackLeibler into one, without mutating the others.

Parameters

other_metricKullbackLeibler

Other KullbackLeibler that need be merged.

Returns

TypeMetric

A new instance of KullbackLeibler

mlm_insights.core.metrics.drift_metrics.population_stability_index module

class mlm_insights.core.metrics.drift_metrics.population_stability_index.PopulationStabilityIndex(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, bins: str | int | ~typing.List[float] = 'sturges', _kll_k: int = 500)

Bases: MetricBase

Data Drift Metric to compute Population Stability Index (PSI) distance between 2 probability distributions
It can process only numerical data types (int, float).
It is an approximate metric
This is used for Model Drift computation, taking into consideration reference and current profiles

Configuration

bin: Union[str, int, List[float]], default=’sturges’
One of the following values
- Number of bins
- Binning algorithm. Default is Sturges
- Bins: List of floats

Returns

  • algorithm: string: Drift Algorithm Name
    • “Population Stability Index”

  • drift_score: float: Drift Score
    • The PSI distances between one probability distribution from a reference probability distribution

Examples

import pandas as pd

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType
from mlm_insights.core.metrics.drift_metrics.population_stability_index import PopulationStabilityIndex
from mlm_insights.core.metrics.metric_metadata import MetricMetadata

input_schema = {
    'square_feet': FeatureType(
        data_type=DataType.FLOAT,
        variable_type=VariableType.CONTINUOUS,
        column_type=ColumnType.INPUT)
}


def get_metrics():
    uni_variate_metrics = {
        "square_feet": [MetricMetadata(klass=PopulationStabilityIndex)]

    }
    metric_details = MetricDetail(univariate_metric=uni_variate_metrics,
                                  dataset_metrics=[])
    return metric_details


def do_run(data_frame):
    runner = InsightsBuilder().                 with_input_schema(input_schema).                 with_data_frame(data_frame=data_frame).                 with_metrics(metrics=get_metrics()).                 with_engine(engine=EngineDetail(engine_name="native")).                 build()
    return runner.run().profile


def main():
    reference_data_frame = pd.DataFrame({'square_feet': [10, 10, 10, 10]})
    target_data_frame = pd.DataFrame({'square_feet': [20, 21.2, 10, 11.3]})

    # do a reference run
    reference_profile = do_run(data_frame=reference_data_frame)
    target_profile = do_run(data_frame=target_data_frame)

    profile_json = target_profile.to_json(reference_profile=reference_profile)
    feature_metrics = profile_json['feature_metrics']
    print(feature_metrics['square_feet']["PopulationStabilityIndex"])


if __name__ == "__main__":
    main()

Returns the standard metric result as:
{
    'metric_name': 'PopulationStabilityIndex',
    'metric_description': 'Data Drift Metric to compute Population Stability Index(PSI) distance between 2 probability distributions',
    'variable_count': 2,
    'variable_names': ['algorithm', 'drift_score'],
    'variable_types': [TEXT, CONTINUOUS],
    'variable_dtypes': [STRING, FLOAT],
    'variable_dimensions': [0, 0],
    'metric_data': ['PopulationStabilityIndex', 0.5],
    'metadata': {},
    'error': None
}
bins: str | int | List[float] = 'sturges'
classmethod create(config: Dict[str, ConfigParameter] | None = None) PopulationStabilityIndex

Factory Method to create an object. The configuration will be available in config.

Returns

MetricBase

An Instance of MetricBase.

get_required_shareable_feature_components() List[SFCMetaData]

Returns list of SFCs required to compute KL metric.

Returns

List: list of SFCs

get_result(**kwargs: Any) Dict[str, Any]

Returns the computed value of the metric Shareable Feature Component(s) can be accessed using kwargs

Returns

Dict[str, Any]: Dictionary with key as string and value as any metric property.

get_standard_metric_result(**kwargs: Any) StandardMetricResult

This method returns metric output in standard format.

Returns

StandardMetricResult

merge(other_metric: PopulationStabilityIndex, **kwargs: Any) PopulationStabilityIndex

Merge two PopulationStabilityIndex into one, without mutating the others.

Parameters

other_metricPopulationStabilityIndex

Other JensenShannon that need be merged.

Returns

TypeMetric

A new instance of JensenShannon