mlm_insights.core.metrics.data_quality package


mlm_insights.core.metrics.data_quality.cramers_v_correlation module

class mlm_insights.core.metrics.data_quality.cramers_v_correlation.CorrelationSummary(cramers_v_correlation: float, p_value: float = nan)

Bases: object

cramers_v_correlation: float
p_value: float = nan
class mlm_insights.core.metrics.data_quality.cramers_v_correlation.CramersVCorrelation(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, feature_pair_mapping: ~typing.Dict[str, ~mlm_insights.core.metrics.data_quality.cramers_v_correlation.CramersVCorrelationState] = <factory>, feature_list: ~typing.List[str] = <factory>)

Bases: DatasetMetricBase, Serializable

This metric computes the Cramers_V correlation matrix and P_value matrix for the user provided feature inputs.
It is a dataset level metric which can process categorical data types.
This is an approximate multivariate metric.
Internally, it uses a sketch data structure with a default K value of 1024.
We use cramer’s V measure of association for correlation metric between n categorical features
This metric handles NaN values, Used for feature importance

NaN handling Example

a = [1, 2, 8, np.nan, 9]
b = [5, np.nan, 7, np.nan, 10]
valid_corresponding_column_values = pd.core.nanops.notna(a) & pd.core.nanops.notna(b)
valid_corresponding_column_values= [ True False True False True]

Applying valid_corresponding_column_values over column_a and column_b:
a = a[valid_corresponding_column_values]
b = b[valid_corresponding_column_values]
a = [1, 8, 9]
b = [5, 7, 10]
It ranges from 0 to 1 where:
  • 0 indicates no association between the two variables.

  • 1 indicates a perfect association between the two variables.

Cramer’s V is computed by taking the square root of the chi-squared statistic divided by the sample size and the minimum dimension minus 1


lg_max_k: int, default=10
  • Maximum size, in log2, of k. The value must be between 7 and 21, inclusive

ignore_invalid_data_types: bool, default=True
  • Flag for ignoring invalid data types

  • If set to True, non-categorical features will be ignored, else, metric will throw an error For example: Cramers only deals with Categorical data types so drop all non-categorical data types

feature_list: List[str]
  • list of feature names for computing the correlation between each provided feature pairs, number of features supported is between 2 and 50 inclusive


feature_list: List[str]
  • list of user provided feature inputs

matrix: numpy.typing.NDArray[np.float64]
  • correlation matrix

p_values: numpy.typing.NDArray[np.float64]
  • The p-value is the probability of observing a non-zero correlation coefficient in our sample data when in fact the null hypothesis is true. A low p-value would lead you to reject the null hypothesis. A typical threshold for rejection of the null hypothesis is a p-value of 0.05.


Currently we support only maximum MAX_FEATURE_THRESHOLD_DEFAULT = 50 categorical feature for computation


  • InvalidParameterException - in case Column Name is not present in provided dataset

  • MissingRequiredParameterException - on breaching MAX_FEATURE_THRESHOLD_DEFAULT

  • ValueError - When comparison columns have no corresponding data to compare, all are NaN

  • TypeError - in case user do not passed feature_list in list format


from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType
from mlm_insights.core.metrics.metric_metadata import MetricMetadata
from mlm_insights.core.metrics.data_quality.cramers_v_correlation import CramersVCorrelation
import pandas as pd

def main():
    input_schema = {
        'transport': FeatureType(data_type=DataType.STRING,
        'gender': FeatureType(data_type=DataType.STRING,

    data_frame = pd.DataFrame({'transport': ['bus', 'bus', 'train', 'walk', 'walk', 'car', 'car'],
                               'gender': ['M', 'M', 'F', 'F', 'M', 'M', 'F']})
    feature1: str = 'transport'
    feature2: str = 'gender'
    correlation_metrics = [
        MetricMetadata(klass=CramersVCorrelation, config={FEATURE_LIST: [feature1, feature2]})

    metric_details = MetricDetail(univariate_metric={},

    runner = InsightsBuilder().                 with_input_schema(input_schema).                 with_data_frame(data_frame=data_frame).                 with_metrics(metrics=metric_details).                 with_engine(engine=EngineDetail(engine_name="native")).                 build()

    run_result =
    profile = run_result.profile

    dataset_metrics = profile.get_dataset_metric(correlation_metrics[0])
    assert dataset_metrics is not None

    cramers_actual_value = dataset_metrics.get_result()['value']
    cramers_correlation_matrix = cramers_actual_value['matrix']
    p_value_matrix = cramers_actual_value['p_values']

    feature_map = {value: index for index, value in enumerate(cramers_actual_value[FEATURE_LIST])}

    cramers_v_value_for_feature1_feature2 = round(
        cramers_correlation_matrix[feature_map[feature1]][feature_map[feature2]], 4)
    p_value_for_feature1_feature2 = round(
        p_value_matrix[feature_map[feature1]][feature_map[feature2]], 4)

    Returns the metric result as:
      return {
      'value':  {
            'matrix': array([[1.        , 0.64549722],
                           [0.64549722, 1.        ]]),
           'p_values': array([[0.00815097, 0.40465279],
                        [0.40465279, 0.01265042]]),
           'feature_list': ['transport', 'gender']
compute(dataset: DataFrame, **kwargs: Any) None

Update the state of the CramersVCorrelation using dataset


dataset : pd.DataFrame DataFrame object for either the entire dataset for a partition on which a Metric is being computed

classmethod create(config: Dict[str, ConfigParameter] | None = None, **kwargs: Any) CramersVCorrelation

Create a CramersVCorrelation data quality metric using the configuration and kwargs


config : Metric configuration kwargs: Key value pair for dynamic arguments. The current kwargs contains:

  • features: Contains list of input feature column names

classmethod deserialize(serialized_bytes: bytes, **kwargs: Any) CramersVCorrelation

Create a new instance from serialized bytes.



Serialized bytes as input.



New instance of Serializable

feature_list: List[str]
feature_pair_mapping: Dict[str, CramersVCorrelationState]
get_result(**kwargs: Any) Dict[str, Any]

Returns CramersVCorrelation data quality metric


Json object: CramersVCorrelation of the data.

get_standard_metric_result(**kwargs: Any) StandardMetricResult

Returns CramersVCorrelation Metric and P_values in Standard format.


StandardMetricResult: CramersVCorrelation Metric and P_values in standard format.

merge(other: CramersVCorrelation, **kwargs: Any) CramersVCorrelation

Merge two CramersVCorrelation into one, without mutating the others. Update sketch with new partition pair values from column1 and column2



Other CramersVCorrelation that need be merged.



A new instance of CramersVCorrelation

serialize(**kwargs: Any) bytes

Serialize the class to bytes. The bytes output must return the instance of the same class when deserialized.


bytes: Byte representation of object

class mlm_insights.core.metrics.data_quality.cramers_v_correlation.CramersVCorrelationState(sketch: _datasketches.frequent_strings_sketch, total_count: int = 0, feature1: str = '', feature2: str = '')

Bases: object

feature1: str = ''
feature2: str = ''
sketch: frequent_strings_sketch
total_count: int = 0

mlm_insights.core.metrics.data_quality.pearson_correlation module

class mlm_insights.core.metrics.data_quality.pearson_correlation.PearsonCorrelation(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, feature_pair_mapping: ~typing.Dict[str, ~mlm_insights.core.metrics.data_quality.pearson_correlation.PearsonCorrelationState] = <factory>, feature_list: ~typing.List[str] = <factory>)

Bases: DatasetMetricBase, Serializable

This metric computes Pearson’s Correlation Coefficient matrix for the user provided feature inputs.
Pearson’s Correlation coefficient has value between -1 to 1.
It is a dataset level metric which can process numeric data types.
This is an exact multivariate metric.
This metric handles NaN values

NaN handling Example

a = [1, 2, 8, np.nan, 9]
b = [5, np.nan, 7, np.nan, 10]
valid_corresponding_column_values = pd.core.nanops.notna(a) & pd.core.nanops.notna(b)
valid_corresponding_column_values= [ True False True False True]

Applying valid_corresponding_column_values over column_a and column_b:
a = a[valid_corresponding_column_values]
b = b[valid_corresponding_column_values]
a = [1, 8, 9]
b = [5, 7, 10]

Used for feature importance It ranges from -1 to 1 where:

  • -1 indicates a perfect negative linear relationship between variables

  • 0 indicates no linear relationship between variables

  • 1 indicates a perfect positive linear relationship between variables

Pearson’s is computed taking Covariance and Variance of both variables


ignore_invalid_data_types: bool, default=True
  • Flag for ignoring invalid data types

  • If set to True, non-numeric features will be ignored, else, metric will throw an error

For example: Pearson only deals with numerical data types so drop all non-numerical data types

feature_list: List[str]
  • list of feature names for computing the correlation between each provided feature pairs, number of features supported is between 2 and 50 inclusive


feature_list: List[str]
  • list of user provided feature inputs

matrix: numpy.typing.NDArray[np.float64]
  • correlation matrix


Currently we support only maximum MAX_FEATURE_THRESHOLD_DEFAULT = 50 numerical feature for computation


  • InvalidParameterException - in case Column Name is not present in provided dataset

  • MissingRequiredParameterException - on breaching MAX_FEATURE_THRESHOLD_DEFAULT

  • TypeError - in case user do not passed feature_list in list format


from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType
from mlm_insights.core.metrics.metric_metadata import MetricMetadata
from mlm_insights.core.metrics.data_quality.pearson_correlation import PearsonCorrelation
import pandas as pd

def main():
    input_schema = {
        'square_feet': FeatureType(data_type=DataType.INTEGER,
        'house_price': FeatureType(data_type=DataType.INTEGER,

    data_frame = pd.DataFrame({'house_price': [1, 2, 3, 4, 5, 6, 7, 8, 5, 6, 7],
                               'square_feet': [5, 6, 7, 8, 9, 10, 11, 12, 9, 10, 11]})
    feature1: str = 'house_price'
    feature2: str = 'square_feet'
    correlation_metrics = [
        MetricMetadata(klass=PearsonCorrelation, config={FEATURE_LIST: [feature1, feature2]})

    metric_details = MetricDetail(univariate_metric={},

    runner = InsightsBuilder().                 with_input_schema(input_schema).                 with_data_frame(data_frame=data_frame).                 with_metrics(metrics=metric_details).                 with_engine(engine=EngineDetail(engine_name="native")).                 build()

    run_result =
    profile = run_result.profile

    dataset_metrics = profile.get_dataset_metric(correlation_metrics[0])
    assert dataset_metrics is not None

    Returns the metric result as:
      return {
      'value':  {
            'matrix': array([[1.        , 0.64549722],
                           [0.64549722, 1.        ]]),
           'feature_list': ['house_price', 'square_feet']
compute(dataset: DataFrame, **kwargs: Any) None

Update the state of the PearsonCorrelation metric using dataset


dataset : pd.DataFrame DataFrame object for either the entire dataset for a partition on which Metric is being computed

classmethod create(config: Dict[str, ConfigParameter] | None = None, **kwargs: Any) PearsonCorrelation

Create a PearsonCorrelation data quality metric using the configuration and kwargs


config : Metric configuration kwargs: Key value pair for dynamic arguments. The current kwargs contains:

  • features: Contains list of input feature column names

classmethod deserialize(serialized_bytes: bytes, **kwargs: Any) PearsonCorrelation

Create a new instance from serialized bytes.



Serialized bytes as input.



New instance of Serializable

feature_list: List[str]
feature_pair_mapping: Dict[str, PearsonCorrelationState]
get_required_shareable_feature_components(**kwargs: Any) Dict[str, List[SFCMetaData]]

Returns the Shareable Feature Components for 2 input features

get_result(**kwargs: Any) Dict[str, Any]

Returns Pearson’s Correlation 2-D matrix for set of features, using the DescriptiveStatisticsSFC


Json object: Pearson’s Correlation 2-D matrix for n features

get_standard_metric_result(**kwargs: Any) StandardMetricResult

Returns Pearson’s Correlation Metric and P_values in Standard format.


StandardMetricResult: Pearson’s Correlation Metric and P_values in standard format.

merge(other: PearsonCorrelation, **kwargs: Any) PearsonCorrelation

Merge two PearsonCorrelation into one, without mutating the others. 1. Calculate cumulative_col12_count 2. Calculate combined mean for feature column1 and column2 3. Calculate numerator of covariance column1 and column2



Other PearsonCorrelation that need be merged.



A new instance of PearsonCorrelation

serialize(**kwargs: Any) bytes

Serialize the class to bytes. The bytes output must return the instance of the same class when deserialized.


bytes: Byte representation of object

class mlm_insights.core.metrics.data_quality.pearson_correlation.PearsonCorrelationState(cumulative_partition_count: int = 0, column1_mean: float = nan, column2_mean: float = nan, covariance_col1_col2: float = nan, feature1: str = '', feature2: str = '')

Bases: object

column1_mean: float = nan
column2_mean: float = nan
covariance_col1_col2: float = nan
cumulative_partition_count: int = 0
feature1: str = ''
feature2: str = ''

mlm_insights.core.metrics.data_quality.correlation_ratio module

class mlm_insights.core.metrics.data_quality.correlation_ratio.CorrelationRatio(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, feature_pair_mapping: ~typing.Dict[str, ~mlm_insights.core.metrics.data_quality.correlation_ratio.CorrelationRatioState] = <factory>, categorical_features: ~typing.List[str] = <factory>, numerical_features: ~typing.List[str] = <factory>)

Bases: DatasetMetricBase, Serializable

Dataset level metric computes correlation matrix for user provided categorical and numerical features.
This is an approximate multivariate metric.
We use Correlation Ratio for correlation metric between n categorical and m numerical features
This metric handles NaN values

NaN handling Example

a = [1, 2, 8, np.nan, 9]
b = [5, np.nan, 7, np.nan, 10]
valid_corresponding_column_values = pd.core.nanops.notna(a) & pd.core.nanops.notna(b)
valid_corresponding_column_values= [ True False True False True]

Applying valid_corresponding_column_values over column_a and column_b:
a = a[valid_corresponding_column_values]
b = b[valid_corresponding_column_values]
a = [1, 8, 9]
b = [5, 7, 10]
It ranges from 0 to 1 where:
  • 0 indicates no dispersion among the means of the different categories

  • 1 indicates dispersion within the respective categories

  • NaN when all data points of the complete population take the same value

Correlation ratio (η) is a measure of the relationship between statistical dispersion within individual categories and dispersion across the whole population or sample.


feature_list: List[str]
  • list of feature names for computing the correlation between each provided feature pairs, number of features supported is between 2 and 50 inclusive


matrix: numpy.typing.NDArray[np.float64]
  • correlation matrix

categorical_features: List[str]
  • list of user provided categorical feature inputs

numerical_features: List[str]
  • list of user provided numerical feature inputs


Currently we support only maximum MAX_FEATURE_THRESHOLD_DEFAULT = 50 features including both categorical and numerical features


  • InvalidParameterException - in case Column Name is not present in provided dataset

  • MissingRequiredParameterException - on breaching MAX_FEATURE_THRESHOLD_DEFAULT or Minimum 1 Numerical and 1 Categorical feature column names not provided

  • ValueError - When comparison columns have no corresponding data to compare, all are NaN

  • TypeError - in case user do not passed feature_list in list format


import pandas as pd

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.definitions import FEATURE_LIST, CATEGORICAL_FEATURES, NUMERICAL_FEATURES
from mlm_insights.constants.types import FeatureType, DataType, VariableType
from mlm_insights.core.metrics.data_quality.correlation_ratio import CorrelationRatio
from mlm_insights.core.metrics.metric_metadata import MetricMetadata

def main():
    input_schema = {
        "Pclass": FeatureType(data_type=DataType.STRING, variable_type=VariableType.NOMINAL),
        "age": FeatureType(data_type=DataType.FLOAT, variable_type=VariableType.CONTINUOUS)

    data_frame = pd.DataFrame({'Pclass': [3, 3, 2, 3, 3, 3, 3, 2, 3, 3],
                               'age': [34.5, 47, 62, 27, 22, 14, 30, 26, 18, 21]})
    feature1: str = 'Pclass'
    feature2: str = 'age'
    correlation_metrics = [
        MetricMetadata(klass=CorrelationRatio, config={FEATURE_LIST: [feature1, feature2]})

    metric_details = MetricDetail(univariate_metric={},
    runner = InsightsBuilder().                     with_input_schema(input_schema).                     with_data_frame(data_frame=data_frame).                     with_metrics(metrics=metric_details).                     with_engine(engine=EngineDetail(engine_name="native")).                     build()
    run_result =
    profile = run_result.profile

    dataset_metrics = profile.get_dataset_metric(correlation_metrics[0])
    assert dataset_metrics is not None

    sfc_registry = {}
    for feature in profile.features.values():
        sfc_registry[feature.get_name()] = feature.sfc_registry

    correlation_ratio_actual_value = dataset_metrics.get_result(sfc_registry=sfc_registry)['value']
    correlation_matrix = correlation_ratio_actual_value['matrix']
    assert correlation_matrix is not None

    categorical_feature_map = {value: index for index, value in
    numerical_feature_map = {value: index for index, value in

    correlation_ratio_value = round(
            numerical_feature_map[feature2]], 4)

Returns the metric result as:
  return {
  'value':  {
        'matrix': array([  [0.50199]
       'categorical_features': ['Pclass'],
       'numerical_features': ['age']
categorical_features: List[str]
compute(dataset: DataFrame, **kwargs: Any) None

Update the state of the CorrelationRatio using dataset


dataset : pd.DataFrame DataFrame object for either the entire dataset for a partition on which a Metric is being computed

classmethod create(config: Dict[str, ConfigParameter] | None = None, **kwargs: Any) CorrelationRatio

Create a CorrelationRatio data quality metric using the configuration and kwargs


config : Metric configuration kwargs: Key value pair for dynamic arguments. The current kwargs contains:

  • features: Contains list of input feature column names

classmethod deserialize(serialized_bytes: bytes, **kwargs: Any) CorrelationRatio

Create a new instance from serialized bytes.



Serialized bytes as input.



New instance of Serializable

feature_pair_mapping: Dict[str, CorrelationRatioState]
get_required_shareable_feature_components(**kwargs: Any) Dict[str, List[SFCMetaData]]

Returns the Shareable Feature Components that a Metric requires to compute its state and values Metrics which do not require SFC need not override this property


Dict where feature_name as key and List of SFCMetadata as value. Each SFCMetadata must contain the klass attribute which points to the SFC class

get_result(**kwargs: Any) Dict[str, Any]

Returns CorrelationRatio data quality metric


Json object: CorrelationRatio of the data.

get_standard_metric_result(**kwargs: Any) StandardMetricResult

Returns CorrelationRatio Metric in Standard format.


StandardMetricResult: CorrelationRatio Metric in standard format.

merge(other: CorrelationRatio, **kwargs: Any) CorrelationRatio

Merge two CorrelationRatio into one, without mutating the others.



Other CorrelationRatio that need be merged.



A new instance of CorrelationRatio

numerical_features: List[str]
serialize(**kwargs: Any) bytes

Serialize the class to bytes. The bytes output must return the instance of the same class when deserialized.


bytes: Byte representation of object

class mlm_insights.core.metrics.data_quality.correlation_ratio.CorrelationRatioDetails(total_sum: float = 0.0, total_count: int = 0)

Bases: object

total_count: int = 0
total_sum: float = 0.0
class mlm_insights.core.metrics.data_quality.correlation_ratio.CorrelationRatioState(category_details: Dict[Union[int, str, float], mlm_insights.core.metrics.data_quality.correlation_ratio.CorrelationRatioDetails] = <factory>, categorical_feature: str = '', numerical_feature: str = '')

Bases: object

categorical_feature: str = ''
category_details: Dict[int | str | float, CorrelationRatioDetails]
numerical_feature: str = ''

Module contents