Work with Data Bias Detection

Bias in data can occur, for example, when there is an unequal representation of different groups like age, gender or race. Bias can exist in both datasets and models, and at every stage of data inference. The Data Bias Detector feature of OML Services REST API helps you detect different types of bias in your data in the early stage of the machine learning lifecycle.

Biased labels in data can come from various sources such as human annotators and social stereotypes. If machine learning models are trained on biased datasets, there is high chance that the biases are reproduced and reinforced during the inference stage. Proactively addressing data bias has multiple benefits:

As AI/ML models are under increasing scrutiny and regulation, detecting and mitigating potential biases and issues can help in complying with relevant laws and guidelines.
Detecting data biases, especially in the early stages of machine learning lifecycle, is important to understand potential impacts on fairness and equity.

The OML Services Data Bias Detector provides REST endpoints for creating bias detector jobs. To help address data biases and mitigate its effects in later stages of the modeling process, the bias mitigation method Reweighing has been added to the data_bias API. The Database Bias Detector calculates metrics to identify common types of data bias: Class Imbalance (CI), Statistical Parity (SP), and Conditional Demographic Disparity (CDD).

Class Imbalance (CI)—This metric evaluates mismatch between the available training data and the population on which the model will be applied. You may consider collecting more representative data. Normalized range: [-1,+1]
- Positive values indicate that group 1 has more training samples in the dataset.
- Values near 0 indicate the groups are balanced in the number of training samples in the dataset.
- Negative values indicate the group 2 has more training samples in the dataset.
Statistical Parity—This metric ensures the same probability of inclusion in the positive predicted class for each sensitive group. The statistical parity difference is the Difference in Proportions of Labels (DPL). Difference: [-1, 1]
- A perfect score indicates that the model does not predict positively any of the subgroups at a different rate than it does for the rest of the population. Perfect score for difference is 0.
- Positive values indicate an outcome where group 1 is accepted more than rejected.
- Negative values indicate an outcome where group 1 is rejected more than accepted.
Conditional Demographic Disparity (CDD)—This is the weighted average of all disparities found in subgroups defined by a dataset attribute. The goal of CDD metric is to rule out Simpson's Paradox from a dataset. Simpson's Paradox is a phenomenon in probability and statistics where a trend appears in several groups of data, but disappears or reverses when the groups are combined. Range: [-1, +1].
- Positive values indicate an outcome where the chance of group 1 getting accepted is more than getting rejected.
- Near zero indicates no demographic disparity on average.
- Negative values indicate an outcome where the chance of group 1 getting rejected is more than getting accepted.

Input Parameters for Data Bias Detection

To run the DATA_BIAS API, you must pass the following parameters in the job. The input parameters are listed in the Input Parameters for Data Bias Detection table below.

Table - Input Parameters for Data Bias Detection

Input Parameters	Type	Default Value	Description	Mandatory
`inputData`	String	NA	This is the name of the input data table.	Yes
`outputData`	String	NA	This is the name of the output data table.	Yes
`outcomeColumn`	String	NA	The name of the feature in the input data that is the outcome of training a machine learning model. The outcome must be either numerical or categorical.	Yes
`sensitiveFeatures`	CLOB	250 If there are more than 250 features, it will error out.	A list of features on which data bias detection and mitigation is performed. There should be a limit on the number of features in the list. Default=`250`. The features must be either numerical or categorical. Note: Text, nested, or date are not supported in the current release.	Yes
`positiveOutcome`	String	NA	The value of the positive outcome. The value must exist in the column specified by `outcomeColumn`. Either `positiveOutcome` or `outcomeThreshold` must be provided. But if both of them are provided, it will error out. In the Data Bias Deteciton - Examples, the income analysis of : `>50K` is the positive outcome `<50K` is the negative outcome	Mandatory if `outcomeThreshold` is NULL. Note: Either `positiveOutcome` or `outcomeThreshold` must be provided. If both are provided, you will receive an error.
`outcomeThreshold`	Number (FLOAT)	NULL	This is a numerical value that divides the data into two parts, that is, `>= outcomeThreshold` and `< outcomeThreshold`. The threshold must be within the range of the outcome column.	Mandatory if `positiveOutcome` is NULL. Note: Either `positiveOutcome` or `outcomeThreshold` must be provided. If both are provided, you will receive an error.
`strata`	CLOB	NULL	This is a list of columns needed to calculate the Conditional Demographic Disparity (CDD). There is a limit on the number of strata in the list. Default = `20`. CDD will not be calculated if strata list is not specified. If there are more than 20 strata, it will error out. The strata can be either numerical or categorical. Note: Text, nested, or date data type are not supported in the current release.	No
`replaceResultTable`	Boolean	0, 1	Indicates whether the result table will be replaced with the result table name. This is applicable when the results table already exists.	No
`pairwiseMode`	Boolean	0, 1	Indicates whether the metrics are calculated between: All possible group pairs within a feature (`pairwise_mode = true`), or Between one group versus the remaining samples.	No
`jobName`	String	NULL	The name of the OML job.	Yes when the procedure is called through the job scheduler No for users using PL/SQL API
`categoricalBinNum`	Integer	4	The number of bins generated for categorical variables, both for `sensitiveFeatures` and `strata.`	No
`numericalBinNum`	Integer	5	The number of bins generated for numerical variables, both for `sensitiveFeatures` and `strata`.	No