Work with Data Bias Detection
Bias in data can occur, for example, when there is an unequal representation of different groups like age, gender or race. Bias can exist in both datasets and models, and at every stage of data inference. The Data Bias Detector feature of OML Services REST API helps you detect different types of bias in your data in the early stage of the machine learning lifecycle.
- As AI/ML models are under increasing scrutiny and regulation, detecting and mitigating potential biases and issues can help in complying with relevant laws and guidelines.
- Detecting data biases, especially in the early stages of machine learning lifecycle, is important to understand potential impacts on fairness and equity.
The OML Services Data Bias Detector provides REST endpoints for creating bias
detector jobs. To help address data biases and mitigate its effects in later stages
of the modeling process, the bias mitigation method Reweighing has been added
to the data_bias
API. The Database Bias Detector calculates metrics
to identify common types of data bias: Class Imbalance (CI), Statistical Parity
(SP), and Conditional Demographic Disparity (CDD).
- Class Imbalance (CI)—This metric evaluates mismatch between the available
training data and the population on which the model will be applied. You may
consider collecting more representative data. Normalized range:
[-1,+1]
- Positive values indicate that group 1 has more training samples in the dataset.
- Values near
0
indicate the groups are balanced in the number of training samples in the dataset. - Negative values indicate the group 2 has more training samples in the dataset.
- Statistical Parity—This metric ensures the same probability of inclusion in
the positive predicted class for each sensitive group. The statistical
parity difference is the Difference in Proportions of Labels (DPL).
Difference:
[-1, 1]
- A perfect score indicates that the model does not predict positively
any of the subgroups at a different rate than it does for the rest
of the population. Perfect score for difference is
0
. - Positive values indicate an outcome where group 1 is accepted more than rejected.
- Negative values indicate an outcome where group 1 is rejected more than accepted.
- A perfect score indicates that the model does not predict positively
any of the subgroups at a different rate than it does for the rest
of the population. Perfect score for difference is
- Conditional Demographic Disparity (CDD)—This is the weighted average of all
disparities found in subgroups defined by a dataset attribute. The goal of
CDD metric is to rule out Simpson's Paradox from a dataset. Simpson's
Paradox is a phenomenon in probability and statistics where a trend appears
in several groups of data, but disappears or reverses when the groups are
combined. Range:
[-1, +1]
.- Positive values indicate an outcome where the chance of group 1 getting accepted is more than getting rejected.
- Near zero indicates no demographic disparity on average.
- Negative values indicate an outcome where the chance of group 1 getting rejected is more than getting accepted.
Input Parameters for Data Bias Detection
DATA_BIAS
API, you must pass the following parameters in
the job. The input parameters are listed in the Input Parameters for Data Bias
Detection table below.
Table - Input Parameters for Data Bias Detection
Input Parameters | Type | Default Value | Description | Mandatory |
---|---|---|---|---|
inputData |
String | NA | This is the name of the input data table. | Yes |
outputData |
String | NA | This is the name of the output data table. | Yes |
outcomeColumn |
String | NA | The name of the feature in the input data that is the outcome of training a machine learning model. The outcome must be either numerical or categorical. | Yes |
sensitiveFeatures |
CLOB | 250
If there are more than 250 features, it will error out. |
A list of features on which data bias detection and
mitigation is performed. There should be a limit on the number
of features in the list. Default=250 . The
features must be either numerical or categorical.
Note: Text, nested, or date are not supported in the current release. |
Yes |
positiveOutcome |
String | NA |
The value of the positive outcome. The value must exist in
the column specified by
|
Mandatory if outcomeThreshold is NULL.
Note: EitherpositiveOutcome or
outcomeThreshold must be provided.
If both are provided, you will receive an error.
|
outcomeThreshold |
Number (FLOAT) | NULL | This is a numerical value that divides the data into two
parts, that is, >= outcomeThreshold and
< outcomeThreshold . The threshold must
be within the range of the outcome column.
|
Mandatory if positiveOutcome is NULL.
Note: EitherpositiveOutcome or
outcomeThreshold must be provided. If
both are provided, you will receive an error.
|
strata |
CLOB | NULL | This is a list of columns needed to calculate the Conditional
Demographic Disparity (CDD). There is a limit on the number of
strata in the list. Default = 20 . CDD will not
be calculated if strata list is not specified. If there are more
than 20 strata, it will error out. The strata can be either
numerical or categorical.
Note: Text, nested, or date data type are not supported in the current release. |
No |
replaceResultTable |
Boolean | 0, 1 | Indicates whether the result table will be replaced with the result table name. This is applicable when the results table already exists. | No |
pairwiseMode |
Boolean | 0, 1 | Indicates whether the metrics are calculated between:
|
No |
jobName |
String | NULL | The name of the OML job. |
|
categoricalBinNum |
Integer | 4 | The number of bins generated for categorical variables, both
for sensitiveFeatures and
strata. |
No |
numericalBinNum |
Integer | 5 | The number of bins generated for numerical variables, both
for sensitiveFeatures and
strata .
|
No |