Data Bias Detection

Data Bias Detection Workflow

To detect data bias through OML REST Services, follow these following steps:

Obtain the access token.
Create and run a data bias detection job.
View the details of the data bias job.
Query the output table to view the data bias details detected for the sensitive features.

1: Obtain the Access Token

You must obtain an authentication token by using your Oracle Machine Learning (OML) account credentials to send requests to OML Services. To authenticate and obtain a token, use cURL with the -d option to pass the credentials for your Oracle Machine Learning account against the Oracle Machine Learning user management cloud service REST endpoint /oauth2/v1/token. Run the following command to obtain the access token:

$ curl -X POST --header 'Content-Type: application/json' --header 'Accept: application/json' -d '{"grant_type":"password", "username":"'<yourusername>'", 
"password":"' <yourpassword>'"}'"<oml-cloud-service-location-url>/omlusers/api/oauth2/v1/token"

Here,

-X POST specifies to use a POST request when communicating with the HTTP server
-header defines the headers required for the request (application/json)
-d sends the username and password authentication credentials as data in a POST request to the HTTP server
Content-Type defines the response format (JSON)
Accept defines the response format (JSON)
yourusername is the user name of a Oracle Machine Learning user with the default OML_DEVELOPER role
yourpassword is the password for the user name
oml-cloud-service-location-url is a URL containing the REST server portion of the Oracle Machine Learning User Management Cloud Service instance URL that includes the tenancy ID and database name. You can obtain the URL from the Development tab in the Service Console of your Oracle Autonomous Database instance.

2: Create a Data Bias Detection Job

The dataset used in this example— the Adult dataset, also known as the Census Income dataset, is a multivariate dataset. It contains census data of 30,940 adults. The prediction task associated with the dataset is to determine whether a person makes over 50K a year.

The attributes in the dataset that are used in the Data Bias Detection example are:

Table - Attributes of the Adult dataset

Attributes	Type	Description
Income	Binary	`>50K` or `<=50K` Note: This attribute is used as the `outcome` for Classification models.
Age	Continuous	Range 17 to 90 years
Gender	Binary	Male, Female
Marital Status	Categorical	Married-civ-spouse Divorced Never-married Separated Widowed Married-spouse-absent Married-AF-spouse

To create a job for data bias detection and data bias mitigation, send the following POST request to the /omlmod/v1/jobs endpoint in OML Services.

Note:

OML Services interacts with the DBMS_SCHEDULER to perform actions on jobs.

Here is an example of a data bias detection job request:

curl -v -X POST <oml-cloud-service-location-url>/-H "Content-Type: 
application/json" -H "accept: application/json" -d
'{"jobProperties":{
	  "jobName":"jobNametest",
	  "jobType":"DATA_BIAS",
	  "jobServiceLevel":"MEDIUM",
	  "inputSchemaName":"OMLUSER",
	  "outputSchemaName":"OMLUSER",
	  "outputData":"adultbias_tab",
	  "jobDescription":"Data_Bias job,specify all parameters",
	  "inputData":"ADULT",
	  "sensitiveFeatures":["\"GENDER\""],
	  "strata":["\"MARITAL_STATUS\""],
	  "outcomeColumn":"INCOME",
	  "positiveOutcome":">50K",
	  "categoricalBinNum":6
	  "numericalBinNum":10}}'
      -H 'Authorization:Bearer <token>'

Response of the data bias detection job creation request:

"jobId":"OML$53D60B34_A275_4B2B_831C_2C8AE40BCB53","links":[{"rel":"self","href":"http://<oml-cloud-service-location-url>/omlmod/v1/jobs/OML%2453D60B34_A275_4B2B_831C_2C8AE40BCB53"}]}

The parameters used in this command are:

jobName: The name of the OML job. In this example, the name is jobNametest.
jobType: Specifies the type of job to be run. It is set to DATA_BIAS for data bias jobs.
jobServiceLevel: MEDIUM
inputSchemaName: OMLUSER
outputSchemaName: OMLUSER
outputData: This is the name of the output data table. In this example, the output data table name is adultbias_tab
jobDescription: This is a description of the Data_Bias job.
inputData: This is the name of the input data table. In this example, the table name is ADULT.
sensitiveFeatures: This is a list of features on which data bias detection and mitigation is performed. By default, 250 features can be monitored for data bias detection. If there are more than 250 features, it will error out. The features can be either numeric or categorical. In this example, the attribute passed for sensitive feature is GENDER.

Note:
Text, Nested, and Date data types are not supported in this release.
strata: This is an array of strata names to calculate the Conditional Demographihc Disparity (CDD) to mitigate the impact of data bias from confounding variables by conditioning a third variable called strata. In this example, the name provided for strata is MARITAL_STATUS.
outcomeColumn: This is the name of the feature in the input data that is the outcome of training a machine learning model. The outcome must be either numeric or categorical. In this examplem it is INCOME.
positiveOutcome: This is a value that is in favor of a specific group in a dataset. It essentially indicates a positive outcome for that group. In the example, the positive outcome value is >50K.
categoricalBinNum: Indicates whether to perform binning on categorical features. The number of bins is set to 6.
numericalBinNum: Indicates whether to perform binning on numerical features. The number of bins is set to the default value 10.

3: View Details of the Submitted Job

Run the following command to view job details:

curl -v -X GET <oml-cloud-service-location-url>/omlmod/v1/jobs/'OML$53D60B34_A275_4B2B_831C_2C8AE40BCB53' 
-H "Content-Type: application/json" -H 'Authorization:Bearer <token>'

In this example,

$token represents an environmental variable that is assigned to the token obtained through the Authorization API.
OML$53D60B34_A275_4B2B_831C_2C8AE40BCB53 is the job ID

Response of the Job Detail Request

Here is a response of the job details request. If your job has already run once earlier, you will see information returned about the last job run.

{"jobId":"OML$53D60B34_A275_4B2B_831C_2C8AE40BCB53","jobRequest":{"jobSchedule":null,"jobProperties":{"jobType":"DATA_BIAS","inputSchemaName":"OMLUSER","outputSchemaName":"OMLUSER","outputData":"adultbias_tab","jobDescription":"Data_Bias job test case400,specify all parameters","jobName":"jobNametest","disableJob":false,"jobServiceLevel":"MEDIUM","inputData":"ADULT","sensitiveFeatures":["\"GENDER\""],"strata":["\"MARITAL_STATUS\""],"outcomeColumn":"INCOME","outcomeThreshold":null,"positiveOutcome":">50K","replaceResultTable":null,"pairwiseMode":null,"categoricalBinNum":6,"numericalBinNum":10}},"jobStatus":"CREATED","dateSubmitted":"2024-08-06T08:20:05.688706Z","links":[{"rel":"self","href":"http:<oml-cloud-service-location-url>/omlmod/v1/jobs/OML%2453D60B34_A275_4B2B_831C_2C8AE40BCB53"}],"jobFlags":[],"state":"SUCCEEDED","enabled":false,"lastStartDate":"2024-08-06T08:20:05.837534Z","runCount":1,"lastRunDetail":{"jobRunStatus":"SUCCEEDED","errorMessage":nul* Connection #0 to host <oml-cloud-service-location-url> left intact l,"requestedStartDate":"2024-08-06T08:20:05.752235Z","actualStartDate":"2024-08-06T08:20:05.837615Z","duration":"0 0:0:1.0"}}

Note:

Make a note the jobId and outputData name in the job response. You will need these to query the output table and view the details of data bias detected for the sensitive features defined in the job request.

4. Connect to the Database to access the Output Table

After the data bias job is submitted successfully, you must connect to the database to access the output table in the output schema. In this example,

inputSchemaName: OMLUSER
outputSchemaName: OMLUSER
outputData: This is the output data table. In this example, the name is adultbias_tab.

Run the following SQL query to count the number of records in the output table:

select * from
    OML$53D60B34_A275_4B2B_831C_2C8AE40BCB53_ADULTBIAS_TAB;

In this example,

OML$53D60B34_A275_4B2B_831C_2C8AE40BCB53 is the job ID.
ADULTBIAS_TAB is the output table name.

Here is the data bias result for GENDER passed for the parameter sensitiveFeature , and MARITAL_STATUS passed for the parameter strata:

{
    "metric": [
        {
            "group_a": "MALE",
            "CI": 0.33841,
            "SP": 0.19628
        },
        {
            "group_a": "FEMALE",
            "CI": -0.33841,
            "SP": -0.19628
        }
    ],
    "cdd": [
        {
            "strata": "MARITAL_STATUS",
            "result": [
                {
                    "group_a": "MALE",
                    "CDD": 0.092269,
                    "detail": [
                        {
                            "subgroup": "MARRIED-CIV-SPOUSE",
                            "DD": -0.0036665
                        },
                        {
                            "subgroup": "NEVER-MARRIED",
                            "DD": 0.11335
                        },
                        {
                            "subgroup": "DIVORCED",
                            "DD": 0.23977
                        },
                        {
                            "subgroup": "SEPARATED",
                            "DD": 0.38267
                        },
                        {
                            "subgroup": "WIDOWED",
                            "DD": 0.31675
                        },
                        {
                            "subgroup": "MARRIED-SPOUSE-ABSENT",
                            "DD": 0.18168
                        },
                        {
                            "subgroup": "",
                            "DD": 0.015385
                        }
                    ]
                },
                {
                    "group_a": "FEMALE",
                    "CDD": -0.092269,
                    "detail": [
                        {
                            "subgroup": "MARRIED-CIV-SPOUSE",
                            "DD": 0.0036665
                        },
                        {
                            "subgroup": "NEVER-MARRIED",
                            "DD": -0.11335
                        },
                        {
                            "subgroup": "DIVORCED",
                            "DD": -0.23977
                        },
                        {
                            "subgroup": "SEPARATED",
                            "DD": -0.38267
                        },
                        {
                            "subgroup": "WIDOWED",
                            "DD": -0.31675
                        },
                        {
                            "subgroup": "MARRIED-SPOUSE-ABSENT",
                            "DD": -0.18168
                        },
                        {
                            "subgroup": "",
                            "DD": -0.015385
                        }
                    ]
                }
            ]
        }
    ],
    "reweighing_matrix": [
        {
            "group_a": "MALE",
            "group_a_pos_weight": 0.78764,
            "non_group_a_pos_weight": 2.2,
            "group_a_neg_weight": 1.0935,
            "non_group_a_neg_weight": 0.85251
        },
        {
            "group_a": "FEMALE",
            "group_a_pos_weight": 2.2,
            "non_group_a_pos_weight": 0.78764,
            "group_a_neg_weight": 0.85251,
            "non_group_a_neg_weight": 1.0935
        }
    ]
}

Let's look at the metrics calculated for the attributes GENDER and MARITAL_STATUS:

The Class Imbalance (CI) and Statistical Parity (SP) computed for the attribute GENDER and for the group MALE are positive. While CI and SP values for the group FEMALE are negative. This indicates that the disadvantaged groups (female) have fewer data points than the advantaged groups. During the machine learning training process, the models try to minimize the overall error rate. As a result, the model tends to be good at predicting the majority classes and perform poorly on the minority or disadvantaged classes.
The Conditional Demographic Disparity for the groups MALE and FEMALE are 0.092269 and -0.092269 respectively. Note that the sum of the computed CDD for the strata MARITAL_STATUS is zero. This indicates that even after taking MARITAL_STATUS into account, the data is still biased towards the MALE group (as the CDD is positive), and is against the FEMALE group (as the CDD is negative).
For GENDER, reweighing matrix is computed. Reweighing is a method to mitigate data bias. Reweighing reduces the data bias by assigning greater weights to instances in the disadvantaged groups and with positive labels. Essentially, the classifier pays more attention to the instances with greater weights, and prioritize their correct classification. The goal is to ensure that the classifier does not favor the advantaged groups or allow existing biases against disadvantage groups. Here's how you can apply the reweighing matrix:
Some machine learning packages accept row or sample weights as a training parameter for certain models. For example, in the Oracle DBMS_DATA_MINING package, you can set the ODMS_ROW_WEIGHT_COLUMN_NAME in Global Settings while training a generalized linear model (GLM). For Classification algorithms that cannot incorporate row weights in the training process, the weighing matrices can serve as guidance for resampling the data.

The Adult dataset is biased, and the outcomes are dependent on the biased groups. That is why the group MALE has higher positive income ratio than the other disadvantaged group FEMALE. To mitigate the bias, the advantaged groups will be assigned lower weights during training, and vice versa. In an ideal scenario, when a dataset is unbiased, the biased groups MALES, WHITE etc. and the outcomes >50K, <=50K should be independent. In other words, an individual should be assigned an outcome despite their biased group; in this case, it is GENDER.

Data Bias Detection - Example