Data Bias Detection - Example
Biases in data can occur when some elements of the data are overrepresented or overweighted. The Data Bias Detector provides quantitative measures of data bias without predefining a threshold for what is considered bias. Data bias assessment depends on the distinctive features of the data and the particular problem that is being addressed. You must set your own acceptable bias levels based on the goals you want to achieve.
Data Bias Detection Workflow
- Obtain the access token.
- Create and run a data bias detection job.
- View the details of the data bias job.
- Query the output table to view the data bias details detected for the sensitive features.
1: Obtain the Access Token
You must obtain an authentication token by using your Oracle Machine Learning (OML) account credentials to send requests to OML Services. To authenticate and obtain a token, use cURL with the -d option to pass the credentials for your Oracle Machine Learning account against the Oracle Machine Learning user management cloud service REST endpoint /oauth2/v1/token. Run the following command to obtain the access token:
$ curl -X POST --header 'Content-Type: application/json' --header 'Accept: application/json' -d '{"grant_type":"password", "username":"'<yourusername>'",
"password":"' <yourpassword>'"}'"<oml-cloud-service-location-url>/omlusers/api/oauth2/v1/token"-X POSTspecifies to use a POST request when communicating with the HTTP server-headerdefines the headers required for the request (application/json)-dsends the username and password authentication credentials as data in a POST request to the HTTP serverContent-Typedefines the response format (JSON)Acceptdefines the response format (JSON)yourusernameis the user name of a Oracle Machine Learning user with the default OML_DEVELOPER roleyourpasswordis the password for the user nameoml-cloud-service-location-urlis a URL containing the REST server portion of the Oracle Machine Learning User Management Cloud Service instance URL that includes the tenancy ID and database name. You can obtain the URL from the Development tab in the Service Console of your Oracle Autonomous AI Database instance.
2: Create a Data Bias Detection Job
The dataset used in this example— the Adult dataset, also
known as the Census Income dataset, is a multivariate dataset. It
contains census data of 30,940 adults. The prediction task associated with the dataset is to
determine whether a person makes over 50K a year.
Table - Attributes of the Adult dataset
| Attributes | Type | Description |
|---|---|---|
| Income | Binary | >50K or <=50KNote: This attribute is used as theoutcome for Classification
models.
|
| Age | Continuous | Range 17 to 90 years |
| Gender | Binary | Male, Female |
| Marital Status | Categorical |
|
/omlmod/v1/jobs endpoint in OML Services.
Note:
OML Services interacts with theDBMS_SCHEDULER to perform actions on jobs.
Here is an example of a data bias detection job request:
curl -v -X POST <oml-cloud-service-location-url>/-H "Content-Type:
application/json" -H "accept: application/json" -d
'{"jobProperties":{
"jobName":"jobNametest",
"jobType":"DATA_BIAS",
"jobServiceLevel":"MEDIUM",
"inputSchemaName":"OMLUSER",
"outputSchemaName":"OMLUSER",
"outputData":"adultbias_tab",
"jobDescription":"Data_Bias job,specify all parameters",
"inputData":"ADULT",
"sensitiveFeatures":["\"GENDER\""],
"strata":["\"MARITAL_STATUS\""],
"outcomeColumn":"INCOME",
"positiveOutcome":">50K",
"categoricalBinNum":6
"numericalBinNum":10}}'
-H 'Authorization:Bearer <token>'
Response of the data bias detection job creation request:
"jobId":"OML$53D60B34_A275_4B2B_831C_2C8AE40BCB53","links":[{"rel":"self","href":"http://<oml-cloud-service-location-url>/omlmod/v1/jobs/OML%2453D60B34_A275_4B2B_831C_2C8AE40BCB53"}]}
The parameters used in this command are:
jobName:The name of the OML job. In this example, the name isjobNametest.jobType:Specifies the type of job to be run. It is set toDATA_BIASfor data bias jobs.jobServiceLevel:MEDIUMinputSchemaName:OMLUSERoutputSchemaName:OMLUSERoutputData:This is the name of the output data table. In this example, the output data table name isadultbias_tabjobDescription:This is a description of theData_Biasjob.inputData:This is the name of the input data table. In this example, the table name isADULT.sensitiveFeatures:This is a list of features on which data bias detection and mitigation is performed. By default, 250 features can be monitored for data bias detection. If there are more than 250 features, it will error out. The features can be either numeric or categorical. In this example, the attribute passed for sensitive feature isGENDER.Note:
Text, Nested, and Date data types are not supported in this release.strata:This is an array of strata names to calculate the Conditional Demographic Disparity (CDD) to mitigate the impact of data bias from confounding variables by conditioning a third variable called strata. In this example, the name provided for strata isMARITAL_STATUS.outcomeColumn:This is the name of the feature in the input data that is the outcome of training a machine learning model. The outcome must be either numeric or categorical. In this examplem it isINCOME.positiveOutcome:This is a value that is in favor of a specific group in a dataset. It essentially indicates a positive outcome for that group. In the example, the positive outcome value is>50K.categoricalBinNum:Indicates whether to perform binning on categorical features. The number of bins is set to6.numericalBinNum:Indicates whether to perform binning on numerical features. The number of bins is set to the default value10.
3: View Details of the Submitted Job
Run the following command to view job details:
curl -v -X GET <oml-cloud-service-location-url>/omlmod/v1/jobs/'OML$53D60B34_A275_4B2B_831C_2C8AE40BCB53'
-H "Content-Type: application/json" -H 'Authorization:Bearer <token>'
$tokenrepresents an environmental variable that is assigned to the token obtained through the Authorization API.OML$53D60B34_A275_4B2B_831C_2C8AE40BCB53is the job ID
Response of the Job Detail Request
Here is a response of the job details request. If your job has already run once earlier, you will see information returned about the last job run.
{"jobId":"OML$53D60B34_A275_4B2B_831C_2C8AE40BCB53","jobRequest":{"jobSchedule":null,"jobProperties":{"jobType":"DATA_BIAS","inputSchemaName":"OMLUSER","outputSchemaName":"OMLUSER","outputData":"adultbias_tab","jobDescription":"Data_Bias
job test case400,specify all
parameters","jobName":"jobNametest","disableJob":false,"jobServiceLevel":"MEDIUM","inputData":"ADULT","sensitiveFeatures":["\"GENDER\""],"strata":["\"MARITAL_STATUS\""],"outcomeColumn":"INCOME","outcomeThreshold":null,"positiveOutcome":">50K","replaceResultTable":null,"pairwiseMode":null,"categoricalBinNum":6,"numericalBinNum":10}},"jobStatus":"CREATED","dateSubmitted":"2024-08-06T08:20:05.688706Z","links":[{"rel":"self","href":"http:<oml-cloud-service-location-url>/omlmod/v1/jobs/OML%2453D60B34_A275_4B2B_831C_2C8AE40BCB53"}],"jobFlags":[],"state":"SUCCEEDED","enabled":false,"lastStartDate":"2024-08-06T08:20:05.837534Z","runCount":1,"lastRunDetail":{"jobRunStatus":"SUCCEEDED","errorMessage":nul*
Connection #0 to host <oml-cloud-service-location-url> left intact
l,"requestedStartDate":"2024-08-06T08:20:05.752235Z","actualStartDate":"2024-08-06T08:20:05.837615Z","duration":"0
0:0:1.0"}}
Note:
Make a note thejobId and outputData name in the
job response. You will need these to query the output table and view the details
of data bias detected for the sensitive features defined in the job request.
4. Connect to the Database to access the Output Table
inputSchemaName:OMLUSERoutputSchemaName:OMLUSERoutputData:This is the output data table. In this example, the name isadultbias_tab.
- Run the following SQL query to count the number of records in the output table:
select * from OML$53D60B34_A275_4B2B_831C_2C8AE40BCB53_ADULTBIAS_TAB;In this example,OML$53D60B34_A275_4B2B_831C_2C8AE40BCB53is the job ID.ADULTBIAS_TABis the output table name.
Here is the data bias result for
GENDERpassed for the parametersensitiveFeature, andMARITAL_STATUSpassed for the parameterstrata:{ "metric": [ { "group_a": "MALE", "CI": 0.33841, "SP": 0.19628 }, { "group_a": "FEMALE", "CI": -0.33841, "SP": -0.19628 } ], "cdd": [ { "strata": "MARITAL_STATUS", "result": [ { "group_a": "MALE", "CDD": 0.092269, "detail": [ { "subgroup": "MARRIED-CIV-SPOUSE", "DD": -0.0036665 }, { "subgroup": "NEVER-MARRIED", "DD": 0.11335 }, { "subgroup": "DIVORCED", "DD": 0.23977 }, { "subgroup": "SEPARATED", "DD": 0.38267 }, { "subgroup": "WIDOWED", "DD": 0.31675 }, { "subgroup": "MARRIED-SPOUSE-ABSENT", "DD": 0.18168 }, { "subgroup": "", "DD": 0.015385 } ] }, { "group_a": "FEMALE", "CDD": -0.092269, "detail": [ { "subgroup": "MARRIED-CIV-SPOUSE", "DD": 0.0036665 }, { "subgroup": "NEVER-MARRIED", "DD": -0.11335 }, { "subgroup": "DIVORCED", "DD": -0.23977 }, { "subgroup": "SEPARATED", "DD": -0.38267 }, { "subgroup": "WIDOWED", "DD": -0.31675 }, { "subgroup": "MARRIED-SPOUSE-ABSENT", "DD": -0.18168 }, { "subgroup": "", "DD": -0.015385 } ] } ] } ], "reweighing_matrix": [ { "group_a": "MALE", "group_a_pos_weight": 0.78764, "non_group_a_pos_weight": 2.2, "group_a_neg_weight": 1.0935, "non_group_a_neg_weight": 0.85251 }, { "group_a": "FEMALE", "group_a_pos_weight": 2.2, "non_group_a_pos_weight": 0.78764, "group_a_neg_weight": 0.85251, "non_group_a_neg_weight": 1.0935 } ] }
GENDER
and MARITAL_STATUS:
- The Class Imbalance (CI) and Statistical Parity (SP) computed for the
attribute
GENDERand for the groupMALEare positive. While CI and SP values for the groupFEMALEare negative. This indicates that the disadvantaged groups (female) have fewer data points than the advantaged groups. During the machine learning training process, the models try to minimize the overall error rate. As a result, the model tends to be good at predicting the majority classes and perform poorly on the minority or disadvantaged classes. - The Conditional Demographic Disparity for the groups
MALEandFEMALEare0.092269and-0.092269respectively. Note that the sum of the computed CDD for the strataMARITAL_STATUSis zero. This indicates that even after takingMARITAL_STATUSinto account, the data is still biased towards theMALEgroup (as the CDD is positive), and is against theFEMALEgroup (as the CDD is negative). - For
GENDER, reweighing matrix is computed. Reweighing is a method to mitigate data bias. Reweighing reduces the data bias by assigning greater weights to instances in the disadvantaged groups and with positive labels. Essentially, the classifier pays more attention to the instances with greater weights, and prioritize their correct classification. The goal is to ensure that the classifier does not favor the advantaged groups or allow existing biases against disadvantage groups. Here's how you can apply the reweighing matrix:Some machine learning packages accept row or sample weights as a training parameter for certain models. For example, in the Oracle DBMS_DATA_MINING package, you can set the
ODMS_ROW_WEIGHT_COLUMN_NAMEin Global Settings while training a generalized linear model (GLM). For Classification algorithms that cannot incorporate row weights in the training process, the weighing matrices can serve as guidance for re-sampling the data.
The Adult dataset is biased, and the outcomes are dependent on the
biased groups. That is why the group MALE has higher positive
income ratio than the other disadvantaged group FEMALE. To mitigate
the bias, the advantaged groups will be assigned lower weights during training, and
vice versa. In an ideal scenario, when a dataset is unbiased, the biased groups
MALES, WHITE etc. and the outcomes >50K,
<=50K should be independent. In other words, an individual should be
assigned an outcome despite their biased group; in this case, it is GENDER.