Data Bias Detection - Example
Bias in data can occur when some elements of the data are overrepresented or overweighted. The Data Bias Detector provides quantitative measures of data bias without predefining a threshold for what is considered bias. Data bias assessment depends on the distinctive features of the data and the particular problem that is being addressed. You must set your own acceptable bias levels based on the goals you want to achieve.
Data Bias Detection Workflow
- Obtain the access token.
- Create and run a data bias detection job.
- View the details of the data bias job.
- Query the output table to view the data bias details detected for the sensitive features.
1: Obtain the Access Token
You must obtain an authentication token by using your Oracle Machine Learning (OML) account credentials to send requests to OML Services. To authenticate and obtain a token, use cURL
with the -d
option to pass the credentials for your Oracle Machine Learning account against the Oracle Machine Learning user management cloud service REST endpoint /oauth2/v1/token
. Run the following command to obtain the access token:
$ curl -X POST --header 'Content-Type: application/json' --header 'Accept: application/json' -d '{"grant_type":"password", "username":"'<yourusername>'",
"password":"' <yourpassword>'"}'"<oml-cloud-service-location-url>/omlusers/api/oauth2/v1/token"
-X POST
specifies to use a POST request when communicating with the HTTP server-header
defines the headers required for the request (application/json)-d
sends the username and password authentication credentials as data in a POST request to the HTTP serverContent-Type
defines the response format (JSON)Accept
defines the response format (JSON)yourusername
is the user name of a Oracle Machine Learning user with the default OML_DEVELOPER roleyourpassword
is the password for the user nameoml-cloud-service-location-url
is a URL containing the REST server portion of the Oracle Machine Learning User Management Cloud Service instance URL that includes the tenancy ID and database name. You can obtain the URL from the Development tab in the Service Console of your Oracle Autonomous Database instance.
2: Create a Data Bias Detection Job
The dataset used in this example— the Adult dataset, also
known as the Census Income dataset, is a multivariate dataset. It
contains census data of 30,940 adults. The prediction task associated with the dataset is to
determine whether a person makes over 50K
a year.
Table - Attributes of the Adult dataset
Attributes | Type | Description |
---|---|---|
Income | Binary | >50K or <=50K Note: This attribute is used as theoutcome for Classification
models.
|
Age | Continuous | Range 17 to 90 years |
Gender | Binary | Male, Female |
Marital Status | Categorical |
|
/omlmod/v1/jobs
endpoint in OML Services.
Note:
OML Services interacts with theDBMS_SCHEDULER
to perform actions on jobs.
Here is an example of a data bias detection job request:
curl -v -X POST <oml-cloud-service-location-url>/-H "Content-Type:
application/json" -H "accept: application/json" -d
'{"jobProperties":{
"jobName":"jobNametest",
"jobType":"DATA_BIAS",
"jobServiceLevel":"MEDIUM",
"inputSchemaName":"OMLUSER",
"outputSchemaName":"OMLUSER",
"outputData":"adultbias_tab",
"jobDescription":"Data_Bias job,specify all parameters",
"inputData":"ADULT",
"sensitiveFeatures":["\"GENDER\""],
"strata":["\"MARITAL_STATUS\""],
"outcomeColumn":"INCOME",
"positiveOutcome":">50K",
"categoricalBinNum":6
"numericalBinNum":10}}'
-H 'Authorization:Bearer <token>'
Response of the data bias detection job creation request:
"jobId":"OML$53D60B34_A275_4B2B_831C_2C8AE40BCB53","links":[{"rel":"self","href":"http://<oml-cloud-service-location-url>/omlmod/v1/jobs/OML%2453D60B34_A275_4B2B_831C_2C8AE40BCB53"}]}
The parameters used in this command are:
jobName:
The name of the OML job. In this example, the name isjobNametest.
jobType:
Specifies the type of job to be run. It is set toDATA_BIAS
for data bias jobs.jobServiceLevel:
MEDIUMinputSchemaName:
OMLUSERoutputSchemaName:
OMLUSERoutputData:
This is the name of the output data table. In this example, the output data table name isadultbias_tab
jobDescription:
This is a description of theData_Bias
job.inputData:
This is the name of the input data table. In this example, the table name isADULT.
sensitiveFeatures:
This is a list of features on which data bias detection and mitigation is performed. By default, 250 features can be monitored for data bias detection. If there are more than 250 features, it will error out. The features can be either numeric or categorical. In this example, the attribute passed for sensitive feature isGENDER
.Note:
Text, Nested, and Date data types are not supported in this release.strata:
This is an array of strata names to calculate the Conditional Demographihc Disparity (CDD) to mitigate the impact of data bias from confounding variables by conditioning a third variable called strata. In this example, the name provided for strata isMARITAL_STATUS
.outcomeColumn:
This is the name of the feature in the input data that is the outcome of training a machine learning model. The outcome must be either numeric or categorical. In this examplem it isINCOME.
positiveOutcome:
This is a value that is in favor of a specific group in a dataset. It essentially indicates a positive outcome for that group. In the example, the positive outcome value is>50K
.categoricalBinNum:
Indicates whether to perform binning on categorical features. The number of bins is set to6.
numericalBinNum:
Indicates whether to perform binning on numerical features. The number of bins is set to the default value10.
3: View Details of the Submitted Job
Run the following command to view job details:
curl -v -X GET <oml-cloud-service-location-url>/omlmod/v1/jobs/'OML$53D60B34_A275_4B2B_831C_2C8AE40BCB53'
-H "Content-Type: application/json" -H 'Authorization:Bearer <token>'
$token
represents an environmental variable that is assigned to the token obtained through the Authorization API.OML$53D60B34_A275_4B2B_831C_2C8AE40BCB53
is the job ID
Response of the Job Detail Request
Here is a response of the job details request. If your job has already run once earlier, you will see information returned about the last job run.
{"jobId":"OML$53D60B34_A275_4B2B_831C_2C8AE40BCB53","jobRequest":{"jobSchedule":null,"jobProperties":{"jobType":"DATA_BIAS","inputSchemaName":"OMLUSER","outputSchemaName":"OMLUSER","outputData":"adultbias_tab","jobDescription":"Data_Bias
job test case400,specify all
parameters","jobName":"jobNametest","disableJob":false,"jobServiceLevel":"MEDIUM","inputData":"ADULT","sensitiveFeatures":["\"GENDER\""],"strata":["\"MARITAL_STATUS\""],"outcomeColumn":"INCOME","outcomeThreshold":null,"positiveOutcome":">50K","replaceResultTable":null,"pairwiseMode":null,"categoricalBinNum":6,"numericalBinNum":10}},"jobStatus":"CREATED","dateSubmitted":"2024-08-06T08:20:05.688706Z","links":[{"rel":"self","href":"http:<oml-cloud-service-location-url>/omlmod/v1/jobs/OML%2453D60B34_A275_4B2B_831C_2C8AE40BCB53"}],"jobFlags":[],"state":"SUCCEEDED","enabled":false,"lastStartDate":"2024-08-06T08:20:05.837534Z","runCount":1,"lastRunDetail":{"jobRunStatus":"SUCCEEDED","errorMessage":nul*
Connection #0 to host <oml-cloud-service-location-url> left intact
l,"requestedStartDate":"2024-08-06T08:20:05.752235Z","actualStartDate":"2024-08-06T08:20:05.837615Z","duration":"0
0:0:1.0"}}
Note:
Make a note thejobId
and outputData
name in the
job response. You will need these to query the output table and view the details
of data bias detected for the sensitive features defined in the job request.
4. Connect to the Database to access the Output Table
inputSchemaName:
OMLUSERoutputSchemaName:
OMLUSERoutputData:
This is the output data table. In this example, the name isadultbias_tab.
- Run the following SQL query to count the number of records in the output table:
select * from OML$53D60B34_A275_4B2B_831C_2C8AE40BCB53_ADULTBIAS_TAB;
In this example,OML$53D60B34_A275_4B2B_831C_2C8AE40BCB53
is the job ID.ADULTBIAS_TAB
is the output table name.
Here is the data bias result for
GENDER
passed for the parametersensitiveFeature
, andMARITAL_STATUS
passed for the parameterstrata
:{ "metric": [ { "group_a": "MALE", "CI": 0.33841, "SP": 0.19628 }, { "group_a": "FEMALE", "CI": -0.33841, "SP": -0.19628 } ], "cdd": [ { "strata": "MARITAL_STATUS", "result": [ { "group_a": "MALE", "CDD": 0.092269, "detail": [ { "subgroup": "MARRIED-CIV-SPOUSE", "DD": -0.0036665 }, { "subgroup": "NEVER-MARRIED", "DD": 0.11335 }, { "subgroup": "DIVORCED", "DD": 0.23977 }, { "subgroup": "SEPARATED", "DD": 0.38267 }, { "subgroup": "WIDOWED", "DD": 0.31675 }, { "subgroup": "MARRIED-SPOUSE-ABSENT", "DD": 0.18168 }, { "subgroup": "", "DD": 0.015385 } ] }, { "group_a": "FEMALE", "CDD": -0.092269, "detail": [ { "subgroup": "MARRIED-CIV-SPOUSE", "DD": 0.0036665 }, { "subgroup": "NEVER-MARRIED", "DD": -0.11335 }, { "subgroup": "DIVORCED", "DD": -0.23977 }, { "subgroup": "SEPARATED", "DD": -0.38267 }, { "subgroup": "WIDOWED", "DD": -0.31675 }, { "subgroup": "MARRIED-SPOUSE-ABSENT", "DD": -0.18168 }, { "subgroup": "", "DD": -0.015385 } ] } ] } ], "reweighing_matrix": [ { "group_a": "MALE", "group_a_pos_weight": 0.78764, "non_group_a_pos_weight": 2.2, "group_a_neg_weight": 1.0935, "non_group_a_neg_weight": 0.85251 }, { "group_a": "FEMALE", "group_a_pos_weight": 2.2, "non_group_a_pos_weight": 0.78764, "group_a_neg_weight": 0.85251, "non_group_a_neg_weight": 1.0935 } ] }
GENDER
and MARITAL_STATUS
:
- The Class Imbalance (CI) and Statistical Parity (SP) computed for the
attribute
GENDER
and for the groupMALE
are positive. While CI and SP values for the groupFEMALE
are negative. This indicates that the disadvantaged groups (female) have fewer data points than the advantaged groups. During the machine learning training process, the models try to minimize the overall error rate. As a result, the model tends to be good at predicting the majority classes and perform poorly on the minority or disadvantaged classes. - The Conditional Demographic Disparity for the groups
MALE
andFEMALE
are0.092269
and-0.092269
respectively. Note that the sum of the computed CDD for the strataMARITAL_STATUS
is zero. This indicates that even after takingMARITAL_STATUS
into account, the data is still biased towards theMALE
group (as the CDD is positive), and is against theFEMALE
group (as the CDD is negative). - For
GENDER
, reweighing matrix is computed. Reweighing is a method to mitigate data bias. Reweighing reduces the data bias by assigning greater weights to instances in the disadvantaged groups and with positive labels. Essentially, the classifier pays more attention to the instances with greater weights, and prioritize their correct classification. The goal is to ensure that the classifier does not favor the advantaged groups or allow existing biases against disadvantage groups. Here's how you can apply the reweighing matrix:Some machine learning packages accept row or sample weights as a training parameter for certain models. For example, in the Oracle DBMS_DATA_MINING package, you can set the
ODMS_ROW_WEIGHT_COLUMN_NAME
in Global Settings while training a generalized linear model (GLM). For Classification algorithms that cannot incorporate row weights in the training process, the weighing matrices can serve as guidance for resampling the data.
The Adult dataset is biased, and the outcomes are dependent on the
biased groups. That is why the group MALE
has higher positive
income ratio than the other disadvantaged group FEMALE
. To mitigate
the bias, the advantaged groups will be assigned lower weights during training, and
vice versa. In an ideal scenario, when a dataset is unbiased, the biased groups
MALES, WHITE
etc. and the outcomes >50K,
<=50K
should be independent. In other words, an individual should be
assigned an outcome despite their biased group; in this case, it is GENDER.