Data Monitoring Example - Data Drift Detection
Data drift occurs when data diverges from the original baseline data over time. Data drift can happen for a variety of reasons, such as a changing business environment, evolving user behavior and interest, data modifications from third-party sources, data quality issues, or issues with upstream data processing pipelines.
The key to accurately interpret your models and to ensure that the models are able to solve business problems is to understand how data evolves over time. Data monitoring is complementary to successful model monitoring, as understanding the changes in data is critical in understanding the changes in the efficacy of the models. The ability to quickly and reliably detect changes in your data's statistical properties help to ensure that your machine learning models are able to meet business objectives.
Data drift can be quantified using basic descriptive statistics. For numeric columns, these include:
- Mean
- Standard deviation
- Range (minimum, maximum)
- Number of nulls
- Number of unique values
- Number of nulls
Data Monitoring Workflow
- Obtain the access token
- Create a data monitoring job
- View the job details
- Update the job (optional)
- Enable the job to run
- View job output
1: Obtain the Access Token
You must obtain an authentication token by using your Oracle Machine Learning (OML) account credentials to send requests to OML Services. To authenticate and obtain a token, use cURL
with the -d
option to pass the credentials for your Oracle Machine Learning account against the Oracle Machine Learning user management cloud service REST endpoint /oauth2/v1/token
. Run the following command to obtain the access token:
$ curl -X POST --header 'Content-Type: application/json' --header 'Accept: application/json' -d '{"grant_type":"password", "username":"'<yourusername>'",
"password":"' <yourpassword>'"}'"<oml-cloud-service-location-url>/omlusers/api/oauth2/v1/token"
-X POST
specifies to use a POST request when communicating with the HTTP server-header
defines the headers required for the request (application/json)-d
sends the username and password authentication credentials as data in a POST request to the HTTP serverContent-Type
defines the response format (JSON)Accept
defines the response format (JSON)yourusername
is the user name of a Oracle Machine Learning user with the default OML_DEVELOPER roleyourpassword
is the password for the user nameoml-cloud-service-location-url
is a URL containing the REST server portion of the Oracle Machine Learning User Management Cloud Service instance URL that includes the tenancy ID and database name. You can obtain the URL from the Development tab in the Service Console of your Oracle Autonomous Database instance.
2: Create a Data Monitoring Job
/omlmod/v1/jobs
endpoint in OML Services. Data monitoring will run on all features by default, or on a list of features specified by the user. There is a maximum of 250 features that can be monitored. The data monitoring job is submitted asynchronously. This means it will run as scheduled, and the results can be retrieved when the job completes.
Note:
OML Services interacts with theDBMS_SCHEDULER
to perform actions on jobs.
The details for data monitoring are specified in jobProperties
parameter, that includes:
- Data monitoring job name and type
- Autonomous Database service level
- Table where the data monitoring details will be saved
- Drift alert trigger
- Threshold
- Maximum number of runs
- Baseline and new data to be used
- Performance metric
- Start date (optional ) and end date (optional ) correspond to a DATE or TIMESTAMP column in the table or view denoted by
newData
, and contained in thetimeColumn
field. If the start and end dates are not specified, the earliest and latest dates and times in thetimeColumn
are used.
Note:
The command usesjq
, a command-line JSON processor available on Linux and Mac OS systems to extract relevant components from the response.
$ curl -X POST "<oml-cloud-service-location-url>/omlmod/v1/jobs" \
--header "Authorization: Bearer ${token}" \
--header 'Content-Type: application/json' \
--data '{
"jobSchedule": {
"jobStartDate": "2023-03-24T20:30:26Z", # job start date and time
"repeatInterval": "FREQ=HOURLY", # job frequency
"jobEndDate": "2023-03-24T23:30:26Z", # job end date and time
"maxRuns": "10" # max runs within the schedule
},
"jobProperties": {
"jobName": "HouseholdPowerDataMonitoring", # job name
"jobType": "DATA_MONITORING", # job type; DATA_MONITORING
"disableJob": false, # flag to disable the job at submission
"outputData": "householdPowerConsumption", # table where the job results will be saved in the format {jobID}_{outputData}
"baselineData": "HOUSEHOLD_POWER_BASE", # table/view containing baseline data
"newData": "HOUSEHOLD_POWER_NEW", # table/view with new data to compare against baseline
"inputSchemaName": "OMLUSER", # database schema that owns the input table/view
"outputSchemaName": "OMLUSER", # database schema that owns the output table/view
"jobDescription": "Monitor household power", # job description
"jobServiceLevel": "LOW", # Autonomous Database service level; either LOW, MEDIUM, or HIGH
"timeColumn": "DATES", # date or timestamp column in newData
"startDate": "2008-01-01T00:00:00Z", # the start date of the monitoring in the new data
"endDate": "2010-11-26T00:00:00Z", # the end date of the monitoring in the new data
"frequency": "Year", # the time window unit on which monitoring is performed on the new data
"threshold": 0.8, # threshold to trigger drift alert
"recompute": false, # flag to determine whether to replace the output table
"caseidColumn": null, # case identifier column in the baseline and new data
"anchorColumn": null, # anchor column for bivariate analysis
"featureList": [ # features to perform data monitoring on
"GLOBAL_ACTIVE_POWER",
"GLOBAL_REACTIVE_POWER",
"VOLTAGE",
"SUB_METERING_1",
"SUB_METERING_2",
"SUB_METERING_3"
]
}
}' | jq
The parameters in this command are:
Required Parameters
jobName
specifies the name of the submitted job.jobType
specifies the type of job to be run.Note:
For model monitoring jobs, this parameter is set toMODEL_MONITORING
outputData
is the output data identifier. The results of the job will be written to a table named{jobId}_{ouputData}
baselineData
is a table or view name that contains baseline data to monitor. At least 50 rows per period are required for model monitoring, otherwise analysis is skipped.newData
is a table or view name with new data to be compared against the baseline. At least 100 rows per period are required for data monitoring, otherwise analysis is skipped.
Optional Parameters
disableJob:
A flag to disable the job at submission. If not set, the default isfalse
and the job is enabled at submission.inputSchemaName:
The database schema that owns the input table or view. If not specified, the input schema will be the same as the username in the request token.outputSchemaName:
The database schema that owns the output table. If not specified, the output schema will be the same as the input schema.jobDescription:
A description of the job.jobServiceLevel:
The service level for the job, which can be LOW, MEDIUM, or HIGH.timeColumn:
The name of a date or timestamp column in the new data. If not provided, the entirenewData:
Is treated as one period.startDate:
The start date or timestamp of monitoring in thenewData
column. The columntimeColumn
is mandatory forstartDate
. IfstartDate
is not provided, thenstartDate
depends on whetherfrequency
is provided. Iffrequency
is not provided, then the earliest date intimeColumn
is used as thestartDate
. If bothstartDate
andfrequency
are not provided, then the most recent of the earliest date intimeColumn
and the starting date of the 10th most recent cycle is considered as thestartDate
.Note:
The supported date and time format is the ISO-8601 date and time format. For example:2022-05-12T02:33:16Z
endDate:
The end date or timestamp of monitoring in thenewData
. The columntimeColumn
is mandatory forendDate
. IfendDate
is not provided, then the most recent date intimeColumn
will be used.Note:
The supported date and time format is the ISO-8601 date and time format. For example:2022-05-12T02:33:16Z
frequency:
Indicates the time window unit on which the monitoring is performed on the new data. This can be"day"
,"week"
,"month"
, or"year"
. If not provided, the whole"new"
data is used as a single time period.threshold:
The threshold to trigger a drift alert.recompute:
A flag to indicate whether to update the already computed periods. The default isFalse
. It means only time periods not present in the output result table will be computed.caseidColumn:
A case identifier column in the baseline and new data. It improves the reproducibility of results.featureList:
A list of features to monitor. The default number of maximum features in the list is 250. Monitoring will run on all features if the feature list is not specified, and the features need to be either numeric or categorical.anchorColumn:
The anchor column in the baseline and new data used for bi-variate analysis. The target column in supervised problems can be passed as the anchor column.
Response of a data monitoring job creation request
When your job is submitted successfully, you will receive a response with a jobid
. Note the jobId
to use it in submit requests to retrieve job details or to perform any other actions on the job. Here is an example of a data monitoring job creation response:
{
"jobId": "OML$7ABB6308_1664_4CB4_84B1_598A6EA599D1",
"links": [
{
"rel": "self",
"href": "<OML Service URL>/omlmod/v1/jobs/OML%247ABB6308_1664_4CB4_84B1_598A6EA599D1"
}
]
}
3: View Details of the Submitted Job
To view details of your submitted job, send a GET request to the /omlmod/v1/jobs/{jobID}
endpoint. Here, jobId
is the ID provided in response to the successful submission of your data monitoring job in the previous step.
Run the following command to view job details:
$ export jobid='OML$7ABB6308_1664_4CB4_84B1_598A6EA599D1' # save the job ID to a single-quoted variable
$ curl -X GET "<oml-cloud-service-location-url>/omlmod/v1/jobs/${jobid}" \
--header 'Accept: application/json' \
--header 'Content-Type: application/json' \
--header "Authorization: Bearer ${token}" | jq
Response of the Job Detail Request
Here is a response of the job details request. If your job has already run once earlier, you will see information returned about the last job run.
{
"jobId": "OML$7ABB6308_1664_4CB4_84B1_598A6EA599D1",
"jobRequest": {
"jobSchedule": {
"jobStartDate": "2023-03-24T20:30:26Z",
"repeatInterval": "FREQ=HOURLY",
"jobEndDate": "2023-03-24T23:30:26Z",
"maxRuns": 3
},
"jobProperties": {
"jobType": "DATA_MONITORING",
"inputSchemaName": "OMLUSER",
"outputSchemaName": "OMLUSER",
"outputData": "householdPowerConsumption",
"jobDescription": "Monitor household power",
"jobName": "HouseholdPowerDataMonitoring",
"disableJob": false,
"jobServiceLevel": "LOW",
"baselineData": "HOUSEHOLD_POWER_BASE",
"newData": "HOUSEHOLD_POWER_NEW",
"timeColumn": "DATES",
"startDate": "2008-01-01T00:00:00Z",
"endDate": "2010-11-26T00:00:00Z",
"frequency": "Year",
"threshold": 0.8,
"recompute": false,
"caseidColumn": null,
"featureList": [
"GLOBAL_ACTIVE_POWER",
"GLOBAL_REACTIVE_POWER",
"VOLTAGE",
"SUB_METERING_1",
"SUB_METERING_2",
"SUB_METERING_3"
],
"anchorColumn": null
}
},
"jobStatus": "CREATED",
"dateSubmitted": "2023-03-24T20:23:31.53651Z",
"links": [
{
"rel": "self",
"href": "<OML Service URL>/omlmod/v1/jobs/OML%247ABB6308_1664_4CB4_84B1_598A6EA599D1"
}
],
"jobFlags": [],
"state": "SCHEDULED",
"enabled": true,
"runCount": 0,
"nextRunDate": "2023-03-24T20:30:26Z"
}
4: Edit a Data Monitoring Job (optional)
This is an optional task. After you submit an asynchronous job, you have the option to update your job. To update a job, send a POST request to the /omlmod/v1/jobs/{jobID}
endpoint with the updated options in the updateProperties
parameters.
Here is an example of a POST request to update your data monitoring job. Here, the following parameters are updated under updateProperties
:
startDate
endDate
threshold
recompute
featureList
$ curl -i -X POST "<oml-cloud-service-location-url>/omlmod/v1/jobs/${jobid}" \
--header "Authorization: Bearer ${token}" \
--header 'Content-Type: application/json' \
--data '{
"updateProperties": {
"startDate": "2022-01-01T19:40:24Z",
"endDate": "2023-01-01T19:20:24Z",
"threshold": 0.80,
"featureList": ["GLOBAL_ACTIVE_POWER", "GLOBAL_REACTIVE_POWER", "VOLTAGE"]
}
}'
In this example, some of the parameters are updated to differ from the original parameters set in the initial request. When you successfully submit your job, you will receive a 204 response with no body.
5: Perform an Action on a Data Monitoring Job (Optional)
OML Services interacts with the DBMS_SCHEDULER to perform actions on jobs. There are four options for actions that you can send to the /omlmod/v1/jobs/{jobid}/action
endpoint:
DISABLE:
Disables the job at submission. Theforce
property can be used with this action to forcefully interrupt any running job.ENABLE:
Enables a job. After a disabled job is enabled, the scheduler begins to automatically run the job according to its schedule.RUN:
Runs the job immediately as a way to test a job, or run it outside of its schedule.STOP:
Stops a job that is running currently.
Note:
Jobs can be set toDISABLED
at submission by setting the
disableJob
flag to
true
.
Once your job has been successfully submitted, its state is set to ENABLED by default. This means that it will run as per the schedule you specified when submitting the job unless its state is changed to another state such as DISABLED. To change its state, send a request to the /omlmod/v1/jobs/{jobid}/action
endpoint.
$ curl -i -X POST "<oml-cloud-service-location-url>/omlmod/v1/jobs/${jobid}/action" \
--header "Authorization: Bearer ${token}" \
--header 'Content-Type: application/json' \
--data '{
"action": "DISABLE"
}'
6: View the Data Monitoring Job Output
RUN
action, you can view its output in the table you specified in your job request with the
outputData
parameter. The full name of the table is
{jobid}_{outputData}
. You can check if your job is complete by sending a request to view its details, as done in .
Note:
If your job has run at least once, you should see thelastRunDetail
field with information about that run.
Here is an example to query the output table associated with this example. After you run the query, scroll down the output table to view if there is information for the baseline
time period and newdata
time period for each of the dataset features being monitored for drift. Many of the columns may be empty in the baseline
rows, as the data monitoring is done on the new data, not the baseline data.
%sql
SELECT START_TIME, END_TIME, IS_BASELINE, THRESHOLD, HAS_DRIFT, round(DRIFT, 3),
FEATURE_NAME, ROUND(IMPORTANCE_VALUE, 3)
FROM OML$7ABB6308_1664_4CB4_84B1_598A6EA599D1_householdPowerConsumption
ORDER BY FEATURE_NAME, IS_BASELINE DESC
7: Delete a Data Monitoring Job
To delete a previously submitted job, send a DELETE request along with the jobId
to the /omlmod/v1/jobs
endpoint.
Here is an example of a DELETE request:
$ curl -X DELETE "<oml-cloud-service-location-url>/omlmod/v1/jobs/${jobid}" \
--header 'Accept: application/json' \
--header 'Content-Type: application/json' \
--header "Authorization: Bearer ${token}" | jq
8: Recreate this Example (Optional)
To recreate the data monitoring example described here, follow these steps:
- Run this command in a R paragraph in an OML notebook to drop any tables, if it exists, and suppress warnings:
%r options(warn=-1) try(ore.drop(table="HOUSEHOLD_POWER_BASE")) try(ore.drop(table="HOUSEHOLD_POWER_NEW"))
-
Run the following command in a R paragraph to read and transform data:
%r test <- read.csv("https://objectstorage.us-sanjose-1.oraclecloud.com/n/adwc4pm/b/OML_Data/o/household_power_consumption.txt", sep=";") test <- transform(test, Date = as.Date(Date, format = "%d/%m/%Y")) test <- transform(test, Global_active_power = as.numeric(Global_active_power)) test <- transform(test, Global_reactive_power = as.numeric(Global_reactive_power)) test <- transform(test, Voltage = as.numeric(Voltage)) test <- transform(test, Global_intensity = as.numeric(Global_intensity)) test <- transform(test, Sub_metering_1 = as.numeric(Sub_metering_1)) test <- transform(test, Sub_metering_2 = as.numeric(Sub_metering_2)) test <- transform(test, Sub_metering_3 = as.numeric(Sub_metering_3)) colnames(test) <- c("DATES", "TIMES", "GLOBAL_ACTIVE_POWER", "GLOBAL_REACTIVE_POWER", "VOLTAGE", "GLOBAL_INTENSITY", "SUB_METERING_1", "SUB_METERING_2", "SUB_METERING_3")
-
Now create the baseline data
HOUSEHOLD_POWER_BASE
and new dataHOUSEHOLD_POWER_NEW
. For this, run this R script:%r test_base <- test[test$DATES < "2008-01-01",] test_new <- test[test$DATES > "2007-12-31",] # Create OML proxy objects ore.create(test_base, table="HOUSEHOLD_POWER_BASE") ore.create(test_new, table="HOUSEHOLD_POWER_NEW")
-
To view the baseline data
HOUSEHOLD_POWER_BASE
, run this SQL command in a SQL paragraph in the notebook:%sql SELECT * FROM HOUSEHOLD_POWER_BASE FETCH FIRST 5 ROWS ONLY;
-
View the new data
HOUSEHOLD_POWER_NEW
by running this SQL command:%sql SELECT * FROM HOUSEHOLD_POWER_NEW FETCH FIRST 5 ROWS ONLY;