Data Monitoring Example - Data Drift Detection

Data drift occurs when data diverges from the original baseline data over time. Data drift can happen for a variety of reasons, such as a changing business environment, evolving user behavior and interest, data modifications from third-party sources, data quality issues, or issues with upstream data processing pipelines.

The key to accurately interpret your models and to ensure that the models are able to solve business problems is to understand how data evolves over time. Data monitoring is complementary to successful model monitoring, as understanding the changes in data is critical in understanding the changes in the efficacy of the models. The ability to quickly and reliably detect changes in your data's statistical properties help to ensure that your machine learning models are able to meet business objectives.

Data drift can be quantified using basic descriptive statistics. For numeric columns, these include:

  • Mean
  • Standard deviation
  • Range (minimum, maximum)
  • Number of nulls
For categorical columns, these include:
  • Number of unique values
  • Number of nulls

Data Monitoring Workflow

To monitor data through OML REST Services, follow these following steps:
  1. Obtain the access token
  2. Create a data monitoring job
  3. View the job details
  4. Update the job (optional)
  5. Enable the job to run
  6. View job output

1: Obtain the Access Token

You must obtain an authentication token by using your Oracle Machine Learning (OML) account credentials to send requests to OML Services. To authenticate and obtain a token, use cURL with the -d option to pass the credentials for your Oracle Machine Learning account against the Oracle Machine Learning user management cloud service REST endpoint /oauth2/v1/token. Run the following command to obtain the access token:

$ curl -X POST --header 'Content-Type: application/json' --header 'Accept: application/json' -d '{"grant_type":"password", "username":"'<yourusername>'", 
"password":"' <yourpassword>'"}'"<oml-cloud-service-location-url>/omlusers/api/oauth2/v1/token"
Here,
  • -X POST specifies to use a POST request when communicating with the HTTP server
  • -header defines the headers required for the request (application/json)
  • -d sends the username and password authentication credentials as data in a POST request to the HTTP server
  • Content-Type defines the response format (JSON)
  • Accept defines the response format (JSON)
  • yourusername is the user name of a Oracle Machine Learning user with the default OML_DEVELOPER role
  • yourpassword is the password for the user name
  • oml-cloud-service-location-url is a URL containing the REST server portion of the Oracle Machine Learning User Management Cloud Service instance URL that includes the tenancy ID and Autonomous AI Database name. You can obtain the URL from the Development tab in the Service Console of your Oracle Autonomous AI Database instance.

2: Create a Data Monitoring Job

To create a data monitoring job, send the following POST request to the /omlmod/v1/jobs endpoint in OML Services. Data monitoring will run on all features by default, or on a list of features specified by the user. There is a maximum of 250 features that can be monitored. The data monitoring job is submitted asynchronously. This means it will run as scheduled, and the results can be retrieved when the job completes.

Note:

OML Services interacts with the DBMS_SCHEDULER to perform actions on jobs.

The details for data monitoring are specified in jobProperties parameter, that includes:

  • Data monitoring job name and type
  • Autonomous AI Database service level
  • Table where the data monitoring details will be saved
  • Drift alert trigger
  • Threshold
  • Maximum number of runs
  • Baseline and new data to be used
  • Performance metric
  • Start date (optional ) and end date (optional ) correspond to a DATE or TIMESTAMP column in the table or view denoted by newData, and contained in the timeColumn field. If the start and end dates are not specified, the earliest and latest dates and times in the timeColumn are used.

Note:

The command uses jq, a command-line JSON processor available on Linux and Mac OS systems to extract relevant components from the response.
Here is an example of a data monitoring job request using these parameters:
$ curl -X POST "<oml-cloud-service-location-url>/omlmod/v1/jobs" \
     --header "Authorization: Bearer ${token}" \
     --header 'Content-Type: application/json' \
     --data '{
         "jobSchedule": {
             "jobStartDate": "2023-03-24T20:30:26Z",            # job start date and time 
             "repeatInterval": "FREQ=HOURLY",                   # job frequency
             "jobEndDate": "2023-03-24T23:30:26Z",              # job end date and time
             "maxRuns": "10"                                    # max runs within the schedule
         },
         "jobProperties": {
             "jobName": "HouseholdPowerDataMonitoring",         # job name
             "jobType": "DATA_MONITORING",                      # job type; DATA_MONITORING
             "disableJob": false,                               # flag to disable the job at submission
             "outputData": "householdPowerConsumption",         # table where the job results will be saved in the format {jobID}_{outputData}
             "baselineData": "HOUSEHOLD_POWER_BASE",            # table/view containing baseline data 
             "newData": "HOUSEHOLD_POWER_NEW",                  # table/view with new data to compare against baseline
             "inputSchemaName": "OMLUSER",                      # Autonomous AI Database schema that owns the input table/view
             "outputSchemaName": "OMLUSER",                     # Autonomous AI Database schema that owns the output table/view
             "jobDescription": "Monitor household power",       # job description
             "jobServiceLevel": "LOW",                          # Autonomous Autonomous AI Database service level; either LOW, MEDIUM, or HIGH
             "timeColumn": "DATES",                             # date or timestamp column in newData
             "startDate": "2008-01-01T00:00:00Z",               # the start date of the monitoring in the new data
             "endDate": "2010-11-26T00:00:00Z",                 # the end date of the monitoring in the new data
             "frequency": "Year",                               # the time window unit on which monitoring is performed on the new data
             "threshold": 0.8,                                  # threshold to trigger drift alert
             "recompute": false,                                # flag to determine whether to replace the output table
             "caseidColumn": null,                              # case identifier column in the baseline and new data
             "anchorColumn": null,                              # anchor column for bivariate analysis
             "featureList": [                                   # features to perform data monitoring on
                "GLOBAL_ACTIVE_POWER",
                "GLOBAL_REACTIVE_POWER",
                "VOLTAGE",
                "SUB_METERING_1",
                "SUB_METERING_2",
                "SUB_METERING_3"
            ]
         }
     }' | jq
     

The parameters in this command are:

Required Parameters

The required parameters are:
  • jobName specifies the name of the submitted job.
  • jobType specifies the type of job to be run.

    Note:

    For model monitoring jobs, this parameter is set to MODEL_MONITORING
  • outputData is the output data identifier. The results of the job will be written to a table named {jobId}_{ouputData}
  • baselineData is a table or view name that contains baseline data to monitor. At least 50 rows per period are required for model monitoring, otherwise analysis is skipped.
  • newData is a table or view name with new data to be compared against the baseline. At least 100 rows per period are required for data monitoring, otherwise analysis is skipped.

Optional Parameters

The optional parameters are:
  • disableJob: A flag to disable the job at submission. If not set, the default is false and the job is enabled at submission.
  • inputSchemaName: The Autonomous AI Database schema that owns the input table or view. If not specified, the input schema will be the same as the username in the request token.
  • outputSchemaName: The Autonomous AI Database schema that owns the output table. If not specified, the output schema will be the same as the input schema.
  • jobDescription: A description of the job.
  • jobServiceLevel: The service level for the job, which can be LOW, MEDIUM, or HIGH.
  • timeColumn: The name of a date or timestamp column in the new data. If not provided, the entire newData: Is treated as one period.
  • startDate: The start date or timestamp of monitoring in the newData column. The column timeColumn is mandatory for startDate. If startDate is not provided, then startDate depends on whether frequency is provided. If frequency is not provided, then the earliest date in timeColumn is used as the startDate. If both startDate and frequency are not provided, then the most recent of the earliest date in timeColumn and the starting date of the 10th most recent cycle is considered as the startDate.

    Note:

    The supported date and time format is the ISO-8601 date and time format. For example: 2022-05-12T02:33:16Z
  • endDate: The end date or timestamp of monitoring in the newData. The column timeColumn is mandatory for endDate. If endDate is not provided, then the most recent date in timeColumn will be used.

    Note:

    The supported date and time format is the ISO-8601 date and time format. For example: 2022-05-12T02:33:16Z
  • frequency: Indicates the time window unit on which the monitoring is performed on the new data. This can be "day", "week", "month", or "year". If not provided, the whole "new" data is used as a single time period.
  • threshold: The threshold to trigger a drift alert.
  • recompute: A flag to indicate whether to update the already computed periods. The default is False. It means only time periods not present in the output result table will be computed.
  • caseidColumn: A case identifier column in the baseline and new data. It improves the reproducibility of results.
  • featureList: A list of features to monitor. The default number of maximum features in the list is 250. Monitoring will run on all features if the feature list is not specified, and the features need to be either numeric or categorical.
  • anchorColumn: The anchor column in the baseline and new data used for bi-variate analysis. The target column in supervised problems can be passed as the anchor column.

Response of a data monitoring job creation request

When your job is submitted successfully, you will receive a response with a jobid. Note the jobId to use it in submit requests to retrieve job details or to perform any other actions on the job. Here is an example of a data monitoring job creation response:

{
  "jobId": "OML$7ABB6308_1664_4CB4_84B1_598A6EA599D1",
  "links": [
    {
      "rel": "self",
      "href": "<OML Service URL>/omlmod/v1/jobs/OML%247ABB6308_1664_4CB4_84B1_598A6EA599D1"
    }
  ]
}

3: View Details of the Submitted Job

To view details of your submitted job, send a GET request to the /omlmod/v1/jobs/{jobID} endpoint. Here, jobId is the ID provided in response to the successful submission of your data monitoring job in the previous step.

Run the following command to view job details:

$ export jobid='OML$7ABB6308_1664_4CB4_84B1_598A6EA599D1'  # save the job ID to a single-quoted variable

$ curl -X GET "<oml-cloud-service-location-url>/omlmod/v1/jobs/${jobid}"  \
       --header 'Accept: application/json' \
       --header 'Content-Type: application/json' \
       --header "Authorization: Bearer ${token}" | jq

Response of the Job Detail Request

Here is a response of the job details request. If your job has already run once earlier, you will see information returned about the last job run.

{
  "jobId": "OML$7ABB6308_1664_4CB4_84B1_598A6EA599D1",
  "jobRequest": {
    "jobSchedule": {
      "jobStartDate": "2023-03-24T20:30:26Z",
      "repeatInterval": "FREQ=HOURLY",
      "jobEndDate": "2023-03-24T23:30:26Z",
      "maxRuns": 3
    },
    "jobProperties": {
      "jobType": "DATA_MONITORING",
      "inputSchemaName": "OMLUSER",
      "outputSchemaName": "OMLUSER",
      "outputData": "householdPowerConsumption",
      "jobDescription": "Monitor household power",
      "jobName": "HouseholdPowerDataMonitoring",
      "disableJob": false,
      "jobServiceLevel": "LOW",
      "baselineData": "HOUSEHOLD_POWER_BASE",
      "newData": "HOUSEHOLD_POWER_NEW",
      "timeColumn": "DATES",
      "startDate": "2008-01-01T00:00:00Z",
      "endDate": "2010-11-26T00:00:00Z",
      "frequency": "Year",
      "threshold": 0.8,
      "recompute": false,
      "caseidColumn": null,
      "featureList": [
        "GLOBAL_ACTIVE_POWER",
        "GLOBAL_REACTIVE_POWER",
        "VOLTAGE",
        "SUB_METERING_1",
        "SUB_METERING_2",
        "SUB_METERING_3"
      ],
      "anchorColumn": null
    }
  },
  "jobStatus": "CREATED",
  "dateSubmitted": "2023-03-24T20:23:31.53651Z",
  "links": [
    {
      "rel": "self",
      "href": "<OML Service URL>/omlmod/v1/jobs/OML%247ABB6308_1664_4CB4_84B1_598A6EA599D1"
    }
  ],
  "jobFlags": [],
  "state": "SCHEDULED",
  "enabled": true,
  "runCount": 0,
  "nextRunDate": "2023-03-24T20:30:26Z"
}

4: Edit a Data Monitoring Job (optional)

This is an optional task. After you submit an asynchronous job, you have the option to update your job. To update a job, send a POST request to the /omlmod/v1/jobs/{jobID} endpoint with the updated options in the updateProperties parameters.

Here is an example of a POST request to update your data monitoring job. Here, the following parameters are updated under updateProperties:

  • startDate
  • endDate
  • threshold
  • recompute
  • featureList
$ curl -i -X POST "<oml-cloud-service-location-url>/omlmod/v1/jobs/${jobid}" \
         --header "Authorization: Bearer ${token}" \
         --header 'Content-Type: application/json' \
         --data '{
            "updateProperties": {
                "startDate": "2022-01-01T19:40:24Z",
                "endDate":   "2023-01-01T19:20:24Z",
                "threshold":  0.80,
                "featureList": ["GLOBAL_ACTIVE_POWER", "GLOBAL_REACTIVE_POWER", "VOLTAGE"]
            }
         }'

In this example, some of the parameters are updated to differ from the original parameters set in the initial request. When you successfully submit your job, you will receive a 204 response with no body.

5: Perform an Action on a Data Monitoring Job (Optional)

OML Services interacts with the DBMS_SCHEDULER to perform actions on jobs. There are four options for actions that you can send to the /omlmod/v1/jobs/{jobid}/action endpoint:

  • DISABLE: Disables the job at submission. The force property can be used with this action to forcefully interrupt any running job.
  • ENABLE: Enables a job. After a disabled job is enabled, the scheduler begins to automatically run the job according to its schedule.
  • RUN: Runs the job immediately as a way to test a job, or run it outside of its schedule.
  • STOP: Stops a job that is running currently.

Note:

Jobs can be set to DISABLED at submission by setting the disableJob flag to true.

Once your job has been successfully submitted, its state is set to ENABLED by default. This means that it will run as per the schedule you specified when submitting the job unless its state is changed to another state such as DISABLED. To change its state, send a request to the /omlmod/v1/jobs/{jobid}/action endpoint.

Here is an example to change the job status to DISABLED. When you successfully submit your job you will receive a 204 response with no body.
$ curl -i -X POST "<oml-cloud-service-location-url>/omlmod/v1/jobs/${jobid}/action" \
       --header "Authorization: Bearer ${token}" \
       --header 'Content-Type: application/json' \
       --data '{
            "action": "DISABLE"
         }'

6: View the Data Monitoring Job Output

Once your job has run, either according to its schedule or by the RUN action, you can view its output in the table you specified in your job request with the outputData parameter. The full name of the table is {jobid}_{outputData}. You can check if your job is complete by sending a request to view its details, as done in .

Note:

If your job has run at least once, you should see the lastRunDetail field with information about that run.

Here is an example to query the output table associated with this example. After you run the query, scroll down the output table to view if there is information for the baseline time period and newdata time period for each of the dataset features being monitored for drift. Many of the columns may be empty in the baseline rows, as the data monitoring is done on the new data, not the baseline data.

%sql


SELECT START_TIME, END_TIME, IS_BASELINE, THRESHOLD, HAS_DRIFT, round(DRIFT, 3), 
       FEATURE_NAME, ROUND(IMPORTANCE_VALUE, 3) 
FROM OML$7ABB6308_1664_4CB4_84B1_598A6EA599D1_householdPowerConsumption
ORDER BY FEATURE_NAME, IS_BASELINE DESC

7: Delete a Data Monitoring Job

To delete a previously submitted job, send a DELETE request along with the jobId to the /omlmod/v1/jobs endpoint.

Here is an example of a DELETE request:

$ curl -X DELETE "<oml-cloud-service-location-url>/omlmod/v1/jobs/${jobid}"  \
       --header 'Accept: application/json' \
       --header 'Content-Type: application/json' \
       --header "Authorization: Bearer ${token}" | jq

8: Recreate this Example (Optional)

To recreate the data monitoring example described here, follow these steps:

  1. Run this command in a R paragraph in an OML notebook to drop any tables, if it exists, and suppress warnings:
    %r
    
    options(warn=-1)
    
    try(ore.drop(table="HOUSEHOLD_POWER_BASE"))
    try(ore.drop(table="HOUSEHOLD_POWER_NEW"))
    
  2. Run the following command in a R paragraph to read and transform data:
    %r
    
    test <- read.csv("https://objectstorage.us-sanjose-1.oraclecloud.com/n/adwc4pm/b/OML_Data/o/household_power_consumption.txt", sep=";")
    
    test <- transform(test, Date = as.Date(Date, format = "%d/%m/%Y"))
    test <- transform(test, Global_active_power = as.numeric(Global_active_power))
    test <- transform(test, Global_reactive_power = as.numeric(Global_reactive_power))
    test <- transform(test, Voltage = as.numeric(Voltage))
    test <- transform(test, Global_intensity = as.numeric(Global_intensity))
    test <- transform(test, Sub_metering_1 = as.numeric(Sub_metering_1))
    test <- transform(test, Sub_metering_2 = as.numeric(Sub_metering_2))
    test <- transform(test, Sub_metering_3 = as.numeric(Sub_metering_3))
    
    colnames(test) <- c("DATES", "TIMES", "GLOBAL_ACTIVE_POWER", "GLOBAL_REACTIVE_POWER", "VOLTAGE", "GLOBAL_INTENSITY", "SUB_METERING_1", "SUB_METERING_2", "SUB_METERING_3") 
  3. Now create the baseline data HOUSEHOLD_POWER_BASE and new data HOUSEHOLD_POWER_NEW. For this, run this R script:
    %r
    
    test_base <- test[test$DATES < "2008-01-01",]
    test_new <- test[test$DATES > "2007-12-31",]
    
    # Create OML proxy objects
    
    ore.create(test_base, table="HOUSEHOLD_POWER_BASE")
    ore.create(test_new, table="HOUSEHOLD_POWER_NEW")
  4. To view the baseline data HOUSEHOLD_POWER_BASE, run this SQL command in a SQL paragraph in the notebook:

    %sql
    
    SELECT * FROM HOUSEHOLD_POWER_BASE
    FETCH FIRST 5 ROWS ONLY;
  5. View the new data HOUSEHOLD_POWER_NEW by running this SQL command:
    %sql
    
    SELECT * FROM HOUSEHOLD_POWER_NEW
    FETCH FIRST 5 ROWS ONLY;