Requirements and Data Preparation

The oracle MSET algorithm can detect early symptoms of failure, such as temperature anomalies and changes in vibration profiles.

To effectively detect the early symptoms, the data profile must meet certain criteria. The criteria are explained in greater detail later, but briefly, the sensor readings must be sequential. Timestamps are not essential, but the sensor readings must be in strict chronological order, and must all be numeric. Also, the training data should consist of sensor readings that are free of anomalies and are within normal operating parameters.

The requirements mean that you might need to process the raw data from the sensors before feeding it into the anomaly detection model.

Service Requirements

To get valid results from the service, you must prepare proper training and testing data.

The training and testing data should only contain timestamps and other numeric attributes. The data are typically from sensor and signal readings. Categorical fields are not supported in the current version.

At a high level, the service has three major data quality requirements for the training data:

The training data should be anomaly-free and without outliers. It should contain observations from normal operational conditions only.
The training data should cover all the normal business scenarios which contain the full value ranges on all attributes.
The attributes in the data should be well-related or belong to the same system or asset. We recommend training separate models if the attributes are from different systems.

The detection data should have the same attributes as the training data. Also, it should be from the same system or asset as the training data. The detection data can have anomaly data points.

Data Quality

Model training and testing data must represent values from multiple attributes, such as signals and sensors, recorded in a chronological order.

To create a high quality model, make sure that the data in your training set adheres to the following list of requirements.

Timestamps

A timestamp column is optional. However, if present, it must be the first column in the table.

The timestamp column must have the label "timestamp", all lowercase with no spaces.
The timestamps must be sorted in ascending order.
There must be no duplicate timestamps.
The timestamps can have variable frequency. For example, 50 observations in one hour and 200 observations in the next hour.
If there is no timestamp column, the data is assumed to be sorted sequentially by time.

Attributes

Each row of data is a single observation at the given timestamp.

The attribute value must be numeric. For Boolean values, use 1 for True and 0 for False.
Missing values are represented by null in JSON files and by an empty field in CSV files.
Each row must have at least one attribute that is not missing. That is, you cannot have a row that is only the timestamp.
The data should have at least three highly correlated attributes.
Each attribute name must be unique.
The number of attributes should be no more than 300.

Training

To determine the number of rows that you should have in the training set, multiply the number of attributes by eight. You must have a minimum of 40 rows in the training set.

For example, if you have 100 sensors, then the number of rows is 8000. If you have only 4 sensors, then the number of rows is 40.

Detection

When using batch processing, the maximum number of data points in the batch is 30,000. The number of data points is the number of signals times the number of rows.

For example, if you have 50 sensors then a maximum of 30,000/50 = 600 rows are allowed in a single batch.

Other Considerations

If one or more attributes are added at some point in the future, then the model must be retrained with the new attributes in the training set.

During training, attributes that are determined to be a flat signals, monotonic signals, low correlated signals, or duplicate signals are automatically dropped by Anomaly Detection Service. The dropped attribute can be present in the detection data, but it will be ignored.

Data Schema

Anomaly Detection Service accepts two data formats: CSV, and JSON.

For CSV files, each column represents sensor data. Each row represents the values corresponding to each sensor at a particular time.

Timestamp values must be in ISO 8601 format. Use as precise a time as possible to avoid duplicates in the training data.

CSV-formatted data should have comma-separated lines, with first line as the header and other lines as data. The first column is the timestamp column. Here is an example of CSV-formatted data:

timestamp,sensor1,sensor2,sensor3,sensor4,sensor5
2020-07-13T14:03:46Z,,0.6459,-0.0016,-0.6792,0
2020-07-13T14:04:46Z,0.1756,-0.5364,-0.1524,-0.6792,1
2020-07-13T14:05:46Z,0.4132,-0.029,,0.679,0

Note:

The CSV file must not have any blank lines, including the last line.

Here is the same data, except in JSON format:

{
    "requestType": "INLINE",
    "signalNames": ["sensor1", "sensor2", "sensor3", "sensor4", "sensor5"],
    "data": [{
            "timestamp": "2020-07-13T14:03:46Z",
            "values": [null, 0.6459, -0.0016, -0.6792, 0]
        },
        {
            "timestamp": "2020-07-13T14:04:46Z",
            "values": [0.1756, -0.5364, -0.1524, -0.6792, 1]
        },
        {
            "timestamp": "2020-07-13T14:05:46Z",
            "values": [0.4132, -0.029, null, 0.679, 0]
        }
    ]
}