Training and Testing Data Requirements

To use the service, you have to prepare proper training and testing data to build the model and test it.

The training and testing data can only contain timestamps and other numeric attributes that typically represent sensor or signal readings.

At a high level, this algorithm has three major data quality requirements on the training data:

  • The training data has to be anomaly free (without outliers) and contain observations that have normal business conditions only.
  • The training data covers all the normal business scenarios that contain the full value ranges on all attributes.
  • The attributes in the data have to be well related or belong to the same system or asset. Attributes from different systems are suggested to train separate models.

The detection data has to have the same attributes from the same system or asset. It can have anomaly data points.

Data Quantity and Quality Requirements

In general, the training and testing data are required to represent values from multiple attributes (like signals or sensors) recorded at certain timestamps in chronological order that have:

  • Columns containing the timestamp as the first column, and other numeric attributes, signals, and sensors following it.
  • Each row represents one observation of those attributes, signals, and sensors at the given timestamp.

These requirements ensure that your training is successful, and that the trained model is of high quality:

Timestamp
  • The timestamp column is optional. Either the timestamp is provided for each value row or not given at all.
    • If a timestamp column is provided, you must name it timestamp (all lowercase without any spaces) in the first column.
    • The timestamps in the data are strictly increasing order and no duplicates
    • The timestamps can have different frequencies. For example, 50 observations in one hour and 200 observation in the next hour.
  • If no timestamp is given, then the data is sorted sequentially by time
Attribute
  • The attribute values must be a numeric data type. It can be missing values, that are represented as null.
  • The entire attributes, signals, and sensors column can't have all the values as missing values.
  • The data has to have at least three highly correlated attributes.
  • The attribute names are required to be unique. The total number of attributes can't be more than 256.
Training Phase

The number of observations and timestamps in training data must be at least 8 x number of attributes, or 40 whichever is greater.

For example, with 100 sensors the minimum rows required are Max(8 x 100, 40) = 800 rows. With four sensors the minimum rows required are Max(8 x 4, 40) = 40 rows.

Detection Phase
  • For a batch detection call, the maximum size of a detection payload is up to 256 for signals, and up to 100 for timestamps and rows.

    Data points = Number of signals x Number of rows

Other Recommendations

  • If a new attribute is added in the future, then the data has to be trained again including the new attribute label to consider it during detection.

  • If an attribute is detected to be one of the following, it is automatically dropped:

    • flat signal

    • monotonic signal

    • low correlated signal

    • duplicate of another signal during training

    The attribute can be present in detection data, but is skipped during detection.

  • More data in the detection call is better as long as it is within the limits of the maximum allowed data.

Data Format Requirements

The Anomaly Detection service supports the CSV and JSON file formats that contain data with timestamps and numeric attributes.

The service also supports data from ATP and InfluxDB, which have similar requirements in terms of number and format of timestamps, and number of numeric attributes.

Note

Timestamps must follow the ISO 8601 format. We recommend that you use the precise time up to seconds or milliseconds as in the file format examples.

CSV Format

Each column represents sensor data and the row represents the values corresponding to each sensor at a particular timestamp.

CSV formatted data must have comma-separated lines, with first line as the header and other lines as data. The Anomaly Detection service requires that the first column is named timestamp when specifying timestamps.

For example:

timestamp,sensor1,sensor2,sensor3,sensor4,sensor5
2020-07-13T14:03:46Z,,0.6459,-0.0016,-0.6792,0
2020-07-13T14:04:46Z,0.1756,-0.5364,-0.1524,-0.6792,1
2020-07-13T14:05:46Z,0.4132,-0.029,,0.679,0
Note

  • Missing values are permitted (with null), data is sorted by timestamp, and boolean flag values have to be converted to a numeric (0 or 1).

  • The last line can't be a new line. The last line is an observation with other signals.

JSON format

Similarly, JSON formatted data must also contain timestamps and numeric attributes only. Use the following keys:

{ "requestType": "INLINE",
  "signalNames": ["sensor1", "sensor2", "sensor3", "sensor4", "sensor5", "sensor6", "sensor7", "sensor8", "sensor9", "sensor10"],
  "data": [
      { "timestamp" : "2012-01-01T08:01:01.000Z", "values" : [1, 2.2, 3, 1, 2.2, 3, 1, 2.2, null, 4] },
      { "timestamp" : "2012-01-02T08:01:02.000Z", "values" : [1, 2.2, 3, 1, 2.2, 3, 1, 2.2, 3, null] }
  ]
}
Note

Missing value is coded as null without quote.

Data Quantity and Quality Requirements

Beside format requirements, follow these data quantity and quality requirements:

  • You have to strictly order your data by timestamp with no duplicated timestamps.
  • The data must have at least three highly correlated attributes.
  • At least one attribute does not have a missing value.
  • The number of observations and timestamps in your training data has to be at least 8 * the number of attributes or 40, whichever is greater.

Best Practices

  • The training dataset covers all the business scenarios that are believed to be normal.
  • The training dataset is a stable dataset, and must not contain anomalies.
  • The attributes in the data are well related or belong to the same system or asset. We suggest that attributes from different systems are used to train separate models.