Baselines and Anomalies

All collected metric data have a baseline - what is considered a normal data pattern. Once you establish a baseline you can use that data pattern to look for anomalies - outliers within your data. An anomaly is a metric value that should grab your attention because it stands out - it should not have happened.

Within APMCS a baseline for a particular metric is marked by a high and low value. This baseline is a value range that was established based on historic values collected over time, and within which data can be considered to be normal. Some metric data may fall outside of this range and, sometimes, these outliers are considered to be statistical anomalies.

The APMCS REST API provides a query which you can use to retrieve high and low baseline values for a particular metric, as well as anomalies (sometimes called anomalous periods) for that same metric. On this page the term baselines generally refers to these high and low baseline values and anomalies.

Using the historic values of your metric, APMCS determines the baselines. In baseline terminology, the historic values are referred to as training values, while the current data that you are evaluating for baselines are referred to as the evaluation values. For a particular period of evaluation values (for example, the last day of activity on a particular page), APMCS will retrieve and analyze the corresponding training values and, using a statistical engine, will create a data model based on detected trends and patterns. The more training data there is, the more reliable the baseline analysis will be for your evaluation period.

Though you may be retrieving (or, when in the UI, viewing) the metric data over longer aggregation periods, such as buckets of 15 minutes, 1 hour, or even longer, the baseline analysis is always done of the finest granularity of data, which is a 1-minute aggregation period. This allows the system to return the same consistent baseline results to you for any particular point in time, whether that point was retrieved as part of the last week, or as part of the last hour.

Baseline retrieval is split into two categories:
  • On-Demand Baseline Retrieval
  • Scheduled Baseline Retrieval

On-Demand Baseline Retrieval

On-demand baseline retrieval refers to the process of the REST API user retrieving the APM metrics for a resource, such as a web page, and running them directly through the baseline analysis engine in real time. For example, a few weeks after the initial installation of a web application the administrator may be interested in seeing the usage patterns. Does this week's usage pattern match recent historic usage? Is there anything unusual (an anomaly) about the page load spike that was observed early yesterday morning? What trends can I expect for the rest of the week? On-demand baseline analysis can answer these and other questions.

The /baselines Query

The REST API /baselines query is similar to the /timeseries query described in the Time Series page. In fact, in order to calculate baselines, the APMCS server will first retrieve the corresponding time series data for the training and evaluation periods. That same data will be used to do the final analysis.

This example retrieves the baselines for the last 12 hours of activity on a particular page:
curl
-i -X GET --insecure -H 'X-USER-IDENTITY-DOMAIN-NAME: apm_testtenantx1' -u 'apm_testtenantx1.emcsadmin:Welcome1!'
'https://slc02ovb:4443/serviceapi/apm.dataserver/api/v1/pages/95D52F790980945/baselines?since=2016-08-27T02:40:00.000Z&until=2016-08-27T14:40:00.000Z&aggregationPeriod=5&attributes=completedCount'
The parameters of the query are:
  • since & until - These are required parameters that specify the range of actual values that you want to evaluate. This time range does not include the training values that are retrieved internally to create an historic data model. The time range is much like the time range in any other REST API operation, described in Since and Until Time Range.
  • aggregationPeriod - Unlike in other queries, this is a required parameter and there is no default aggregation period value. This value is specified in units of 1 minute and must be less than the full time range specified between your since and until parameters. As noted earlier, even though your query's aggregation period may be longer than 1-minute, the internal analysis is done on the 1-minute data stored in the APMCS data repository. See the Time Series page for a more complete description of aggregation periods.
  • attributes - This required parameter is a comma-separated list of metric attributes in which you're interested in gathering a baseline analysis. Current valid values are averageResponseTime and completedCount; a more complete set of value can be found by querying the /metadata resource (the Get metadata operation) with the Baselines type parameter and the correct resource type (for example /pages or /requests).
  • information - This optional parameter is a comma-separated list of informational values which are used to gather diagnostic information while the baseline analysis is being run. If set, the resulting JSON will include more data elements. Current valid informational values are:
    • analytics - return internal training and evaluation period segments and models that were used for the analysis.
    • timings - return timings of the different phases during the analysis.
    • settings - return system settings used to initiate the analysis.

Baseline Analysis

Internally the server will first determine your time range between your since and until parameters. Then, after figuring out the corresponding training period to use for that time range, the server will retrieve the 1-minute data that comprises both the training and evaluation periods. This data is then analyzed and each evaluation data point is marked with a baseline high and low value, and whether or not its own metric value is anomalous. Note that a metric value merely being out of the range of the baseline high and low bands, does not necessarily mark that value as an anomaly.

The more data that is available for the analysis the more reliable the result. When, for example, there is more than a few days of training data, the system can start considering daily seasonality. At what hours did it detect certain trends? Does page load tend to increase before noon and then taper off?

When there is even more training data, at a certain point the system can augment the daily seasonality with weekly seasonality. What typical trends were detected on specific days of the week? A page may be idle on Saturday, but spike between the hours of 7 a.m. and noon on Monday and Friday.

Earlier we noted that all the analysis is done using the finest granularity of data in the APMCS data repository. Since data is stored internally at 1-minute intervals the baseline analysis is done for 1-minute metric data. Your /baselines query, however, may have an aggregation period that is longer than 1-minute. If, for example, you wanted to analyze the last 6 hours of data, you may choose to aggregate your results into buckets of 5-minutes (aggregationPeriod=5).

This aggregation period does not affect the baseline analysis. Rather, when all baseline values have been collected, the aggregation period is used to roll up those values so that your results can be grouped together in more manageable (and more intuitive) sizes. So, even though for a particular 1-minute evaluation point there cannot be more than one anomaly, when grouped into buckets of longer aggregation period lengths, a single evaluation period (of say, 5-minutes) may contain multiple anomalies. Of course, you may always choose to retrieve the 1-minute aggregation period no matter what the time range you specify.

The /baselines JSON Response

The JSON response of the query will contain the following:
  • timeSeries - The response contains the original time-series data object that was used to determine the baseline values. This object is identical to the time series object had the query not been a baselines query. For example, the timeSeries object returned for the query:
    /pages/95D52F790980945/baselines?since=2016-08-27T02:40:00.000Z&until=2016-08-27T14:40:00.000Z&aggregationPeriod=5&attributes=completedCount'
    is identical to the object returned directly from the /timeSeries query itself:
    /pages/95D52F790980945/timeSeries?since=2016-08-27T02:40:00.000Z&until=2016-08-27T14:40:00.000Z&aggregationPeriod=5'
    Refer to the Time Series page for an explanation of that returned object. You can consider the /baselines query as augmenting the /timeSeries query with more baseline values.
  • <attribute>Baselines - For each attribute listed in the comma-separated attributes list in the input REST API query, there will be one element called <attribute>Baselines. For example, for the completedCount attribute there will be a completedCountBaselines element. This element will contain all baseline-pertinent information that is relevant to the named attribute for your evaluation period. The returned values in each such element are all arrays, and each single element in an array corresponds to the same numbered element in the time (or formattedTime) array in the returned timeSeries object (described just above). This will be better demonstrated in the later example.
    • baselineHigh - array of decimal values that lists the high baseline.
    • baselineLow - array of decimal values that lists the low baseline.
    • anomalyHighCount - array of integer values listing the number of anomalies about the high value. This value will be between 0 and the input aggregationPeriod value. Note again that a metric value merely being out of the range of the baseline high and low bands, does not necessarily mark that value as an anomaly. However, it is true that an anomalous value for 1-minute evaluation point, will always be above the baseline high mark or below the baseline low mark.
    • anomalyLowCount - array of integer values listing the number of anomalies below the low value. This value will be between 0 and the input aggregationPeriod value. The note just above applies here too.
    • analytics - (optional) If the REST API query includes the &information=analytics parameter, then the JSON response will include an analytics object describing the internal training and evaluation periods used for the analysis. This object includes summary information as well as the specific data used for the analysis and the lower level details returned from the analysis. This can be used to help understand the baseline results that are returned from your REST API query. This is particularly useful in understanding the analysis as you move in time closer to the beginning of the historic data that is stored in the APMCS data repository.

Extra /baselines Response Status Codes

The status codes already specified in the Status Codes page all apply. Baseline processing adds some extra meanings to the following status codes:
Code Status Description
200 OK Request completed successfully with baseline results.
204 No Content No baseline results available for resource (i.e. a web page) in given time range. Besides a named resource not found, this could also be because the resource was not active during the given time range.
500 Internal Server Error If APMCS encounters an internal error in the baseline analysis engine it will report this error. The specifics of the error can be found in the APMCS Server logs.
501 Not Implemented The APMCS server was configured to temporarily disable baseline analysis and retrieval.

An Example /baselines Query & JSON Response

The following example uses the query below:
curl
-i -X GET --insecure -H 'X-USER-IDENTITY-DOMAIN-NAME: apm_testtenantx1' -u 'apm_testtenantx1.emcsadmin:Welcome1!'
'https://slc02ovb:4443/serviceapi/apm.dataserver/api/v1/pages/E7E826A8/baselines?since=2016-08-23T06:50:00.000Z&until=2016-08-23T07:15:00.000Z&aggregationPeriod=1&attributes=averageResponseTime,completedCount'
For the sake of the example, we use a short time range of 25 minutes and an aggregation period of one minute. The example gathers baseline values for both the averageResponseTime and completedCount page attributes.

The JSON result is below with some brief annotations:

{
# baseline results for the averageResponseTime metric in the time range specified in the query.
# the number of elements (25) in these arrays matches the number of elements in the time series arrays.
# the baseline high and the baseline low values do not vary in this example;
# this is common for a time range as short as 25 minutes.

"averageResponseTimeBaselines" : {
"anomalyHighCount" : [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ],
"anomalyLowCount" : [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ],
"baselinesHigh" : [ 2180.91, 2180.91, 2180.91, 2180.91, 2180.91, 2180.91, 2180.91, 2180.91, 2180.91, 2180.91, 2180.91, 2180.91, 2180.91, 2180.91, 2180.91, 2180.91, 2180.91, 2180.91, 2180.91, 2180.91, 2180.91, 2180.91, 2180.91, 2180.91, 2180.91 ],
"baselinesLow" : [ 1819.58, 1819.58, 1819.58, 1819.58, 1819.58, 1819.58, 1819.58, 1819.58, 1819.58, 1819.58, 1819.58, 1819.58, 1819.58, 1819.58, 1819.58, 1819.58, 1819.58, 1819.58, 1819.58, 1819.58, 1819.58, 1819.58, 1819.58, 1819.58, 1819.58 ]
},

# baseline results for the completedCount metric in the time range specified in the query.
# notice also the one high anomaly and the one low anomaly.

"completedCountBaselines" : {
"anomalyHighCount" : [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ],
"anomalyLowCount" : [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ],
"baselinesHigh" : [ 54.0, 54.0, 54.0, 54.0, 54.0, 54.0, 54.0, 54.0, 54.0, 54.0, 54.0, 54.0, 54.0, 54.0, 54.0, 54.0, 54.0, 54.0, 54.0, 54.0, 54.0, 54.0, 54.0, 54.0, 54.0 ],
"baselinesLow" : [ 45.0, 45.0, 45.0, 45.0, 45.0, 45.0, 45.0, 45.0, 45.0, 45.0, 45.0, 45.0, 45.0, 45.0, 45.0, 45.0, 45.0, 45.0, 45.0, 45.0, 45.0, 45.0, 45.0, 45.0, 45.0 ]
},

# time series object used for the baseline evaluation analysis.

"timeSeries" : {
"totalFirstByteTime" : [ 49100, 41192, 53697, 47544, 47331, 48044, 46025, 46560, 53758, 53232, 74563, 34895, 49820, 49972, 53145, 51525, 50701, 58564, 54743, 48094, 51343, 39906, 65788, 51028, 51054 ],
"minFirstByteTime" : [ 19, 23, 13, 9, 46, 4, 14, 27, 74, 76, 39, 54, 14, 16, 12, 11, 49, 13, 46, 34, 38, 4, 27, 28, 1 ],
"maxFirstByteTime" : [ 2580, 2375, 2604, 2349, 2568, 2981, 2387, 2761, 2295, 2722, 2972, 2542, 2337, 2360, 3032, 2567, 2803, 2837, 2923, 2437, 2720, 2382, 2814, 2637, 2817 ],
"averageFirstByteTime" : [ 982, 823.84, 1073.94, 932.24, 928.06, 906.49, 979.26, 895.38, 1054.08, 1086.37, 1222.34, 894.74, 976.86, 1019.84, 1155.33, 1010.29, 1014.02, 1126.23, 1052.75, 1023.28, 1047.82, 814.41, 1241.28, 1085.7, 911.68 ],
"totalInteractiveTime" : [ 73603, 71120, 76822, 70511, 69830, 74033, 72711, 78319, 81985, 79269, 97545, 56375, 73693, 74777, 76908, 75932, 79927, 81540, 76611, 70882, 79852, 63516, 86525, 74431, 79217 ],
"maxInteractiveTime" : [ 3088, 2604, 2716, 2477, 2738, 2993, 2692, 3053, 2791, 2742, 3002, 2968, 2844, 2651, 3052, 2744, 2883, 2849, 2972, 2692, 2792, 2514, 2878, 2794, 2917 ],
"minInteractiveTime" : [ 104, 217, 180, 39, 291, 203, 288, 225, 406, 634, 491, 484, 231, 341, 320, 280, 557, 269, 422, 404, 547, 144, 258, 233, 12 ],
"averageInteractiveTime" : [ 1472.06, 1422.4, 1536.44, 1382.57, 1369.22, 1396.85, 1547.04, 1506.13, 1607.55, 1617.73, 1599.1, 1445.51, 1444.96, 1526.06, 1671.91, 1488.86, 1598.54, 1568.08, 1473.29, 1508.13, 1629.63, 1296.24, 1632.55, 1583.64, 1414.59 ],
"errorPercentage" : [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ],
"averageResponseTime" : [ 2009.78, 1859.50, 2048.58, 1967.29, 1931.59, 1930.98, 2002.81, 2006.21, 2105.92, 1973.73, 2046.98, 1942.51, 2008.67, 2058.27, 2103.74, 2026.96, 2060.50, 1984.06, 1981.13, 2035.66, 2053.20, 1794.10, 2093.96, 1931.26, 2018.29 ],
"maxResponseTime" : [ 3166, 3156, 2950, 3015, 2976, 3049, 2927, 3107, 3125, 3011, 3024, 3101, 3042, 3090, 3074, 2856, 3126, 2880, 3116, 2959, 2977, 3037, 2946, 3135, 3104 ],
"minResponseTime" : [ 1074, 922, 863, 944, 949, 983, 884, 893, 1091, 1024, 979, 835, 927, 1139, 824, 1041, 1024, 1054, 827, 1133, 1046, 949, 1071, 847, 914 ],
"completedCount" : [ 50, 50, 50, 51, 51, 53, 47, 52, 51, 49, 61, 39, 51, 49, 46, 51, 50, 52, 52, 47, 49, 49, 53, 47, 56 ],
"failureCount" : [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ],
"count" : 25,
"totalTime" : [ 100489, 92975, 102429, 100332, 98511, 102342, 94132, 104323, 107402, 96713, 124866, 75758, 102442, 100855, 96772, 103375, 103025, 103171, 103019, 95676, 100607, 87911, 110980, 90769, 113024 ],
"lengthMin" : [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ],
"formattedTime" : [ "2016-08-23T06:50:00.000Z", "2016-08-23T06:51:00.000Z", "2016-08-23T06:52:00.000Z", "2016-08-23T06:53:00.000Z", "2016-08-23T06:54:00.000Z", "2016-08-23T06:55:00.000Z", "2016-08-23T06:56:00.000Z", "2016-08-23T06:57:00.000Z", "2016-08-23T06:58:00.000Z", "2016-08-23T06:59:00.000Z", "2016-08-23T07:00:00.000Z", "2016-08-23T07:01:00.000Z", "2016-08-23T07:02:00.000Z", "2016-08-23T07:03:00.000Z", "2016-08-23T07:04:00.000Z", "2016-08-23T07:05:00.000Z", "2016-08-23T07:06:00.000Z", "2016-08-23T07:07:00.000Z", "2016-08-23T07:08:00.000Z", "2016-08-23T07:09:00.000Z", "2016-08-23T07:10:00.000Z", "2016-08-23T07:11:00.000Z", "2016-08-23T07:12:00.000Z", "2016-08-23T07:13:00.000Z", "2016-08-23T07:14:00.000Z" ],
"time" : [ 1471935000000, 1471935060000, 1471935120000, 1471935180000, 1471935240000, 1471935300000, 1471935360000, 1471935420000, 1471935480000, 1471935540000, 1471935600000, 1471935660000, 1471935720000, 1471935780000, 1471935840000, 1471935900000, 1471935960000, 1471936020000, 1471936080000, 1471936140000, 1471936200000, 1471936260000, 1471936320000, 1471936380000, 1471936440000 ],
"timeRange" : {
"since" : 1471935000000,
"until" : 1471936500000
}
},
"links" : [ {
"href" : "api/v1/pages/E7E826A8/baselines?since=2016-08-23T06:50:00.000Z&until=2016-08-23T07:15:00.000Z&aggregationPeriod=1&attributes=averageResponseTime,completedCount",
"rel" : "self"
} ]
}
        

Scheduled Baseline Retrieval

Scheduled baseline retrieval refers to the process of the REST API user retrieving the existing baseline results of APM metrics for a resource. During normal running of Application Performance Monitoring, while the different server and browser agents are gathering and storing APM metrics, those same metrics are forwarded to the baseline analysis engine for processing. That analysis engine runs on a schedule that keeps training models up to date, and maintains an up-to-date analysis of evaluation data. The results of the baseline analysis are themselves stored in a repository where they are available for later retrieval.

Unlike the on-demand baseline retrieval described above, the scheduled baseline retrieval merely retrieves the already-persisted baseline results. Other than when the baseline analysis is performed, and where the baseline analysis results are retrieved from, scheduled baseline retrieval and on-demand baseline retrieval are functionally equivalent:
  • The /baselines query syntax is the same as described above, allowing you to specify a resource, an attribute and a time range. The baseline results that were persisted for the specified time range will be returned.
  • The baseline analysis, which generates the baseline results, was previously run and its results were persisted externally. These results are retrieved in order to create the REST API response.
  • The /baselines JSON Response is identical to the on-demand baseline retrieval.