## 3.1 Aggregation Concepts

An aggregating function is one that has the following property:

``func`(`func`(x0) U `func`(x1) U ... U `func`(x`n`)) = `func`(x0 U x1 U ... U x`n`)`

where `xn` is a set of arbitrary data, which is to say, applying an aggregating function to subsets of the whole and then applying it again to the results yields the same result as applying it to the whole itself. For example, consider the `SUM` function, which yields the summation of a given data set. If the raw data consists of {2, 1, 2, 5, 4, 3, 6, 4, 2}, the result of applying `SUM` to the entire set is {29}. Similarly, the result of applying `SUM` to the subset consisting of the first three elements is {5}, the result of applying `SUM` to the set consisting of the subsequent three elements is {12}, and the result of applying `SUM` to the remaining three elements is also {12}. `SUM` is an aggregating function because applying it to the set of these results, {5, 12, 12}, yields the same result, {29}, as though applying `SUM` to the original data.

Not all functions are aggregating functions. An example of a non-aggregating function is the `MEDIAN` function. This function determines the median element of the set. The median is defined to be that element of a set for which as many elements in the set are greater than the element, as those that are less than it. The `MEDIAN` is derived by sorting the set and selecting the middle element. Returning to the original raw data, if `MEDIAN` is applied to the set consisting of the first three elements, the result is {2}. The sorted set is {1, 2, 2}; {2} is the set consisting of the middle element. Likewise, applying `MEDIAN` to the next three elements yields {4} and applying `MEDIAN` to the final three elements yields {4}. Thus, applying `MEDIAN` to each of the subsets yields the set {2, 4, 4}. Applying `MEDIAN` to this set yields the result {4}. Note that sorting the original set yields {1, 2, 2, 2, 3, 4, 4, 5, 6}. Thus, applying `MEDIAN` to this set yields {3}. Because these results do not match, `MEDIAN` is not an aggregating function. Nor is `MODE`, the most common element of a set.

Many common functions that are used to understand a set of data are aggregating functions. These functions include the following:

• Counting the number of elements in the set.

• Computing the minimum value of the set.

• Computing the maximum value of the set.

• Summing all of the elements in the set.

• Histogramming the values in the set, as quantized into certain bins.

Moreover, some functions, which strictly speaking are not aggregating functions themselves, can nonetheless be constructed as such. For example, average (arithmetic mean) can be constructed by aggregating the count of the number of elements in the set and the sum of all elements in the set, reporting the ratio of the two aggregates as the final result. Another important example is standard deviation.

Applying aggregating functions to data as it is traced has a number of advantages, including the following:

• The entire data set need not be stored. Whenever a new element is to be added to the set, the aggregating function is calculated, given the set consisting of the current intermediate result and the new element. When the new result is calculated, the new element can be discarded. This process reduces the amount of storage that is required by a factor of the number of data points, which is often quite large.

• Data collection does not induce pathological scalability problems. Aggregating functions enable intermediate results to be kept per-CPU instead of in a shared data structure. DTrace then applies the aggregating function to the set consisting of the per-CPU intermediate results to produce the final system-wide result.