3.1 Aggregation Concepts

An aggregating function is one that has the following property:

func(func(x0) U func(x1) U ... U func(xn)) = func(x0 U x1 U ... U xn)

where xn is a set of arbitrary data, which is to say, applying an aggregating function to subsets of the whole and then applying it again to the results yields the same result as applying it to the whole itself. For example, consider the SUM function, which yields the summation of a given data set. If the raw data consists of {2, 1, 2, 5, 4, 3, 6, 4, 2}, the result of applying SUM to the entire set is {29}. Similarly, the result of applying SUM to the subset consisting of the first three elements is {5}, the result of applying SUM to the set consisting of the subsequent three elements is {12}, and the result of applying SUM to the remaining three elements is also {12}. SUM is an aggregating function because applying it to the set of these results, {5, 12, 12}, yields the same result, {29}, as though applying SUM to the original data.

Not all functions are aggregating functions. An example of a non-aggregating function is the MEDIAN function. This function determines the median element of the set. The median is defined to be that element of a set for which as many elements in the set are greater than the element, as those that are less than it. The MEDIAN is derived by sorting the set and selecting the middle element. Returning to the original raw data, if MEDIAN is applied to the set consisting of the first three elements, the result is {2}. The sorted set is {1, 2, 2}; {2} is the set consisting of the middle element. Likewise, applying MEDIAN to the next three elements yields {4} and applying MEDIAN to the final three elements yields {4}. Thus, applying MEDIAN to each of the subsets yields the set {2, 4, 4}. Applying MEDIAN to this set yields the result {4}. Note that sorting the original set yields {1, 2, 2, 2, 3, 4, 4, 5, 6}. Thus, applying MEDIAN to this set yields {3}. Because these results do not match, MEDIAN is not an aggregating function. Nor is MODE, the most common element of a set.

Many common functions that are used to understand a set of data are aggregating functions. These functions include the following:

  • Counting the number of elements in the set.

  • Computing the minimum value of the set.

  • Computing the maximum value of the set.

  • Summing all of the elements in the set.

  • Histogramming the values in the set, as quantized into certain bins.

Moreover, some functions, which strictly speaking are not aggregating functions themselves, can nonetheless be constructed as such. For example, average (arithmetic mean) can be constructed by aggregating the count of the number of elements in the set and the sum of all elements in the set, reporting the ratio of the two aggregates as the final result. Another important example is standard deviation.

Applying aggregating functions to data as it is traced has a number of advantages, including the following:

  • The entire data set need not be stored. Whenever a new element is to be added to the set, the aggregating function is calculated, given the set consisting of the current intermediate result and the new element. When the new result is calculated, the new element can be discarded. This process reduces the amount of storage that is required by a factor of the number of data points, which is often quite large.

  • Data collection does not induce pathological scalability problems. Aggregating functions enable intermediate results to be kept per-CPU instead of in a shared data structure. DTrace then applies the aggregating function to the set consisting of the per-CPU intermediate results to produce the final system-wide result.