An aggregating function is one that has the following property:
func
(func
(x0) Ufunc
(x1) U ... Ufunc
(xn
)) =func
(x0 U x1 U ... U xn
)
where
x
is a set of arbitrary data, which is to say, applying an
aggregating function to subsets of the whole and then applying it
again to the results yields the same result as applying it to the
whole itself. For example, consider the n
SUM
function, which yields the summation of a given data set. If the
raw data consists of {2, 1, 2, 5, 4, 3, 6, 4, 2}, the result of
applying SUM
to the entire set is {29}.
Similarly, the result of applying SUM
to the
subset consisting of the first three elements is {5}, the result
of applying SUM
to the set consisting of the
subsequent three elements is {12}, and the result of applying
SUM
to the remaining three elements is also
{12}. SUM
is an aggregating function because
applying it to the set of these results, {5, 12, 12}, yields the
same result, {29}, as though applying SUM
to
the original data.
Not all functions are aggregating functions. An example of a
non-aggregating function is the MEDIAN
function. This function determines the median element of the set.
The median is defined to be that element of a set for which as
many elements in the set are greater than the element, as those
that are less than it. The MEDIAN
is derived by
sorting the set and selecting the middle element. Returning to the
original raw data, if MEDIAN
is applied to the
set consisting of the first three elements, the result is {2}. The
sorted set is {1, 2, 2}; {2} is the set consisting of the middle
element. Likewise, applying MEDIAN
to the next
three elements yields {4} and applying MEDIAN
to the final three elements yields {4}. Thus, applying
MEDIAN
to each of the subsets yields the set
{2, 4, 4}. Applying MEDIAN
to this set yields
the result {4}. Note that sorting the original set yields {1, 2,
2, 2, 3, 4, 4, 5, 6}. Thus, applying MEDIAN
to
this set yields {3}. Because these results do not match,
MEDIAN
is not an aggregating function. Nor is
MODE
, the most common element of a set.
Many common functions that are used to understand a set of data are aggregating functions. These functions include the following:
Counting the number of elements in the set.
Computing the minimum value of the set.
Computing the maximum value of the set.
Summing all of the elements in the set.
Histogramming the values in the set, as quantized into certain bins.
Moreover, some functions, which strictly speaking are not aggregating functions themselves, can nonetheless be constructed as such. For example, average (arithmetic mean) can be constructed by aggregating the count of the number of elements in the set and the sum of all elements in the set, reporting the ratio of the two aggregates as the final result. Another important example is standard deviation.
Applying aggregating functions to data as it is traced has a number of advantages, including the following:
The entire data set need not be stored. Whenever a new element is to be added to the set, the aggregating function is calculated, given the set consisting of the current intermediate result and the new element. When the new result is calculated, the new element can be discarded. This process reduces the amount of storage that is required by a factor of the number of data points, which is often quite large.
Data collection does not induce pathological scalability problems. Aggregating functions enable intermediate results to be kept per-CPU instead of in a shared data structure. DTrace then applies the aggregating function to the set consisting of the per-CPU intermediate results to produce the final system-wide result.