GROUP/GROUP BY clauses

The GROUP and GROUP BY clauses specify how to map source records to result records in order to group statement output.

Some of the ways to use these clauses in a query are:

Omitting the GROUP clause maps each source record to its own result record.
GROUP maps all source records to a single result record.
GROUP BY <attributeList> maps source records to result records by the combination of values in the listed attributes.

You can also use other grouping functions (such as CUBE or GROUPING SETS) with the GROUP and GROUP BY clauses. Details on these functions are given in later in this chapter.

BNF grammar for grouping

The BNF grammar representation for GROUP and the family of group functions is:

GroupByList ::= GROUP | GROUP BY GroupByList_ | GROUP BY GroupAll
GroupByList_ ::= GroupByElement | GroupByList_ ,  GroupByElement
GroupByElement ::= GroupBySingle | GroupingSets | CubeRollup

GroupingSets ::= GROUPING SETS (GroupingSetList)
GroupingSetList ::=  GroupingSetElement |  GroupingSetList ,  GroupingSetElement
GroupingSetElement ::=  GroupBySingle | GroupByComposite | CubeRollup | GroupAll

CubeRollup ::= {CUBE | ROLLUP} (CubeRollupList)
CubeRollupList ::=  CubeRollupElement |  CubeRollupList ,  CubeRollupElement
CubeRollupElement ::=  GroupBySingle | GroupByComposite

GroupBySingle ::= Identifier
GroupByComposite ::= (GroupByCompositeList)
GroupByCompositeList ::= GroupBySingle | GroupByCompositeList, GroupBySingle

GroupAll ::= ()

Note that the use of GroupAll results in the following being all equivalent:

GROUP = GROUP BY() = GROUP BY GROUPING SETS(())

Specifying only GROUP

You can use a GROUP clause to aggregate results into a single bucket. As the BNF grammar shows, the GROUP clause does not take an argument.

For example, the following statement uses the SUM statement to return a single sum across a set of records:

RETURN ReviewCount AS 
SELECT SUM(NumReviews) AS NumberOfReviews
GROUP

This statement returns one record for NumberOfReviews. The value is the sum of the values for the NumReviews attribute.

Specifying GROUP BY

You can use GROUP BY to aggregate results into buckets with common values for the grouping keys. The GROUP BY syntax is:

GROUP BY(attributeList)

where attributeList is a single attribute, a comma-separated list of multiple attributes, GROUPING SETS, CUBE, ROLLUP, or () to specify an empty group. The empty group generates a total.

Grouping is allowed on source and locally-defined attributes.

Note: If you group by a locally-defined attribute, that attribute cannot refer to non-grouping attributes and cannot contain any aggregates. However, LOOKUP and IN expressions are valid in this context.

All grouping attributes are part of the result records. A NULL value in any grouping attribute is treated like any other value, which means the source record is mapped to result records. (However, note that NULL values are ignored if selecting from the corpus.) For information about user-defined NULL-value handling in EQL, see COALESCE.

For example, suppose we have sales transaction data with records consisting of the following attributes:

{ TransId, ProductType, Amount, Year, Quarter, Region,
  SalesRep, Customer }

For example:

{ TransId = 1, ProductType = "Widget", Amount = 100.00,
  Year = 2011, Quarter = "11Q1", Region  = "East",
  SalesRep = "J. Smith", Customer = "Customer1" }

If an EQL statement uses Region and Year as GROUP BY attributes, the statement results contain an aggregated record for each valid, non-empty combination of Region and Year. In EQL, this example is expressed as:

DEFINE RegionsByYear AS
GROUP BY Region, Year

resulting in the aggregates of the form { Region, Year }, for example:

{ "East", "2010" }
{ "West", "2011" }
{ "East", "2011" }

Note that using duplicated columns in GROUP BY clauses is allowed. This means that the following two queries are treated as equivalent:

RETURN Results AS
SELECT SUM(PROMO_COST) AS PR_Cost
GROUP BY PROMO_NAME

RETURN Results AS
SELECT SUM(PROMO_COST) AS PR_Cost
GROUP BY PROMO_NAME, PROMO_NAME

Using a GROUP BY that is an output of a SELECT expression

A GROUP BY key can be the output of a SELECT expression, as long as that expression itself does not contain an aggregation function.

For example, the following syntax is a correct usage of GROUP BY:

SELECT COALESCE(Person, 'Unknown Person') AS Person2, ... GROUP BY Person2

The following syntax is incorrect and results in an error, because Sales2 contains an aggregation function (SUM):

SELECT SUM(Sales) AS Sales2, ... GROUP BY Sales2

Specifying the hierarchy level for a managed attribute

You can group by a specified depth of each managed attribute. However, GROUP BY statements cannot use the ANCESTOR function (because you cannot group by an expression in EQL). Therefore, you must first use ANCESTOR with the SELECT statement and then specify the aliased results in the GROUP BY clause.

For example, assume that the Region attribute contains the hierarchy Country, State, and City. We want to group the results at the State level (one level below the root of the managed attribute hierarchy). An abbreviated query would look like this:

SELECT ANCESTOR("Region", 1) AS StateInfo
...
GROUP BY StateInfo

Grouping by a multi-assign attribute

If you group by a multi-assign attribute, each source record will map to multiple corresponding output records. For example, the record [A:1, A:2, B:3, B:4, B:5] will map to:

Two output records if you group by A
Three output records if you group by B
Six output records if you group by both A and B
Six output records for SELECT A + B AS C GROUP BY C, because all six possible values of A + B will be computed prior to grouping.

This can only occur with a corpus source, because result records are always single assign.

In this example, UserTag is multi-assign:

RETURN "Example" AS SELECT
   AVG("Gross") AS "AvgGross",
   SUM("Gross") AS "TotalGross"
GROUP BY UserTag

To define the set of resulting buckets, a statement must specify a set of GROUP BY attributes. The cross product of all values in these grouping attributes defines the set of candidate buckets.

The results are automatically pruned to include only non-empty buckets.

If an attribute reference appears in a statement with a GROUP clause in the definition of an attribute not in the GROUP clause, the attribute will have an implicit ARB aggregate applied.