GROUP BY Clause

The GROUP BY clause is used in a SELECT statement to collect data across multiple rows and group the result by one or more columns or expressions. The GROUP BY clause is often used with aggregate functions. Oracle NoSQL Database applies the aggregate functions to each group of rows and returns a single row for each group.

Syntax

groupby_clause ::= GROUP BY expression ("," expression)*

Semantics

Each (grouping) expression must return at most one atomic value. If a grouping expression returns an empty result on an input row, that row is skipped. Equality among grouping values is defined according to the semantics of the "=" operator, with the exception that two NULL values are considered equal. See Value Comparison Operators section. Then, for each group, a single record is constructed and returned by the GROUP BY clause. If the clause has N grouping expressions, the first N fields of the returned record store the values of the grouping expressions. The remaining M fields (M >= 0) store the result of zero or more aggregate functions. In general, aggregate functions iterate over the rows of a group, evaluate an expression for each such row, and aggregate the returned values into a single value per group. Oracle NoSQL Database supports many aggregate functions as described in the Using Aggregate Functions section.

Syntactically, aggregate functions are not actually listed in the GROUP BY clause, but appear in the SELECT clause instead. In fact, aggregate functions can appear only in the SELECT or ORDER BY clauses, and they cannot be nested. Semantically, however, every aggregate function that appears in the SELECT or ORDER BY list is actually evaluated by the GROUP BY clause. If the SELECT clause contains any aggregate functions, but the SELECT expression does not contain a GROUP BY clause, the whole set of rows produced by the FROM or the WHERE clauses is considered as one group and the aggregate functions are evaluated over this single group.

The implementation of the GROUP BY clause may be index-based or generic. Index-based grouping is possible only if an index exists that sorts the rows by the values of the grouping expressions. More precisely, let e1, e2, …, eN where ei is the ith expression (i is a number in the range 1,2,3, ...N) be the grouping expressions as they appear in the GROUP BY clause (from left to right). Then, for index-based grouping, there must exist an index (which may be the primary-key index or one of the existing secondary indexes) such that for each i in 1,2,...,N, ei matches the definition of the i-th index field. If such an index does not exist or is not selected by the query optimizer, the GROUP BY will be generic. A generic GROUP BY uses a hash table to find rows belonging to the same group and stores all groups before returning any results to the application. The hash table is stored in the client driver memory (local hash tables, of limited size, may be used at the servers as well). As a result, a generic GROUP BY may consume a large amount of driver memory. In contrast, index-based grouping exploits the row sorting provided by the index to avoid the materialization and caching of any intermediate results. It is therefore recommended to create appropriate indexes for use in GROUP BY queries. See Using Indexes for Query Optimization. Finally, notice that when you use index-based grouping, the results of a grouping SELECT expression are ordered by the grouping expressions.

Example 6-10 Group by age in the application that maintains the user data

SELECT 
age, 
count(*) AS count, 
avg(income) AS income 
FROM users 
GROUP BY age

Explanation: In this query, you group users by their age using the GROUP BY clause. For each group, the query returns the associated age and the average income of the users in the group. You use the aggregate functions count to count the users in each age group and avg to calculate the average income.

Output:

{"age":null,"count":1,"income":75000.0}
{"age":22,"count":2,"income":50000.0}
{"age":45,"count":1,"income":75000.0}

Example 6-11 Display the number of bags for each passenger in the airline application

SELECT 
bag.confNo,
count(bag.bagInfo.id) AS TOTAL_BAGS 
FROM Baggageinfo bag 
GROUP BY bag.confNo

Explanation: In the airline application, every passenger has a unique reservation code (confNo). A passenger can have more than one piece of luggage. Each piece of luggage is attached with a unique id during check-in that is stored in the id attribute of the bagInfo field.

In this query, you group the data based on the confNo using the GROUP BY clause. To fetch the number of bags per passenger, you get the count of the bagInfo.id field associated with each confNo using the count aggregate function.

Output:

{"confNo":"FH7G1W","TOTAL_BAGS":1}
{"confNo":"PQ1M8N","TOTAL_BAGS":1}
{"confNo":"XT6K7M","TOTAL_BAGS":1}
{"confNo":"DN3I4Q","TOTAL_BAGS":1}
{"confNo":"QB1O0J","TOTAL_BAGS":1}
{"confNo":"TX1P7E","TOTAL_BAGS":1}
{"confNo":"CG6O1M","TOTAL_BAGS":1}
{"confNo":"OH2F8U","TOTAL_BAGS":1}
{"confNo":"BO5G3H","TOTAL_BAGS":1}
{"confNo":"ZG8Z5N","TOTAL_BAGS":1}
{"confNo":"LE6J4Z","TOTAL_BAGS":1}
{"confNo":"XT1O7T","TOTAL_BAGS":1}
{"confNo":"QI3V6Q","TOTAL_BAGS":2}
{"confNo":"RL3J4Q","TOTAL_BAGS":1}
{"confNo":"HJ4J4P","TOTAL_BAGS":1}
{"confNo":"CR2C8MY","TOTAL_BAGS":1}
{"confNo":"LN0C8R","TOTAL_BAGS":1}
{"confNo":"MZ2S5R","TOTAL_BAGS":1}
{"confNo":"KN4D1L","TOTAL_BAGS":1}
{"confNo":"MC0E7R","TOTAL_BAGS":1}

Example 6-12 Select the total baggage originating from each airport (excluding the transit baggage) in the airline application

SELECT $flt_src as SOURCE, 
count(*) as COUNT 
FROM BaggageInfo $bag, 
$bag.bagInfo.flightLegs[0].fltRouteSrc $flt_src 
GROUP BY $flt_src

Explanation: The fltRouteSrc field in the flightLegs array of the BaggageInfo table includes the details of the originating station in a passenger record.

In the above query, you get the total count of baggage originating from each airport. However, you don't want to consider the airports that are part of the transit. You group the data with the flight source values of the first element of the flightLegs array (as the first element is the source station). You then determine the count of baggage using the count function.

Output:

{"SOURCE":"SFO","COUNT":6}
{"SOURCE":"BZN","COUNT":1}
{"SOURCE":"GRU","COUNT":1}
{"SOURCE":"LAX","COUNT":1}
{"SOURCE":"YYZ","COUNT":1}
{"SOURCE":"MEL","COUNT":1}
{"SOURCE":"MIA","COUNT":4}
{"SOURCE":"MSQ","COUNT":2}
{"SOURCE":"MXP","COUNT":2}
{"SOURCE":"JFK","COUNT":3}