Example 7: Group data with fields not part of the index

Fetch the age and sum of income of all users whose state is CA, grouping the data by age.
SELECT age, sum(income) FROM Users u 
WHERE u.address.state = "CA" GROUP BY age
Query execution plan:
{
  "iterator kind" : "GROUP",
  "input variable" : "$gb-2",
  "input iterator" :
  {
    "iterator kind" : "RECEIVE",
    "distribution kind" : "ALL_SHARDS",
    "input iterator" :
    {
      "iterator kind" : "GROUP",
      "input variable" : "$gb-1",
      "input iterator" :
      {
        "iterator kind" : "SELECT",
        "FROM" :
        {
          "iterator kind" : "TABLE",
          "target table" : "users",
          "row variable" : "$$u",
          "index used" : "idx_state_city_income",
          "covering index" : false,
          "index scans" : [
            {
              "equality conditions" : {"address.state":"CA"},
              "range conditions" : {}
            }
          ]
        },
        "FROM variable" : "$$u",
        "SELECT expressions" : [
          {
            "field name" : "age",
            "field expression" :
            {
              "iterator kind" : "FIELD_STEP",
              "field name" : "age",
              "input iterator" :
              {
                "iterator kind" : "VAR_REF",
                "variable" : "$$u"
              }
            }
          },
          {
            "field name" : "Column_2",
            "field expression" :
            {
              "iterator kind" : "FIELD_STEP",
              "field name" : "income",
              "input iterator" :
              {
                "iterator kind" : "VAR_REF",
                "variable" : "$$u"
              }
            }
          }
        ]
      },
      "grouping expressions" : [
        {
          "iterator kind" : "FIELD_STEP",
          "field name" : "age",
          "input iterator" :
          {
            "iterator kind" : "VAR_REF",
            "variable" : "$gb-1"
          }
        }
      ],
      "aggregate functions" : [
        {
          "iterator kind" : "FUNC_SUM",
          "input iterator" :
          {
            "iterator kind" : "FIELD_STEP",
            "field name" : "Column_2",
            "input iterator" :
            {
              "iterator kind" : "VAR_REF",
              "variable" : "$gb-1"
            }
          }
        }
      ]
    }
  },
  "grouping expressions" : [
    {
      "iterator kind" : "FIELD_STEP",
      "field name" : "age",
      "input iterator" :
      {
        "iterator kind" : "VAR_REF",
        "variable" : "$gb-2"
      }
    }
  ],
  "aggregate functions" : [
    {
      "iterator kind" : "FUNC_SUM",
      "input iterator" :
      {
        "iterator kind" : "FIELD_STEP",
        "field name" : "Column_2",
        "input iterator" :
        {
          "iterator kind" : "VAR_REF",
          "variable" : "$gb-2"
        }
      }
    }
  ]
}
Explanation of the query execution plan :
  • In this example , you group all users in the state CA based on their age and determine the sum of the income of users belonging to each age group.
  • As the GROUP BY field (age in this example) is not part of any index, you need a separate GROUP operator to do the grouping. This is indicated by the existence of the GROUP iterators in the execution plan. There are two GROUP iterators: one that operates at the driver (above the RECEIVE iterator) and another that operates at the RNs (below the RECEIVE iterator).
  • The lower GROUP iterator has a SELECT iterator as input. The SELECT returns the age and income of each user in the state CA. The GROUP iterator will operate until the batch limit is reached. If the batch limit is defined as the max number N of results produced, the GROUP iterator will stop when up to N age groups have been created. If the batch limit is defined as the max number of bytes read, it will stop when this max is reached. The GROUP operator has an input variable. For the inner GROUP operator the input variable is $gb-1 and for the outer GROUP operator it is $gb-2.
    "iterator kind" : "GROUP","input variable" : "$gb-1",
  • The index idx_state_city_income is used here and in this example, it is not a covering index as the query has the age field, which is not part of the entries of the index.
  • The index scan property contains the start and stop conditions that define the index scans to be performed. In this example, only one index scan will be performed. The conditions correspond to the predicate u.address.state = "CA". Specifically, the starting index entry must have the value CA on the address.state field. All subsequent entries must have CA as the value of their address.state field, and the scan will stop as soon as an entry with a different state value is encountered.
  • When the FROM iterator is a TABLE iterator, the FROM variable is the same as either the index row variable or the row variable of the TABLE iterator, depending on whether the used index is covering or not. In this example, the FROM variable is same as the row variable as the index is not covering.
  • This row variable ($$u) will be referenced by iterators implementing the other clauses of the inner SELECT expression.
  • The GROUP iterator creates an internal variable ($gb-1) that iterates over the records produced by the SELECT expression.
  • The result set produced by the lower GROUP iterator is partial: it may not contain all the age groups and for the age groups that it does contain, the income may be a partial sum (because all rows for a given age may not have been retrieved when query execution stops). The upper GROUP iterator receives the partial results from each RN and performs the final grouping and aggregation. It operates the same way as the lower GROUP iterators and will keep operating until there are no more partial results from the RNs. At that point, the full and final result set is cached at the upper GROUP iterator and is returned to the application.
  • The upper GROUP iterator creates an internal variable ($gb-2) that iterates over the records produced by the outer SELECT expression. The $gb-2 variable has the age and sum of income of all users of the CA state, grouped by age.
  • In the SELECT expression, two fields are fetched: age and sum(income). These correspond to two field names and field expressions in the SELECT expression clause.
  • For the age field, the field expression is computed by a FIELD_STEP iterator. The FIELD_STEP iterator extracts and returns the value of the age field from the records produced by its input iterator.
    {
       "iterator kind" : "FIELD_STEP",
       "field name" : "age",
       "input iterator" :
       {
          "iterator kind" : "VAR_REF",
          "variable" : "$gb-2"
       }
    }
  • The second field to be fetched is an aggregate function sum of income values. The FUNC_SUM iterator is used for this. It iterates over the value of the income field and determines the sum of all incomes from the result of its input iterator.
    "aggregate functions" : [
       {
          "iterator kind" : "FUNC_SUM",
          "input iterator" :
          {
            "iterator kind" : "FIELD_STEP",
            "field name" : "Column_2",
            "input iterator" :
            {
              "iterator kind" : "VAR_REF",
              "variable" : "$gb-2"
            }
          }
       }
    ]