Incremental flag syntax

This topic describes the syntax of the --incrementalUpdate flag.

The DP CLI flag syntax for an Incremental update operation is one of the following:

./data_processing_CLI --incrementalUpdate <logicalName> <filter>

./data_processing_CLI --incrementalUpdate <logicalName> <filter> --table <tableName>

./data_processing_CLI --incrementalUpdate <logicalName> <filter> --table <tableName> 
   --database <dbName>

where:

--incrementalUpdate (abbreviated as -incremental) is mandatory and specifies the Data Set Logical Name (logicalName) of the data set to be updated. filter is a filter predicate that limits the records to be selected from the Hive table.
--table (abbreviated as -t) is optional and specifies a Hive table to be used for the source data. This flag allows you to override the source Hive table that was used to create the original data set (the name of the original Hive table is stored in the data set's metadata).
--database (abbreviated as -d) is optional and specifies the database of the Hive table specified with the --table flag. This flag allows you to override the database that was used to create the original data set). The --database flag can be used only if the --table flag is also used.

The logicalName value is available in the Data Set Logical Name property in Studio. For details, see Obtaining the Data Set Logical Name.

Filter predicate format

A filter predicate is mandatory and is one simple Boolean expression (not compounded), with this format:

"columnName operator filterValue"

where:

columnName is the name of a column in the source Hive table.
operator is one of the following comparison operators:
- =
- <>
- >
- >=
- <
- <=
filterValue is a primitive value. Only primitive data types are supported, which are: integers (TINYINT, SMALLINT, INT, and BIGINT), floating point numbers (FLOAT and DOUBLE), Booleans (BOOLEAN), and strings (STRING). Note that expressions (such as "amount+1") are not supported.

You should enclose the entire filter predicate in either double quotes or single quotes. If you need to use quotes within the filter predicate, use the other quotation format. For example, if you use double quotes to enclose the filter predicate, then use single quotes within the predicate itself.

If columnName is configured as a DATE or TIMESTAMP data type, you can use the unix_timestamp date function, with one of these syntaxes:

columnName operator unix_timestamp(dateValue)

columnName operator unix_timestamp(dateValue, dateFormat)

If dateFormat is not specified, then the DP CLI uses one of two default data formats:

// date-time format:
yyyy-MM-dd HH:mm:ss

// time-only format:
HH:mm:ss

The date-time format is used for columns that map to Dgraph mdex:dateTime attributes, while the time-only format is used for columns that map to Dgraph mdex:time attributes.

If dateFormat is specified, use a pattern described here: http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html

Note on data types in the filter predicate

You should pay close attention to the Hive column data types when constructing a filter for Incremental update, because the results of a comparison can differ. This is especially important for columns of type String, because results of String comparison are different from results of Number comparison.

Take, as an example, this filter that uses the "age" column in the Hive table:

./data_processing_CLI -incremental 10133:WarrantyClaims "age<18"

If the "age" column is a String column, then the results from the filter will be different than if "age" were a Number column (such as Int or Tinyint). The results would differ because:

If "age" is a Number column, then "age < 18" means the column value must be numerically less than 18. The value 6, for example, is numerically less than 18.
If "age" is a String column, then "age < 18" means the column value must be lexicographically less than 18. The value 6 is lexicographically more than 18.

Therefore, the number of filtered records will differ depending on the data type of the "age" column.

Also keep in mind that if the data set was originally created using File Upload in Studio, then the underlying Hive table for that data set will have all columns of type String.

Examples

Example 1: If the Hive "birthyear" column contains a year of birth for a person, then the command can be:

./data_processing_CLI --incrementalUpdate 10133:WarrantyClaims "claimyear > 1970"

In the example, only the records of claims made after 1970 are processed.

Example 2: Using the unix_timestamp function with a supplied date-time format:

./data_processing_CLI --incrementalUpdate 10133:WarrantyClaims 
 "factsales_shipdatekey_date >= unix_timestamp('2006-01-01 00:00:00', 'yyy-MM-dd HH:mm:ss')"

Example 3: Another example of using the unix_timestamp function with a supplied date-time format:

./data_processing_CLI --incrementalUpdate 10133:WarrantyClaims
"creation_date >= unix_timestamp('2015-06-01 20:00:00', 'yyyy-MM-dd HH:mm:ss')"

Example 4: An invalid example of using the unix_timestamp function with a date that does not contain a time:

./data_processing_CLI --incrementalUpdate 10133:WarrantyClaims
"claim_date >= unix_timestamp('2000-01-01')"

The error will be:

16:41:29.375 main ERROR: Failed to parse date / time value '2000-01-01' using the format 'yyyy-MM-dd HH:mm:ss'