This topic describes the syntax of the
--incrementalUpdate flag.
The DP CLI flag syntax for an Incremental update operation is one of
the following:
./data_processing_CLI --incrementalUpdate <logicalName> <filter>
or
./data_processing_CLI --incrementalUpdate <logicalName> <filter> --table <tableName>
or
./data_processing_CLI --incrementalUpdate <logicalName> <filter> --table <tableName>
--database <dbName>
where:
- --incrementalUpdate
(abbreviated as
-incremental) is mandatory and specifies the
Data Set Logical Name (logicalName) of the data set to be
updated.
filter is a filter predicate that limits the
records to be selected from the Hive table.
- --table
(abbreviated as
-t) is optional and specifies a Hive table to
be used for the source data. This flag allows you to override the source Hive
table that was used to create the original data set (the name of the original
Hive table is stored in the data set's metadata).
- --database
(abbreviated as
-d) is optional and specifies the database of
the Hive table specified with the
--table flag. This flag allows you to override
the database that was used to create the original data set). The
--database flag can be used only if the
--table flag is also used.
The
logicalName value is available in the
Data Set Logical Name property in Studio. For
details, see
Obtaining the Data Set Logical Name.
Filter predicate format
A filter predicate is mandatory and is one simple Boolean expression
(not compounded), with this format:
"columnName operator filterValue"
where:
- columnName is the
name of a column in the source Hive table.
- operator is one of
the following comparison operators:
- filterValue is a
primitive value. Only primitive data types are supported, which are: integers
(TINYINT,
SMALLINT,
INT, and
BIGINT), floating point numbers
(FLOAT and
DOUBLE), Booleans (BOOLEAN), and
strings (STRING). Note that expressions (such as "amount+1")
are not supported.
You should enclose the entire filter predicate in either double quotes
or single quotes. If you need to use quotes within the filter predicate, use
the other quotation format. For example, if you use double quotes to enclose
the filter predicate, then use single quotes within the predicate itself.
If
columnName is configured as a
DATE or
TIMESTAMP data type, you can use the
unix_timestamp date function, with one of these
syntaxes:
columnName operator unix_timestamp(dateValue)
columnName operator unix_timestamp(dateValue, dateFormat)
If
dateFormat is not specified, then the DP CLI uses
one of two default data formats:
// date-time format:
yyyy-MM-dd HH:mm:ss
// time-only format:
HH:mm:ss
The date-time format is used for columns that map to Dgraph
mdex:dateTime attributes, while the time-only format
is used for columns that map to Dgraph
mdex:time attributes.
If
dateFormat is specified, use a pattern described
here:
http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html
Note on data types in the filter predicate
You should pay close attention to the Hive column data types when
constructing a filter for Incremental update, because the results of a
comparison can differ. This is especially important for columns of type String,
because results of String comparison are different from results of Number
comparison.
Take, as an example, this filter that uses the "age" column in the
Hive table:
./data_processing_CLI -incremental 10133:WarrantyClaims "age<18"
If the "age" column is a String column, then the results from the
filter will be different than if "age" were a Number column (such as Int or
Tinyint). The results would differ because:
- If "age" is a Number
column, then "age < 18" means the column value must be numerically less than
18. The value 6, for example, is numerically less than 18.
- If "age" is a String
column, then "age < 18" means the column value must be lexicographically
less than 18. The value 6 is lexicographically more than 18.
Therefore, the number of filtered records will differ depending on the
data type of the "age" column.
Also keep in mind that if the data set was originally created using
File Upload in Studio, then the underlying Hive table for that data set will
have all columns of type String.
Examples
Example 1: If the Hive "birthyear" column contains a year of
birth for a person, then the command can be:
./data_processing_CLI --incrementalUpdate 10133:WarrantyClaims "claimyear > 1970"
In the example, only the records of claims made after 1970 are
processed.
Example 2: Using the
unix_timestamp function with a supplied date-time
format:
./data_processing_CLI --incrementalUpdate 10133:WarrantyClaims
"factsales_shipdatekey_date >= unix_timestamp('2006-01-01 00:00:00', 'yyy-MM-dd HH:mm:ss')"
Example 3: Another example of using the
unix_timestamp function with a supplied date-time
format:
./data_processing_CLI --incrementalUpdate 10133:WarrantyClaims
"creation_date >= unix_timestamp('2015-06-01 20:00:00', 'yyyy-MM-dd HH:mm:ss')"
Example 4: An invalid example of using the
unix_timestamp function with a date that does not
contain a time:
./data_processing_CLI --incrementalUpdate 10133:WarrantyClaims
"claim_date >= unix_timestamp('2000-01-01')"
The error will be:
16:41:29.375 main ERROR: Failed to parse date / time value '2000-01-01' using the format 'yyyy-MM-dd HH:mm:ss'