Data type discovery

When Data Processing retrieves data from a Hive table, the Hive data types are mapped to Dgraph data types when the data is ingested into the Dgraph.

The discovery phase of a workflow means that Data Processing discovers the data set metadata in order to determine the Dgraph attribute schema. Once Data Processing can ascertain what the data type is of a given Hive table column, it can map that Hive column data type to a Dgraph attribute data type.

For most types of workflows, the discovery phase is performed on the sample file. The exception is a Refresh update, which is a full data refresh on a BDD data set from the original Hive table.

Hive-to-Dgraph data conversions

When a Hive table is created, a data type is specified for each column (such as BOOLEAN or DOUBLE). During a Data Processing workflow, a Dgraph attribute is created for each Hive column. The Dgraph data type for the created attribute is based on the Hive column data type. For more information on the data model, including information about what are Dgraph records, and what are Dgraph attributes, see the section Dgraph Data Model.

This table lists the mappings for supported Hive data types to Dgraph data types. If a Hive data type is not listed, it is not supported by Data Processing and the data in that column will not be provisioned.
Hive Data Type Hive Description Dgraph Data Type Conversion
ARRAY<data_type> Array of values of a Hive data type (such as, ARRAY<STRING>) mdex:data_type-set

where data_type is a Dgraph data type in this column. These -set data types are for multi-assign attributes (such as mdex:string-set).

BIGINT 8-byte signed integer. mdex:long
BOOLEAN Choice of TRUE or FALSE. mdex:boolean
CHAR Character string with a fixed length (maximum length is 255) mdex:string
DATE Represents a particular year/month/day, in the form:
YYYY-MM-DD
Date types do not have a time-of-day component. The range of values supported is 0000-01-01 to 9999-12-31.
mdex:dateTime
DECIMAL Numeric with a precision of 38 digits. mdex:double
DOUBLE 8-byte (double precision) floating point number. mdex:double
FLOAT 4-byte (single precision) floating point number. mdex:double
INT 4-byte signed integer. mdex:long
SMALLINT 2-byte signed integer. mdex:long
STRING String values with a maximum of 32,767 bytes. mdex:string

A String column can be mapped as a Dgraph non-string data type if 100% of the values are actually in another data format, such as long, dateTime, and so on.

TIMESTAMP Represents a point in time, with an optional nanosecond precision. Allowed date values range from 1400-01-01 to 9999-12-31. mdex:dateTime
TINYINT 1-byte signed integer. mdex:long
VARCHAR Character string with a length specifier (between 1 and 65355) mdex:string

Data type discovery for Hive string columns

If a Hive column is configured with a data type other than STRING, Data Processing assumes that the formats of the record values in that column are valid. In this case, a Dgraph attributes derived from the column automatically use the mapped Dgraph data type listed in the table above.

String columns, however, often store data that really is non-string data (for example, integers can be stored as strings). When it analyzes the content of Hive table string columns, Data Processing makes a determination as to what type of data is actually stored in each column, using this algorithm:
  • If 100% of the column values are of a certain type, then the column values are ingested into the Dgraph as their Dgraph data type equivalents (see the table above).
  • If the data types in the column are mixed (such as integers and dates), then the Dgraph data type for that column is string (mdex:string). The only exception to this rule is if the column has a mixture of integers and doubles (or floats); in this case, the data type maps to mdex:double (because an integer can be ingested as a double but not vice-versa).

For example, if the Data Processing discoverer concludes that a given string column actually stores geocodes (because 100% of the column values are proper geocodes), then those geocode values are ingested as Dgraph mdex:geocode data types. If however, 95% of the column values are geocodes but the other 5% are another data type, then the data type for the column defaults to the Dgraph mdex:string data type. Note, however, that double values that are in scientific notation (such as "1.4E-4") are evaluated as strings, not as doubles.

To take another example, if 100% of a Hive string column consists of integer values, then the values are ingested as Dgraph mdex:long data types. Any valid integer format is accepted, such as "10", "-10", "010", and "+10".

Space-padded values

Hive values that are padded with spaces are treated as follows:
  • All integers with spaces are converted to strings (mdex:string)
  • Doubles with spaces are converted to strings (mdex:string)
  • Booleans with spaces are converted to strings (mdex:string)
  • Geocodes are not affected even if they are padded with spaces.
  • All date/time/timestamps are not affected even if they are padded with spaces.

Supported geocode formats

The following Hive geocode formats are supported during the discovery phase and are mapped to the Dgraph mdex:geocode data type:
Latitude Longitude
Latitude, Longitude
(Latitude Longitude)
(Latitude, Longitude)
For example:
40.55467767 -54.235
40.55467767, -54.235
(40.55467767 -54.235)
(40.55467767, -54.235)

Note that the comma-delimited format requires a space after the comma.

If Data Processing discovers any of these geocode formats in the column data, the value is ingested into the Dgraph as a geocode (mdex:geocode) attribute.

Supported date formats

Dates that are stored in Hive tables as DATE values are assumed to be valid dates for ingest. These DATE values are ingested as Dgraph mdex:dateTime data types.

For a date that is stored in a Hive table as a string, Data Processing checks it against a list of supported date formats. If the string date matches one of the supported date formats, then it is ingested as an mdex:dateTime data type. The date formats that are supported by Data Processing are listed in the dateFormats.txt file. Details on this file are provided in the topic Date format configuration.

In addition, Data Processing verifies that each date in a string column is a valid date. If a date is not valid, then the column is considered a string column, not a date column.

As an example of how a Hive column date is converted to a Dgraph date, a Hive date value of:
2013-10-23 01:23:24.1234567
will be converted to a Dgraph dateTime value of:
2013-10-23T05:23:24.123Z

The date will be ingested as a Dgraph mdex:dateTime data type.

Support of timestamps

Hive TIMESTAMP values are assumed to be valid dates and are ingested as Dgraph mdex:dateTime data types. Therefore, their format is not checked against the formats in the dateFormats.txt file.

When shown in Studio, Hive TIMESTAMP values will be formatted as "yyyy-MM-dd" or "yyyy-MM-dd HH:mm:ss" (depending on if the values in that column have times).

Note that if all values in a Hive timestamp column are not in the same format, then the time part in the Dgraph record becomes zero. For example, assume that a Hive column contains the following values:
2013-10-23 01:23:24
2012-09-22 02:24:25
Because both timestamps are in the same format, the corresponding values created in the Dgraph records are:
2013-10-23T01:23:24.000Z
2012-09-22T02:24:25.000Z
Now suppose a third row is inserted into that Hive table without the time part. The Hive column now has:
2013-10-23 01:23:24
2012-09-22 02:24:25
2007-07-23
In this case, the time part of the Dgraph records (the mdex:dateTime value) becomes zero:
2013-10-23T00:00:00.000Z
2012-09-22T00:00:00.000Z
2007-07-23T00:00:00.000Z

The reason is that if there are different date formats in the input data, then the Data Processing discoverer selects the more general format that matches all of the values, and as a result, the values that have more specific time information may end up losing some information.

To take another example, the pattern "yyyy-MM-dd" can parse both "2001-01-01" and "2001-01-01 12:30:23". However, a pattern like "yyyy-MM-dd hh:mm:ss" will throw an error when applied on the short string "2001-01-01". Therefore, the discoverer picks the best (longest possible) choice of "yyyy-MM-dd" that can match both "2001-01-01" and "2001-01-01 12:30:23". Because the picked pattern does not have time in it, there will be loss of precision.

Handling of unconvertible values

It is possible for your data to have column values that result in conversion errors (that is, where the original value cannot be converted to a Dgraph data type). Warnings are logged for the columns that contain conversion errors. For each column, one of the values that could not be converted is logged, as well as the total number of records that contained values that could not be converted. In addition, the values from the data set.

The following are examples of these log messages for unconvertible values:
[2016-03-16T16:01:43.315-04:00] [DataProcessing] [WARN] [] 
[com.oracle.endeca.pdi.logging.ProvisioningLogger] [tid:Driver] [userID:yarn] 
Found 2 records containing unconvertible values (such as "2.718") for data source key type_tinyint. 
These values could not be converted to type mdex:long and have been removed from the data set.

[2016-03-16T16:01:43.315-04:00] [DataProcessing] [WARN] [] 
[com.oracle.endeca.pdi.logging.ProvisioningLogger] [tid:Driver] [userID:yarn] 
Found 4 records containing unconvertible values (such as "maybe") for data source key
type_string_as_boolean. These values could not be converted to type mdex:boolean and 
have been removed from the data set.