When creating a new data set, you can specify the maximum number
of records that the Data Processing workflow should process from the Hive
table.
The number of sampled
records from a Hive table is set by the Studio or DP CLI configuration:
- In Studio, the
bdd.sampleSize parameter in the
Data Processing Settings page on Studio's
Control Panel.
- In DP CLI, the
maxRecordsForNewDataSet configuration parameter or the
--maxRecords flag.
If the settings of these parameters are greater than the number of
records in the Hive table, then all the Hive records are processed. In this
case, the data set will be considered a full data set.
Discovery for attributes
The Data Processing discovery phase discovers the data set metadata in
order to suggest a Dgraph attribute schema. For detailed information on the
Dgraph schema, see
Dgraph Data Model.
Record and value search settings for string attributes
When the DP data type discoverer determines that an attribute should
be a string attributes, the settings for the record search and value search for
the attribute are configured according to the settings of two properties in the
bdd.conf file:
- The attribute is
configured as record searchable if the average string length is greater than
the
RECORD_SEARCH_THRESHOLD property value.
- The attribute is
configured as value searchable if the average string length is equal to or less
than the
VALUE_SEARCH_THRESHOLD property value.
In both cases, "average string length" refers to the average string
length of the values for that column.
You can
override this behavior by using the
--disableSearch flag with the DP CLI. With this
flag, the record search and value search settings for string attributes are set
to false, regardless of the average String length of the attribute values. Note
the following about using the
--disableSearch flag:
- The flag can used only for
provisioning workflows (when a new data set is created from a Hive table) and
for refresh update workflows (when the DP CLI
--refreshData flag is used). The flag cannot
be used with any other type of workflow (for example, workflows that use the
--incrementalUpdate flag are not supported
with the
--disableSearch flag).
- A disable search workflow
can be run only with the DP CLI. This functionality is not available in Studio.
Effect of NULL values on column conversion
When a Hive table is being sampled, a Dgraph attribute is created for
each column. The data type of the Dgraph attribute depends on how Data
Processing interprets the values in the Hive column. For example, if the Hive
column is of type String but it contains Boolean values only, the Dgraph
attribute is of type
mdex:boolean. NULL values are basically ignored in the
Data Processing calculation that determines the data type of the Dgraph
attribute.
Handling of Hive column names that are invalid Avro names
Data Processing uses Avro files to store data that should be ingested
into the Dgraph (via the Dgraph HDFS Agent). In Avro, attribute names must
start with an alphabetic or underscore character (that is, [A-Za-z_]), and the
rest of the name can contain only alphanumeric characters and underscores (that
is, [A-Za-z0-9_]).
Hive column names, however, can contain almost any Unicode
characters, including characters that are not allowed in Avro attribute names.
This format was introduced in Hive 0.13.0.
Because Data Processing uses Avro files to do ingest, this limits the
names of Dgraph attributes to the same rules as Avro. This means that the
following changes are made to column names when they are stored as Avro
attributes:
- Any non-ASCII alphanumeric
characters (in Hive column names) are changed to _ (the underscore).
- If the leading character
is disallowed, that character is changed to an underscore and then the name is
prefixed with "A_". As a result, the name would actually begin with "A__" (an A
followed by two underscores).
- If the resulting name is a
duplicate of an already-process column name, a number is appended to the
attribute name to make it unique. This could happen especially with non-English
column names.
For example:
Hive column name: @first-name
Changed name: A__first_name
In this example, the leading character (@) is not a valid Avro
character and is, therefore, converted to an underscore (the name is also
prefixed with "A_"). The hyphen is replaced with an underscore and the other
characters are unchanged.
Attribute names for non-English tables would probably have quite a few
underscore replacements and there could be duplicate names. Therefore, a
non-English attribute name may look like this: A_______2