Sampling and attribute handling

When creating a new data set, you can specify the maximum number of records that the Data Processing workflow should process from the Hive table.

The number of sampled records from a Hive table is set by the Studio or DP CLI configuration:
  • In Studio, the bdd.sampleSize parameter in the Data Processing Settings page on Studio's Control Panel.
  • In DP CLI, the maxRecordsForNewDataSet configuration parameter or the --maxRecords flag.

If the settings of these parameters are greater than the number of records in the Hive table, then all the Hive records are processed. In this case, the data set will be considered a full data set.

Discovery for attributes

The Data Processing discovery phase discovers the data set metadata in order to suggest a Dgraph attribute schema. For detailed information on the Dgraph schema, see Dgraph Data Model.

Record and value search settings for string attributes

When the DP data type discoverer determines that an attribute should be a string attributes, the settings for the record search and value search for the attribute are configured according to the settings of two properties in the bdd.conf file:
  • The attribute is configured as record searchable if the average string length is greater than the RECORD_SEARCH_THRESHOLD property value.
  • The attribute is configured as value searchable if the average string length is equal to or less than the VALUE_SEARCH_THRESHOLD property value.

In both cases, "average string length" refers to the average string length of the values for that column.

You can override this behavior by using the --disableSearch flag with the DP CLI. With this flag, the record search and value search settings for string attributes are set to false, regardless of the average String length of the attribute values. Note the following about using the --disableSearch flag:
  • The flag can used only for provisioning workflows (when a new data set is created from a Hive table) and for refresh update workflows (when the DP CLI --refreshData flag is used). The flag cannot be used with any other type of workflow (for example, workflows that use the --incrementalUpdate flag are not supported with the --disableSearch flag).
  • A disable search workflow can be run only with the DP CLI. This functionality is not available in Studio.

Effect of NULL values on column conversion

When a Hive table is being sampled, a Dgraph attribute is created for each column. The data type of the Dgraph attribute depends on how Data Processing interprets the values in the Hive column. For example, if the Hive column is of type String but it contains Boolean values only, the Dgraph attribute is of type mdex:boolean. NULL values are basically ignored in the Data Processing calculation that determines the data type of the Dgraph attribute.

Handling of Hive column names that are invalid Avro names

Data Processing uses Avro files to store data that should be ingested into the Dgraph (via the Dgraph HDFS Agent). In Avro, attribute names must start with an alphabetic or underscore character (that is, [A-Za-z_]), and the rest of the name can contain only alphanumeric characters and underscores (that is, [A-Za-z0-9_]).

Hive column names, however, can contain almost any Unicode characters, including characters that are not allowed in Avro attribute names. This format was introduced in Hive 0.13.0.

Because Data Processing uses Avro files to do ingest, this limits the names of Dgraph attributes to the same rules as Avro. This means that the following changes are made to column names when they are stored as Avro attributes:
  • Any non-ASCII alphanumeric characters (in Hive column names) are changed to _ (the underscore).
  • If the leading character is disallowed, that character is changed to an underscore and then the name is prefixed with "A_". As a result, the name would actually begin with "A__" (an A followed by two underscores).
  • If the resulting name is a duplicate of an already-process column name, a number is appended to the attribute name to make it unique. This could happen especially with non-English column names.
For example:
Hive column name: @first-name

Changed name: A__first_name

In this example, the leading character (@) is not a valid Avro character and is, therefore, converted to an underscore (the name is also prefixed with "A_"). The hyphen is replaced with an underscore and the other characters are unchanged.

Attribute names for non-English tables would probably have quite a few underscore replacements and there could be duplicate names. Therefore, a non-English attribute name may look like this: A_______2