When creating a new data set, you can specify the maximum number of records that the Data Processing workflow should process from the Hive table.
bdd.sampleSize
parameter in the Data Processing Settings page on Studio's Control Panel.maxRecordsForNewDataSet
configuration parameter or the --maxRecords flag.If the settings of these parameters are greater than the number of records in the Hive table, then all the Hive records are processed. In this case, the data set will be considered a full data set.
Discovery for attributes
The Data Processing discovery phase discovers the data set metadata in order to suggest a Dgraph attribute schema. For detailed information on the Dgraph schema, see Dgraph Data Model.
Record and value search settings for string attributes
bdd.conf
file:
RECORD_SEARCH_THRESHOLD
property value.VALUE_SEARCH_THRESHOLD
property value.In both cases, "average string length" refers to the average string length of the values for that column.
Effect of NULL values on column conversion
When a Hive table is being sampled, a Dgraph attribute is created for each column. The data type of the Dgraph attribute depends on how Data Processing interprets the values in the Hive column. For example, if the Hive column is of type String but it contains Boolean values only, the Dgraph attribute is of type mdex:boolean
. NULL values are basically ignored in the Data Processing calculation that determines the data type of the Dgraph attribute.
Handling of Hive column names that are invalid Avro names
Data Processing uses Avro files to store data that should be ingested into the Dgraph (via the Dgraph HDFS Agent). In Avro, attribute names must start with an alphabetic or underscore character (that is, [A-Za-z_]), and the rest of the name can contain only alphanumeric characters and underscores (that is, [A-Za-z0-9_]).
Hive column names, however, can contain almost any Unicode characters, including characters that are not allowed in Avro attribute names. This format was introduced in Hive 0.13.0.
Hive column name: @first-name Changed name: A__first_name
In this example, the leading character (@) is not a valid Avro character and is, therefore, converted to an underscore (the name is also prefixed with "A_"). The hyphen is replaced with an underscore and the other characters are unchanged.
Attribute names for non-English tables would probably have quite a few underscore replacements and there could be duplicate names. Therefore, a non-English attribute name may look like this: A_______2