Preparing your data for ingest

Although not required, it is recommended that you clean your source data so that it is in a state that makes Data Processing workflows run smoother and prevents ingest errors.

Data Processing does not have a component that manipulates the source data as it is being ingested. For example, Data Processing cannot remove invalid characters (that are stored in the Hive table) as they are being ingested. Therefore, you should use Hive or third-party tools to clean your source data, if you choose to do so.

Note that after a data set is created in Big Data Discovery, you can manipulate the contents of the data set by using Transform and its functions, in Studio.

Removing invalid XML characters

During the ingest procedure that is run by Data Processing, it is possible for a record to contain invalid data, which will be skipped (that is, will not be ingested into the Dgraph). Typically, the invalid data will consist of invalid XML characters. A valid character for ingest must be a character according to production 2 of the XML 1.0 specification. If an invalid character is detected, an exception is thrown with this error message:
Character <c> is not legal in XML 1.0

The record with that character is rejected.

Fixing date formats

Ingested date values originate from one (or more) Hive table columns:
  • Columns configured as DATE data types.
  • Columns configured as TIMESTAMP data types.
  • Columns configured as STRING data types but having date values. The date formats that are supported via this data type discovery method are listed in the dateFormats.txt file. For details on this file, see Date format configuration.

Make sure that dates in STRING columns are well-formed and conform to a format in the dateFormats.txt file, or else they will be ingested as string values, not as Dgraph mdex:dateTime data types.

In addition, make sure that the dates in a STRING column are valid dates. For example, the date Mon, Apr 07, 1925 is invalid because April 7, 1925 is a Tuesday, not a Monday. Therefore, this invalid date would cause the column to be detected as a STRING column, not a DATE column.

Uploading Excel and CSV files

In Studio, you can create a new data set by uploading data from an Excel or CSV file. The data upload for these file types is always done as STRING data types.

For this reason, you should make sure that the file's column data are of consistent data types. For example, if a column is supposed to store integers, check that the column does not have non-integer data. Likewise, check that date input conforms to the formats in the dateFormats.txt file.