Although not required, it is recommended that you clean your source data so that it is in a state that makes Data Processing workflows run smoother and prevents ingest errors.
Data Processing does not have a component that manipulates the source data as it is being ingested. For example, Data Processing cannot remove invalid characters (that are stored in the Hive table) as they are being ingested. Therefore, you should use Hive or third-party tools to clean your source data, if you choose to do so.
Note that after a data set is created in Big Data Discovery, you can manipulate the contents of the data set by using Transform and its functions, in Studio.
Character <c> is not legal in XML 1.0
The record with that character is rejected.
Make sure that dates in STRING columns are well-formed and conform to a format in the dateFormats.txt file, or else they will be ingested as string values, not as Dgraph mdex:dateTime data types.
In addition, make sure that the dates in a STRING column are valid dates. For example, the date Mon, Apr 07, 1925 is invalid because April 7, 1925 is a Tuesday, not a Monday. Therefore, this invalid date would cause the column to be detected as a STRING column, not a DATE column.
In Studio, you can create a new data set by uploading data from an Excel or CSV file. The data upload for these file types is always done as STRING data types.
For this reason, you should make sure that the file's column data are of consistent data types. For example, if a column is supposed to store integers, check that the column does not have non-integer data. Likewise, check that date input conforms to the formats in the dateFormats.txt file.