Preparing your data for ingest

Although not required, it is recommended that you clean your source data so that it is in a state that makes Data Processing workflows run smoother and prevents ingest errors.

Data Processing does not have a component that manipulates the source data as it is being ingested. For example, Data Processing cannot remove invalid characters (that are stored in the Hive table) as they are being ingested. Therefore, you should use Hive or third-party tools to clean your source data.

After a data set is created, you can manipulate the contents of the data set by using the Transform functions in Studio.

Removing invalid XML characters

During the ingest procedure that is run by Data Processing, it is possible for a record to contain invalid data, which will be detected by the Dgraph during the ingest operation. Typically, the invalid data will consist of invalid XML characters. A valid character for ingest must be a character according to production 2 of the XML 1.0 specification.

If an invalid XML character is detected, it is replaced with an escaped version. In the escaped version, the invalid character is represented as a decimal number surrounded by two hash characters (##) and a semi-colon (;). For example, a control character whose 32-bit value is decimal 15 would be represented as
##15;

The record with the replaced character would then be ingested.

Fixing date formats

Ingested date values come from one (or more) Hive table columns:
  • Columns configured as DATE data types.
  • Columns configured as TIMESTAMP data types.
  • Columns configured as STRING data types but having date values. The date formats that are supported via this data type discovery method are listed in the dateFormats.txt file. For details on this file, see Date format configuration.

Make sure that dates in STRING columns are well-formed and conform to a format in the dateFormats.txt file, or else they will be ingested as string values, not as Dgraph mdex:dateTime data types.

In addition, make sure that the dates in a STRING column are valid dates. For example, the date Mon, Apr 07, 1925 is invalid because April 7, 1925 is a Tuesday, not a Monday. Therefore, this invalid date would cause the column to be detected as a STRING column, not a DATE column.

Uploading Excel and CSV files

In Studio, you can create a new data set by uploading data from an Excel or CSV file. The data upload for these file types is always done as STRING data types.

For this reason, you should make sure that the file's column data are of consistent data types. For example, if a column is supposed to store integers, check that the column does not have non-integer data. Likewise, check that date input conforms to the formats in the dateFormats.txt file.

Note that BDD cannot load multimedia or binary files (other than Excel).

Non-splittable input data handling for Hive tables

Hive tables supports the use of input data that has been compressed using non-splittable compression at the individual file level. However, Oracle discourages using a non-splittable input format for Hive tables that will be processed by BDD. The reason is that when the non-splittable compressed input files are used, the suggested input data split size specified by the DP configuration will not be honored by Spark (and Hadoop), as there is no clear split point on those inputs. In this situation, Spark (and Hadoop) will read and treat each compressed file as a single partition, which will result in a large amount of resources being consumed during the workflow.

If you must non-splittable compression, you should use block-based compression, where the data is divided into smaller blocks first and then the data is compressed within each block. More information is available at: https://cwiki.apache.org/confluence/display/Hive/CompressedStorage

In summary, you are encouraged to use splittable compression, such as BZip2. For information on choosing a data compression format, see: http://www.cloudera.com/content/cloudera/en/documentation/core/v5-3-x/topics/admin_data_compression_performance.html

Anti-Virus and Malware

Oracle strongly encourages you to use anti-virus products prior to uploading files into Big Data Discovery. The Data Processing component of BDD either finds Hive tables that are already present and then loads them, or lets you load data from new Hive tables, using DP CLI. In either case, use anti-virus software to ensure the quality of the data that is being loaded.