7 Add Data to Data Lake

A Data Lake is a massive repository that can include structured, semi-structured, unstructured, and even raw data. It enables you to store and organize large volumes of highly diverse data from a variety of sources.

What is an Add Data to Data Lake Task?

By adding data to Data Lake, you can store vast amounts of data, enabling deep analytics, big data processing, and machine learning. You can add data from a variety of sources into the Data Lake. 

Data can be ingested from a variety data sources, including relational data sources or flat files. Harvested metadata is stored in the Data Integration Platform Cloud Catalog, and the data will be transformed and secured within the target Data Lake for downstream activities.

Data Lake Builder provides the capability to:

  • Add data from various sources into the Data Lake, using a simple, configurable, and flexible workflow.

  • Configure a Data Lake.

  • Configure file format and file options for files added to Data Lake.

  • Configure a connection to sources outside of the Data Lake.

What’s Certified for Add Data to Data Lake?

Review the supported agents, data sources, and limitations before choosing your source and target for the Add Data to Data Lake Task (Data Lake Builder) in Oracle Data Integration Platform Cloud.

Note:

  • All data sources must have x86_64, the 64-bit version of x86 operating systems, with the latest upgrade.

  • The only target that you can use to build a Data Lake and then add data to it, is Oracle Object Storage Classic.

Connection Type Data Source Version OEL RHEL SLES Windows Source Target

Oracle Database Cloud Classic

12.2

6.x

no

no

no

yes

no

Oracle Database Cloud Classic

12.1

6.x

no

no

no

yes

no

Oracle Database Cloud Classic

11.2

6.x

no

no

no

yes

no

Oracle Database

12.2

6.x

6.x, 7.x

11 & 12

2012, 2016

yes

no

Oracle Database

12.1

6.x

6.x, 7.x

11 & 12

2012, 2016

yes

no

Oracle Database

11.2.0.4

6.x

6.x, 7.x

11 & 12

2012

yes

no

Flat Files

n/a

yes

yes

yes

yes

yes

no

Oracle Object Storage Classic

Latest

n/a

n/a

n/a

n/a

no

yes

Autonomous Data Warehouse 18.1 6.x 6.x, 7.x 11, 12 no no yes
Amazon S3 Latest n/a n/a n/a n/a yes no

After you verify your data source, operating systems, and versions, you must set up agents only for data sources that are certified for your tasks.

See Agent Certifications.

What’s Supported for Add Data to Data Lake Task?

Before adding data to the Data Lake, you must consider what source and target data sources are supported for the Add Data to Data Lake Task.

You can use the following source and target data sources for the Add Data to Data Lake Task:

Supported Source Data Sources

  • Oracle Databases

  • Relational Databases

  • Amazon S3

  • Flat files

  • Parquet

  • JSON

  • Delimited (csv, tsv)

Support Target Data Sources

  • Parquet

  • Delimited (csv, tsv)

  • Autonomous Data Warehouse (ADWC)

Limitations for Add Data to Data Lake Using an Execution Environment

Consider the following limitations when you're using an Execution Environment to add data to a Data Lake.

  • You can't create an external table Oracle Autonomous Data Warehouse Cloud (ADWC) if any of the following conditions are true:

    • Your target file type is Parquet. ADWC doesn't support Parquet.

    • Your target file name is alphanumerical and starts with a number or a special character, as the target file name doesn't follow ADWC naming conventions.

      See Schema Object Naming Guidelines.

    • Your target file name contains a floating decimal with suffix f, for example, 2.4f.

    • Your varchar column size is greater than 4000 and decimal column size is greater than precision 35 and a scale of 15.

  • If your source or target file name contains a space, it is replace by an underscore (_).

  • If the source file uses a colon ( : ) as a delimiter, it will fail, unless it contains a timestamp within the Text Qualifier. Use a delimiter other than a colon.
  • If the data within a column is inconsistent (for example, a mix of integers, decimals, and strings) then the number of rows processed will be less than the actual number of rows in your file. This is because Data Integration Platform Cloud reads a subset of rows to determine the column type to parse and write the data. If some rows have data types different than the rows sampled, they are omitted. Cleanse your data so that the data types are consistent to avoid any loss of data.
  • For Spark Execution:

    • Only Amazon S3 is supported as a source endpoint
    • Limited JSON support. Currently, only JSON files where each record is a valid JSON is supported. Multi-line JSON files or entire JSON files are not supported.
    • Output can be in a directory or in a file, based on input format.
    • Using the agent's JAVA_OPTS properties, you can configure Spark application properties (for example, export JAVA_OPTS="-Dsparkexecutor.cores=2"):
      • spark.executor.memory
      • spark executor.cores
      • spark.cores.max
      • spark.driver.cores
      • spark.driver.memory
      • spark,app.name
    • Only Spark on Big Data Cloud (OCI Classic) with basic authentication is supported, not BDC (OCI-C) with IDCS authentication.
    • For Spark on YARN, Spark 2.2+ with HDFS, YARN, and Spark should be preinstalled on the cluster.
    • For Spark on YARN, YARN REST port (default 8088) and HDFS port (default 8020) should be open and accessible from the DIPC Agent.
    • Only Spark on YARN with no authentication is supported.
    • DIPC uploads a jar file containing the Data Lake copy Spark application to a path in HDFS that was specified as part of the Spark Execution Environment configuration. DIPC assumes that you have the proper permissions to copy the jar file to this path in HDFS.
    • DIPC allows you to configure up to eight additional properties when configuring a Spark Execution Environment.

Before You Add Data to Data Lake

Before you perform an Add Data to Data Lake Task, you'll need to download and configure your agents and create connections to your source and target data sources.

Make sure that you’ve done the following:

You must also create a Data Lake using the Data Lake Configuration Element. Optionally, you can create an Execution Environment to execute the Add Data to Data Lake job using Spark on Big Data Cloud or YARN.

Create a Data Lake

Before you add data to a Data Lake, you must create one using the Data Lake Configuration Element.

To create a Data Lake:
  1. From the Home Page Getting Started section, click Create in the Data Lake tile or click Create and select Data Lake in the Catalog.
  2. On the Create Data Lake page, complete the General Information fields.
  3. For Connection, select an existing Oracle Object Storage Classic Connection or create a new one.
  4. For Default Data Management Settings,
    1. For Type, select the file format of the target file that you're adding to the Data Lake.
    2. Click Options, and specify the Encoding, Delimiter, Text Qualifier, and Header.
  5. Click Save.
The Data Lake Configuration Element is stored in Data Integration Platform Cloud as a Data Asset. You can now use it as a target when you create an Add Data to Data Lake task.

Set Up an Execution Environment

An Execution Environment enables you to run an Add Data to Data Lake task using Spark Execution on Big Data Cloud or YARN.

To create an Execution Environment:
  1. From the Home page Getting Started section, click Create in the Execution Environment tile or click Create and select Execution Environment in the Catalog.
  2. On the Create Execution Environment Configuration page, complete the General Information fields.
  3. For Environment Settings, depending on your selection of Execution Environment Type in the General Information section, provide the environment connection details.
  4. Click Save.
After you save your Execution Environment configuration, you can find it in the Catalog, or use it when creating an Add Data to Data Lake task.

Add Data to the Data Lake

After your Data Lake is created, you can add data to it from a variety of data sources.

To add data to a data lake:
  1. From the Getting Started section of the Data Integration Platform Cloud Home page, click Create from the Add Data to Data Lake tile or click Create  and select Create Data Lake in the Catalog.
  2. In the Add Data to Datalake page, complete the General Information fields.
  3. To use a Spark Execution Environment, select Spark Settings, and then select an Execution Environment for Spark Environment.

    To learn more about creating Execution Environments, see Set Up an Execution Environment.

  4. For Source Configuration, select your source Connection or create a new one.
    To learn more about creating Connections, see Create a Connection.
    1. If you select a relational database as your source Connection, click the drop-down and choose a Schema, and then select or enter the Data Entity (table) name.
    2. If you select a File Type Connection, the Directory populates with the default location specified when the Connection was created. For File click Select too choose the file. Click Options to select DelimitedParquet, or JSON.
  5. In the Target Configuration section,
    1. Select a Data Lake to copy your data to and enter a name for your new Data Entity.
      To learn more about creating a Data Lake, see Create a Data Lake.
    2. For File Path, enter a Directory/File path for your target file in the Data Lake.
    3. Select Use Default Data Lake Settings if you want to use the configuration settings set when you created the Data Lake, or deselect this option to overwrite the default settings.
    4. Select Update ADWC Table if you want to query the data in the Data Lake using Autonomous Data Warehouse.
  6. Click Save to save the task, or click Save & Execute to save and start the task.