7 Add Data to Data Lake
A Data Lake is a massive repository that can include structured, semi-structured, unstructured, and even raw data. It enables you to store and organize large volumes of highly diverse data from a variety of sources.
Topics:
This chapter includes the following topics:
What is an Add Data to Data Lake Task?
By adding data to Data Lake, you can store vast amounts of data, enabling deep analytics, big data processing, and machine learning. You can add data from a variety of sources into the Data Lake.
Data can be ingested from a variety data sources, including relational data sources or flat files. Harvested metadata is stored in the Data Integration Platform Cloud Catalog, and the data will be transformed and secured within the target Data Lake for downstream activities.
Data Lake Builder provides the capability to:
-
Add data from various sources into the Data Lake, using a simple, configurable, and flexible workflow.
-
Configure a Data Lake.
-
Configure file format and file options for files added to Data Lake.
-
Configure a connection to sources outside of the Data Lake.
What’s Certified for Add Data to Data Lake?
Review the supported agents, data sources, and limitations before choosing your source and target for the Add Data to Data Lake Task (Data Lake Builder) in Oracle Data Integration Platform Cloud.
Note:
-
All data sources must have x86_64, the 64-bit version of x86 operating systems, with the latest upgrade.
-
The only target that you can use to build a Data Lake and then add data to it, is Oracle Object Storage Classic.
Connection Type | Data Source Version | OEL | RHEL | SLES | Windows | Source | Target |
---|---|---|---|---|---|---|---|
Oracle Database Cloud Classic |
12.2 |
6.x |
no |
no |
no |
yes |
no |
Oracle Database Cloud Classic |
12.1 |
6.x |
no |
no |
no |
yes |
no |
Oracle Database Cloud Classic |
11.2 |
6.x |
no |
no |
no |
yes |
no |
Oracle Database |
12.2 |
6.x |
6.x, 7.x |
11 & 12 |
2012, 2016 |
yes |
no |
Oracle Database |
12.1 |
6.x |
6.x, 7.x |
11 & 12 |
2012, 2016 |
yes |
no |
Oracle Database |
11.2.0.4 |
6.x |
6.x, 7.x |
11 & 12 |
2012 |
yes |
no |
Flat Files |
n/a |
yes |
yes |
yes |
yes |
yes |
no |
Oracle Object Storage Classic |
Latest |
n/a |
n/a |
n/a |
n/a |
no |
yes |
Autonomous Data Warehouse | 18.1 | 6.x | 6.x, 7.x | 11, 12 | no | no | yes |
Amazon S3 | Latest | n/a | n/a | n/a | n/a | yes | no |
After you verify your data source, operating systems, and versions, you must set up agents only for data sources that are certified for your tasks.
See Agent Certifications.
What’s Supported for Add Data to Data Lake Task?
Before adding data to the Data Lake, you must consider what source and target data sources are supported for the Add Data to Data Lake Task.
You can use the following source and target data sources for the Add Data to Data Lake Task:
Supported Source Data Sources
-
Oracle Databases
-
Relational Databases
-
Amazon S3
-
Flat files
-
Parquet
-
JSON
-
Delimited (csv, tsv)
Support Target Data Sources
-
Parquet
-
Delimited (csv, tsv)
-
Autonomous Data Warehouse (ADWC)
Limitations for Add Data to Data Lake Using an Execution Environment
Consider the following limitations when you're using an Execution Environment to add data to a Data Lake.
-
You can't create an external table Oracle Autonomous Data Warehouse Cloud (ADWC) if any of the following conditions are true:
-
Your target file type is Parquet. ADWC doesn't support Parquet.
-
Your target file name is alphanumerical and starts with a number or a special character, as the target file name doesn't follow ADWC naming conventions.
-
Your target file name contains a floating decimal with suffix f, for example, 2.4f.
-
Your varchar column size is greater than 4000 and decimal column size is greater than precision 35 and a scale of 15.
-
-
If your source or target file name contains a space, it is replace by an underscore (_).
- If the source file uses a colon ( : ) as a delimiter, it will fail, unless it contains a timestamp within the Text Qualifier. Use a delimiter other than a colon.
- If the data within a column is inconsistent (for example, a mix of integers, decimals, and strings) then the number of rows processed will be less than the actual number of rows in your file. This is because Data Integration Platform Cloud reads a subset of rows to determine the column type to parse and write the data. If some rows have data types different than the rows sampled, they are omitted. Cleanse your data so that the data types are consistent to avoid any loss of data.
-
For Spark Execution:
- Only Amazon S3 is supported as a source endpoint
- Limited JSON support. Currently, only JSON files where each record is a valid JSON is supported. Multi-line JSON files or entire JSON files are not supported.
- Output can be in a directory or in a file, based on input format.
- Using the agent's JAVA_OPTS properties, you can configure Spark application properties (for example, export JAVA_OPTS="-Dsparkexecutor.cores=2"):
- spark.executor.memory
- spark executor.cores
- spark.cores.max
- spark.driver.cores
- spark.driver.memory
- spark,app.name
- Only Spark on Big Data Cloud (OCI Classic) with basic authentication is supported, not BDC (OCI-C) with IDCS authentication.
- For Spark on YARN, Spark 2.2+ with HDFS, YARN, and Spark should be preinstalled on the cluster.
- For Spark on YARN, YARN REST port (default 8088) and HDFS port (default 8020) should be open and accessible from the DIPC Agent.
- Only Spark on YARN with no authentication is supported.
- DIPC uploads a jar file containing the Data Lake copy Spark application to a path in HDFS that was specified as part of the Spark Execution Environment configuration. DIPC assumes that you have the proper permissions to copy the jar file to this path in HDFS.
- DIPC allows you to configure up to eight additional properties when configuring a Spark Execution Environment.
Before You Add Data to Data Lake
Before you perform an Add Data to Data Lake Task, you'll need to download and configure your agents and create connections to your source and target data sources.
Make sure that you’ve done the following:
You must also create a Data Lake using the Data Lake Configuration Element. Optionally, you can create an Execution Environment to execute the Add Data to Data Lake job using Spark on Big Data Cloud or YARN.
Create a Data Lake
Before you add data to a Data Lake, you must create one using the Data Lake Configuration Element.
- From the Home Page Getting Started section, click Create in the Data Lake tile or click Create and select Data Lake in the Catalog.
- On the Create Data Lake page, complete the General Information fields.
- For Connection, select an existing Oracle Object Storage Classic Connection or create a new one.
- For Default Data Management Settings,
- For Type, select the file format of the target file that you're adding to the Data Lake.
- Click Options, and specify the Encoding, Delimiter, Text Qualifier, and Header.
- Click Save.
Set Up an Execution Environment
An Execution Environment enables you to run an Add Data to Data Lake task using Spark Execution on Big Data Cloud or YARN.
- From the Home page Getting Started section, click Create in the Execution Environment tile or click Create and select Execution Environment in the Catalog.
- On the Create Execution Environment Configuration page, complete the General Information fields.
- For Environment Settings, depending on your selection of Execution Environment Type in the General Information section, provide the environment connection details.
- Click Save.