Data Migration Options

HDFS Migration

You can migrate data from an external HDFS to Oracle Cloud Infrastructure in a few different ways.

The primary consideration is how much data needs to move, and whether it's practical to move the data "over the wire" given the time and resources that would be required to move the data. If there is sufficient bandwidth and source cluster resources to support it, two options are relevant:

DistCp to Object Storage
DistCp to HDFS

For Object Storage copy, only the source cluster needs internet connectivity, and either the HDFS Connector (Apache Hadoop) or S3 Compatibility setup (Cloudera and Hortonworks). If you use S3 Compatibility, data can be copied only into the home region for the tenancy.

After the prerequisites are in place, you transfer data by running DistCp against a source HDFS target, into an Object Storage bucket. The following syntax demonstrates a copy into the US East (Ashburn) region Object Storage (replace the variables with your correct values):

hadoop distcp -Dfs.s3a.secret.key='<SECRET_KEY>' 
-Dfs.s3a.access.key='<ACCESS_KEY>' \
-Dfs.s3a.path.style.access=true 
-Dfs.s3a.paging.maximum=1000 \
-Dfs.s3a.endpoint='https://<TENANCY>.compat.objectstorage.us-ashburn-1.oraclecloud.com' \
/hdfs_target s3a://<BUCKET_NAME>/

Conversely, the HDFS target and S3 target can be switched to copy data from Object Storage into HDFS. This method works for Cloudera, Hortonworks, and Apache Hadoop.

The second option is to establish a Hadoop cluster in Oracle Cloud Infrastructure, ensure that the source cluster and the Oracle Cloud Infrastructure cluster have connectivity, and run DistCp between the clusters. This approach also works for Apache Hadoop, Cloudera, and Hortonworks.

For MapR clusters, you migrate data by setting up volume remote mirroring between clusters.

Data Transfer Appliance

The Oracle Data Transfer Appliance is another option for data transfer when moving data over the wire is not feasible.

Bandwidth or resource constraints might exist on the source cluster, or proximity to an Oracle Cloud Infrastructure region might limit FastConnect availability. The data set could also be so large that it would take too long to copy. In these cases, Oracle can send you a Data Transfer Appliance that you can deploy in your data center and use as a DistCp target for HDFS data.

Cluster Metadata Migration

The approach for migrating cluster metadata to Oracle Cloud Infrastructure varies between Cloudera, Hortonworks, MapR, and Apache.

Cloudera

For Cloudera clusters, three types of databases are supported for cluster metadata: Postgres, MySQL, and Oracle.

Steps to back up Cloudera Manager Databases are included in Cloudera Enterprise documentation. You can then import this data to a cluster running Cloudera on Oracle Cloud Infrastructure.

Hortonworks

For Hortonworks, the same databases are supported as for Cloudera. For Ambari, you can export a blueprint from the existing cluster and use it to configure the Oracle Cloud Infrastructure Hortonworks deployment.

MapR

Follow the steps in the MapR Best Practices for Backing Up MapR documentation. You can then import this data into an Oracle Cloud Infrastructure MapR cluster.

Apache

For Apache Hadoop, the same databases are supported as for Cloudera and Hortonworks, using the same procedures as for Ambari, Hive, and HBase.