Data Migration Options

Oracle provides several options for migrating HDFS data, bulk data migration using the Oracle Data Transfer Appliance, and cluster metadata migration.

Data Migration Guidelines

After you decide what data needs to move and how it'll be structured in Oracle Cloud Infrastructure, determine the method to use to move the data from its current location to Oracle Cloud Infrastructure. A critical component of this process is the connection to Oracle Cloud Infrastructure. The throughput depends on the size of the connection.

Oracle Cloud Infrastructure supports many levels of connectivity. Connections can range anywhere from 10 Mbps to 10 Gbps. Taking into account the size of the data set and the connection throughput, the migration of the data might be as simple as a direct copy, or you might need specialized appliances (such as the Data Transfer service) to move the data.

The following table presents a reasonable expectation of how long it will take to move the data to Oracle Cloud Infrastructure, based on the connection bandwidth and the size of the data set.
  Approximate Data Upload Time
Data Set Size 10Mbps 100Mbps 1 Gbps 10 Gbps Data Transfer Service
10 TB 92 days 9 days 22 hours 2 hours 1 week
100 TB 1,018 days 101 days 10 days 24 hours 1 week
500 TB 5,092 days 509 days 50 days 5 days 1 week
1 PB 10,185 days 1,018 days 101 days 10 days 2 weeks

Data Transfer Service

Oracle offers offline data transfer solutions that let you migrate data to Oracle Cloud Infrastructure. You can also export data currently residing in Oracle Cloud Infrastructure to your data center offline. Moving data over the public internet is not always feasible because of high network costs, unreliable network connectivity, long transfer times, and security concerns. Our transfer solutions address these pain points, are easy to use, and provide faster data upload compared to over-the-wire data transfer.
  • Disk based data transfer - You send your data as files on encrypted commodity disk to an Oracle transfer site. Operators at the Oracle transfer site upload the files into your designated Object Storage or Archive Storage bucket in your tenancy.
  • Appliance-based data transfer - You send your data as files on secure, high-capacity, Oracle-supplied storage appliances to an Oracle transfer site. Operators at the Oracle transfer site upload the data into your designated Object Storage or Archive Storage bucket in your tenancy.

HDFS Migration

You can migrate data from an external HDFS to Oracle Cloud Infrastructure in a few different ways.

The primary consideration is how much data needs to move, and whether it's practical to move the data "over the wire" given the time and resources that would be required to move the data. If there is sufficient bandwidth and source cluster resources to support it, two options are relevant:

  • DistCp to Object Storage
  • DistCp to HDFS

For Object Storage copy, only the source cluster needs internet connectivity, and either the HDFS Connector (Apache Hadoop) or S3 Compatibility setup (Cloudera and Hortonworks). If you use S3 Compatibility, data can be copied only into the home region for the tenancy.

After the prerequisites are in place, you transfer data by running DistCp against a source HDFS target, into an Object Storage bucket. The following syntax demonstrates a copy into the US East (Ashburn) region Object Storage (replace the variables with your correct values):

hadoop distcp -Dfs.s3a.secret.key='<SECRET_KEY>' 
-Dfs.s3a.access.key='<ACCESS_KEY>' \
-Dfs.s3a.path.style.access=true 
-Dfs.s3a.paging.maximum=1000 \
-Dfs.s3a.endpoint='https://<object_storage_namespace>.compat.objectstorage.us-ashburn-1.oraclecloud.com' \
/hdfs_target s3a://<BUCKET_NAME>/ 

Conversely, the HDFS target and S3 target can be switched to copy data from Object Storage into HDFS. This method works for Cloudera, Hortonworks, and Apache Hadoop.

The second option is to establish a Hadoop cluster in Oracle Cloud Infrastructure, ensure that the source cluster and the Oracle Cloud Infrastructure cluster have connectivity, and run DistCp between the clusters. This approach also works for Apache Hadoop, Cloudera, and Hortonworks.

For MapR clusters, you migrate data by setting up volume remote mirroring between clusters.

Data Transfer Appliance

The Oracle Data Transfer Appliance is another option for data transfer when moving data over the wire is not feasible.

Bandwidth or resource constraints might exist on the source cluster, or proximity to an Oracle Cloud Infrastructure region might limit FastConnect availability. The data set could also be so large that it would take too long to copy. In these cases, Oracle can send you a Data Transfer Appliance that you can deploy in your data center and use as a DistCp target for HDFS data.

Cluster Metadata Migration

The approach for migrating cluster metadata to Oracle Cloud Infrastructure varies between Cloudera, Hortonworks, MapR, and Apache.

Cloudera

For Cloudera clusters, three types of databases are supported for cluster metadata: Postgres, MySQL, and Oracle.

Steps to back up Cloudera Manager Databases are included in Cloudera Enterprise documentation. You can then import this data to a cluster running Cloudera on Oracle Cloud Infrastructure.

Hortonworks

For Hortonworks, the same databases are supported as for Cloudera. For Ambari, you can export a blueprint from the existing cluster and use it to configure the Oracle Cloud Infrastructure Hortonworks deployment.

MapR

Follow the steps in the MapR Best Practices for Backing Up MapR documentation. You can then import this data into an Oracle Cloud Infrastructure MapR cluster.

Apache

For Apache Hadoop, the same databases are supported as for Cloudera and Hortonworks, using the same procedures as for Ambari, Hive, and HBase.