Migrate from Big Data Appliance (BDA) or Big Data Cloud Service (BDCS)

Find out how to migrate from Oracle Big Data Appliance or Big Data Cloud Service to Big Data Service

At a high level, you do the following:
  • Back up your Big Data Appliance or Big Data Cloud Service data and metadata to Oracle Object Storage.
  • Create a Big Data Service cluster on Oracle Cloud Infrastructure.
  • Restore your Big Data Appliance or Big Data Cloud Service data from Oracle Object Storage to Big Data Service.
Note

We recommend that even after migrating to OCI, keep your Big Data Appliance or Big Data Cloud Service clusters (in a stopped state) for at least three months as a backup.

Migrating Resources Using WANdisco LiveData Migrator

Ensure that Port 8020 opens at the destination.

For information about WANdisco LiveData Migrator, click here.

To migrate resources using WANdisco LiveData Migrator, follow these steps:

  1. Install LiveData migrator on any edge of the source cluster by running the following commands:
    wget https://wandisco.com/downloads/livedata-migrator.sh
     
    chmod +x livedata-migrator.sh && ./livedata-migrator.sh
     
    service livedata-migrator status
    service hivemigrator status
    service livedata-ui status
  2. After the installation and setup of the LiveData migrator is complete, access the UI and create your user account. The URL of the UI is as follows:
    http://<LDM-Installation-Host.com>:8081
  3. Do the following to migrate data:
    1. Configure source filesystem.
      To add a source filesystem, on your LiveData Migrator dashboard, do the following:
      1. From the Products panel, select the relevant instance.
      2. In the Filesystem Configuration page, click Add source filesystem.
    2. Configure target filesystem.
      To add a target filesystem, on your LiveData Migrator dashboard, do the following:
      1. From the Products panel, select the relevant instance.
      2. In the Filesystem Configuration page, click Add target filesystem.
      3. Select Apache Hadoop for Target as BDS cluster (CDH or ODH) and provide the default filesystem path. Make sure that source and target connect to destination on 8020 port.
    3. Create a path mapping.
      Path mapping enables migrated data to be stored at an equivalent default location on the target. To create path mappings using the UI, follow these steps:
      1. From the Products list on the dashboard, select the LiveData Migrator instance for which you want to create a path mapping.
      2. From the Migrations menu, select Path Mappings.
      3. At the top right of the interface, click the Add New Path button.
    4. Create a migration.
      Migrations transfer existing data from the defined source to a target. To create a new migration from the UI, follow these steps:
      1. Provide a name for the migration.
      2. From your filesystems, select a source and target.
      3. Select the Path on your source filesystem that you want to migrate. Use the folder browser and select the path name you want to migrate. Select the grey folder next to a path name to view its subdirectories.
  4. Migrate the metadata.
    To migrate the metadata, follow these steps:
    1. Connect metastores.
      Hive Migrator, which comes bundled with LiveData Migrator, lets you transfer metadata from a source metastore to target metastores. Connect to metastores by creating local or remote metadata agents.
    2. Create a metadata migration.
      To create a metadata migration, follow these steps:
      1. On the dashboard, select Add a Hive migration.
      2. Provide a name for the migration.
      3. Select the source and target agents.
      4. Create a database pattern and a table pattern based on Hive DDL that matches the databases and tables you want to migrate.

        For example, using test* for the database pattern matches any database name that starts with test, such as test01, test02, test03.

      5. Click Create.

Migrating Resources Using BDR

Back up BDA Data to Oracle Object Storage

Before you back up your Oracle Big Data Appliance cluster, ensure the following:

  • You have administrator access to your Big Data Appliance cluster.

    • You need the administrator credentials for Cloudera Manager.

    • You need a Hadoop administrator user with full access to the HDFS data and Hive metadata that's getting backed up to Oracle Object Storage.

  • Set up the Oracle Cloud Infrastructure object store to which HDFS data is getting copied. For more information, see Overview of Object Storage.

  • Set up your Oracle Cloud Infrastructure tenancy with the following details

    • The administrator has created a user in Oracle Cloud Infrastructure and has added the user to the required groups.

    • The user has permission and can access the Oracle Cloud Infrastructure console.

    • The user has permission and can create a bucket. For more information, see Let Object Storage admins manage buckets and objects in Common Policies.

    • The user can inspect the configuration of the Oracle Cloud Infrastructure object store.

To back up a BDA cluster, follow these steps:

  1. Create a bucket in Oracle Object Storage.

    Buckets are used to store and organize data. The bucket that you create is the container for the data that's backed up to Oracle Object Storage. For more information about managing Oracle Object Storage, see Overview of Object Storage.

    1. Log on to the OCI Console.
    2. Open the navigation menu and click Storage. Under Object Storage, click Buckets.
    3. Click Create Bucket.
    4. Create a bucket with its name in uppercase. For example, BDA-MIGRATION. Hadoop S3 connector works only if the bucket name is in uppercase. Create the bucket in your home region, irrespective of whether the cluster is in a different region. For information about buckets, see Putting Data into Object Storage.
  2. Create a customer secret key.
    A secret key is required to make a secure connection from your Hadoop cluster to Oracle Object Storage. The key is created for a user with privileges to perform updates to the target bucket.
    1. Log on to the OCI Console.
    2. Open the navigation menu and click Identity & Security. Under Identity, click Users.
    3. Locate the user in the list, and then click the user's name to view the details.
    4. On the left side of the page, click Customer Secret Keys.
    5. Click Generate Secret Key.
    6. Enter a friendly description for the key (such as mysecretkey) and click Generate Secret Key.
    7. Copy the secret key immediately, because you can't retrieve the secret key after closing the dialog box for security reasons.
  3. Create an external account in Cloudera Manager for backing up.

    Use the access key and secret key to create an external account in Cloudera Manager. Ensure that you use the keys for an Oracle Cloud Infrastructure user that has access to the target bucket in Oracle Object Storage.

    1. Log on to Cloudera Manager on the source cluster.
    2. Go to Administration and click External Accounts.
    3. On the AWS Credentials tab, click Add Access Key Credentials and specify the following:
      • Name: Specify a name for the credentials. For example, oracle-credential.

      • AWS Access Key ID: Specify a name for the access key. For example, myaccesskey.

      • AWS Secret Key: Enter the secret key value that you noted down when you created a customer secret key.

    4. Click Add. The Edit S3Guard page appears. Do not select Enable S3Guard.
    5. Click Save.
  4. Update core-site.xml with the Object Storage credentials.

    Here is a sample of core-site.xml updates that must be added to connect with Object storage while backing up the data. After updating core-site.xml on both source and destination clusters, save the changes and restart HDFS. Ensure that the staging directory named /tmp/s3-temp exists on the local cluster nodes of the source and destination.

    <property>
        <name>fs.s3a.secret.key</name>
        <value><SECRET-KEY></value>
      </property>
      <property>
        <name>fs.s3a.access.key</name>
        <value><ACCESS-KEY></value>
      </property>
      <property>
        <name>fs.s3a.bucket.<BUCKET-NAME>.endpoint</name>
        <value>https://<TENANCY>.compat.objectstorage.us-ashburn-1.oraclecloud.com</value>
      </property>
      <property>
        <name>fs.s3a.acl.default</name>
        <value>Private</value>
      </property>
      <property>
        <name>fs.s3a.path.style.access</name>
        <value>true</value>
      </property>
      <property>
        <name>fs.s3a.paging.maximum</name>
        <value>1000</value>
      </property>
      <property>
        <name>fs.s3a.buffer.dir</name>
        <value>/tmp/s3-temp</value>
      </property>
  5. Create a Hive replication schedule for backing up.

    To back up Hive data and metadata from your Hadoop cluster to Oracle Object Storage, you must create a Hive replication schedule in Cloudera Manager.

    Before running the replication, we recommend that you enable snapshots for the /user/hive/warehouse directory. Enabling snapshots ensures that any changes made to files during the replication process does not cause replication failures. For more information, see Using Snapshots with Replication in the Cloudera documentation.

    To create a Hive replication schedule, follow these steps:

    1. Log on to Cloudera Manager on the source cluster.
    2. Enable snapshots for the /user/hive/warehouse directory.
      1. From the list of services on the left, click HDFS.
      2. Click the File Browser tab towards the top of the page.
      3. Navigate to the /user/hive/warehouse directory.
      4. Click Enable Snapshots.
    3. Create the Hive replication schedule.
      1. Go to Backup and click Replication Schedules.
      2. Click Create Schedule and select Hive Replication.
      3. Specify the following details for the replication schedule:
        • Name: Enter a name. For example, hive-rep1.

        • Source: Select the credential that you defined earlier. For example, oracle-credential.

        • Destination: Select Hive (cluster name).

        • Cloud Root Path: Specify the root location where you want your data backed up. For example, s3a://BDCS-BACKUP/.

        • HDFS Destination: Leave this empty, because you're backing up to object storage.
        • Databases: Select Replicate All.

        • Replication Option: Select Metadata and Data.

        • Schedule: Select Immediate.

        • Run As Username: Specify a user with access to the data and metadata that will be backed up. This is typically a Hadoop superuser and Sentry administrator.

          Note

          If you don't have a user with access to the required data and metadata, create one. Do not use the hdfs superuser for this step.
          Note

          If you have Hadoop encryption enabled, which stores encrypted data in HDFS, ensure to run as the user who has decrypt access to the encryption zone.
    4. Click Save Schedule. You can monitor the replication on the Replication Schedules page.
  6. Review the source cluster configuration.

    Note down any custom configuration settings on your source cluster. You may need to make similar updates to your Big Data Service cluster. This includes installation of third-party products, additional JAR files, updates to the Hadoop configuration (such as auxiliary JARs for Hive), and so on.

    The Cloudera Manager API allows you to see all the changes that are made to the default cluster configuration. Download the cluster deployment and review the output to understand the changes that may need to be applied to Big Data Service.

    Use the following URL to log on to Cloudera Manager and download the cluster deployment:

    https://your-CM-host:7183/api/v18/cm/deployment

    You get the cluster configuration in JSON format, listing any overrides made on the cluster. Go through the file and verify if there are any overrides to the default. For example, search for hive.aux.jars.path to see the list of application JARs required to run your Hive queries. Copy these files to your Big Data Service cluster, and make an equivalent update to hive.aux.jars.path.

Restore the BDA Backup to the Big Data Service Cluster

Before you restore your Oracle Big Data Appliance cluster to Oracle Big Data Service, you must have the following:

Restore the BDA Backup

To restore BDA backup, follow these steps:
  1. Log on to Cloudera Manager on the Big Data Service cluster.
    1. Log on to https://your-utility-node-1:7183, where your-utility-node is the public or private IP address for the utility node. If high availability is used, this is the first utility node on the cluster. If high availability is not used, this is the only utility node.
    2. Enter the user name admin, and the password specified during cluster creation.
  2. Create an external account in Cloudera Manager for restore.

    Use the access key and secret key to create an external account in Cloudera Manager. You set up an external account to allow the cluster to access data in Oracle Object Storage.

    To create an external account, follow these steps:
    1. Log on to Cloudera Manager on the Oracle Big Data Service cluster.
    2. Go to Administration and click External Accounts.
    3. On the AWS Credentials tab, click Add Access Key Credentials and specify the following:
      • Name: Specify a name for the credentials. For example, oracle-credential.

      • AWS Access Key ID: Specify a name for the access key. For example, myaccesskey.

      • AWS Secret Key: Enter the secret key value generated earlier when you created a customer secret key.

    4. Click Add. The Edit S3Guard page appears. Do not select Enable S3Guard.
    5. Click Save.
    6. On the page that appears, enable cluster access to S3:
      1. Select Enable for the cluster name.
      2. Select the More Secure credential policy and click Continue.
      3. On the Restart Dependent Services page, select Restart Now, and then click Continue. Restart details are displayed. Restarting the cluster can take a few minutes.
      4. After the restart, click Continue, and then click Finish.
  3. Update the s3a endpoint.
    Note

    Skip this step if you have already updated the core-site.xml file.

    The endpoint URI enables your Hadoop cluster to connect to the object store that contains your source data. Specify this URI in Cloudera Manager.

    To update the endpoint, follow these steps:

    1. Log on to Cloudera Manager on the Oracle Big Data Service cluster.
    2. From the list of services on the left, click S3 Connector.
    3. Click the Configuration tab.
    4. Update the Default S3 Endpoint property with the following:
      https://your-tenancy.compat.objectstorage.your-region.oraclecloud.com

      For example, https://oraclebigdatadb.compat.objectstorage.us-phoenix-1.oraclecloud.com

    5. Save your changes.
    6. Update the cluster:
      1. Go to your cluster, click Actions, select Deploy Client Configuration, and then confirm the action.
      2. When complete, click Close.
    7. Restart the cluster (Click Actions and click Restart).
  4. Create an HDFS replication schedule for restore.

    Restore HDFS data that's backed up to Oracle Object Storage. Restore the HDFS data to the HDFS file system root directory to mirror the source.

    If Hive has external data that's captured in HDFS and not managed by Hive, create the HDFS replication schedule before you create the Hive replication schedule.

    To create an HDFS replication schedule:

    1. Log in to Cloudera Manager on the Oracle Big Data Service cluster.
    2. Create an HDFS replication schedule:
      1. Go to Backup and click Replication Schedules.
      2. Click Create Schedule and select HDFS Replication.
      3. Specify details for the replication schedule:
        • Name: Enter a name. For example, hdfs-rep1.

        • Source: Select the credential that you defined earlier. For example, oracle-credential.

        • Source Path: Specify the root location where your data was backed up. For example, s3a://BDA-BACKUP/.

        • Destination: Select HDFS (cluster name).

        • Destination Path: Enter /

        • Schedule: Select Immediate.

        • Run As Username: Specify a user with access to the data and metadata that's being restored. This is typically a Hadoop superuser and Sentry administrator.
          Note

          If you don't have a user with access to the required data and metadata, you should create one. Do not use the hdfs superuser for this step.
          Note

          : If Hadoop encryption is used, ensure to have destination directory created with appropriate keys and the command is executed as user who has encrypt access.
    3. Click Save Schedule. You can monitor the replication on the Replication Schedules page.
  5. Create a Hive replication schedule for restore.

    To restore Hive data and metadata from Oracle Object Storage to the Hadoop cluster, create a Hive replication schedule in Cloudera Manager.

    To create a Hive replication schedule, follow these steps:

    1. Log on to Cloudera Manager on the Oracle Big Data Service cluster.
    2. Create the replication schedule:
      1. Go to Backup and click Replication Schedules.
      2. Click Create Schedule and select Hive Replication.
      3. Specify details for the Hive replication schedule:
        • Name: Enter a name. For example, hive-rep1.

        • Source: Specify the credential that you defined earlier. For example, oracle-credential.

        • Destination: Select Hive (cluster name).

        • Cloud Root Path: Select the root location where you backed up your data. For example, s3a://BDA-BACKUP/.

        • HDFS Destination Path: Enter /

        • Databases: Select Replicate All.

        • Replication Option: Select Metadata and Data.

        • Schedule: Select Immediate.

        • Run As Username: Specify a user with access to the data and metadata that will be restored. This is typically a Hadoop and Hive superuser, and Sentry administrator.
          Note

          If you don't have a user with access to the required data and metadata, you should create one. Do not use the hdfs superuser for this step.
    3. Click Save Schedule. You can monitor the replication on the Replication Schedules page.
Review Service Migrations

Spark

Review the spark job and update it based on the new cluster details.

Yarn

To use the same queue names from source cluster, do the following:
  • From the source cluster, copy the Cloudera Manager. (Go to YARN, click Configuration, and click Fair Scheduler Allocations (Deployed) content to the target cluster's same position.)
  • If you are unable to copy, create the queue manually. (On Cloudera Manager, go to Clusters and select Dynamic Resource Pool Configuration.)

Sentry

  • Migrate the HDFS data and Hive Metadata using BDR, Wandisco, or Hadoop Distcp.
  • To export the sentry data from source sentry database and restore at Destination sentry database, you need the Sentry meta migration tool. Reach out to Oracle Support for MOS note Doc ID 2879665.1 for the Sentry meta migration tooling.

Migrating Resources Using the Distcp Tool

You can also migrate data and metadata from BDA and import them to the Big Data Service using the Distcp tool. Distcp is an open source tool that can be used to copy large data sets between distributed file systems within and across clusters.

Preparing the BDA or BDCS Cluster

To prepare the BDA or BDCS cluster for export, follow these steps:

  1. Set the HDFS service in maintenance mode to block alerts and bulk operations. For more information, see Set Maintenance Mode for a Service.
  2. Connect to the BDA or BDCS cluster through SSH as an opc user. For more information, see Connect to a Cluster Node Through Secure Shell (SSH).
  3. Run the following commands to set the HDFS service in safe mode:
    $ sudo su hdfs -c 'hdfs dfsadmin -safemode get'
    Safe mode is OFF
     
    $ sudo su hdfs -c 'hdfs dfsadmin -safemode enter'
    Safe mode is ON
  4. Do the following:
    1. As a root user, copy your OCI private key (for example: ~/.oci/oci_api_key.pem) to the /home/opc/.oci directory on the BDA or BDCS cluster.
      sudo su -
      cp <private-key> /home/opc/.oci/oci_api_key.pem
    2. Change the private key file permissions to read-only.
      chmod 444 /home/opc/.oci/oci_api_key.pem
    3. Change the ownership of the private PEM key file to hdfs user.
      chown -R hdfs:hdfs /home/opc/.oci/oci_api_key.pem
    4. Create a directory named data that the HDFS user can access.
    5. Copy the OCI private key from the /home/opc/.oci directory to the /data directory.
      cp /home/opc/.oci/oci_api_key.pem /data
    6. As a root user, change the private key file in the /data directory to read-only.
      chmod 444 /data/oci_api_key.pem
    If the BDA or BDCS cluster has multiple nodes, ensure that the OCI private key is available in all the nodes with the same path. You can use otools user to copy files to all the nodes. However, ensure that the file is readable by otools. After you copy the files to all nodes, set the original permissions back.

    To install the OCI private key on all the hosts, as a otools user, execute the following commands in the given sequence for each host:

    sudo su - otools
    ssh <hostname> sudo mkdir /data
    scp oci_api_key.pem <hostname>:/data
    ssh <hostname> sudo chown hdfs:hdfs /data/oci_api_key.pem
    ssh <hostname> sudo chmod 444 /data/oci_api_key.pem
Exporting Data from HDFS

To export data from HDFS, follow these steps:

  1. Set up the distcp tool. The distcp tool requires an updated oci-hdfs driver. Therefore, you must download the driver to the host where the distcp tool needs to be run. To download the driver, use the following commands:
    1. Run the following commands to create a directory named hdfsjar under /data and to navigate to the directory.
      mkdir -p /data/hdfsjars
      cd /data/hdfsjars
    2. Download oci-hdfs.zip.
      wget https://github.com/oracle/oci-hdfs-connector/releases/download/v2.7.7.0/oci-hdfs.zip
    3. Extract the oci-hdfs.zip file.
      1. Unzip the file under the /data/hdfsjars/lib directory.
        unzip oci-hdfs.zip lib/oci-hdfs-full-2.7.7.0.jar
      2. Move the lib/oci-hdfs-full-2.7.7.0.jar to the current working directory, that is, /data/hdfsjar.
        mv lib/oci-hdfs-full-2.7.7.0.jar .
      3. Remove the lib directory and the oci-hdfs.zip file.
        rm -rf lib/ oci-hdfs.zip
  2. Set the environment variables to point to the oci-hdfs-full-2.7.7.0.jar.
    export LIBJARS=/data/hdfsjars/oci-hdfs-full-2.7.7.0.jar
  3. (Optional) Configure the proxy by setting the following variable with sun.net.inetaddr.ttl.
    export DISTCP_PROXY_OPTS="-Dhttps.proxyHost=<proxy-host> -Dhttps.proxyPort=<proxy-port> -Dsun.net.inetaddr.ttl=60"
    
  4. Test the configuration by listing the contents in the OCI storage container.
    hadoop fs -Dfs.oci.client.auth.fingerprint=<fingerprint> \
    -Dfs.oci.client.auth.pemfilepath=<oci_pem_key> \
    -Dfs.oci.client.auth.passphrase=<passphrase> \
    -Dfs.oci.client.auth.tenantId=<OCID forTenancy> \
    -Dfs.oci.client.auth.userId=<OCID forUser> \
    -Dfs.oci.client.hostname=<HostName. Example: https://objectstorage.us-phoenix-1.oraclecloud.com/> \
    -Dfs.oci.client.proxy.uri=http://<proxyhost>:<proxyport> \
    -ls oci://<bucket>@<namespace>/
  5. To copy all the HDFS data under a specific directory in the OCI Object Store bucket, create a directory under the bucket.
    hadoop fs -Dfs.oci.client.auth.fingerprint=<fingerprint> \
          -Dfs.oci.client.auth.pemfilepath=<oci_pem_key> \
          -Dfs.oci.client.auth.passphrase=<passphrase> \
          -Dfs.oci.client.auth.tenantId=<OCID for Tenancy> \
          -Dfs.oci.client.auth.userId=<OCID for User> \
          -Dfs.oci.client.hostname=<HostName. Example: https://objectstorage.us-phoenix-1.oraclecloud.com/> \
          -Dfs.oci.client.proxy.uri=http://<proxyhost>:<proxyport>  \
          -mkdir oci://<bucket>@<namespace>/<Dir>
  6. Run the following commands to verify that the directory is created in Object Store.
    hadoop fs -Dfs.oci.client.auth.fingerprint=<fingerprint> \
          -Dfs.oci.client.auth.pemfilepath=<oci_pem_key> \
          -Dfs.oci.client.auth.passphrase=<passphrase> \
          -Dfs.oci.client.auth.tenantId=<OCID forTenancy> \
          -Dfs.oci.client.auth.userId=<OCID forUser> \
          -Dfs.oci.client.hostname=<HostName>\
          -ls oci://<bucket>@<namespace>/
  7. Do one of the following to export HDFS data:
    • Use hadoop fs -cp to export HDFS data from a single node cluster to Object Store as an HDFS user.
      sudo su - hdfs
      hadoop fs -Dfs.oci.client.auth.fingerprint=<fingerprint> \
            -Dfs.oci.client.auth.pemfilepath=<oci_pem_key> \
            -Dfs.oci.client.auth.passphrase=<passphrase> \
            -Dfs.oci.client.auth.tenantId=<OCID for Tenancy> \
            -Dfs.oci.client.auth.userId=<OCID for User> \
            -Dfs.oci.client.hostname=<HostName> \
            -Dfs.oci.client.proxy.uri=http://<proxyhost>:<proxyport>  \
            -cp <hdfs-dir-to-copy> oci://<bucket>@<namespace>/<dir>/
    • Use hadoop distcp to export HDFS data from multi-node clusters to Object Store. Validate it against the specific directory. For example, use /user/oozie/ for <hdfs-dir-to-copy> in the following command and then run it for the entire HDFS (/).
      hadoop distcp -libjars ${LIBJARS} \
      -Dfs.client.socket-timeout=3000000
      -Dfs.oci.client.auth.fingerprint=<fingerprint> \
      -Dfs.oci.client.auth.pemfilepath=<oci_pem_key> \
      -Dfs.oci.client.auth.passphrase=<passphrase> \
      -Dfs.oci.client.auth.tenantId=<OCID for Tenancy> \
      -Dfs.oci.client.auth.userId=<OCID for User> \
      -Dfs.oci.client.hostname=<HostName. Example: https://objectstorage.us-phoenix-1.oraclecloud.com/> \
      -Dfs.oci.client.multipart.allowed=true \
      -Dfs.oci.client.proxy.uri=<http://proxy-host>:port \
      -Dmapreduce.map.java.opts="$DISTCP_PROXY_OPTS" \
      -Dmapreduce.reduce.java.opts="$DISTCP_PROXY_OPTS" \
      -Dmapreduce.task.timeout=6000000 \
      -skipcrccheck -m 40 -bandwidth 500 \
      -update -strategy dynamic -i <hdfs-dir-to-copy> oci://<bucket>@<namespace>/<exportedHdfsDir>/
  8. Verify the contents of Object Store.
    hadoop fs -Dfs.oci.client.auth.fingerprint=<fingerprint> \
         -Dfs.oci.client.auth.pemfilepath=<oci_pem_key> \
         -Dfs.oci.client.auth.passphrase=<passphrase> \
         -Dfs.oci.client.auth.tenantId=<OCID forTenancy> \
         -Dfs.oci.client.auth.userId=<OCID forUser> \
         -Dfs.oci.client.hostname=<HostName>\
         -ls oci://<bucket>@<namespace>/<dir>
  9. Create metadata files and export them to Object Store.
    1. Download the following zip file:

      files/create_metadata.zip

    2. Unzip the file and place create_metadata.sh and create_permissions.py in one directory.
    3. Run the following shell script. This script creates multiple metadata files and places them under the /metadata directory in HDFS.
      bash create_metadata.sh <hdfs_directory>
    4. Export the metadata files to Object Store.
      hadoop distcp -Dfs.client.socket-timeout=3000000 \
          -Dfs.oci.client.auth.fingerprint=67:3a:86:d2:8b:61:b0:fd:c4:21:58:92:5a:d2:e8:83 \
          -Dfs.oci.client.auth.pemfilepath=<oci_pem_key> \
          -Dfs.oci.client.auth.tenantId=ocid1.tenancy.oc1..aaaaaaaajh6gxgbuudok7odjpwe5wyv2iq74ihhac5lyulxdgrhsssxc4hna \
          -Dfs.oci.client.auth.userId=ocid1.user.oc1..aaaaaaaaeslrldhdvniuue3dsi7p3xblpfpsluxtx2g6txg5iiukfhzg77ua \
          -Dfs.oci.client.hostname=https://objectstorage.us-ashburn-1.oraclecloud.com/ \
          -Dfs.oci.client.multipart.allowed=true -Dmapreduce.task.timeout=6000000  \
          -skipcrccheck -m 20 -bandwidth 100 -update -strategy dynamic -i /metadata/Metadata* oci://<bucket>@<namespace>/metadata
      
Incrementally Exporting Data from HDFS

Migrate HDFS data incrementally using distcp to send data from source to target after an interval of time and an addition, update, or deletion in the source data.

Note

  • Be sure the snapshot name in the source and target cluster are the same.
  • Don't delete/change the HDFS data in the target cluster. This can cause errors mentioned in next section.
  1. Identify the source and target parent directory.
  2. Allow snapshots on the source directory:
    hdfs dfsadmin -allowSnapshot /data/a
  3. Create a Snapshot of /data/a:
    hdfs dfs -createSnapshot /data/a s1
  4. Make content to the source directory /data/a that must be replicated to /data/a_target:
  5. Take a new snapshot of /data/a:
    hdfs dfs -createSnapshot /data/a s2
  6. (Optional) Verify differences between the two snapshots:
    hdfs snapshotDiff /data/a s1 s2
  7. Migrate the changes to /data/a_target:
    hadoop distcp -diff s1 s2 -update /data/a /data/a_target
  8. Create a matching snapshot on /data/a_target:
    hdfs dfs -createSnapshot /data/a_target s2
Exporting Hive Metadata

To export Hive metadata, follow these steps:

  1. Get the target Big Data Service HDFS path to update hive_migrate.properties.
    1. To retrieve the TARGET_HDFS_PATH, log on to the target BDS cluster using the Bastion host.
      cat /etc/hadoop/conf/core-site.xml | grep -A1 "fs.defaultFS"
    2. Note down the TARGET_HDFS_PATH.

  2. Update the TARGET_HDFS_PATH and TARGET_OBJECTSTORE_PATH in the hive_migrate.properties script to the location where the Hive table data is available after the migration. You need not escape the forward slashes in the path.
    #!/bin/bash
    #
    # Copyright (c) 2022, Oracle and/or its affiliates. All rights reserved.
    # Properties file to overide
    #
    # This file will be sourced by the generate_target_ddl_from_source.sh
    # to customize/overide properties
    # SRC_HDFS_PATH and SRC_OBJECTSTORE_PATH will be derived automatically from the cluster.
    # You will be customizing
    #     RUNDIR , TARGET_HDFS_PATH, TARGET_OBJECTSTORE_PATH
    #
    # ------------------------------------------------------------
    # Location to hold intermediate and final scripts to be generated.
    # You must have write privileges on this directory
    export RUNDIR=/tmp/hivemigrate
     
    # ------------------------------------------------------------
    # Specify current hive.metastore.warehouse.dir from Hive Config.
    #
    export SRC_WAREHOUSE_DIR=/apps/hive/warehouse
     
     
    # ------------------------------------------------------------
    # Modify expected target BDS hdfs path in your hive DDL script
    #
    export TARGET_HDFS_PATH=hdfs://testbdmn0.bmbdcsad1.bmbdcs.oraclevcn.com:8020/warehouse/tablespace/external/hive
     
     
    # ------------------------------------------------------------
    # Modify expected target BDS Object Store path in your hive DDL script
    #
    export TARGET_OBJECTSTORE_PATH=oci://bdc2odh@bdsdevcluster/warehouse
  3. Connect to the BDA or BDCS cluster through ssh as an opc user. For more information, see Connect to a Cluster Node Through Secure Shell (SSH).
  4. Create a script named generate_target_ddl_from_source.sh in the ~opc home directory with the following code. This script generates the DDL statements that you can run on the target cluster to create the hive metadata.
    #!/bin/bash
    #
    # Copyright (c) 2022, Oracle and/or its affiliates. All rights reserved.
    #
    # Script to generate queries that can be run on a target cluster for hive metadata migration.
    #
    # Usage :
    #   sudo generate_target_ddl_from_source.sh
    # The generated scripts are available under /tmp/hivemigrate/generatedSQL_target_Hive
    #  - 1_CREATE_DB.hql
    #  - 2_CREATE_TABLE.hql
    #  - 3_ADD_PARTITION.hql
    #
    # These scripts can be run on a target cluster AS IS to create hive metadata that matches the source..
    #
     
     
    #------------------------------------------------------------
    # Set Environment
    detect_basedir()
    {
      DIRNAME=`dirname $0`
      BASENAME=`basename $0`
      BASEDIR=$(cd "$DIRNAME" ; pwd)
    }
    detect_basedir
     
     
    #------------------------------------------------------------
    # Initilize any default config properties that has been overriden in
    #     ${BASEDIR}/hive_migrate.properties
    # DO NOT MODIFY. PLEASE OVERRIDE ANY PARAMETER IN ${BASEDIR}/hive_migrate.properties
    export RUNDIR=/tmp/hivemigrate
    export TARGET_HDFS_PATH=hdfs://Some_cluster_name/tmp
    export TARGET_OBJECTSTORE_PATH=oci://mybucket@mytenancy/tmp/
    #
    #------------------------------------------------------------
    if [ -f ${BASEDIR}/hive_migrate.properties ]; then
        echo -e "*** Sourcing Overide proprties file ${BASEDIR}/hive_migrate.properties"
        source ${BASEDIR}/hive_migrate.properties
    else
        echo -e "*** Warning : Missing Overide proprties file ${BASEDIR}/hive_migrate.properties"
        echo -e "*** Continue with default settings\n"
    fi
    #
    usage()
    {
      echo -e "Usage : ${BASENAME} \n"
      echo -e "**  This script is to be run on your source cluster and can be run as many times as you wish."
      echo -e "**  It cleans up any older generated scripts located in ${RUNDIR} before starting \n"
      echo -e "**  The generated output hive query scripts will be available in RUNDIR: ${RUNDIR} ...\n"
      echo -e "*** Modify RUNDIR if you would like to override\n"
      echo -e "**  Currently TARGET_HDFS_PATH is set to hdfs://Some_cluster_name/"
      echo -e "*** Modify TARGET_HDFS_PATH to match your hdfs path in Hive DDL \n"
    }
     
     
    # Make sure only root can run our script
    if [ "$(id -u)" != "0" ]; then
       echo -e "ERROR: *** This script ${BASENAME} must be run as root" 1>&2
       usage
       exit 1
    fi
     
    if [ "$1" != "" ]; then
       usage
       exit 1
    fi
     
     
    #------------------------------------------------------------
    #
    # Figure out the HDFS Path in source cluster.
    # Depending on HA or non-HA cluster the hdfs path syntax is different
    if ! grep -q "dfs.ha.namenodes.mycluster" /etc/bdcsce/conf/blueprint/*.json
    then
        echo -e "INFO:**  Source: Non-HA cluster..."
        nn_fqdn=$(hostname -f)
        SRC_HDFS_PATH=hdfs://${nn_fqdn}:8020${SRC_WAREHOUSE_DIR}
    else
        echo -e 'INFO:**  Source: HA cluster...'
        SRC_HDFS_PATH=hdfs://mycluster${SRC_WAREHOUSE_DIR}
    fi
    echo -e "INFO:**  Assumed source hdfspath : $SRC_HDFS_PATH"
     
     
    #------------------------------------------------------------
    #
    # Figureout Source Object source Patch
    source /etc/bdcsce/conf/datasources.properties
    export SRC_OBJECTSTORE_PATH=${bdcsce_default_objectstore_url}
     
     
     
     
    #------------------------------------------------------------
    # Step-1 : Setup/Clean source and target directories to hold intermediate and final scripts
     
     
    #Path to store intermediate hive queries
    outputSourcePath=${RUNDIR}/generatedSQL_source_Hive
    rm -rf $outputSourcePath
    mkdir -m 777 -p $outputSourcePath
     
     
    #Path to store final target hive queries to be used at target system
    outputTargetPath=${RUNDIR}/generatedSQL_target_Hive
    rm -rf $outputTargetPath
    mkdir -m 777 -p $outputTargetPath
     
     
    #------------------------------------------------------------
    # Step-2 : Extract Create DB statements from MetaStore to Run in Target Hive Installation.
    #
    mysql -Dhive --skip-column-names -e"
     
     
    -- Query for creating databases
    SELECT
        CONCAT('CREATE DATABASE IF NOT EXISTS ',
                D.NAME,
                '\;') AS CREATE_DB_STATEMENTS
    FROM
        DBS D;" > $outputTargetPath/1_CREATE_DB.hql
     
     
    ret=$?
    if [ $ret -ne 0 ];
    then
        echo -e "Error:** - MySQL Error code is $ret while trying to extract DB creation scripts"
        exit $ret
    else
        echo -e "**  DB Creation extraction script $outputTargetPath/1_CREATE_DB.hql completed successfully"
    fi
     
     
    #------------------------------------------------------------
    # Extract Show Create Table statements from MetaStore to Run in Source Hive Installation.
     
     
    mysql -Dhive --skip-column-names -e"
     
     
    -- Query for Creating Show CT Statements to be run in the Source Hive Installation
    SELECT
        CONCAT('USE ', D.name, '\; ','SHOW CREATE TABLE ',
                T.tbl_name,
                '\;') AS SHOW_CT_STATEMENTS
    FROM
        TBLS T
            INNER JOIN
        DBS D ON D.DB_ID = T.DB_ID
    WHERE
        T.TBL_TYPE != 'INDEX_TABLE';" > $outputSourcePath/1_SHOW_CREATE_TABLE.hql
     
     
     
     
    ret=$?
    if [ $ret -ne 0 ];
    then
        echo -e "Error:** - MySQL Error code is $ret while trying to extract SHOW CT scripts"
        exit $ret
    else
        echo -e "\n\n"
        echo -e "**  SHOW CREATE TABLE extraction script $outputSourcePath/1_SHOW_CREATE_TABLE.hql completed successfully"
    fi
     
     
     
     
    #------------------------------------------------------------
    # Run Show Create Table Command on Source Hive Installation to create target Hive CT scripts
    echo -e "**  Running show create table in source cluster"
    echo -e "**  This may take a while...Depends on hive metadata size\n\n"
     
     
     
     
    sudo -u hive hive -f $outputSourcePath/1_SHOW_CREATE_TABLE.hql > $outputTargetPath/2_CREATE_TABLE_temp.hql
    ret=$?
    if [ $ret -ne 0 ];
    then
        echo -e "Error - Hive Error code is $ret while trying to execute SHOW CT scripts"
        exit $ret
    else
        echo -e "\n\n"
        echo -e "**  SHOW CT execution script $outputTargetPath/2_CREATE_TABLE_temp.hql completed successfully"
    fi
     
     
    # Add semicolon suffix ; after CT statements
    echo -e "USE default;" > $outputTargetPath/2_CREATE_TABLE.hql
    cat $outputTargetPath/2_CREATE_TABLE_temp.hql | tr '\n' '\f' | sed -e 's/)\fCREATE/);\fCREATE/g'  | tr '\f' '\n' >> $outputTargetPath/2_CREATE_TABLE.hql
    echo -e ";">>$outputTargetPath/2_CREATE_TABLE.hql
    rm -rf $outputTargetPath/2_CREATE_TABLE_temp.hql
    echo -e "**  Create Table execution script $outputTargetPath/2_CREATE_TABLE.hql created successfully"
     
    #------------------------------------------------------------
    # Convert Managed Table to External table for HIVE 3
    #
    #echo -e "INFO:** Convering Managed Table to External table for HIVE 3 in CT $outputTargetPath/2_CREATE_TABLE.hql scripts"
    #sed -i "s/^CREATE TABLE/CREATE EXTERNAL TABLE/g" $outputTargetPath/2_CREATE_TABLE.hql
     
    #------------------------------------------------------------
    # Replace HDFS Path
    #
    echo -e "INFO:** Replacing $SRC_HDFS_PATH with $TARGET_HDFS_PATH in CT $outputTargetPath/2_CREATE_TABLE.hql scripts"
    sed -i "s,$SRC_HDFS_PATH,$TARGET_HDFS_PATH,g" $outputTargetPath/2_CREATE_TABLE.hql
     
    #------------------------------------------------------------
    # Replace Target Object Store Path
    #
    echo -e "INFO:** Replacing $SRC_OBJECTSTORE_PATH with $TARGET_OBJECTSTORE_PATH in CT $outputTargetPath/2_CREATE_TABLE.hql scripts"
    sed -i "s,$SRC_OBJECTSTORE_PATH,$TARGET_OBJECTSTORE_PATH,g" $outputTargetPath/2_CREATE_TABLE.hql
     
     
    echo -e "INFO:** Replaced $SRC_OBJECTSTORE_PATH with $TARGET_OBJECTSTORE_PATH in CT $outputTargetPath/2_CREATE_TABLE.hql scripts"
     
     
     
    #------------------------------------------------------------
    # Extract Add Partition statements if any from MetaStore to Run in Target Hive Installation.
     
     
    mysql -Dhive --skip-column-names -e"
    -- Query to create add partition scripts to run in target Hive Installation
     
     
     
     
    SET @oldNameNode = '$SRC_HDFS_PATH'; -- Old NameNode address
    SET @newNameNode = '$TARGET_HDFS_PATH'; -- New NameNode address
     
     
     
     
    SELECT
        REPLACE(add_partition_query,
            @oldNameNode,
            @newNameNode) AS add_partition_query
     
     
     
     
    FROM
            (SELECT
                    CONCAT('USE ', D.name, '\; ', ' ALTER TABLE ', T.TBL_NAME, ' ADD PARTITION (', GROUP_CONCAT(PK.PKEY_NAME, '=', '\'', PKV.PART_KEY_VAL, '\''
                        ORDER BY PK.INTEGER_IDX), ') LOCATION \'', S.location, '\'\; ') AS add_partition_query
        FROM
            TBLS T
        INNER JOIN DBS D ON T.DB_ID = D.DB_ID
        INNER JOIN PARTITION_KEYS PK ON T.TBL_ID = PK.TBL_ID
        INNER JOIN PARTITIONS P ON P.TBL_ID = T.TBL_ID
        INNER JOIN PARTITION_KEY_VALS PKV ON P.PART_ID = PKV.PART_ID
            AND PK.INTEGER_IDX = PKV.INTEGER_IDX
        INNER JOIN SDS S ON P.SD_ID = S.SD_ID
        GROUP BY P.PART_ID
        ORDER BY D.name) alias1;" > $outputTargetPath/3_ADD_PARTITION.hql
    ret=$?
    if [ $ret -ne 0 ];
    then
        echo -e "Error - MySQL Error code is $ret while trying to extract ADD PARTITION scripts"
        exit $ret
    else
        echo -e "\n\n"
        echo -e "**  ADD PARTITION extraction script $outputTargetPath/3_ADD_PARTITION.hql completed successfully"
    fi
     
     
     
     
    echo -e "\n\n"
    echo -e "INFO: ** Here are your Target installation scripts"
    ls -al $outputTargetPath/*
    echo -e "\n\n"
    echo -e "INFO: ** Run these hql scripts on your Target hive shell..."
     
     
    exit 0
  5. As root user, run the generate_target_ddl_from_source.sh script.
    sudo ./generate_target_ddl_from_source.sh
    The following scripts are generated under /tmp/hivemigrate/generatedSQL_target_Hive:
    • 1_CREATE_DB.hql
    • 2_CREATE_TABLE.hql
    • 3_ADD_PARTITION.hql

    During import, you copy these scripts from /tmp/hivemigrate/generatedSQL_target_Hive to a master node in the target BDS cluster with root and database access.

Importing Resources to Big Data Service

You now import the exported data and metadata to Big Data Service.

Prerequisites
Before importing resources, do the following:
  1. Set up a fresh target environment on Big Data Service with the same BDA or BDCS Hadoop version (Hadoop 2.7.x) as the source cluster.
    Note

    Note the following:
      • Define the Big Data Service cluster on OCI with the same size as the source BDA or BDCS cluster. However, you must review your computing and storage needs before deciding the size of the target cluster.
      • For Oracle Cloud Infrastructure VM shapes, see Compute Shapes. BDA or BDCS does not support all shapes.
      • If any software other than the BDA or BDCS stack is installed on the source system using the bootstrap script or some other method, you must install and maintain that software on the target system as well.
  2. Copy the PEM private key (oci_api_key.pem) file to all the nodes of the Big Data Service cluster, and set the appropriate permissions.
  3. Export the artifacts from the source BDA or BDCS cluster.
Importing Data to HDFS

To import data to HDFS, follow these steps:

  1. Connect to the Big Data Service (BDS) cluster through SSH as an opc user.
    ssh -i id_rsa <private_node_ip>
  2. Determine the HDFS location where the data needs to be stored.
    
    grep -A1 -i fs.defaultFS /etc/hadoop/conf/core-site.xml
  3. As an HDFS user, move the data from Object Store to the target HDFS.
    1. Use hadoop fs -cp to import HDFS data from Object Store to a destination BDS cluster as an HDFS user.
      sudo su - hdfs
       
      hadoop fs -Dfs.oci.client.auth.fingerprint=<fingerprint> \
      -Dfs.oci.client.auth.pemfilepath=<oci_pem_key> \
      -Dfs.oci.client.auth.passphrase=<passphrase> \
      -Dfs.oci.client.auth.tenantId=<OCID for Tenancy> \
      -Dfs.oci.client.auth.userId=<OCID for User> \
      -Dfs.oci.client.hostname=<HostName. Example: https://objectstorage.us-phoenix-1.oraclecloud.com/> \
      -Dfs.oci.client.proxy.uri=http://<proxyhost>:<proxyport> \
      -cp oci://<bucket>@<tenancy>/<dir>/ <hdfs-dir-to-copy>
    2. Use hadoop distcp to import HDFS data from Object Store to a multi-node BDS cluster. Validate it against specific directory. For example, use /user/oozie for <hdfs-dir-to-copy> in the following command first, and then run it for the entire HDFS (/).
      hadoop distcp -libjars ${LIBJARS} \
      -Dfs.client.socket-timeout=3000000
      -Dfs.oci.client.auth.fingerprint=<fingerprint> \
      -Dfs.oci.client.auth.pemfilepath=<oci_pem_key> \
      -Dfs.oci.client.auth.passphrase=<passphrase> \
      -Dfs.oci.client.auth.tenantId=<OCID for Tenancy> \
      -Dfs.oci.client.auth.userId=<OCID for User> \
      -Dfs.oci.client.hostname=<HostName. Example: https://objectstorage.us-phoenix-1.oraclecloud.com/> \
      -Dfs.oci.client.multipart.allowed=true \
      -Dfs.oci.client.proxy.uri=<http://proxy-host>:port \
      -Dmapreduce.map.java.opts="$DISTCP_PROXY_OPTS" \
      -Dmapreduce.reduce.java.opts="$DISTCP_PROXY_OPTS" \
      -Dmapreduce.task.timeout=6000000 \
      -skipcrccheck -m 40 -bandwidth 500 \
      -update -strategy dynamic -i oci://<bucket>@<tenancy>/<exportedHdfsDir>/ <hdfs-dir-to-copy>
  4. List the content of the tmp directory.
    hadoop fs -ls /<hdfs-dir-to-copy>
Import the metadata files and execute the permissions
Do the following:
  1. Import metadata files from Object Store to /metadata in HDFS.
    
    hadoop distcp -libjars ${LIBJARS} \
    -Dfs.client.socket-timeout=3000000
    -Dfs.oci.client.auth.fingerprint=<fingerprint> \
    -Dfs.oci.client.auth.pemfilepath=<oci_pem_key> \
    -Dfs.oci.client.auth.passphrase=<passphrase> \
    -Dfs.oci.client.auth.tenantId=<OCID for Tenancy> \
    -Dfs.oci.client.auth.userId=<OCID for User> \
    -Dfs.oci.client.hostname=<HostName. Example: https://objectstorage.us-phoenix-1.oraclecloud.com/> \
    -Dfs.oci.client.multipart.allowed=true \
    -Dfs.oci.client.proxy.uri=<http://proxy-host>:port \
    -Dmapreduce.map.java.opts="$DISTCP_PROXY_OPTS" \
    -Dmapreduce.reduce.java.opts="$DISTCP_PROXY_OPTS" \
    -Dmapreduce.task.timeout=6000000 \
    -skipcrccheck -m 40 -bandwidth 500 \
    -update -strategy dynamic -i oci://<bucket>@<tenancy>/metadata/ /metadata
  2. Move files to the local directory.
    hdfs dfs -get /metadata/Metadata*
  3. Run the files in parallel in the background or in multiple terminals.
    bash Metadataaa &
     
    bash Metadataab &
     
    bash Metadataac &...
Importing Metadata

To import metadata, follow these steps:

  1. Connect to the BDS cluster through bastion host as an opc use
  2. Copy the HIVE DDL query scripts, created in the source BDA cluster, to the target BDS cluster
    1. 1_CREATE_DB.hql.
    2. 2_CREATE_TABLE.hql
    3. 3_ADD_PARTITION.hql
  3. Run the 1_CREATE_DB.hql script.
    hive -f 1_CREATE_DB.hql
  4. Run the 2_CREATE_TABLE.hql script.
    hive -f 2_CREATE_TABLE.hql
  5. Run the 3_ADD_PARTITION.hql script.
    hive -f 3_ADD_PARTITION.hql
Updating the Hive Metastore Warehouse Directory

Do the following:

  1. Find the value of hive.metastore.warehouse.dir. This is a hive property configured to /apps/hive/warehouse in the BDA or BDCS cluster in hive-site.xml.
    <name>hive.metastore.warehouse.dir</name>
    <value>/apps/hive/warehouse</value>
  2. Re-configure hive.metastore.warehouse.dir to <exportedHdfsDir>/apps/hive/warehouse in hive-site.xml in the BDS cluster.
  3. If you copied your HDFS data to hdfs://tmp, configure the hive.metastore.warehouse.dir property as /tmp/apps/hive/warehouse.

Validating the Migration

To validate the migration, do the following:
  • Verify that you see the same set of hive tables in the target cluster as in the source cluster.
    1. Connect to the hive shell.
      hive
    2. Run the following command to list the tables:
      show tables;