9.2.12 Azure Data Lake Storage

9.2.12.1 Azure Data Lake Gen1 (ADLS Gen1)

Microsoft Azure Data Lake supports streaming data through the Hadoop client. Therefore, data files can be sent to Azure Data Lake using either the Oracle GoldenGate for Distributed Applications and Analytics (GG for DAA) Hadoop Distributed File System (HDFS) Handler or the File Writer Handler in conjunction with the HDFS Event Handler.

The preferred mechanism for ingest to Microsoft Azure Data Lake is the File Writer Handler in conjunction with the HDFS Event Handler.

Use these steps to connect to Microsoft Azure Data Lake from GG for DAA.

  1. Download Hadoop 2.9.1 from http://hadoop.apache.org/releases.html.
  2. Unzip the file in a temporary directory. For example, /ggwork/hadoop/hadoop-2.9.
  3. Edit the /ggwork/hadoop/hadoop-2.9/hadoop-env.sh file in the directory.
  4. Add entries for the JAVA_HOME and HADOOP_CLASSPATH environment variables:
    export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64
    export HADOOP_CLASSPATH=/ggwork/hadoop/hadoop-2.9.1/share/hadoop/tools/lib/*:$HADOOP_CLASSPATH

    This points to Java 8 and adds the share/hadoop/tools/lib to the Hadoop classpath. The library path is not in the variable by default and the required Azure libraries are in this directory.

  5. Edit the /ggwork/hadoop/hadoop-2.9.1/etc/hadoop/core-site.xml file and add:
    <configuration>
    <property>
    <name>fs.adl.oauth2.access.token.provider.type</name>
    <value>ClientCredential</value>
    </property>
    <property>
    <name>fs.adl.oauth2.refresh.url</name>
    <value>Insert the Azure https URL here to obtain the access token</value>
    </property>
    <property>
    <name>fs.adl.oauth2.client.id</name>
    <value>Insert the client id here</value>
    </property>
    <property>
    <name>fs.adl.oauth2.credential</name>
    <value>Insert the password here</value>
    </property>
    <property>
    <name>fs.defaultFS</name>
    <value>adl://Account Name.azuredatalakestore.net</value>
    </property>
    </configuration>
  6. Open your firewall to connect to both the Azure URL to get the token and the Azure Data Lake URL. Or disconnect from your network or VPN. Access to Azure Data Lake does not currently support using a proxy server per the Apache Hadoop documentation.
  7. Use the Hadoop shell commands to prove connectivity to Azure Data Lake. For example, in the 2.9.1 Hadoop installation directory, execute this command to get a listing of the root HDFS directory.
    ./bin/hadoop fs -ls /
  8. Verify connectivity to Azure Data Lake.
  9. Configure either the HDFS Handler or the File Writer Handler using the HDFS Event Handler to push data to Azure Data Lake, see Flat Files. Oracle recommends that you use the File Writer Handler with the HDFS Event Handler.

    Setting the gg.classpath example:

    gg.classpath=/ggwork/hadoop/hadoop-2.9.1/share/hadoop/common/:/ggwork/hadoop/hadoop-
    2.9.1/share/hadoop/common/lib/:/ggwork/hadoop/hadoop-
    2.9.1/share/hadoop/hdfs/:/ggwork/hadoop/hadoop-2.9.1/share/hadoop/hdfs/lib/:/ggwork/hadoop/hadoop-
    2.9.1/etc/hadoop:/ggwork/hadoop/hadoop-2.9.1/share/hadoop/tools/lib/*

See https://hadoop.apache.org/docs/current/hadoop-azure-datalake/index.html.

9.2.12.2 Azure Data Lake Gen2 using Hadoop Client and ABFS

Microsoft Azure Data Lake Gen 2 (using Hadoop Client and ABFS) supports streaming data via the Hadoop client. Therefore, data files can be sent to Azure Data Lake Gen 2 using either the Oracle GoldenGate for Distributed Applications and Analytics (GG for DAA) HDFS Handler or the File Writer Handler in conjunction with the HDFS Event Handler.

Hadoop 3.3.0 (or higher) is recommended for connectivity to Azure Data Lake Gen 2. Hadoop 3.3.0 contains an important fix to correctly fire Azure events on file close using the "abfss" scheme. For more information, see Hadoop Jira issue Hadoop-16182.

Use the File Writer Handler in conjunction with the HDFS Event Handler. This is the preferred mechanism for ingest to Azure Data Lake Gen 2.

Prerequisites

Part 1:

  1. Connectivity to Azure Data Lake Gen 2 assumes that the you have correctly provisioned an Azure Data Lake Gen 2 account in the Azure portal.

    From the Azure portal select Storage Accounts from the commands on the left to view/create/delete storage accounts.

    In the Azure Data Lake Gen 2 provisioning process, it is recommended that the Hierarchical namespace is enabled in the Advanced tab.

    It is not mandatory to enable Hierarchical namespace for Azure storage account.

  2. Ensure that you have created a Web app/API App Registration to connect to the storage account.

    From the Azure portal select All services from the list of commands on the left, type app into the filter command box and select App registrations from the filtered list of services. Create an App registration of type Web app/API.

    Add permissions to access Azure Storage. Assign the App registration to an Azure account. Generate a Key for the App Registration as follows:
    1. Navigate to the respective App registration page.
    2. On the left pane, select Certificates & secrets.
    3. Click + New client secret (This should show a new key under the column Value).
    The generated key string is your client secret and is only available at the time the key is created. Therefore, ensure you document the generated key string.

Part 2:

  1. In the Azure Data Lake Gen 2 account, ensure that the App Registration is given access.

    In the Azure portal, select Storage accounts from the left panel. Select the Azure Data Lake Gen 2 account that you have created.

    Select the Access Control (IAM) command to bring up the Access Control (IAM) panel. Select the Role Assignments tab and add a roll assignment for the created App Registration.

    The app registration assigned to the storage account must be provided with read and write access into the Azure storage account.

    You can use either of the following roles: the built-in Azure role Storage Blob Data Contributor or custom role with the required permissions.
  2. Connectivity to Azure Data Lake Gen 2 can be routed through a proxy server.
    Three parameters need to be set in the Java boot options to enable:
    jvm.bootoptions=-Xmx512m -Xms32m -Djava.class.path=ggjava/ggjava.jar -DproxySet=true -Dhttps.proxyHost={insert your proxy server} -Dhttps.proxyPort={insert your proxy port}
  3. Two connectivity schemes to Azure Data Lake Gen 2 are supported: abfs and abfss.

    The preferred method is abfss since it employs HTTPS calls thereby providing security and payload encryption.

Connecting to Microsoft Azure Data Lake 2

To connect to Microsoft Azure Data Lake 2 from GG for DAA:

  1. Download Hadoop 3.3.0 from http://hadoop.apache.org/releases.html.
  2. Unzip the file in a temporary directory. For example, /usr/home/hadoop/hadoop-3.3.0.
  3. Edit the {hadoop install dir}/etc/hadoop/hadoop-env.sh file to point to Java 8 and add the Azure Hadoop libraries to the Hadoop classpath. These are entries in the hadoop-env.sh file:
    export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_202
    export HADOOP_OPTIONAL_TOOLS="hadoop-azure"
  4. Private networks often require routing through a proxy server to access the public internet. Therefore, you may have to configure proxy server settings for the hadoop command line utility to test the connectivity to Azure. To configure proxy server settings, set the following in the hadoop-env.sh file:
    export HADOOP_CLIENT_OPTS="-Dhttps.proxyHost={insert your proxy server} -Dhttps.proxyPort={insert your proxy port}"

    Note:

    These proxy settings only work for the hadoop command line utility. The proxy server settings for GG for DAA connectivity to Azure are set in the jvm.bootoptions as described in this point.
  5. Edit the {hadoop install dir}/etc/hadoop/core-site.xml file and add the following configuration:
    <configuration>
    <property>
      <name>fs.azure.account.auth.type</name>
      <value>OAuth</value>
    </property>
    <property>
      <name>fs.azure.account.oauth.provider.type</name>
      <value>org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider</value>
    </property>
    <property>
      <name>fs.azure.account.oauth2.client.endpoint</name>
      <value>https://login.microsoftonline.com/{insert the Azure Tenant id here}/oauth2/token</value>
    </property>
    <property>
      <name>fs.azure.account.oauth2.client.id</name>
      <value>{insert your client id here}</value>
    </property>
    <property>
      <name>fs.azure.account.oauth2.client.secret</name>
      <value>{insert your client secret here}</value>
    </property>
    <property>
      <name>fs.defaultFS</name>
      <value>abfss://{insert your container name here}@{insert your ADL gen2 storage account name here}.dfs.core.windows.net</value>
    </property>
    <property>
      <name>fs.azure.createRemoteFileSystemDuringInitialization</name>
      <value>true</value>
    </property>
    </configuration>

    To obtain your Azure Tenant Id, go to the Microsoft Azure portal. Enter Azure Active Directory in the Search bar and select Azure Active Directory from the list of services. The Tenant Id is located in the center of the main Azure Active Directory service page.

    To obtain your Azure Client Id and Client Secret go to the Microsoft Azure portal. Select All Services from the list on the left to view the Azure Services Listing. Type App into the filter command box and select App Registrations from the listed services. Select the App Registration that you have created to access Azure Storage. The Application Id displayed for the App Registration is the Client ID. The Client Secret is the generated key string when a new key is added. This generated key string is available only once when the key is created. If you do not know the generated key string, create another key making sure you capture the generated key string.

    The ADL gen2 account name is the account name you generated when you created the Azure ADL gen2 account.

    File systems are sub partitions within an Azure Data Lake Gen 2 storage account. You can create and access new file systems on the fly but only if the following Hadoop configuration is set:

    <property>
      <name>fs.azure.createRemoteFileSystemDuringInitialization</name>
      <value>true</value>
    </property>
  6. Verify connectivity using Hadoop shell commands.
    ./bin/hadoop fs -ls /
    ./bin/hadoop fs -mkdir /tmp
  7. Configure either the HDFS Handler or the File Writer Handler using the HDFS Event Handler to push data to Azure Data Lake, see Flat Files. Oracle recommends that you use the File Writer Handler with the HDFS Event Handler.

    Setting the gg.classpath example:

    gg.classpath=/ggwork/hadoop/hadoop-3.3.0/share/hadoop/common/*:/ggwork/hadoop/hadoop-3.3.0/share/hadoop/common/lib/*:/ggwork/hadoop/hadoop-3.3.0/share/hadoop/hdfs/*:
    /ggwork/hadoop/hadoop-3.3.0/share/hadoop/hdfs/lib/*:/ggwork/hadoop/hadoop-3.3.0/etc/hadoop/:/ggwork/hadoop/hadoop-3.3.0/share/hadoop/tools/lib/*

See https://hadoop.apache.org/docs/current/hadoop-azure-datalake/index.html.

9.2.12.3 Azure Data Lake Gen2 using BLOB endpoint

Oracle GoldenGate for Distributed Applications and Analytics (GG for DAA) can connect to ADLS Gen2 using BLOB endpoint. GG for DAA ADLS Gen2 replication using BLOB endpoint does not require any Hadoop installation. For more information, see For more information, see Azure Blob Storage.