29 Connecting to Microsoft Azure Data Lake Gen 2

Microsoft Azure Data Lake Gen 2 supports streaming data via the Hadoop client. Therefore, data files can be sent to Azure Data Lake Gen 2 using either the Oracle GoldenGate for Big Data HDFS Handler or the File Writer Handler in conjunction with the HDFS Event Handler.

Hadoop 3.3.0 (or higher) is recommended for connectivity to Azure Data Lake Gen 2. Hadoop 3.3.0 contains an important fix to correctly fire Azure events on file close using the "abfss" scheme. For more information, see Hadoop Jira issue Hadoop-16182.

Use the File Writer Handler in conjunction with the HDFS Event Handler. This is the preferred mechanism for ingest to Azure Data Lake Gen 2.

Prerequisites

Part 1:

  1. Connectivity to Azure Data Lake Gen 2 assumes that the you have correctly provisioned an Azure Data Lake Gen 2 account in the Azure portal.

    From the Azure portal select Storage Accounts from the commands on the left to view/create/delete storage accounts.

    In the Azure Data Lake Gen 2 provisioning process, it is recommended that the Hierarchical namespace is enabled in the Advanced tab.

    It is not mandatory to enable Hierarchical namespace for Azure storage account.

  2. Ensure that you have created a Web app/API App Registration to connect to the storage account.

    From the Azure portal select All services from the list of commands on the left, type app into the filter command box and select App registrations from the filtered list of services. Create an App registration of type Web app/API.

    Add permissions to access Azure Storage. Assign the App registration to an Azure account. Generate a Key for the App Registration.

    The generated key string is your client secret and is only available at the time the key is created. Therefore, ensure you document the generated key string.

Part 2:

  1. In the Azure Data Lake Gen 2 account, ensure that the App Registration is given access.

    In the Azure portal, select Storage accounts from the left panel. Select the Azure Data Lake Gen 2 account that you have created.

    Select the Access Control (IAM) command to bring up the Access Control (IAM) panel. Select the Role Assignments tab and add a roll assignment for the created App Registration.

    The app registration assigned to the storage account must be provided with read and write access into the Azure storage account.

    You can use either of the following roles: the built-in Azure role Storage Blob Data Contributor or custom role with the required permissions.
  2. Connectivity to Azure Data Lake Gen 2 can be routed through a proxy server.
    Three parameters need to be set in the Java boot options to enable:
    javawriter.bootoptions=-Xmx512m 
    -Xms32m -Djava.class.path=ggjava/ggjava.jar 
    -DproxySet=true 
    -Dhttps.proxyHost={insert your proxy server} 
    -Dhttps.proxyPort={insert your proxy port}
  3. Two connectivity schemes to Azure Data Lake Gen 2 are supported: abfs and abfss.

    The preferred method is abfss since it employs HTTPS calls thereby providing security and payload encryption.

Connecting to Microsoft Azure Data Lake 2

To connect to Microsoft Azure Data Lake 2 from Oracle GoldenGate for Big Data:

  1. Download Hadoop 3.3.0 from http://hadoop.apache.org/releases.html.
  2. Unzip the file in a temporary directory. For example, /usr/home/hadoop/hadoop-3.3.0.
  3. Edit the {hadoop install dir}/etc/hadoop/hadoop-env.sh file to point to Java 8 and add the Azure Hadoop libraries to the Hadoop classpath. These are entries in the hadoop-env.sh file:
    export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_202
    export HADOOP_OPTIONAL_TOOLS="hadoop-azure"
  4. Private networks often require routing through a proxy server to access the public internet. Therefore, you may have to configure proxy server settings for the hadoop command line utility to test the connectivity to Azure. To configure proxy server settings, set the following in the hadoop-env.sh file:
    export HADOOP_CLIENT_OPTS="
    -Dhttps.proxyHost={insert your proxy server} 
    -Dhttps.proxyPort={insert your proxy port}"

    Note:

    These proxy settings only work for the hadoop command line utility. The proxy server settings for Oracle GoldenGate for Big Data connectivity to Azure are set in the javawriter.bootoptions as described in this point.
  5. Edit the {hadoop install dir}/etc/hadoop/core-site.xml file and add the following configuration:
    <configuration>
    <property>
      <name>fs.azure.account.auth.type</name>
      <value>OAuth</value>
    </property>
    <property>
      <name>fs.azure.account.oauth.provider.type</name>
      <value>org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider</value>
    </property>
    <property>
      <name>fs.azure.account.oauth2.client.endpoint</name>
      <value>https://login.microsoftonline.com/{insert the Azure instance id here}/oauth2/token</value>
    </property>
    <property>
      <name>fs.azure.account.oauth2.client.id</name>
      <value>{insert your client id here}</value>
    </property>
    <property>
      <name>fs.azure.account.oauth2.client.secret</name>
      <value>{insert your client secret here}</value>
    </property>
    <property>
      <name>fs.defaultFS</name>
      <value>abfss://{insert your container name here}@{insert your ADL gen2 storage account name here}.dfs.core.windows.net</value>
    </property>
    <property>
      <name>fs.azure.createRemoteFileSystemDuringInitialization</name>
      <value>true</value>
    </property>
    </configuration>

    To obtain your Azure instance id go to the Microsoft Azure portal. Select Azure Active Directory from the list on the left to view the Azure Active Directory panel. Select Properties in the Azure Active Directory panel to view the Azure Active Directory properties. The Azure instance Id is the field marked as Directory ID.

    To obtain your Azure Client Id and Client Secret go to the Microsoft Azure portal. Select All Services from the list on the left to view the Azure Services Listing. Type App into the filter command box and select App Registrations from the listed services. Select the App Registration that you have created to access Azure Storage. The Application Id displayed for the App Registration is the Client Id. The Client Secret is the generated key string when a new key is added. This generated key string is available only once when the key is created. If you do not know the generated key string, create another key making sure you capture the generated key string.

    The ADL gen2 account name is the account name you generated when you created the Azure ADL gen2 account.

    File systems are sub partitions within an Azure Data Lake Gen 2 storage account. You can create and access new file systems on the fly but only if the following Hadoop configuration is set:

    <property>
      <name>fs.azure.createRemoteFileSystemDuringInitialization</name>
      <value>true</value>
    </property>
  6. Verify connectivity using Hadoop shell commands.
    ./bin/hadoop fs -ls /
    ./bin/hadoop fs -mkdir /tmp
  7. Configure either the HDFS Handler or the File Writer Handler using the HDFS Event Handler to push data to Azure Data Lake, see Using the File Writer Handler. Oracle recommends that you use the File Writer Handler with the HDFS Event Handler.

    Setting the gg.classpath example:

    gg.classpath=/ggwork/hadoop/hadoop-3.3.0/share/hadoop/common/*:/ggwork/hadoop/hadoop-3.3.0/share/hadoop/common/lib/*:/ggwork/hadoop/hadoop-3.3.0/share/hadoop/hdfs/*:
    /ggwork/hadoop/hadoop-3.3.0/share/hadoop/hdfs/lib/*:/ggwork/hadoop/hadoop-3.3.0/etc/hadoop/:/ggwork/hadoop/hadoop-3.3.0/share/hadoop/tools/lib/*

See https://hadoop.apache.org/docs/current/hadoop-azure-datalake/index.html.