Connecting to Microsoft Azure Data Lake

28 Connecting to Microsoft Azure Data Lake

Microsoft Azure Data Lake supports streaming data through the Hadoop client. Therefore, data files can be sent to Azure Data Lake using either the Oracle GoldenGate for Big Data Hadoop Distributed File System (HDFS) Handler or the File Writer Handler in conjunction with the HDFS Event Handler.

The preferred mechanism for ingest to Microsoft Azure Data Lake is the File Writer Handler in conjunction with the HDFS Event Handler.

Use these steps to connect to Microsoft Azure Data Lake from Oracle GoldenGate for Big Data.

Download Hadoop 2.9.1 from http://hadoop.apache.org/releases.html.
Unzip the file in a temporary directory. For example, /ggwork/hadoop/hadoop-2.9.
Edit the /ggwork/hadoop/hadoop-2.9/hadoop-env.sh file in the directory.
Add entries for the JAVA_HOME and HADOOP_CLASSPATH environment variables:
```
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64
export HADOOP_CLASSPATH=/ggwork/hadoop/hadoop-2.9.1/share/hadoop/tools/lib/*:$HADOOP_CLASSPATH
```
This points to Java 8 and adds the share/hadoop/tools/lib to the Hadoop classpath. The library path is not in the variable by default and the required Azure libraries are in this directory.

Edit the /ggwork/hadoop/hadoop-2.9.1/etc/hadoop/core-site.xml file and add:

<configuration>
<property>
<name>fs.adl.oauth2.access.token.provider.type</name>
<value>ClientCredential</value>
</property>
<property>
<name>fs.adl.oauth2.refresh.url</name>
<value>Insert the Azure https URL here to obtain the access token</value>
</property>
<property>
<name>fs.adl.oauth2.client.id</name>
<value>Insert the client id here</value>
</property>
<property>
<name>fs.adl.oauth2.credential</name>
<value>Insert the password here</value>
</property>
<property>
<name>fs.defaultFS</name>
<value>adl://Account Name.azuredatalakestore.net</value>
</property>
</configuration>

Open your firewall to connect to both the Azure URL to get the token and the Azure Data Lake URL. Or disconnect from your network or VPN. Access to Azure Data Lake does not currently support using a proxy server per the Apache Hadoop documentation.
Use the Hadoop shell commands to prove connectivity to Azure Data Lake. For example, in the 2.9.1 Hadoop installation directory, execute this command to get a listing of the root HDFS directory.
```
./bin/hadoop fs -ls /
```
Verify connectivity to Azure Data Lake.
Configure either the HDFS Handler or the File Writer Handler using the HDFS Event Handler to push data to Azure Data Lake, see Using the File Writer Handler. Oracle recommends that you use the File Writer Handler with the HDFS Event Handler.
Setting the gg.classpath example:
```
gg.classpath=/ggwork/hadoop/hadoop-2.9.1/share/hadoop/common/:/ggwork/hadoop/hadoop-
2.9.1/share/hadoop/common/lib/:/ggwork/hadoop/hadoop-
2.9.1/share/hadoop/hdfs/:/ggwork/hadoop/hadoop-2.9.1/share/hadoop/hdfs/lib/:/ggwork/hadoop/hadoop-
2.9.1/etc/hadoop:/ggwork/hadoop/hadoop-2.9.1/share/hadoop/tools/lib/*
```

See https://hadoop.apache.org/docs/current/hadoop-azure-datalake/index.html.