28 Connecting to Microsoft Azure Data Lake
Microsoft Azure Data Lake supports streaming data through the Hadoop client. Therefore, data files can be sent to Azure Data Lake using either the Oracle GoldenGate for Big Data Hadoop Distributed File System (HDFS) Handler or the File Writer Handler in conjunction with the HDFS Event Handler.
The preferred mechanism for ingest to Microsoft Azure Data Lake is the File Writer Handler in conjunction with the HDFS Event Handler.
Use these steps to connect to Microsoft Azure Data Lake from Oracle GoldenGate for Big Data.
- Download Hadoop 2.9.1 from http://hadoop.apache.org/releases.html.
- Unzip the file in a temporary directory. For example,
/ggwork/hadoop/hadoop-2.9
. - Edit the
/ggwork/hadoop/hadoop-2.9/hadoop-env.sh
file in thedirectory.
- Add entries for the
JAVA_HOME
andHADOOP_CLASSPATH
environment variables:export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64 export HADOOP_CLASSPATH=/ggwork/hadoop/hadoop-2.9.1/share/hadoop/tools/lib/*:$HADOOP_CLASSPATH
This points to Java 8 and adds the
share/hadoop/tools/lib
to the Hadoop classpath. The library path is not in the variable by default and the required Azure libraries are in this directory. - Edit the
/ggwork/hadoop/hadoop-2.9.1/etc/hadoop/core-site.xml
file and add:<configuration> <property> <name>fs.adl.oauth2.access.token.provider.type</name> <value>ClientCredential</value> </property> <property> <name>fs.adl.oauth2.refresh.url</name> <value>Insert the Azure https URL here to obtain the access token</value> </property> <property> <name>fs.adl.oauth2.client.id</name> <value>Insert the client id here</value> </property> <property> <name>fs.adl.oauth2.credential</name> <value>Insert the password here</value> </property> <property> <name>fs.defaultFS</name> <value>adl://Account Name.azuredatalakestore.net</value> </property> </configuration>
- Open your firewall to connect to both the Azure URL to get the token and the Azure Data Lake URL. Or disconnect from your network or VPN. Access to Azure Data Lake does not currently support using a proxy server per the Apache Hadoop documentation.
- Use the Hadoop shell commands to prove connectivity to Azure Data Lake. For
example, in the 2.9.1 Hadoop installation directory, execute this command to get
a listing of the root HDFS
directory.
./bin/hadoop fs -ls /
- Verify connectivity to Azure Data Lake.
- Configure either the HDFS Handler or the File Writer Handler using the HDFS Event
Handler to push data to Azure Data Lake, see Using the File Writer Handler. Oracle recommends that
you use the File Writer Handler with the HDFS Event Handler.
Setting the
gg.classpath
example:gg.classpath=/ggwork/hadoop/hadoop-2.9.1/share/hadoop/common/:/ggwork/hadoop/hadoop- 2.9.1/share/hadoop/common/lib/:/ggwork/hadoop/hadoop- 2.9.1/share/hadoop/hdfs/:/ggwork/hadoop/hadoop-2.9.1/share/hadoop/hdfs/lib/:/ggwork/hadoop/hadoop- 2.9.1/etc/hadoop:/ggwork/hadoop/hadoop-2.9.1/share/hadoop/tools/lib/*
See https://hadoop.apache.org/docs/current/hadoop-azure-datalake/index.html.