3 Oracle Loader for Hadoop

この章では、Oracle Loader for Hadoopを使用してHadoopファイルからOracle Databaseの表へデータをコピーする方法について説明します。この章の内容は次のとおりです。

Oracle Loader for Hadoopとは
Oracle Loader for Hadoopの使用
OraLoader起動時の出力モード
エラー処理と診断
パーティション表にデータをロードする場合のロード・バランシング
OraLoaderの構成プロパティ
Oracle Loader for Hadoopの使用例
ターゲット表の特性
ローダー・マップXMLスキーマ定義
構成プロパティのXMLドキュメント
同梱されているソフトウェアのサードパーティ・ライセンス

3.1 Oracle Loader for Hadoopとは

Oracle Loader for Hadoopは、HadoopクラスタからOracle Databaseの表にデータをすばやく移動するための効率的でパフォーマンスのよいローダーです。Oracle Loader for Hadoopは、必要に応じてデータを事前にパーティション化し、そのデータをデータベース対応形式に変換します。また、データのロードや出力ファイルの作成の前に主キーまたはユーザー指定の列でレコードを任意にソートします。Oracle Loader for Hadoopは、コマンドライン・ユーティリティとして起動するMapReduceアプリケーションです。

注意:

パーティション化は、非常に大規模な表の管理および効率的な問合せを行うためのデータベースの機能です。アプリケーションに対して完全に透過的な方法で、大規模な表をパーティションと呼ばれる小規模でより管理し易いサイズに分割する方法を提供します。パーティション化の詳細は、『Oracle Database VLDBおよびパーティショニング・ガイド』を参照してください。

Oracle Loader for Hadoopは、一連の入力データ形式で機能します。これは、入力データの歪みに対応するため、パフォーマンスを最大限に生かすことができます。

事前パーティション化と変換のステップの後、HadoopクラスタからOracle Databaseにデータをロードするには2つのモードがあります。

オンライン・データベース・モード: JDBC出力形式またはOCIダイレクト・パス出力形式を使用してデータがデータベースにロードされます。OCIダイレクト・パス出力形式は、パフォーマンスのよい、ターゲット表のダイレクト・パス・ロードを実行します。JDBC出力形式は、従来のパス・ロードを実行します。OCIダイレクト・パス出力形式の制限を含めたオンライン・ロード方法の詳細は、「OraLoader起動時の出力モード」を参照してください。
オフライン・データベース・モード: リデューサ・タスクで、バイナリまたはテキスト形式の出力ファイルが作成されます。Oracle Data Pump出力形式では、外部表およびORACLE_DATAPUMPアクセス・ドライバを使用したOracle Databaseへのロードが可能なバイナリ形式ファイルが作成されます。デリミタ付きテキスト出力形式では、デリミタ付きレコード形式でテキスト・ファイルが作成されます。(デリミタがカンマの場合、通常カンマ区切り(CSV)形式と呼ばれます。)これらのテキスト・ファイルは、外部表およびORACLE_LOADERアクセス・ドライバを使用したOracle Databaseへのロードが可能です。ファイルは、SQL*Loaderユーティリティを使用してロードすることもできます。

3.2 Oracle Loader for Hadoopの使用方法

この項では、Oracle Loader for Hadoopを使用するための次の手順について説明します。

InputFormatの実装
loaderMapドキュメントの作成
表のメタデータへのアクセス
OraLoaderの起動
Oracle Databaseへのファイルのロード(オフライン・ロードのみ)

インストール手順については、「Oracle Loader for Hadoopの設定」を参照してください。

3.2.1 InputFormatの実装

Oracle Loader for Hadoopは、mapreduce.inputformat.class構成プロパティで指定されるorg.apache.hadoop.mapreduce.InputFormatクラスで規定されるように、org.apache.hadoop.mapreduce.RecordReader実装から入力を取得するMapReduceアプリケーションです。Oracle Loader for Hadoopでは、RecordReaderがgetCurrentValue()メソッドからAvro IndexedRecord入力オブジェクトを返す必要があります。メソッドのシグネチャは次のとおりです。

public org.apache.avro.generic.IndexedRecord getCurrentValue()     
throws IOException, InterruptedException;

Oracle Loader for Hadoopでは、IndexedRecord入力オブジェクトのスキーマを使用して入力フィールドの名前を検出し、ロードする表の列にマップします。このマッピングについては、後続の各項で詳細を説明します。

また、Oracle Loader for Hadoopは、対応するIndexedRecordの値の処理でエラーが発生する場合にフィードバックを提供する手段として、RecordReaderのgetCurrentKey()メソッドで返されるオブジェクトを使用します。このような場合、キーのtoString()メソッドが呼び出され、結果がエラー・メッセージで書式設定されます。次のいずれかの情報を返すことにより、InputFormatの開発者は、Oracle Loader for Hadoopが拒否したレコードの識別でユーザーを支援できます。

データ・ファイルのURI
InputSplit情報
データ・ファイルとそのファイルのレコードのオフセット

データに機密情報が含まれる場合、実際のレコードのテキスト表現は役立ちません。これは、クラスタ全体のHadoopのログに出力されるためです。かわりに、拒否されたレコードのロギングに関する不正なファイルの有効化については、「拒否されたレコードの不正なファイルへのロギング」を参照してください。

キーがnullの場合、レコードが失敗すると、そのレコードを識別する情報は出力されません。

Oracle Loader for Hadoopには、4つの組込み入力形式が付属しています。Oracle NoSQL Databaseでも入力形式が提供されます。さらに、Oracle Loader for Hadoopには、InputFormatサンプルのソース・コードもあります。サンプル・ソース・コードは、examples/jsrc/ディレクトリにあります。表3-1に、これらのすべての入力形式と処理される入力タイプ、Avroスキーマ・フィールド名の生成方法を示します。(ターゲット表にデータをロードするには、InputFormatによるフィールド名の生成方法を理解する必要があります。)

組込み入力形式クラスについては、後続の各項で説明します。InputFormatサンプルについては、ソース・コードとこのクラスのJavadocで詳細を確認してください。

表3-1 InputFormatの各クラス、タイプおよびフィールド名

クラス	入力タイプ	Avroスキーマ・フィールド名
`oracle.hadoop.loader.lib.input.HiveToAvroInputFormat`	Hive表ソース	Hive表の列名(大文字)
`oracle.hadoop.loader.lib.input.DelimitedTextInputFormat`	デリミタ付きテキスト・ファイル	プロパティ`oracle.hadoop.loader. input.fieldNames`からのカンマ区切りのリスト (または、プロパティが定義されていない場合はF0, F1,…)
`oracle.hadoop.loader.lib.input.RegexInputFormat`	テキスト・ファイル	プロパティ`oracle.hadoop.loader. input.fieldNames`からのカンマ区切りのリスト (または、プロパティが定義されていない場合はF0, F1,…)
`oracle.hadoop.loader.lib.input.AvroInputFormat`	バイナリ形式のAvroレコード・ファイル	入力ファイルのAvroスキーマのフィールド名
`oracle.hadoop.loader.examples.CSVInputFormat`	単純なデリミタ付きテキスト・ファイル	F0, F1,...
`oracle.kv.hadoop.KVAvroInputFormat`	Oracle NoSQL Database	Oracle NoSQL DatabaseのAvroレコードに対するキーと値のペアの値部分からのフィールド名

3.2.1.1 HiveToAvroInputFormat

このクラスは、Hive表からのデータを読み取る入力形式を表します。次の構成プロパティを使用して、Hiveのデータベースと表の名前が指定される必要があります。

oracle.hadoop.loader.input.hive.tableName
oracle.hadoop.loader.input.hive.databaseName

HiveToAvroInputFormatは、HiveMetaStoreClientにアクセスして表の列、場所、InputFormat、SerDeなどの情報を取得します。Hiveの構成方法によっては、追加のHive固有のプロパティを設定する必要があります(hive.metastore.uris、hive.metastore.localなど)。

Oracle Loader for Hadoopでは、現在Hiveパーティション表をサポートしていないことに注意してください。

HiveToAvroInputFormatは、表全体(Hive表のディレクトリ内の全ファイル)をインポートします。このドキュメントに記載されている他のすべての(ファイルベースの)入力形式で、グロビング(ワイルドカード・パターンを入力ディレクトリに追加して入力を制限すること)が可能です。

Hive表の行は、フィールド名がHive表の大文字の列名になっているAvroレコードに変換されます。このため、フィールドがデータベースの列名と一致する可能性が高くなります。「loaderMapドキュメントの作成」を参照してください。

3.2.1.2 DelimitedTextInputFormat

これは、カンマ区切りファイルやタブ区切りファイルなどのデリミタ付きファイル向けのInputFormatクラスです。DelimitedTextInputFormatでは、改行文字でレコードが区切られ、1文字のマーカーを使用してフィールドが区切られる必要があります。

DelimitedTextInputFormatクラスは、SQL*Loaderの"terminated by t [optionally enclosed by ie [and te]]"の動作をエミュレートするものです。tはフィールド終端文字、ieは開始フィールド囲み文字、teは終了フィールド囲み文字です。

DelimitedTextInputFormatでは、次の文法に基づいてパーサーが使用されます。

Line = Token t Line | Token\n
Token = EnclosedToken | UnenclosedToken
EnclosedToken = (white-space)* ie [(non-te)* te te ]* (non-te)* te (white-space)*
UnenclosedToken = (white-space)* (non-t)*
white-space = {c | Character.isWhitespace(c) and c!=t}

囲まれたトークン内に含まれる終了フィールド囲み文字は、二重に(2回出力)してコード化する必要があります。

囲まれたトークンの前後の空白は破棄されます。囲まれていないトークンの場合、先頭の空白は廃棄されますが、末尾の空白(ある場合)は破棄されません。

空の文字列のトークンは(囲まれていても囲まれていなくても)、nullで置き換えられます。

この実装では、カスタム囲み文字と終端文字は許可されます(表3-2を参照)が、レコード終端文字と空白は(それぞれ、改行とJavaのCharacter.isWhitespace()に)ハードコードされます。囲み文字は、終端文字と空白とは異なる必要があります(囲み文字同士は同じものでかまいません)。終端文字は空白でもかまいません(その値は、空白文字のクラスから削除されます)。

表3-2に、DelimitedTextInputFormatで使用可能なデリミタを示します。表内のHHHHは、UTF-16の文字をビッグエンディアンの16進で表したものです。

表3-2 DelimitedTextInputFormatのデリミタ

デリミタのタイプ	プロパティ	使用可能な値	デフォルト
フィールド終端文字	oracle.hadoop.loader.input.fieldTerminator	1文字 \uHHHH	,(カンマ)
開始フィールド囲み文字	oracle.hadoop.loader.input.initialFieldEncloser	1文字 \uHHHH なし	デフォルトなし
終了フィールド囲み文字	oracle.hadoop.loader.input.trailingFieldEncloser	1文字 \uHHHH なし	開始フィールド囲み文字の値

終了フィールド囲み文字を設定しない場合、パーサーでは開始フィールド囲み文字の値が使用されます。たとえば、使用しない囲み文字に相当する空の文字列値などがあります。

開始フィールド囲み文字を設定しない場合、終了フィールド囲み文字も設定しないでください。

フィールド囲み文字が設定されていない場合、EnclosedToken non-terminalは、前述の文法から基本的には削除されます。開始フィールド囲み文字を設定する場合、パーサーは、各フィールドをUnenclosedTokenと読み取る前にEnclosedTokenとして読み取ろうとします。

DelimitedTextInputFormatは、構成プロパティoracle.hadoop.loader.input.fieldNamesからのカンマ区切りリストとしてフィールド名を読み取ります。行の解析の結果フィールド名を超える数のトークン(フィールド)があった場合、余分なトークンは破棄されます。フィールド名よりトークンが少ない場合、不足分のトークンはnullに設定されます。

oracle.hadoop.loader.input.fieldNamesプロパティが設定されていない場合、 DelimitedTextInputFormatのRecordReaderは、F0, F1,…Fnをフィールド名(nは、この時点までにRecordReaderで検出された1行当たりのトークンの最大数)として使用します。

3.2.1.3 RegexInputFormat

RegexInputFormatは、DelimitedTextInputFormatの汎化と考えられます。これは、Oracle Loader for HadoopのDelimitedTextInputFormatで処理できないデータに役立ちます。例として、あるフィールドが引用符で区切られ、別のフィールドが角カッコで区切られているWebログなどがあります。さらにレコードを改行文字で区切る必要がありますが、java.util.regex正規表現ベースのパターン一致エンジンを使用して各テキスト行のフィールドを識別します。java.util.regexの詳細は、次のサイトのJava Platform Standard Edition 6 Javadocを参照してください。

http://docs.oracle.com/javase/6/docs/api/java/util/regex/package-summary.html

表3-3に、RegexInputFormatの解析動作を制御するプロパティを示します。

表3-3 RegexInputFormatでパターン一致を制御するプロパティ

プロパティ説明と可能な値デフォルト

プロパティ	説明と可能な値	デフォルト
`oracle.hadoop.loader.input.regexPattern`	正規表現のパターンを設定します。仕様と可能な値の説明については、次のサイトにあるJava Platform Standard Edition 6のJavadocを参照してください。 `http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html`	(空の文字列)
`oracle.hadoop.loader.input.regexCaseInsensitive`	一致プロセスで大文字と小文字を区別する必要があるかどうかを指定します。可能な値は次のとおりです。 `true` `false`	`false`

oracle.hadoop.loader.input.regexPattern

正規表現のパターンを設定します。仕様と可能な値の説明については、次のサイトにあるJava Platform Standard Edition 6のJavadocを参照してください。

http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html

(空の文字列)

oracle.hadoop.loader.input.regexCaseInsensitive

一致プロセスで大文字と小文字を区別する必要があるかどうかを指定します。可能な値は次のとおりです。

true
false

false

正規表現は、全体として各テキスト行に一致する必要があることに注意してください。たとえば、入力行「a,b,c,」の正しい正規表現パターンは「([^,]*),」ではなく「([^,]*),([^,]*),([^,]*),」になります。これは、正規表現が入力テキストの行に対して繰り返し適用されないためです。

RegexInputFormatでは、正規表現の一致による取得グループをフィールドとして使用します。特殊なグループのゼロは、入力行全体を表すため無視されます。グループの取得については、http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#cgで説明しています。

oracle.hadoop.loader.input.fieldNamesプロパティが設定される場合、RegexInputFormatは、このプロパティをフィールド名のカンマ区切りリストと解釈します。この場合、フィールド名の数によって、各レコードのフィールドの数が決まります。余分なフィールドは破棄され、末尾の不足分のフィールドはnullに設定されます。

oracle.hadoop.loader.input.fieldNamesプロパティが設定されていない場合、RegexInputFormatのRecordReaderは、F0, F1,…Fnをフィールド名(nは、この時点までにRecordReaderで検出された1行当たりのフィールドの最大数)として使用します。

3.2.1.4 AvroInputFormat

AvroInputFormatは、Avroレコードを含む標準Avroデータ・ファイルのInputFormatクラスです。

拡張子が.avroのファイルのみを処理するには、mapred.input.dir構成プロパティで指定するディレクトリに*.avroを追加します。

3.2.1.5 KVAvroInputFormat

Oracle NoSQL Database 11gリリース2には、oracle.kv.hadoop.KVAvroInputFormatと呼ばれるクラスがあります。Oracle Loader for Hadoopは、このクラスを使用してOracle NoSQL Databaseからデータを直接読み取ることができます。このクラスは、org.apache.hadoop.mapreduce.InputFormat<oracle.kv.Key, org.apache.avro.generic.IndexedRecord>のサブクラスで、mapreduce.inputformat.classプロパティの値として指定できます。このクラスを使用して、Oracle NoSQL Databaseに格納されるAvroの値をOracle Loader for Hadoopに読み取ることができます。

KVAvroInputFormatクラスによって、Oracle NoSQL Database Avroレコードのキーと値のペアの値部分はOracle Loader for Hadoopに直接渡されますが、Oracle NoSQL Databaseレコードのキーは渡されません。レコードのキーは、Oracle Loader for HadoopのMapReduceジョブでMapReduceキーとして使用できますが、Oracle Loader for Hadoopマッピングで指定可能なフィールドとして利用できません。Oracle Loader for HadoopでOracle NoSQL Databaseレコードのキーにアクセスする必要がある場合(ターゲット表にキーを格納する場合など)、Oracle Loader for Hadoop構成ファイルのoracle.kv.formatterClassプロパティを使用してoracle.kv.hadoop.AvroFormatterを実装するクラスを指定できます。

関連項目:

次のサイトにあるKVInputFormatBaseクラスのJavadoc

http://docs.oracle.com/cd/NOSQL/html/index.html

3.2.2 loaderMapドキュメントの作成

Oracle Loader for Hadoopでは、1つのデータベース表にデータがロードされます。この表は、ターゲット表と呼ばれます。次の方法で、ターゲット表、ロードする列および入力フィールドのデータベース列へのマップ方法と日付形式の仕様を指定できます。

データベース表のすべての列をロードし、入力フィールドの名前がデータベースの列名と完全に一致することを指定するには、構成プロパティoracle.hadoop.loader.targetTableを使用します。ターゲットのロード表に対して名前を定義できます。データベースの列ごとに、ローダーは列名を使用して同じ名前の入力フィールドを検出します。その後、フィールドの値が列にロードされます。
ターゲット表の列の一部をロードする場合、または入力フィールド名がデータベースの列名と完全に一致しないときに明示的なマッピングを作成する場合、loaderMapドキュメントを作成してターゲット表、列および入力フィールドのデータベース列へのマップ方法を指定します。loaderMapドキュメントの場所は、oracle.hadoop.loader.loaderMapFile構成プロパティを使用して指定します。
すべての日付入力フィールドに適用されるデフォルトの日付形式を指定するには、構成プロパティoracle.hadoop.loader.defaultDateFormatを使用します。loaderMapドキュメントを使用すると、様々な日付入力フィールドに異なる日付形式を指定できます。

関連項目:

「ターゲット表の特性」
XMLスキーマ定義(XSD)ドキュメントの内容の詳細は、「ローダー・マップXMLスキーマ定義」を参照してください

3.2.2.1 loaderMapドキュメントの例

次の例のloaderMapドキュメントでは、ロードする必要があるHR.EMPLOYEES表の列のリストを指定します。入力データ・フィールド名と表の列名との間のマッピングも含まれます。その列に使用される入力データの形式も指定されます。

<?xml version="1.0" encoding="UTF-8"?>
<LOADER_MAP>
<SCHEMA>HR</SCHEMA>
<TABLE>EMPLOYEES</TABLE>
<COLUMN field="empId">EMPLOYEE_ID</COLUMN>
<COLUMN field="lastName">LAST_NAME</COLUMN>
<COLUMN field="email">EMAIL</COLUMN>
<COLUMN field="hireDate" format="MM-dd-yyyy">HIRE_DATE</COLUMN>
<COLUMN field="jobId">JOB_ID</COLUMN>
</LOADER_MAP>

注意:

ターゲット表のすべての列がロードに使用され、IndexedRecord入力オブジェクトの入力データ・フィールド名が列名に完全に一致する場合、表の列がDATEデータ型である場合を除き、loaderMapファイルは必要ありません。DATE列にマップされる入力フィールドは、デフォルトのJava日付形式を使用して解析されます。入力が別の形式の場合、loaderMapドキュメントを作成し、形式属性を使用して入力値の解析時に使用されるJava日付形式文字列を指定する必要があります。

3.2.3 表のメタデータへのアクセス

Oracle Loader for Hadoopでは、Oracle Databaseの表のメタデータを使用してローダー・ジョブの実行を制御します。JDBC接続が確立可能な場合、ローダーでメタデータを自動的にフェッチします。ローダー・ジョブでデータベースにアクセスできない場合があります。たとえば、Hadoopクラスタが、データベースとは別のネットワークにある場合などです。この場合、OraLoaderMetadataユーティリティ・プログラムを使用して、データベースからXMLドキュメントに表のメタデータを抽出します。メタデータのドキュメントは、Hadoopクラスタに転送されます。構成プロパティoracle.hadoop.loader.tableMetadataFileを使用して、メタデータ・ドキュメントの場所を指定します。ローダー・ジョブの実行時、このドキュメントがアクセスされ、ターゲット表に関する必要なメタデータ情報がすべて検出されます。

3.2.3.1 OraLoaderMetadataユーティリティの実行

OraLoaderMetadata Javaユーティリティを実行するには、次のJARファイルをCLASSPATH変数に追加します。

$OLH_HOME/jlib/oraloader.jar
$OLH_HOME/jlib/ojdbc6.jar
$OLH_HOME/jlib/oraclepki.jar

注意:

oraclepki.jarライブラリは、Oracle Walletに格納されている資格証明を使用してデータベースに接続する場合にのみ必要です。

次のコマンドを実行します。

java oracle.hadoop.loader.metadata.OraLoaderMetadata \ 
-user <username> -connection_url <connection URL> [-schema <schemaName>] \
-table <tableName> -output <output filename>

OraLoaderMetadataのパラメータ

OraLoaderMetadataユーティリティは次のパラメータを受け入れます。

-userはOracle Databaseのユーザー名です。ユーザーは、パスワードを要求されます。
-connection_urlは、Oracle Databaseに接続する接続URLです。
-schemaは、ターゲット表を含むスキーマの名前です。指定されない場合、ターゲット表は、接続URLで指定されたユーザー・スキーマ内にあるとみなされます。
-tableはターゲット表の名前です。
-outputは、メタデータ・ドキュメントの格納に使用される出力ファイルの名前です。

3.2.4 OraLoaderの起動

OraLoaderは、標準のHadoopツールを使用して実行するHadoopジョブです。OraLoaderは、org.apache.hadoop.util.Toolインタフェースを実装し、MapReduceアプリケーションを構築する標準的なHadoopの方法に従います。OraLoaderは次のアクションを実行します。

入力構成パラメータを読み取り、チェックします。
ターゲット表の表と列のメタデータ情報を取得し、チェックします。JDBC接続が確立可能な場合、メタデータがデータベースから取得されます。そうでない場合、oracle.hadoop.loader.tableMetadataFileプロパティで指定された場所に格納されているメタデータがローダーによって検索されます。
OraLoaderのMapReduceタスク用に内部構成情報を準備し、表のメタデータ情報と従属するJavaライブラリを分散キャッシュに格納して、クラスタ全体でマップとリデュースのタスクに使用できるようにします。
MapReduceジョブをHadoopに発行します。
マップとリデュースのタスクの完了後、個々のタスクからのレポート情報をまとめてジョブの共通のログ・ファイルを作成します。ログ・ファイルはジョブ出力ディレクトリの_olhサブディレクトリに書き込まれ、oraloader-report.txtという名前が付けられます。

OraLoaderはコマンドラインから起動され、一般的なコマンドライン・オプションを受け付けます。起動例を次に示します。

HADOOP_CLASSPATH="$HADOOP_CLASSPATH:$OLH_HOME/jlib/*"

bin/hadoop jar $OLH_HOME/jlib/oraloader.jar oracle.hadoop.loader.OraLoader \ 
-conf MyConf.xml

表3-1に示す入力形式のいずれかの例を使用する場合、OraLoaderの起動に-libjarsオプションを含める必要があります。

bin/hadoop jar $OLH_HOME/jlib/oraloader.jar oracle.hadoop.loader.OraLoader \
-conf MyConf.xml -libjars $OLH_HOME/jlib/oraloader-examples.jar

Oracle NoSQL Database 11gリリース2のKVAvroInputFormatを使用する場合、HADOOP_CLASSPATHに$KVHOME/lib/kvstore-2.0.22.jarを含め、OraLoaderの起動に-libjarsオプションを含める必要があります。

bin/hadoop jar $OLH_HOME/jlib/oraloader.jar oracld.hadoop.loader.OraLoader -conf
MyConf.xml -libjars $KVHOME/lib/kvstore-2.0.22.jar

関連項目:

Hadoop実行可能ファイルの場所とHADOOP_CLASSPATH変数の設定については、Apache Hadoopのドキュメント
Apacheサイト(http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/util/GenericOptionsParser.html)にある汎用オプションのJavadoc

3.2.5 Oracle Databaseへのファイルのロード(オフライン・ロードのみ)

オフライン・ロードの場合、Oracle Loader for Hadoopによって、データベース・サーバーにコピーされ、Oracle Databaseにロードされるファイルが生成されます。次の項で、使用可能なオフライン・ロード方法について説明します。

3.2.5.1 デリミタ付きテキスト・ファイルからOracle Databaseへのロード

デリミタ付きテキスト・ファイルをデータベース・システムにコピーしたら、生成された制御ファイルを使用してSQL*Loaderを起動し、デリミタ付きテキスト・ファイルからデータベースにデータをロードします。また、生成されたSQLスクリプトを使用して、外部表のデータベースへのロードを実行することもできます。「デリミタ付きテキスト出力」を参照してください。

3.3 OraLoader起動時の出力モード

この項では、次の出力オプションについて説明します。

JDBC出力
Oracle OCIダイレクト・パス出力
デリミタ付きテキスト出力
Oracle Data Pump出力

3.3.1 JDBC出力

JDBCは、オンライン・データベース・モードの出力オプションです。ローダー・ジョブの出力レコードは、OraLoaderプロセスの一部としてマップ・タスクまたはリデュース・タスクによって直接ターゲット表にロードされます。データをロードするために追加手順を実行する必要はありません。この出力オプションには、HadoopシステムとOracle Databaseとの間のJDBC接続が必要です。

JDBC出力オプションでは、標準のJDBCバッチを使用してパフォーマンスと効率を向上させます。バッチの実行時に制約違反などのエラーが発生すると、JDBCドライバは最初のエラーで実行を停止します。つまり、バッチに100行あり、10行目でエラーが発生した場合、9行は挿入され、91行は挿入されません。また、JDBC ドライバでは、エラーが発生した行を特定するための情報は提供されません。この場合、Oracle Loader for Hadoopでは、バッチ内の各行の挿入ステータスは把握されません。バッチ内のすべての行に問題があるとみなされ、次のバッチのロードが続けられます。発生したバッチ・エラーの数と挿入ステータスに問題のある行の数を示すロード・レポートがジョブの最後に生成されます。この問題に対処する方法の1つは、データに対して一意キーを使用することです。データのロード後、キーを有効にして欠落しているキー値を検出します。ロードに失敗した原因がわかったら、欠落している行を入力ファイル内で特定し、再ロードする必要があります。

JDBC出力形式を選択するには、次のHadoopプロパティを設定します。

<property>
  <name>mapreduce.outputformat.class</name>
  <value>oracle.hadoop.loader.lib.output.JDBCOutputFormat</value>
</property>

JDBC出力の構成に関連するプロパティはoracle.hadoop.loader.jdbc.defaultExecuteBatchです。これは、バッチのサイズを制御します。

Big Data Appliance (BDA)とExadata間のインフィニバンドによるデータ移動を最適化するには、Sockets Direct Protocol (SDP)を使用できます。SDPプロトコルを指定するには、次のようにします。

JVMオプションを追加してJDBC SDPエクスポートを有効にします。
```
HADOOP_OPTS="-Doracle.net.SDP=true -Djava.net.preferIPv4Stack=true"
```

SDPプロトコルで接続URLを使用します。ポート1522は、SDP対応リスナーを指し示すため意図的なものになります。

<property> 
  \u000b    
  <name>oracle.hadoop.loader.connection.url</name> 
  \u000b
  <value>jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS=(PROTOCOL=SDP)
   (HOST=example.com}) (PORT=1522))
   (CONNECT_DATA=(SERVICE_NAME=example_service))) </value> 
  \u000b 
</property>

関連項目:

インフィニバンドによる最高のパフォーマンスを確保するためにBDAとExadataを構成する場合の詳細は、Big Data Appliance (BDA)のドキュメント

3.3.2 Oracle OCIダイレクト・パス出力

Oracle OCIダイレクト・パス出力形式は、オンライン・データベース・モードで使用可能です。この出力形式では、OCIダイレクト・パス・インタフェースを使用して、行をターゲット表にロードします。各リデューサが異なるデータベース・パーティションにロードするため、並列ダイレクト・パス・ロードが可能になります。

Oracle OCIダイレクト・パス出力形式を選択するには、次のHadoopプロパティを設定します。

<property>
  <name>mapreduce.outputformat.class</name>
  <value>oracle.hadoop.loader.lib.output.OCIOutputFormat</value>
</property>

ダイレクト・パス・ストリーム・バッファのサイズは、次のプロパティを使用して制御されます。

<property>
  <name>oracle.hadoop.loader.output.dirpathBufsize</name>
  <value>131072</value>
  <description>
   This property is used to set the size, in bytes, of the direct path 
   stream buffer for OCIOutputFormat. If needed, values are rounded 
   up to the next nearest multiple of 8k.
  </description>
</property>

JVMオプションを追加してJDBC SDPエクスポートを有効にします。
```
HADOOP_OPTS="-Doracle.net.SDP=true -Djava.net.preferIPv4Stack=true"
```

SDPプロトコルで接続URLを使用します。ポート1522は、SDP対応リスナーを指し示すため意図的なものになります。

<property> 
  \u000b    
  <name>oracle.hadoop.loader.connection.url</name> 
  \u000b
  <value>jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS=(PROTOCOL=SDP)
   (HOST=example.com}) (PORT=1522))
   (CONNECT_DATA=(SERVICE_NAME=example_service))) </value> 
  \u000b 
</property>

関連項目:

インフィニバンドによる最高のパフォーマンスを確保するためにBDAとExadataを構成する場合の詳細は、Big Data Appliance (BDA)のドキュメント

Oracle OCIダイレクト・パス出力形式には、次の制限があります。

Linux x86.64プラットフォームでのみ使用できます。
ロード・ターゲット表は、パーティション化されている必要があります。
リデューサの数は、ゼロより大きい数である必要があります。
OCIダイレクト・パス出力では、サブパーティション・キーにCHAR、VARCHAR2、NCHARまたはNVARCHAR2列が含まれるコンポジット時間隔パーティション表をロードできません。ローダーはこの状態をチェックし、ターゲット・ロード表がこの条件に合う場合、エラーで停止します。サブパーティション・キーに文字型の列が含まれないコンポジット時間隔パーティションはサポートされます。

Oracle OCIダイレクト・パス出力形式には、次の構成手順が必要です。この手順によって、ローダーで、出力形式を実装するC共有ライブラリの検出が可能になります。これらのライブラリは自動的に配布され、Hadoop分散キャッシュ・メカニズムを使用するノードが導出されます。

ディレクトリ$OLH_HOME/libを指し示す環境変数JAVA_LIBRARY_PATHを作成します。この環境変数は、ジョブが発行されるノードにのみ必要です。ジョブの作成時、CDHの$HADOOP_HOME/bin/hadoopコマンドは、この変数の値をJavaシステム・プロパティjava.library.pathに自動的に挿入します。Apache Hadoopディストリビューションの場合、新しい値と既存の値が連結されるように$HADOOP_HOME/bin/hadoopコマンドを編集する必要があります。Apache hadoopコマンドは、空のJAVA_LIBRARY_PATH値から開始され、環境から値をインポートしません。
ローダー・ジョブが発行されるクライアントでLD_LIBRARY_PATH変数に$OLH_HOME/libを追加します。

3.3.3 デリミタ付きテキスト出力

デリミタ付きテキストは、オフライン・データベース・モードの出力オプションです。カンマ区切り(CSV)形式のファイルまたは他のデリミタ付きテキスト・ファイルは、マップまたはリデュース・タスクによって生成されます。これらのファイルは、SQL*Loaderまたは外部表を使用してターゲット表にロードされます。

デリミタ付きテキスト出力形式を選択するには、次のHadoopプロパティを設定します。

<property>
  <name>mapreduce.outputformat.class</name>
  <value>oracle.hadoop.loader.lib.output.DelimitedTextOutputFormat</value>
</property>

各出力タスクで、デリミタ付きテキスト形式ファイルと、SQL*Loader制御ファイルが生成されます(例3-1を参照)。表がパーティション化されない場合やoracle.hadoop.loader.loadByPartition=falseの場合、ジョブ全体のSQL*Loader制御ファイルが1つ生成されます。さらに、1つのSQLスクリプトが生成され、デリミタ付きテキスト・ファイルをターゲット表にロードします。

デリミタ付きテキストファイルには、次のテンプレートがあります。

oraloader-${taskId}-csv-${partitionId}.dat

SQL*Loader制御ファイル名には、次のテンプレートがあります。

oraloader-${taskId}-csv-${partitionId}.ctl

テンプレート・パラメータの定義は次のとおりです。

${taskId}: マッパー(リデューサ)ID

${partitionId}: パーティション識別子

表がパーティション化されない場合やoracle.hadoop.loader.loadByPartition=falseの場合、1つのSQL*Loader制御ファイルoraloader-csv.ctlが生成されます。

外部表にロードするためのSQLスクリプトはoraloader-csv.sqlと呼ばれます。

デリミタ付きテキスト・ファイル内のレコードとフィールドの形式は、次のプロパティによって制御されます。

oracle.hadoop.loader.output.fieldTerminator: フィールドを区切る1文字を特定します。
oracle.hadoop.loader.output.initialFieldEncloser: フィールドの開始を特定します。これは、この文字とtrailingFieldEncloser文字でフィールドが常に囲まれていることを示します。
oracle.hadoop.loader.output.trailingFieldEncloser: 設定しない場合、initialFieldEncloserの値とみなされます。
oracle.hadoop.loader.output.escapeEnclosers: 埋込みの終了フィールド囲み文字のエスケープに使用される文字を特定します。

例3-1に、出力タスクで生成されるサンプルSQL*Loader制御ファイルを示します。

例3-1 サンプルSQL*Loader制御ファイル

LOAD DATA CHARACTERSET AL32UTF8
INFILE 'oraloader-csv-1-0.dat'
BADFILE 'oraloader-csv-1-0.bad'
DISCARDFILE 'oraloader-csv-1-0.dsc'
INTO TABLE "SCOTT"."CSV_PART" PARTITION(10) APPEND
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
(
"ID"      DECIMAL EXTERNAL,
"NAME"    CHAR,
"DOB"     DATE 'SYYYY-MM-DD HH24:MI:SS'
)

3.3.4 Oracle Data Pump出力

Oracle Data Pump出力形式は、オフライン・データベース・モードで使用可能です。外部表とORACLE_DATAPUMP アクセス・ドライバを使用してターゲット表にロードされるバイナリ形式ファイルがローダーによって生成されます。出力ファイルは、HDFSファイルシステムから、Oracle Databaseにアクセス可能なローカル・ファイルシステムにコピーされる必要があります。

Oracle Data Pump出力形式を選択するには、次のHadoopプロパティを設定します。

<property>
  <name>mapreduce.outputformat.class</name>
  <value>oracle.hadoop.loader.lib.output.DataPumpOutputFormat</value>
</property>

Oracle Data Pump出力ファイル名には、次のテンプレートがあります。

oraloader-${taskId}-dp-${partitionId}.dat

Oracle Loader for Hadoopで、次のタスクを実行するコマンドを含むSQLファイルも生成されます。

ORACLE_DATAPUMPアクセス・ドライバを使用する外部表定義を作成します。バイナリ形式のOracle Data Pump出力ファイルは、外部表のLOCATION句にリストされます。

外部表によって使用されるディレクトリ・オブジェクトを作成します。このコマンドは、使用する前にコメント解除される必要があります。生成されるディレクトリ名をSQLファイルに指定するには、次のプロパティを設定します。

<property>
  <name>oracle.hadoop.loader.extTabDirectoryName</name>
  <value>OLH_EXTTAB_DIR</value>
  <description>
   The name of the Oracle directory object for the external table's
   LOCATION data files. This property applies only to the CSV and 
   DataPump output formats.
  </description>
</property>

外部表からターゲット表に行を挿入します。このコマンドは、使用する前にコメント解除される必要があります。

SQLファイルは、ジョブの出力ディレクトリ内の_olhディレクトリに配置されます。

関連項目:

外部表の作成と管理の詳細は、『Oracle Database管理者ガイド』
ORACLE_DATAPUMPアクセス・ドライバの詳細は、『Oracle Databaseユーティリティ』

3.4 エラー処理と診断

Oracle Loader for Hadoopは、次のような様々な理由で入力レコードを拒否します。

不正なloaderMapファイル
フィールドの不足
表の無効なパーティションへのレコードのマップ
不正な書式のレコード(たとえば、フィールドが日付形式に一致しない場合や、レコードが正規表現パターンに一致しない場合など)

問題の原因となるレコードとエラー・メッセージのログを明示的に記録する方法の手順については、「拒否されたレコードの不正なファイルへのロギング」を参照してください。

エラーの検出が多すぎる場合にジョブを早期に停止するようにOracle Loader for Hadoopを構成する手順については、「ジョブの拒否制限の設定」を参照してください。

3.4.1 拒否されたレコードの不正なファイルへのロギング

デフォルトでは、Oracle Loader for Hadoopは、拒否されたレコードをHadoopログに記録せず、拒否されたレコードの識別方法のみを記録します。これにより、クラスタ全体に散在しているHadoopログにユーザーの機密情報が格納されることはありません。拒否されたレコードの識別方法の詳細は、「InputFormatの実装」のgetCurrentKey()に関する説明を参照してください。

拒否されたレコードを記録するようにOracle Loader for Hadoopを管理するには、構成プロパティoracle.hadoop.loader.logBadRecordsをtrue (デフォルトはfalse)に設定します。このプロパティをtrueに設定すると、Oracle Loader for Hadoopは、ジョブの出力ディレクトリ内にある_olh/ディレクトリの1つ以上の「不正な」ファイルに不正なレコードを記録します。

このプロパティは、入力形式とマッパーで拒否されるレコードに適用されます。サンプリング機能(「パーティション表にデータをロードする場合のロード・バランシング」を参照)や出力形式(「JDBC出力」および「Oracle OCIダイレクト・パス出力」を参照)で検出されるエラーには適用されません。

3.4.2 ジョブの拒否制限の設定

不正なloaderMapファイルを使用するなどの特定の問題により、Oracle Loader for Hadoopが入力のすべてのレコードを拒否することがあります。このような問題による時間の無駄を軽減するため、 Oracle Loader for Hadoopは、1000を超えるレコードが拒否される場合にジョブを中止します。拒否レコードの最大許可数を変更するには、構成プロパティoracle.hadoop.loader.rejectLimitを使用します。このプロパティを負の値にすると、拒否制限が無効になり、拒否されるレコードの数に関係なくジョブの実行が完了します。

入力形式のエラーは、致命的でマップ・タスクが中止になるため、拒否制限に反映されないことに注意してください。サンプリング機能(「パーティション表にデータをロードする場合のロード・バランシング」を参照)や出力形式(「JDBC出力」および「Oracle OCIダイレクト・パス出力」を参照)で検出されるエラーも拒否制限に反映されません。

3.5 パーティション表にデータをロードする場合のロード・バランシング

パーティション化されたデータベース表にデータをロードするときにリデューサ間で負荷を分散させるには、Oracle Loader for Hadoopのサンプリング機能を使用します。

リデューサの実行時間は、通常、処理するレコードの数に比例します。レコードが多いほど、実行時間は長くなります。サンプリング機能が無効の場合、特定のデータベース・パーティションのすべてのレコードが1つのリデューサに送られます。データベース・パーティションによってレコードの数が異なることがあるため、これによってリデューサの負荷は不均等になります。Hadoopジョブの実行時間は、最も遅いリデューサの実行時間であるため、リデューサの負荷が不均等な場合、ジョブ全体のパフォーマンスが低下します。

レコードをリデューサ間で均等に分割すると、リデューサの負荷は分散されますが、データベースに挿入する前に、データベース・パーティションごとにレコードが分類されるわけでは必ずしもありません。

Oracle Loader for Hadoopのサンプリング機能では、リデューサの負荷を分散させると同時に、データベース・パーティションごとにレコードを分類する、効率的なMapReduceパーティション化スキームが生成されます。

3.5.1 サンプリング機能の使用方法

サンプリング機能を有効にするには、構成プロパティoracle.hadoop.loader.sampler.enableSamplingをtrueに設定します。

enableSamplingプロパティがtrueに設定されている場合でも、サンプリングが不要な場合、または適切なサンプルを作成できないとローダーが判断した場合、ローダーによってサンプリング機能が自動的に無効になります。たとえば、表がパーティション化されていない場合、リデューサ・タスクの数が2未満の場合、または入力データが少なすぎて適切なロード・バランシングの計算ができない場合、サンプリングが自動的に無効になります。このような場合、ローダーは情報メッセージを返します。

注意:

サンプラはマルチスレッド化され、各サンプラ・スレッドは、指定されたInputFormatクラスのコピーをインスタンス化します。Oracle Loader for Hadoopに提供される新規のInputFormatの実装では、静的でミュータブルなデータ構造が複数のスレッド・アクセスに対して必ず同期化されます。

ローダー・ジョブが発行されるクライアント・ノードでサンプラがメモリー不足エラーを返す場合があります。これは、InputFormatによって返される入力分割がメモリーに収まらない場合に起こります。

この問題に対する考えられる解決策は、次のとおりです。

ジョブが発行されるJVMのヒープ・サイズを大きくします。
次のプロパティを調整します。
```
oracle.hadoop.loader.sampler.hintMaxSplitSize
oracle.hadoop.loader.sampler.hintNumMapTasks
```
これらのプロパティの説明については、「構成プロパティのXMLドキュメント」を参照してください。

3.5.2 ロード・バランシングとサンプリング動作のチューニング

Oracle Loader for Hadoopには、ロード・バランシングとサンプリング動作のチューニングに使用できるプロパティが用意されています。これらのプロパティについては、「ロード・バランシング機能に関する主な構成プロパティ」でまとめられています。

3.5.2.1 ロード・バランシングをチューニングするプロパティ

ロード・バランシングの目標は、すべてのリデューサにほぼ同量の処理を割り当てるMapReduceパーティション化スキームを生成することです。このスキームは、Oracle Loader for Hadoopのジョブの実行時のパーティション化ステップで使用されます。

2つのプロパティmaxLoadFactorとloadCIがロード・バランシングの質を制御します。サンプラは、所定のリデューサ負荷係数を使用してパーティション化スキームの質を評価します。負荷係数は、リデューサの負荷が、完全に分散されたリデューサの負荷とどの程度違っているかを示すメトリックです。負荷係数1は、完全に分散された負荷(過負荷ではない)を表します。

負荷係数が小さい場合、負荷分散が適切であることを表します。maxLoadFactorのデフォルト0.05は、5%以上の過負荷状態になるリデューサがないことを表します。サンプラでは、loadCIの値で決まる統計的信頼度でこのmaxLoadFactorが保証されます。loadCIのデフォルト値は0.95で、maxLoadFactorを超えるリデューサの負荷係数は5%のみであることを表します。

サンプラの実行時間と負荷分散の質の間にはトレードオフがあります。maxLoadFactorの値を低くしてloadCIの値を高くすると、リデューサの負荷はより均等化されますが、サンプリング時間は長くなります。デフォルト値maxLoadFactor=0.05およびloadCI値=0.95により、ロード・バランシングの質と実行時間のバランスが適切に保たれます。

3.5.2.2 サンプリング動作をチューニングするプロパティ

デフォルトでは、サンプラは、maxLoadFactorとloadCIの基準を満たすパーティション化スキームを生成するのに十分なサンプルを収集するまで実行されます。

ただし、サンプラがサンプリングを停止する最大レコード数を指定するmaxSamplesPctプロパティを使用すると、サンプラの実行時間を制限できます。

3.5.3 Oracle Loader for Hadoopは常にサンプラのパーティション化スキームを使用するのか

Oracle Loader for Hadoopでは、サンプリングが成功の場合にのみ、生成されたパーティション化スキームを使用します。統計的信頼度loadCIで保証される最大リデューサ負荷係数(1+ maxLoadFactor)のパーティション化スキームが生成される場合、サンプリングは成功です。デフォルト値maxLoadFactor、loadCIおよびmaxSamplesPctにより、サンプラは、様々な入力データ分布に対する質の高いパーティション化スキームを正常に生成できます。ただし、サンプラが制約を満たすパーティション化スキームの生成に失敗することがあります(制約が厳しすぎる場合や、必要なサンプルの数が、ユーザーが指定した最大数であるmaxSamplesPctを超えている場合など)。このような場合、Oracle Loader for Hadoopは、十分なサンプルがなかったことを示すログ・メッセージを出力します。デフォルトであるデータベース・パーティション別のレコードの分割を行い、ロード・バランシングは保証されません(「ロード・バランシングとサンプリング動作のチューニング」を参照してください)。

代替策は、構成プロパティの値を緩和することです。これは、maxSamplesPctを増やすか、maxLoadFactorまたはloadCI、あるいはその両方を減らすことによって行えます。

3.5.4 サンプリング機能のプロパティの値が無効な場合

サンプリング機能の構成プロパティが、許容可能な範囲外の値に設定されている場合、例外は返されません。かわりに、サンプラは警告メッセージを出力し、プロパティをデフォルト値に設定して実行を続けます。

3.5.5 ロード・バランシング機能に関する主な構成プロパティ

次に、サンプリング動作のチューニングで使用できる主なプロパティを示します。これらのプロパティの詳細は、「構成プロパティのXMLドキュメント」を参照してください。

Hadoopサンプリングの構成プロパティ

oracle.hadoop.loader.sampler.maxSamplesPct

型: Float

デフォルト: 0.01

許容可能な範囲: [0, 1]

0以下の値の場合、このプロパティは無効になります。

説明: 最大サンプル・サイズ(入力データ内のレコード数の割合)。値0.05は、サンプラがサンプリングするのはレコードの総数の5%以下であることを示します。サンプラは、これより少ないサンプルを収集します。

oracle.hadoop.loader.sampler.maxLoadFactor

型: Float

デフォルト: 0.05

許容可能な範囲: >0

0以下の値の場合、プロパティがデフォルトに再設定されます。

説明: リデューサ作業負荷に対する最大許容負荷係数。

oracle.hadoop.loader.sampler.loadCI

型: Float

デフォルト: 0.95

許容可能な範囲: >=0.5および<1

推奨値は>=0.9です。値が<0.5の場合、プロパティがデフォルトに再設定されます。

説明: リデューサの最大負荷係数に対する統計的信頼度。デフォルト以外でよく使用される値は0.90と0.99です。

3.6 OraLoaderの構成プロパティ

OraLoaderは、構成プロパティの指定にHadoopの標準的なメソッドを使用します。これらのプロパティは、構成ファイルで指定するか、GenericOptionsParserおよびToolRunnerに対して-D property=value optionを使用して指定できます。

次に、Oracle Loader for Hadoopの主なジョブ構成プロパティに関して簡単に説明します。すべての構成プロパティの一覧と詳細は、「構成プロパティのXMLドキュメント」のoraloader-conf.xmlドキュメントを参照してください。

3.6.1 主なジョブ構成プロパティ

次に、Oracle Loader for Hadoopの主なジョブ構成プロパティについて説明します。

oracle.hadoop.loader.targetTable

型: String

デフォルト: 定義されていません。

説明: ロード先の表のスキーマで修飾された名前。このオプションを使用して、表のすべての列がロードされることと、入力フィールドの名前が列名と一致することを示します。このプロパティは、oracle.hadoop.loader.loaderMapFileプロパティよりも優先されます。表がスキーマで修飾されない場合、Oracle Loader for Hadoopでは接続ユーザーが使用されます。

oracle.hadoop.loader.loaderMapFile

型: String

デフォルト: 定義されていません。

説明: ローダー・マップ・ファイルへのパス

oracle.hadoop.loader.tableMetadataFile

型: String

デフォルト: 定義されていません。

説明: ターゲット表のメタデータ・ファイルへのパス。切断モードで実行する場合、このオプションを使用します。表のメタデータ・ファイルは、OraLoaderMetadataユーティリティを実行すると作成されます。

oracle.hadoop.loader.olhcachePath

型: String

デフォルト: ${mapred.output.dir}/../olhcache

説明: Oracle Loader for Hadoopが、DistributedCacheにロードされるファイルを作成できるディレクトリのパス。分散モードでは、値はHDFSパスである必要があります。

oracle.hadoop.loader.extTabDirectoryName

型: String

デフォルト: OLH_EXTTAB_DIR

説明: 外部表のLOCATIONデータ・ファイルのデータベース・ディレクトリ・オブジェクトの名前。このプロパティは、デリミタ付きテキストとデータ・ポンプ出力形式にのみ適用されます。

oracle.hadoop.loader.sampler.enableSampling

型: Boolean

デフォルト: true

説明: サンプリング機能が有効かどうかを示します。

oracle.hadoop.loader.enableSorting

型: Boolean

デフォルト: true

説明: 各リデューサ・グループ内の出力レコードが、表の主キーでソートされるかどうかを示します。

oracle.hadoop.loader.connection.url

型: String

デフォルト: 定義されていません。

説明: データベース接続文字列のURLを指定します。このプロパティは、他のすべての接続プロパティより優先されます。Oracle Walletが外部パスワード・ストアとして構成されている場合、プロパティ値は、ドライバ接頭辞jdbc:oracle:thin:@で始まる必要があり、db_connect_stringは、ウォレットに定義されている資格証明と完全に一致する必要があります。

oracle.hadoop.loader.connection.user

型: String

デフォルト: 定義されていません。

説明: データベース・ログインの名前

oracle.hadoop.loader.connection.password

型: String

デフォルト: 定義されていません。

説明: 接続するユーザーのパスワード

oracle.hadoop.loader.connection.wallet_location

型: String

デフォルト: 定義されていません。

説明: 接続の資格証明が格納されるOracle Walletディレクトリへのファイル・パス。

Oracle Walletを外部パスワード・ストアとして使用する場合、次の3つのプロパティを設定します。

oracle.hadoop.loader.connection.wallet_location
oracle.hadoop.loader.connection.url
oracle.hadoop.loader.connection.tns_admin

または、次の3つのプロパティを設定します。

oracle.hadoop.loader.connection.wallet_location
oracle.hadoop.loader.connection.tnsEntryName
oracle.hadoop.loader.connection.tns_admin

oracle.hadoop.loader.connection.tnsEntryName

型: String

デフォルト: 定義されていません。

説明: tnsnames.oraファイルに定義されたTNSエントリ名を指定します。このプロパティは、oracle.hadoop.loader.connection.tns_adminプロパティと一緒に使用されます。

oracle.hadoop.loader.connection.tns_admin

型: String

デフォルト: 定義されていません。

説明: sqlnet.oraやtnsnames.oraなどのSQL*Net構成ファイルを含むディレクトリのファイル・パス。このプロパティが定義されない場合、環境変数TNS_ADMINの値が使用されます。データベース接続文字列でTNSエントリ名を使用する場合、このプロパティを定義します。

Oracle Walletを外部パスワード・ストアとして使用する場合、このプロパティを設定する必要があります。プロパティoracle.hadoop.loader.connection.wallet_locationを参照してください。

oracle.hadoop.loader.connection.defaultExecuteBatch

型: Integer

デフォルト: 100

説明: JDBCおよびOCIダイレクト・パス出力形式にのみ適用されます。データベースへのトリップごとにバッチで挿入されるレコードの数のデフォルト値。1より大きい値を指定すると、デフォルト値がオーバーライドされます。指定された値が1より小さい場合、このプロパティはデフォルト値をとります。最大値の制限はありませんが、パフォーマンスはあまり向上せずにメモリー・フットプリントが大きくなるため、非常に大きいバッチ・サイズを使用することは推奨されません。

oracle.hadoop.loader.connection.sessionTimeZone

型: String

デフォルト: LOCAL

説明: このプロパティは、データベース接続のセッション・タイムゾーンの変更に使用されます。有効な値は次のとおりです。

[+|-] hh:mm: UTCとの差分の時間数と分数
LOCAL: JVMのデフォルト・タイムゾーン
time_zone_region: 有効なタイムゾーン・リージョン

このプロパティは、TIMESTAMP、TIMESTAMP WITH TIME ZONEおよびTIMESTAMP WITH LOCAL TIME ZONEのデータベース列にロードされる入力データの解析に使用されるデフォルト・タイムゾーンも決定します。

oracle.hadoop.loader.output.dirpathBufsize

型: Integer

デフォルト: 131072

説明: このプロパティは、OCIOutputFormatのダイレクト・パス・ストリーム・バッファのサイズ(バイト)の設定に使用されます。必要に応じて、値は8KBの倍数に切り上げられます。

oracle.hadoop.loader.output.fieldTerminator

型: String

デフォルト: , (カンマ)

説明: DelimitedTextOutputFormatのフィールドを区切る1文字。

代替表記: \uHHHH (HHHHは文字のUTF-16エンコーディング)。

oracle.hadoop.loader.output.initialFieldEncloser

型: String

デフォルト: なし

説明: この値が設定されている場合、指定された文字(デフォルトでこのプロパティの値に設定)と${oracle.hadoop.loader.output.trailingFieldEncloser}でフィールドは常に囲まれます。

この値を設定する場合、1文字または\uHHHH (HHHHは文字のUTF-16エンコーディング)である必要があります。ゼロ長値は囲み文字がないことを表します(デフォルト値)。

${oracle.hadoop.loader.output.initialFieldEncloser}および${oracle.hadoop.loader.output.trailingFieldEncloser}は、両方とも設定しないか、両方とも設定する必要があります。

一部のフィールドにfieldTerminatorが含まれる場合はこれらのプロパティを使用します。一部のフィールドにtrailingFieldEncloserも含まれる場合は、escapeEnclosersプロパティをtrueに設定します。

oracle.hadoop.loader.output.trailingFieldEncloser

型: String

デフォルト: なし

説明: この値が設定されている場合、フィールドは常に${oracle.hadoop.loader.output.initialFieldEncloser}とこのプロパティに指定された文字で囲まれます。

この値が設定されない場合、${oracle.hadoop.loader.output.initialFieldEncloser}の値がかわりに使用されます。

${oracle.hadoop.loader.output.initialFieldEncloser}が設定されない場合、このプロパティを設定しないでください。

oracle.hadoop.loader.output.escapeEnclosers

型: Boolean

デフォルト: false

説明: これがtrueに設定され、開始と終了の両方のフィールド囲み文字が設定されている場合、フィールドが走査され、埋込みの終了囲み文字がエスケープされます。フィールド値に終了囲み文字が含まれている可能性がある場合、このオプションを使用します。

oracle.hadoop.loader.input.fieldTerminator

型: String

デフォルト: , (カンマ)

説明: DelimitedTextInputFormatのフィールドを区切る1文字。

代替表記: \uHHHH (HHHHは文字のUTF-16エンコーディング)。

oracle.hadoop.loader.input.initialFieldEncloser

型: String

デフォルト: なし

説明: この値が設定されている場合、指定された文字(デフォルトでこのプロパティの値に設定)と${oracle.hadoop.loader.input.trailingFieldEncloser}でフィールドを囲むことができます。

この値を設定する場合、1文字または\uHHHH (HHHHは文字のUTF-16エンコーディング)である必要があります。

ゼロ長値は囲み文字がないことを表します(デフォルト値)。

oracle.hadoop.loader.input.trailingFieldEncloser

型: String

デフォルト: なし

説明: この値が設定されている場合、${oracle.hadoop.loader.input.initialFieldEncloser}と指定された文字でフィールドを囲むことができます。

${oracle.hadoop.loader.input.initialFieldEncloser}および${oracle.hadoop.loader.input.trailingFieldEncloser}は、両方とも設定しないか、両方とも設定する必要があります。

oracle.hadoop.loader.input.fieldNames

型: 文字列のカンマ区切りのリスト

デフォルト: F0,F1,F2,...

説明: 入力フィールドに割り当てられる名前。名前は、レコードのAvroスキーマの作成に使用されます。文字列は、有効なJSON名文字列である必要があります。

3.6.2 一般的なプロパティ

次に、Oracle Loader for Hadoopの一般的なプロパティについて説明します。

mapreduce.inputformat.class

InputFormatを実装するクラスの名前

mapreduce.outputformat.class

Oracle Loader for Hadoopでサポートされる出力オプション。値は次のとおりです。

oracle.hadoop.loader.lib.output.DelimitedTextOutputFormat

データ・レコードを、カンマ区切り(CSV)形式ファイルなどのデリミタ付きテキスト形式ファイルに書き込みます。
oracle.hadoop.loader.lib.output.JDBCOutputFormat

JDBCを使用してデータ・レコードをターゲット表に挿入します。
oracle.hadoop.loader.lib.output.OCIOutputFormat

Oracle OCIダイレクト・パス・インタフェースを使用して、行をターゲット表に挿入します。
oracle.hadoop.loader.lib.output.DataPumpOutputFormat

外部表を使用してターゲット表にロードされるバイナリ形式ファイルに行を書き込みます。

3.7 Oracle Loader for Hadoopの使用例

この項に示す例では、JDBCを使用するオンライン・データベース・モードでOracle Loader for Hadoopを使用します。次のステップがあります。

データベースに表を作成します。この例では、Oracle DatabaseのHRサンプル・スキーマの一部として入手できるHR.EMPLOYEES表が使用されます。
oracle.hadoop.loader.examplesパッケージの例と同様のInputFormatクラスを実装します。

構成プロパティを設定します。MyLoaderMap.xmlドキュメントには、次のように入力データ・フィールドとHR.EMPLOYEES表の列のマッピングが含まれます。

<?xml version="1.0" encoding="UTF-8"?>
<LOADER_MAP>
<SCHEMA>HR</SCHEMA>
<TABLE>EMPLOYEES</TABLE>
<COLUMN field="empId">EMPLOYEE_ID</COLUMN>
<COLUMN field="lastName">LAST_NAME</COLUMN>
<COLUMN field="email">EMAIL</COLUMN>
<COLUMN field="hireDate" format="MM-dd-yyyy">HIRE_DATE</COLUMN>
<COLUMN field="jobId">JOB_ID</COLUMN>
</LOADER_MAP>

MyConf.xmlの構成プロパティは次のとおりです。

<configuration>
  <property>
    <name>mapreduce.inputformat.class</name>
    <value><full_class_name>.MyInputFormat</value>
    <description> Name of the class implementing InputFormat </description>
  </property>
 
  <property>
    <name>mapreduce.outputformat.class</name>
    <value>oracle.hadoop.loader.lib.output.JDBCOutputFormat</value>
    <description> Output mode after the loader job executes on Hadoop  </description>
  </property>
 
  <property>
    <name>oracle.hadoop.loader.loaderMapFile</name>
    <value>MyLoaderMap.xml</value>
    <description> The loaderMap file specifying the mapping of input data
     fields to the table columns </description>
  </property>
 
 <property>
   <name>oracle.hadoop.loader.connection.user</name>
   <value>HR</value>
   <description> Name of the user connecting to the database</description>
 </property>

<property> 
  <name>oracle.hadoop.loader.connection.password</name>
  <value>[HR password]</value>
  <description>Password of the user connecting to the database</description>
</property>
 
 <property>
   <name>oracle.hadoop.loader.connection.url</name>
   <value>jdbc:oracle:thin:@//example.com:1521/serviceName</value>
   <description> Database connection string </description>
 </property>
</configuration>

OraLoaderを起動します。

bin/hadoop jar oraloader.jar oracle.hadoop.loader.OraLoader -libjars \
avro-1.4.1.jar, MyInputFormat.jar -conf MyConf.xml \ 
-fs [<local|namenode:port>] \
-jt [<local|jobtracker:port>]

3.8 ターゲット表の特性

Oracle Loader for Hadoopでは、ターゲット表と呼ばれる1つの表へのロードがサポートされます。ターゲット表はOracle Database内に存在する必要があります。データを含めることも、空にすることもできます。

3.8.1 サポートされるデータ型

Oracle Loader for Hadoopでは、次のデータベース組込みデータ型がサポートされます。

VARCHAR2
CHAR
NVARCHAR2
NCHAR
NUMBER
FLOAT
RAW
BINARY_FLOAT
BINARY_DOUBLE
DATE
TIMESTAMP
TIMESTAMP WITH TIME ZONE
TIMESTAMP WITH LOCAL TIME ZONE
INTERVAL YEAR TO MONTH
INTERVAL DAY TO SECOND

ターゲット表には、サポートされていないデータ型の列が含まれていてもかまいませんが、これらの列はnull値可能である必要があります。そうでない場合、値を設定します。

3.8.2 サポートされるパーティション化方法

Oracle Loader for Hadoopでは、次の単一レベルおよび複合レベルのパーティション化方法がサポートされます。

レンジ
リスト
ハッシュ
時間隔
レンジ-レンジ
レンジ-ハッシュ
レンジ-リスト
リスト-レンジ
リスト-ハッシュ
リスト-リスト
ハッシュ-レンジ
ハッシュ-ハッシュ
ハッシュ-リスト
時間隔-レンジ
時間隔-ハッシュ
時間隔-リスト

Oracle Loader for Hadoopでは、参照パーティション化または仮想列ベースのパーティション化はサポートされません。

3.9 ローダー・マップXMLスキーマ定義

この項には、ターゲット表にロードされる列を指定するローダー・マップのXMLスキーマ定義(XSD)があります。

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema elementFormDefault="qualified" attributeFormDefault="unqualified"
           xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:attributeGroup name="columnAttrs">
    <xs:annotation>
      <xs:documentation>Column attributes define how to map input fields to the
                        database column. field - is the name of the field in the
                        IndexedRecord input object. The field name need not be
                        unique. This means that the same input field can map to
                        different columns in the database table. format - is a
                        format string for interpreting the input. For example,
                        if the field is a date then the format is a date format
                        string suitable for interpreting dates</xs:documentation>
    </xs:annotation>
    <xs:attribute name="field" type="xs:token" use="optional"/>
    <xs:attribute name="format" type="xs:token" use="optional"/>
  </xs:attributeGroup>
  <xs:simpleType name="TOKEN_T">
    <xs:restriction base="xs:token">
      <xs:minLength value="1"/>
    </xs:restriction>
  </xs:simpleType>
  <xs:element name="LOADER_MAP">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="SCHEMA" type="TOKEN_T" minOccurs="0"/>
        <xs:element name="TABLE" type="TOKEN_T" nillable="false"/>
        <xs:element name="COLUMN" maxOccurs="unbounded" minOccurs="0">
          <xs:annotation>
            <xs:documentation>specifies the database column name that will be
                              loaded. Each column name must be unique.
            </xs:documentation>
          </xs:annotation>
          <xs:complexType>
            <xs:simpleContent>
              <xs:extension base="TOKEN_T">
                <xs:attributeGroup ref="columnAttrs"/>
              </xs:extension>
            </xs:simpleContent>
          </xs:complexType>
        </xs:element>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
 </xs:schema>

3.10 構成プロパティのXMLドキュメント

これはoraloader-conf.xmlドキュメントで、Oracle Loader for Hadoopの構成プロパティを示します。

<?xml version="1.0"?>
<!-- 
 Copyright (c) 2011, 2012, Oracle and/or its affiliates. All rights reserved. 
 
   NAME
     oraloader-conf.xml 
 
   DESCRIPTION     
     This file is loaded as the very first conf resource.
-->
<configuration>
  <property>
    <name>oracle.hadoop.loader.libjars</name>
    <value>${oracle.hadoop.loader.olh_home}/jlib/ojdbc6.jar,
${oracle.hadoop.loader.olh_home}/jlib/orai18n.jar,
${oracle.hadoop.loader.olh_home}/jlib/orai18n-utility.jar,
${oracle.hadoop.loader.olh_home}/jlib/orai18n-mapping.jar,
${oracle.hadoop.loader.olh_home}/jlib/orai18n-collation.jar,
${oracle.hadoop.loader.olh_home}/jlib/oraclepki.jar,
${oracle.hadoop.loader.olh_home}/jlib/osdt_cert.jar,
${oracle.hadoop.loader.olh_home}/jlib/osdt_core.jar,
${oracle.hadoop.loader.olh_home}/jlib/commons-math-2.2.jar,
${oracle.hadoop.loader.olh_home}/jlib/jackson-core-asl-1.8.8.jar,
${oracle.hadoop.loader.olh_home}/jlib/jackson-mapper-asl-1.8.8.jar,
${oracle.hadoop.loader.olh_home}/jlib/avro-1.6.3.jar,
${oracle.hadoop.loader.olh_home}/jlib/avro-mapred-1.6.3.jar</value> 
    <description>
      Comma separated list of library jar files. These jars are 
      appended to the value of the "-libjars" command-line argument. 
      Users can distribute their application jars using this property
      in place of, or in combination with, the "-libjars" option.
                 
      It is invalid for this list to have a leading comma or 
      consecutive commas.
    </description>
  </property>
  <property>
    <name>oracle.hadoop.loader.sharedLibs</name>
    <value>
${oracle.hadoop.loader.olh_home}/lib/libolh11.so,
${oracle.hadoop.loader.olh_home}/lib/libclntsh.so.11.1,
${oracle.hadoop.loader.olh_home}/lib/libnnz11.so,
${oracle.hadoop.loader.olh_home}/lib/libociei.so
</value>
    <description>
      These files are appended to the value of the "-files" 
      command-line argument.
     
      It is invalid for this list to have a leading comma or 
      consecutive commas.
    </description>
  </property>
  <property>
    <name>oracle.hadoop.loader.olh_home</name>
    <value/>
    <description>
      A path to the OLH_HOME on the node where the OraLoader job
      is initiated. OraLoader uses this path to locate required libraries.
      If this property is not defined, OraLoader will use the value in the
      environment variable OLH_HOME.
    </description>
  </property>
  <property>
    <name>mapred.job.name</name>
    <value>OraLoader</value>
    <description>
      Hadoop job name for this Oracle loader job.
    </description>
  </property>
  <property> 
    <name>oracle.hadoop.loader.targetTable</name>
    <value/>
    <description>
      A schema qualified name for the table to be loaded. Use this 
      property to indicate that all columns of the table will be 
      loaded and that the names of the input fields match the 
      column names. This property takes precedence over the
      oracle.hadoop.loader.loaderMapFile property. By default
      this property is not defined. If the table is not schema-qualified, 
      then Oracle Loader for Hadoop uses the connection user.
    </description>
  </property>
  <property>
    <name>oracle.hadoop.loader.defaultDateFormat</name>
    <value>yyyy-MM-dd HH:mm:ss</value>
    <description>
      A java.text.SimpleDateFormat pattern that is used to parse any input 
      field into a DATE column. The format is constructed using 
      the default locale. If you need different patterns for different input 
      fields, then specify the pattern for each input field using the "format" 
      attribute of the COLUMN element definition in the loader map file 
    </description>
  </property>
  <property>
    <name>oracle.hadoop.loader.loaderMapFile</name>
    <value/>
    <description>
      Path to the loader map file. 
      Use a file:// scheme to indicate a local file.
    </description>
  </property>
  <property>
    <name>oracle.hadoop.loader.tableMetadataFile</name>
    <value/>
    <description>
      Path to the target table metadata file. Use this property when
      running in disconnected mode. The table metadata file is
      created by running the OraLoaderMetadata utility.
      Use a file:// scheme to indicate a local file.
    </description>
  </property>
  <property>
    <name>oracle.hadoop.loader.olhcachePath</name>
    <value>${mapred.output.dir}/../olhcache</value>
    <description>
      Path to a directory where Oracle Loader for Hadoop can create 
      files that will be loaded into the DistributedCache. 
              
      The default value is a directory called 'olhcache' in the parent directory 
      of the job's output directory (i.e. ${mapred.output.dir}).
      
      In distributed mode, the value must be a hdfs path
      (see javaDoc for org.apache.hadoop.filecache.DistributedCache).    
    </description>
  </property>
  <property>
    <name>oracle.hadoop.loader.loadByPartition</name>
    <value>true</value>
    <description>
      Instructs the output format to perform a partition-aware load.
      For DelimitedText output format, this option controls whether the 
      keyword "PARTITION" appears in the generated .ctl file(s).
    </description>
  </property>  
  <property>
    <name>oracle.hadoop.loader.extTabDirectoryName</name>
    <value>OLH_EXTTAB_DIR</value>
    <description>
      The name of the Oracle directory object for the external table's
      LOCATION data files. This property applies only to the DelimitedText 
      and DataPump output formats.
    </description>
  </property>
  <property>
    <name>oracle.hadoop.loader.sampler.enableSampling</name>
    <value>true</value>
    <description>
      Indicates whether the sampling feature is enabled. 
      Set the value to false to disable this feature.
    </description>
  </property>  
  <property>
     <name>oracle.hadoop.loader.enableSorting</name>
     <value>true</value>
     <description>
      Indicates whether output records within each reducer group 
      should be sorted. Use the property oracle.hadoop.loader.sortKey
      to specify the columns to use when sorting. If no value is given,
      then records will be sorted by primary key if the table has one.
     </description>
 </property>
 <property>
     <name>oracle.hadoop.loader.sortKey</name>
     <value/>
     <description>
      A comma separated list of column names that is used to form a 
      key for sorting output records within a reducer group. If no value
      is given, and oracle.hadoop.loader.enableSorting is true, then 
      records will be sorted by primary key if the table has one.
      
      A name may be provided as a quoted or non-quoted identifier. A 
      quoted identifier begins and ends with a double quotation mark.
      The name between the quotes is used to identify the column.  
      A non-quoted identifier will be converted to uppercase before use.
     </description>
  </property>
  <property>
     <name>oracle.hadoop.loader.rejectLimit</name>
     <value>1000</value>
     <description>
      The allowed number of rejected (skipped) records. If this value is 
      exceeded, the job is aborted.
       
      A negative value signals that no limit is imposed on the number of
      rejected records.
 
      Note that if mapper speculative execution is turned on
      (${mapred.map.tasks.speculative.execution}=true by default),
      the number of rejected records may be temporarily inflated, and
      the job may be prematurely aborted.
     </description>
  </property>
 
  <property>
     <name>oracle.hadoop.loader.logBadRecords</name>
     <value>false</value>
     <description>
      When set to true, Oracle Loader for Hadoop will log bad records to a file.
      This applies to records rejected by input formats and mappers. 
     </description>
  </property>
  
  <property>
     <name>oracle.hadoop.loader.badRecordFlushInterval</name>
     <value>500</value>
     <description>
      Specifies the number of records logged by a task attempt before 
      flushing (sync-ing) the log.
      
      This limits the number of records logged by a task attempt 
      that can be lost when the job is killed due to 
      ${oracle.hadoop.loader.rejectLimit}.
      
      This is ignored unless ${oracle.hadoop.loader.logBadRecords} 
      is set to true.
     </description>
  </property>
  
  <property>
    <name>oracle.hadoop.loader.log4j.propertyPrefix</name>
    <value>log4j.logger.oracle.hadoop.loader</value>
    <description>
      Oracle Loader for Hadoop allows you to specify log4j properties
      using Hadoop's job configuration mechanism (-conf, -D).
      
      All configuration properties starting with this prefix are loaded into
      log4j. They will override the settings with the same property names 
      log4j loaded from ${log4j.configuration}. 
      
      These overrides apply to the Oracle Loader for Hadoop job driver
      and all its map and reduce tasks.
      
      Example: -D log4j.logger.oracle.hadoop.loader.OraLoader=DEBUG
               -D log4j.logger.oracle.hadoop.loader.metadata=INFO
               
      Expert: properties are copied from the conf to log4j with their
      RAW values; any variable expansion is done in the context of log4j.
      In order to use conf variables in the expansion, the variables would have
      to start with this prefix.
    </description>
  </property>
 
  <!-- CONNECTION properties -->
  
  <property>
    <name>oracle.hadoop.loader.connection.url</name>
    <value/>
    <description>
      Specifies the URL of the database connection string. This property 
      takes precedence and overrides all other connection properties.      
    
      If Oracle Wallet is configured as an external password store,
      the property value must start with the driver prefix: jdbc:oracle:thin:@ 
      and the db_connect_string must exactly match the credential defined in the 
      wallet.
 
        Example 1: ( using oracle net syntax) 
        jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS_LIST=
            (ADDRESS=(PROTOCOL=TCP)(HOST=myhost)(PORT=1521)))
                     (CONNECT_DATA=(SERVICE_NAME=my_db_service_name)))
        
        Example 2: ( using TNS entry)
          jdbc:oracle:thin:@myTNS
        
          - Also see documentation for 
            oracle.hadoop.loader.connection.wallet_location          
    
      If Oracle Wallet is NOT used, then set the following conf properties     
      oracle.hadoop.loader.connection.url. 
      Examples of connection URL styles:
        thin-style: 
          jdbc:oracle:thin:@//myhost:1521/my_db_service_name  
          jdbc:oracle:thin:user/password@//myhost:1521/my_db_service_name
         
        Oracle Net:
          jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS_LIST=
              (ADDRESS=(PROTOCOL=TCP)(HOST=myhost)(PORT=1521)))
                       (CONNECT_DATA=(SERVICE_NAME=my_db_service_name)))
        
        TNSEntry Name:
          jdbc:oracle:thin:@myTNSEntryName
    
     oracle.hadoop.loader.connection.user  
     oracle.hadoop.loader.connection.password  
       
     If OCIOutputFormat is configured, and Oracle Wallet is not used, then 
     username and password must be specified in these separate properties.     
    </description>
  </property>
  <property>
    <name>oracle.hadoop.loader.connection.user</name>
    <value/>
    <description>
      Name for the database login. If OCIOutputFormat is configured and this 
      user property is not defined, then logon with Oracle Wallet is assumed.
    </description>
  </property>
  <property>
    <name>oracle.hadoop.loader.connection.password</name>
    <value/>
    <description>Password for the connecting user.</description>
  </property>
  <property>
    <name>oracle.hadoop.loader.connection.wallet_location</name>
    <value/>
    <description>
     File path to an Oracle wallet directory where the connection 
     credential is stored.
     
     When using Oracle Wallet as an external password store 
     set the three properties
      oracle.hadoop.loader.connection.wallet_location
      oracle.hadoop.loader.connection.url
      oracle.hadoop.loader.connection.tns_admin
      
     or set the three properties      
      oracle.hadoop.loader.connection.wallet_location
      oracle.hadoop.loader.connection.tnsEntryName
      oracle.hadoop.loader.connection.tns_admin     
    </description>
  </property>
  <property>
    <name>oracle.hadoop.loader.connection.tnsEntryName</name>
    <value/>
    <description>
      Specifies a TNS entry name defined in the tnsnames.ora file.
      This property is used together with the 
      oracle.hadoop.loader.connection.tns_admin property.
    </description>
  </property>  
  <property>
    <name>oracle.hadoop.loader.connection.tns_admin</name>
    <value/>
    <description>
      File path to a directory containing
      SQL*Net configuration files like sqlnet.ora and tnsnames.ora.
      If this property is not defined, the value of the environment
      variable TNS_ADMIN will be used. Define this property in order
      to use TNS entry names in database connect strings.
 
      This property must be set when using Oracle Wallet as an
      external password store. See the property 
      oracle.hadoop.loader.connection.wallet_location
    </description>
  </property>  
  <property>
    <name>oracle.hadoop.loader.connection.defaultExecuteBatch</name>
    <value>100</value>
    <description>
       Applicable only for JDBC and OCI output formats. The default
       value for the number of records to be inserted in a batch for
       each trip to the database. Specify a value >= 1 to
       override the default value. If the specified value is less than 1,
       this property assumes the default value. Though the maximum
       value is unlimited, using very large batch sizes is not
       recommended, as it results in a large memory footprint without
       much increase in performance.
     </description>
  </property>
  <property>
    <name>oracle.hadoop.loader.connection.sessionTimezone</name>
    <value>LOCAL</value>
    <description>
      This property is used to alter the session time zone for 
      database connections.  Valid values are:
      
        [+|-] hh:mm      - hours and minutes before or after UTC
        LOCAL            - the default timezone of the JVM 
        time_zone_region - a valid time zone region
      
      This property also determines the default timezone when parsing
      input data that will be loaded to database column types:
      TIMESTAMP, TIMESTAMP WITH TIME ZONE and TIMESTAMP WITH LOCAL TIME ZONE
    </description>
  </property>
   
  <!-- DEBUG properties 
       These properties should not compromise security or
       expose customer data or Oracle proprietary information.
       Remember these properties will be available to anyone.
       
       Current NO DEBUG properties
  -->
  <!-- properties for OCIOutputFormat -->
  <property>
    <name>oracle.hadoop.loader.output.dirpathBufsize</name>
    <value>131072</value>
    <description>
      This property is used to set the size, in bytes, of the direct path stream
      buffer for OCIOutputFormat.  If needed, values are rounded up to the next 
      nearest multiple of 8k. The default is 128k.
    </description>
  </property>
  
  <!-- 
    This is currently only used by parallel dirpath output, but it's conceivable
    that this could be used in a read scenario as well. That is why it does not
    have an output prefix.
  -->
  <property>
    <name>oracle.hadoop.loader.compressionFactors</name>
    <value>BASIC=5.0,OLTP=5.0,QUERY_LOW=10.0,QUERY_HIGH=10.0,
           ARCHIVE_LOW=10.0,ARCHIVE_HIGH=10.0</value>
    <description>
      This property is used to define the compression factor for different types
      of compression. The format is a comma separated list of name=value pairs
      where name is one of BASIC, OLTP, QUERY_LOW, QUERY_HIGH, ARCHIVE_LOW, or
      ARCHIVE_HIGH.  Value is a decimal number.
    </description>
  </property>
 
  
  <!-- properties for DelimitedTextOutputFormat -->
    <property>
      <name>oracle.hadoop.loader.output.fieldTerminator</name>
      <value>,</value>
      <description>
        A single character to delimit fields for DelimitedTextOutputFormat.
        Alternate representation: \uHHHH (where HHHH is the character's UTF-16 
        encoding).
      </description>
    </property>
    <property>
      <name>oracle.hadoop.loader.output.initialFieldEncloser</name>
      <value></value>
      <description>
        When this value is set, fields are always enclosed between the
        specified character and 
        ${oracle.hadoop.loader.output.trailingFieldEncloser} (which defaults
        to this property's value).
        
        If this value is set, it must be either a single character, or \uHHHH 
        (where HHHH is the character's UTF-16 encoding).
        
        A zero length value means no enclosers (default value).
                           
        Use this when some field may contain the fieldTerminator. 
        If some field may also contain the trailingFieldEncloser, then
        the escapeEnclosers property should be set to true.
      </description>
    </property>
    <property>
      <name>oracle.hadoop.loader.output.trailingFieldEncloser</name>
      <value></value>
      <description>
        When this value is set, fields are always enclosed between 
        ${oracle.hadoop.loader.output.initialFieldEncloser} and the
        specified character for this property.
 
        If this property is not defined, the value of 
        ${oracle.hadoop.loader.output.initialFieldEncloser} is used instead.
 
        If this value is set, it must be either a single character, or \uHHHH 
        (where HHHH is the character's UTF-16 encoding).
        
        Do not set this property if
        ${oracle.hadoop.loader.output.initialFieldEncloser} is not defined.
                
        Use this when some field may contain the fieldTerminator. 
        If some field may also contain the trailingFieldEncloser,then
        the escapeEnclosers property should be set to true.
      </description>
    </property>
    <property>
      <name>oracle.hadoop.loader.output.escapeEnclosers</name>
      <value>false</value>
      <description>
        When this is set to true and both initial and trailing field enclosers 
        are set, fields will be scanned, and embedded trailing encloser 
        characters will be escaped. Use this option when some of the field
        values may contain the trailing encloser character.
      </description>
    </property>
 
  <!-- property for DataPumpInputFormat -->
    <property>
      <name>oracle.hadoop.loader.output.granuleSize</name>
      <value>10240000</value>
      <description>
 
        Granule size (in bytes) used in the generated data pump files.
        A granule determines the work load for a pq-slave when loading the 
        file through the ORACLE_DATAPUMP access driver.
      </description>
    </property>
    
  <!-- common property for DelimitedTextInputFormat and RegexInputFormat -->
    <property>
      <name>oracle.hadoop.loader.input.fieldNames</name>
      <value/>
      <description>
        Comma-separated list of names to assign to input fields. 
        The names are used to create the Avro schema for the record. 
        The strings must be valid JSON name strings.
 
        If this property is not defined, the names F0,F1,F2,... will be used
        (consistent with oracle.hadoop.loader.examples.CSVInputFormat).
      </description>
    </property>
 
 
  <!-- properties for DelimitedTextInputFormat -->
    <property>
      <name>oracle.hadoop.loader.input.fieldTerminator</name>
      <value>,</value>
      <description>
        A single character to delimit fields for DelimitedTextInputFormat.
        Alternate representation: \uHHHH (where HHHH is the character's UTF-16 
        encoding).
      </description>
    </property>
    <property>
      <name>oracle.hadoop.loader.input.initialFieldEncloser</name>
      <value></value>
      <description>
        When this value is set, fields are allowed to be enclosed
        between the specified character and 
        ${oracle.hadoop.loader.input.trailingFieldEncloser} (which defaults
        to this property's value).
        
        If this value is set, it must be either a single character, or \uHHHH 
        (where HHHH is the character's UTF-16 encoding).
        
        A zero length value means no enclosers (default value).
      </description>
    </property>
    <property>
      <name>oracle.hadoop.loader.input.trailingFieldEncloser</name>
      <value></value>
      <description>
        When this value is set, fields are allowed to be enclosed
        between ${oracle.hadoop.loader.input.initialFieldEncloser} 
        and the specified character.
        
        If this property is not defined, the value of 
        ${oracle.hadoop.loader.input.initialFieldEncloser} is used instead.
        
        If this value is set, it must be either a single character, or \uHHHH 
        (where HHHH is the character's UTF-16 encoding).
        
        Do not set this property if
        ${oracle.hadoop.loader.input.initialFieldEncloser} is not defined.
      </description>
    </property>
    
    <!-- properties for RegexInputFormat -->
    <property>
      <name>oracle.hadoop.loader.input.regexPattern</name>
      <value/>
      <description>
        The pattern string for a regular expression as defined in 
        http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html.
        
        The pattern must match the entire text line (the matching is applied 
        only once per line).
        
        Same as org.apache.hadoop.hive.contrib.serde2.RegexSerDe's 
        "input.regex" property.
      </description>
    </property>
    <property>
      <name>oracle.hadoop.loader.input.regexCaseInsensitive</name>
      <value>false</value>
      <description>
         Directs the pattern matching to be case insensitive. 
         
         Same as org.apache.hadoop.hive.contrib.serde2.RegexSerDe's 
         "input.regex.case.insensitive" property.
      </description>
    </property>    
 
    <!--Properties for tuning the sampler-->
    <!-- set numThreads > 1 for large datasets -->
    <property>
      <name>oracle.hadoop.loader.sampler.numThreads</name>
          <value>5</value>
      <description>
        Number of sampler threads.  
        
        This value should be set based on the processor and memory resources 
        available on the node where the Oracle Loader for Hadoop job is initiated. 
        A higher number of sampler threads implies higher concurrency in sampling.
        Set this value to 1 to disable multi-threading in the sampler. 
        The default value is 5 threads.  
      </description>
   </property> 
   <property>
     <name>oracle.hadoop.loader.sampler.maxLoadFactor</name>
     <value>0.05</value>
     <description> 
       The maximum acceptable reducer load factor.
       In a perfectly load balanced job, every reducer is assigned 
       an equal amount of work (or load). 
       Load factor is the percent overload per reducer 
       i.e. (assigned load - ideal load)%
       For example: a value of 0.05, indicates that it is acceptable for 
       reducers to be assigned up to 5% more data than their ideal load. 
       If load balancing is successful, it guarantees this 
       maximum load factor at the specified confidence.
       (see oracle.hadoop.loader.sampler.loadCI)
       Default = 0.05, another common value is 0.1.
     </description>
  </property>
  <property>
    <name>oracle.hadoop.loader.sampler.loadCI</name>
    <value>0.95</value>
    <description> 
      The confidence level for the specified 
      maximum reducer load factor.
      (See oracle.hadoop.loader.sampler.maxLoadFactor)
      Default = 0.95, other common values = 0.90, 0.99
    </description>
  </property>
  <property>
    <name>oracle.hadoop.loader.sampler.minSplits</name>
    <value>5</value>
    <description>
      The minimum number of splits that will be 
      read by the sampler. If the total number of splits 
      is lesser than this value, then the sampler will read
      all splits. Splits may be read partially. 
      A non-positive value is equivalent to minSplits=1. 
      The default value is 5.
    </description>
  </property>
  <property>
    <name>oracle.hadoop.loader.sampler.hintMaxSplitSize</name>
    <value>1048576</value>
    <description> 
      The sampler sets Hadoop configuration parameter
      mapred.max.split.size to this value before it calls the InputFormat's 
      getSplits() method.
      
      The value of mapred.max.split.size is only set to this value for the 
      duration of sampling, it is not changed in the actual job 
      configuration. Some InputFormats (e.g. FileInputFormat) use the 
      maximum split size as a hint to determine the number of splits 
      returned by getSplits(). Smaller split sizes imply that more
      chunks of data will be sampled at random (good). While large splits are 
      better for IO performance, they are not necessarily better for sampling. 
      Set this value to be small enough for good sampling performance, 
      but not any smaller: extremely small values can cause inefficient IO 
      performance and cause getSplits() to run out of memory by returning too
      many splits. 
      
      The recommended minimum value for this property is 1048576 bytes (1 MB).
      Note that org.apache.hadoop.mapreduce.lib.input.FileInputFormat will
      always return splits of size at least FileInputFormat::getMinSplitSize(),
      regardless of the value of this property.
 
      This value can be increased for larger datasets (e.g. tens of terabytes) 
      or if the InputFormat's getSplits() method throws an OutOfMemoryError.  
      If the specified value is less than 1, this property is ignored. 
      The default value is 1048576 bytes (1 MB).
    </description>
  </property>
  <property>
    <name>oracle.hadoop.loader.sampler.hintNumMapTasks</name>
    <value>100</value>
    <description> 
      The sampler sets Hadoop configuration parameter
      mapred.map.tasks to this value for the duration of sampling. 
      The value of mapred.map.tasks is not changed in the actual job 
      configuration. Some InputFormats (e.g. DBInputFormat) use the 
      number of map tasks parameter as a hint to determine the number of 
      splits returned by getSplits(). Higher values imply that more chunks 
      of data will be sampled at random (good). The default value is 100. 
      This value should typically be increased for large datasets (e.g. more 
      than a million rows), while keeping in mind that extremely large values
      can cause the InputFormat's getSplits() method to run out of memory by
      returning too many splits.
      
      If the specified value is less than 1, this property is ignored.  
      The default value is 100.
    </description>
  </property>
  <property>
    <name>oracle.hadoop.loader.sampler.maxSamplesPct</name>
    <value>0.01</value>
    <description> 
      This property specifies the maximum data to sample, as a
      percentage of the total amount of data. In general, the 
      sampler will stop sampling if any one of the following is true:
      (1) it has collected the minimum number of samples 
          required for optimal load-balancing, or 
      (2) the percent of data sampled exceeds 
          oracle.hadoop.loader.sampler.maxSamplesPct, or 
      (3) the number of bytes sampled exceeds 
          oracle.hadoop.loader.sampler.maxHeapBytes.
      If this parameter is set to a negative value, 
      condition (2) is not imposed.
      The default value is 0.01 (1%).
    </description>
  </property>
  <property>
    <name>oracle.hadoop.loader.sampler.maxHeapBytes</name>
    <value>-1</value>
    <description> 
      This value specifies the maximum memory available to 
      the sampler in bytes. In general, the sampler will 
      stop sampling when any one of these conditions is true:
      (1) it has collected the minimum number of samples 
          required for optimal load-balancing, or 
      (2) the percent of data sampled exceeds 
          oracle.hadoop.loader.sampler.maxSamplesPct, or 
      (3) the number of bytes sampled exceeds 
          oracle.hadoop.loader.sampler.maxHeapBytes.
      If this parameter is set to a negative value, 
      condition (3) is not imposed.
      Default = -1 (no memory restrictions on the sampler).
    </description>
  </property>
  
</configuration>

3.11 同梱されているソフトウェアのサードパーティ・ライセンス

Oracle Loader for Hadoopは、次のサードパーティ製品をインストールします。

Apache Avro
Apache Commons Mathematics Library
Jackson JSON Processor

Oracle Loader for Hadoopには、Oracle 11gリリース2 (11.2)クライアント・ライブラリが含まれます。Oracle Database 11gリリース2 (11.2)に含まれるサードパーティ製品の詳細は、『Oracle Databaseライセンス情報』を参照してください。

特に断りがないかぎり、あるいは、サードパーティ・ライセンス(LGPLなど)の条項で求められている場合、Apache Licensed Codeに関連するすべてのステートメントを含めた、この項のライセンスとステートメントは、告知のみを目的とするものです。

3.11.1 Apache Licensed Code

The following is included as a notice in compliance with the terms of the Apache 2.0 License, and applies to all programs licensed under the Apache 2.0 license:

You may not use the identified files except in compliance with the Apache License, Version 2.0 (the "License.")

You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

A copy of the license is also reproduced below.

See the License for the specific language governing permissions and limitations under the License.

Apache License

Version 2.0, January 2004
http://www.apache.org/licenses/

TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

Definitions

"License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document.

"Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License.

"Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity.For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity.

"You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License.

"Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.

"Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.

"Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below).

"Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship.For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof.

"Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner.For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution."

"Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work.
Grant of Copyright License.Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form.
Grant of Patent License.Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted.If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed.
Redistribution.You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
1. You must give any other recipients of the Work or Derivative Works a copy of this License; and
2. You must cause any modified files to carry prominent notices stating that You changed the files; and
3. You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and
4. If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear.The contents of the NOTICE file are for informational purposes only and do not modify the License.You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License.
You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License.
Submission of Contributions.Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions.Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions.
Trademarks.This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file.
Disclaimer of Warranty.Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE.You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License.
Limitation of Liability.In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages.
Accepting Warranty or Additional Liability.While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License.However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.

END OF TERMS AND CONDITIONS

APPENDIX: How to apply the Apache License to your work

To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information.(Do not include the brackets!)The text should be enclosed in the appropriate comment syntax for the file format.We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.See the License for the specific language governing permissions and limitations under the License.

この製品には、The Apache Software Foundation (http://www.apache.org/)によって開発されたソフトウェアが含まれています(次に記載)。

3.11.2 Apache Avro 1.6.3

Licensed under the Apache License, Version 2.0 (the "License"); you may not use Apache Avro except in compliance with the License.You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

3.11.3 Apache Commons Mathematics Library 2.2

Licensed under the Apache License, Version 2.0 (the "License"); you may not use the Apache Commons Mathematics library except in compliance with the License.You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

3.11.4 Jackson JSON 1.8.8

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this library except in compliance with the License.You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0