Skip Headers
Oracle® Big Data Appliance Software User's Guide
Release 1 (1.0)

Part Number E25961-04
Go to Documentation Home
Go to Book List
Book List
Go to Table of Contents
Go to Index
Go to Feedback page
Contact Us

Go to previous page
Go to next page
PDF · Mobi · ePub



A service that ensures that all nodes in the cluster store about the same amount of data, within a set range. Data is balanced over the nodes in the cluster, not over the disks in a node.

Cloudera's Distribution including Apache Hadoop (CDH)

See CDH.


The version of Apache Hadoop and related components installed on Oracle Big Data Appliance.


A group of servers on a network that are configured to work together. A server is either a master node or a worker node.

All servers in Oracle Big Data Appliance Rack form a cluster. Servers 1, 2, and 3 are master nodes. Servers 4 to 18 are worker nodes.

See Hadoop.


A server in a CDH cluster that stores data in HDFS. A DataNode performs file system operations assigned by the NameNode.

See also HDFS; NameNode.


A distributed service in CDH for collecting and aggregating data from almost any source into a data store like HDFS or HBase.

See also HBase; HDFS.


A service that assigns MapReduce tasks to specific nodes in the CDH cluster, preferably those nodes storing the data.

See also Hadoop; MapReduce.


A batch processing infrastructure that stores files and distributes work across a group of servers. Oracle Big Data Appliance uses Cloudera's Distribution including Apache Hadoop (CDH).

Hadoop Distributed File System (HDFS)


Hadoop User Experience (HUE)

See HUE.


An open-source, column-oriented database that provides random, read/write access to large amounts of sparse data stored in a CDH cluster. It provides fast lookup of values by key and can perform thousands of insert, update, and delete operations per second.

See also cluster.


An open-source file system designed to store extremely large data files (megabytes to petabytes) with streaming data access patterns. HDFS splits these files into data blocks and distributes the blocks across a CDH cluster.

When a data set is larger than the storage capacity of a single computer, then it must be partitioned across several computers. A distributed file system can manage the storage of a data set across a network of computers.

See also cluster.


An open-source data warehouse in CDH that supports data summarization, ad-hoc querying, and data analysis of data stored in HDFS. It uses a SQL-like language called HiveQL. An interpreter generates MapReduce code from the HiveQL queries.

By using Hive, you can avoid writing MapReduce programs in Java.

See also Hive Thrift; HiveQL; MapReduce

Hive Thrift

An RPC interface for remote access to CDH for Hive queries.

See also Hive.


See also Hive.


HotSpot is a Java Virtual Machine (JVM) that is maintained and distributed by Oracle. It automatically optimizes code that is executed frequently, leading to high performance. HotSpot is the standard JVM for the other components of the Oracle Big Data Appliance stack.


A web user interface in CDH that includes several applications, including a file browser for HDFS, a job browser, an account management tool, a MapReduce job designer, and Hive wizards. Cloudera Manager runs on HUE.

See also HDFS; Hive.

Java HotSpot Virtual Machine

See HotSpot.


A method of distributing work across a cluster used by the MapReduce engine.

A programming model that enables the MapReduce engine to distribute the work across the cluster. MapReduce programs can run massively in parallel in CDH.

A MapReduce program contains these tasks:

MySQL Database

A SQL-based relational database management system. On Oracle Big Data Appliance, Cloudera Manager, Oracle Data Integrator, Hive, and Oozie use MySQL Database as a metadata repository.


A service that maintains a directory of all files in HDFS and tracks where data is stored in the CDH cluster.

See also HDFS.


A server in a CDH cluster.

See cluster.


An open-source workflow and coordination service for managing data processing jobs in CDH.

Oracle Database Instant Client

A small-footprint client that enables Oracle applications to run without a standard Oracle client.

Oracle Linux

An open-source operating system. Oracle Linux 5.6 is the same version used by Exalogic 1.1. It features the Oracle Unbreakable Enterprise Kernel.

Oracle Wallet Manager

An application for managing the security credentials stored in Oracle wallets. A wallet is a password-protected container that stores authentication and signing credentials.


An open-source platform for analyzing large data sets that consists of the following:

Pig runs as a client application.

See also MapReduce.


A configuration management tool for deploying and configuring software components across a cluster. The Oracle Big Data Appliance initial software installation uses Puppet.

The Puppet tool consists of three components: puppet agents, typically just called puppets; the puppet master server; a console; and a cloud provisioner.

See also puppet agent; puppet master.

puppet agent

Primarily pull configurations from the puppet master and apply them. Puppet agents run on every server in Oracle Big Data Appliance.

puppet master

Primarily serve configurations to the puppet agents.

See also Puppet; puppet agent.


An open-source language and environment for statistical analysis and graphing.

Oracle Auto Service Request for Sun Systems

A software tool that monitors the health of the hardware and automatically generates a service request if it detects a problem. This tool is a feature of an Oracle Warranty.

See also Oracle Automated Service Monitor (OASM).

Oracle Automated Service Monitor (OASM)

A service for monitoring the health of Oracle Sun hardware systems. Formerly named Sun Automatic Service Monitor (SASM).


A command-line tool that imports and exports data between HDFS or Hive and structured databases. The name Sqoop comes from "SQL to Hadoop." Oracle R Connector for Hadoop uses the Sqoop executable to move data between HDFS and Oracle Database.


In Hive, all files in a directory stored in HDFS.

See also HDFS.


A centralized coordination service for CDH distributed processes that maintains configuration information and naming, and provides distributed synchronization and group services.