Glossary
Apache Flume
A distributed service for collecting and aggregating data from almost any source into a data store such as HDFS or HBase.
See also Apache HBase; HDFS.
Parent topic: Glossary
Apache HBase
An open-source, column-oriented database that provides random, read/write access to large amounts of sparse data stored in a CDH cluster. It provides fast lookup of values by key and can perform thousands of insert, update, and delete operations per second.
Parent topic: Glossary
Apache Hive
An open-source data warehouse in CDH that supports data summarization, ad hoc querying, and data analysis of data stored in HDFS. It uses a SQL-like language called HiveQL. An interpreter generates MapReduce code from the HiveQL queries.
By using Hive, you can avoid writing MapReduce programs in Java.
See also Hive Thrift; MapReduce.
Parent topic: Glossary
Apache Sentry
Integrates with the Hive and Impala SQL-query engines to provide fine-grained authorization to data and metadata stored in Hadoop.
Parent topic: Glossary
Apache Solr
Provides an enterprise search platform that includes full-text search, faceted search, geospatial search, and hit highlighting.
Parent topic: Glossary
Apache Spark
A fast engine for processing large-scale data. It supports Java, Scala, and Python applications. Because it provides primitives for in-memory cluster computing, it is particularly suited to machine-learning algorithms. It promises performance up to 100 times faster than MapReduce.
Parent topic: Glossary
Apache Sqoop
A command-line tool that imports and exports data between HDFS or Hive and structured databases. The name Sqoop comes from "SQL to Hadoop." Oracle R Advanced Analytics for Hadoop uses the Sqoop executable to move data between HDFS and Oracle Database.
Parent topic: Glossary
Apache YARN
An updated version of MapReduce, also called MapReduce 2. The acronym stands for Yet Another Resource Negotiator.
Parent topic: Glossary
ASR
Oracle Auto Service Request, a software tool that monitors the health of the hardware and automatically generates a service request if it detects a problem.
See also OASM.
Parent topic: Glossary
Balancer
A service that ensures that all nodes in the cluster store about the same amount of data, within a set range. Data is balanced over the nodes in the cluster, not over the disks in a node.
Parent topic: Glossary
CDH
Cloudera's Distribution including Apache Hadoop, the version of Apache Hadoop and related components installed on Oracle Big Data Appliance.
Parent topic: Glossary
Cloudera Hue
Hadoop User Experience, a web user interface in CDH that includes several applications, including a file browser for HDFS, a job browser, an account management tool, a MapReduce job designer, and Hive wizards. Cloudera Manager runs on Hue.
See also HDFS; Apache Hive.
Parent topic: Glossary
Cloudera Impala
A massively parallel processing query engine that delivers better performance for SQL queries against data in HDFS and HBase, without moving or transforming the data.
Parent topic: Glossary
Cloudera Manager
Cloudera Manager enables you to monitor, diagnose, and manage CDH services in a cluster.
The Cloudera Manager agents on Oracle Big Data Appliance also provide information to Oracle Enterprise Manager, which you can use to monitor both software and hardware.
Parent topic: Glossary
Cloudera Navigator
Verifies access privileges and audits access to data stored in Hadoop, including Hive metadata and HDFS data accessed through HDFS, Hive, or HBase.
Parent topic: Glossary
Cloudera Search
Provides search and navigation tools for data stored in Hadoop. Based on Apache Solr.
Parent topic: Glossary
cluster
A group of servers on a network that are configured to work together. A server is either a master node or a worker node.
All servers in an Oracle Big Data Appliance rack form a cluster. Servers 1, 2, and 3 are master nodes. Servers 4 to 18 are worker nodes.
See Hadoop.
Parent topic: Glossary
DataNode
A server in a CDH cluster that stores data in HDFS. A DataNode performs file system operations assigned by the NameNode.
Parent topic: Glossary
Hadoop
A batch processing infrastructure that stores files and distributes work across a group of servers. Oracle Big Data Appliance uses Cloudera's Distribution including Apache Hadoop (CDH).
Parent topic: Glossary
HDFS
Hadoop Distributed File System, an open-source file system designed to store extremely large data files (megabytes to petabytes) with streaming data access patterns. HDFS splits these files into data blocks and distributes the blocks across a CDH cluster.
When a data set is larger than the storage capacity of a single computer, then it must be partitioned across several computers. A distributed file system can manage the storage of a data set across a network of computers.
See also cluster.
Parent topic: Glossary
Hive Thrift
A remote procedure call (RPC) interface for remote access to CDH for Hive queries.
See also CDH; Apache Hive.
Parent topic: Glossary
HotSpot
A Java Virtual Machine (JVM) that is maintained and distributed by Oracle. It automatically optimizes code that executes frequently, leading to high performance. HotSpot is the standard JVM for the other components of the Oracle Big Data Appliance stack.
Parent topic: Glossary
JobTracker
A service that assigns tasks to specific nodes in the CDH cluster, preferably those nodes storing the data. MRv1 only.
Parent topic: Glossary
Kerberos
A network authentication protocol that helps prevent malicious impersonation. It was developed at the Massachusetts Institute of Technology (MIT).
Parent topic: Glossary
Mahout
Apache Mahout is a machine learning library that includes core algorithms for clustering, classification, and batch-based collaborative filtering.
Parent topic: Glossary
MapReduce
A parallel programming model for processing data on a distributed system. Two versions of MapReduce are available, MapReduce 1 and YARN (MapReduce 2). The default version on Oracle Big Data Appliance 3.0 and later is YARN.
A MapReduce program contains these functions:
-
Mappers: Process the records of the data set.
-
Reducers: Merge the output from several mappers.
-
Combiners: Optimizes the result sets from the mappers before sending them to the reducers (optional and not supported by all applications).
See also Apache YARN.
Parent topic: Glossary
MySQL Database
A SQL-based relational database management system. Cloudera Manager, Oracle Data Integrator, Hive, and Oozie use MySQL Database as a metadata repository on Oracle Big Data Appliance.
Parent topic: Glossary
NameNode
A service that maintains a directory of all files in HDFS and tracks where data is stored in the CDH cluster.
See also HDFS.
Parent topic: Glossary
NodeManager
A service that runs on each node and executes the tasks assigned to it by the ResourceManager. YARN only.
See also ResourceManager; YARN.
Parent topic: Glossary
OASM
Oracle Automated Service Manager, a service for monitoring the health of Oracle Sun hardware systems. Formerly named Sun Automatic Service Manager (SASM).
Parent topic: Glossary
Oozie
An open-source workflow and coordination service for managing data processing jobs in CDH.
Parent topic: Glossary
Oracle Database Instant Client
A small-footprint client that enables Oracle applications to run without a standard Oracle Database client.
Parent topic: Glossary
Oracle Linux
Oracle Linux is Oracle’s commercial version of the Linux operating system. Oracle Linux is free to download, use, and redistribute without a support contract.
Parent topic: Glossary
Oracle NoSQL Database
A distributed key-value database that supports fast querying of the data, typically by key lookup.
Parent topic: Glossary
Oracle R Distribution
An Oracle-supported distribution of the R open-source language and environment for statistical analysis and graphing.
Parent topic: Glossary
Oracle R Enterprise
A component of the Oracle Advanced Analytics Option. It enables R users to run R commands and scripts for statistical and graphical analyses on data stored in an Oracle database.
Parent topic: Glossary
Pig
An open-source platform for analyzing large data sets that consists of the following:
-
Pig Latin scripting language
-
Pig interpreter that converts Pig Latin scripts into MapReduce jobs
Pig runs as a client application.
See also MapReduce.
Parent topic: Glossary
Puppet
A configuration management tool for deploying and configuring software components across a cluster. The Oracle Big Data Appliance initial software installation uses Puppet.
The Puppet tool consists of these components: puppet agents, typically just called puppets; the puppet master server; a console; and a cloud provisioner.
See also puppet agent; puppet master.
Parent topic: Glossary
puppet agent
A service that primarily pulls configurations from the puppet master and applies them. Puppet agents run on every server in Oracle Big Data Appliance.
See also Puppet; puppet master
Parent topic: Glossary
puppet master
A service that primarily serves configurations to the puppet agents.
See also Puppet; puppet agent.
Parent topic: Glossary
ResourceManager
A service that assigns tasks to specific nodes in the CDH cluster, preferably those nodes storing the data. YARN only.
Parent topic: Glossary
TaskTracker
A service that runs on each node and executes the tasks assigned to it by the JobTracker service. MRv1 only.
See also JobTracker.
Parent topic: Glossary
ZooKeeper
A MapReduce 1 centralized coordination service for CDH distributed processes that maintains configuration information and naming, and provides distributed synchronization and group services.
Parent topic: Glossary