BDD integration with Hadoop

This topic discusses how BDD fits into the Hadoop environment.

Hadoop is a platform for storing, accessing, and analyzing all kinds of data: structured, unstructured, and data from the Internet Of Things. Hadoop is broadly adopted by IT organizations, especially those that have high volumes of data.

As a data scientist, you often must practice two kinds of analytics work:

In operational analytics, you may work on model fitting and its analysis. For this, you may write code for machine-learning models, and issue queries to these models at scale, with real-time incoming updates to the data. Such work involves relying on the Hadoop ecosystem. Big Data Discovery allows you to work without leaving the Hadoop environment that the rest of your work takes place in. BDD supports an enterprise-quality business intelligence experience directly on Hadoop data, with high numbers of concurrent requests and low latency of returned results.
In investigative analytics, you may use interactive statistical environments, such as R to answer ad-hoc, exploratory questions and gain insights. BDD also lets you export your data from BDD back into Hadoop, for further investigative analysis with other tools within your Hadoop deployment.

By coupling tightly with Hadoop, Oracle Big Data Discovery achieves data discovery for any data, at significantly-large scale, with high query-processing performance.

About Hadoop distributions

Big Data Discovery works with very large amounts of data which may already be stored within HDFS. A Hadoop distribution is a prerequisite for the product, and it is critical for the functionality provided by the product.

Big Data Discovery supports:

CLoudera Distribution for Hadoop (CDH). Cloudera CDH is a complete, tested, and popular distribution of Apache Hadoop and related projects. CDH is 100% Apache-licensed open source and offers unified batch processing, interactive SQL and interactive search, and role-based access controls. CDH delivers the core elements of Hadoop — scalable storage and distributed computing — along with additional components, such as a user interface, plus necessary enterprise capabilities, such as security.
HortonWorks Data Platform (HDP). HDP is a data platform for multi-workload data processing across an array of processing methods, supported by key capabilities required of an enterprise data platform, including data governance, security and operations.

BDD uses the HDFS, Hive, Spark, and YARN components packaged with a specific Hadoop distribution (CDH or HDP). For detailed information on version support and packages, see the Installation and Deployment Guide.

BDD inside the Hadoop Infrastructure

Big Data Discovery brings itself to the data that is natively available in Hadoop.

BDD maintains a list of all of a company’s data sources found in Hive and registered in HCatalog. When new data arrives, BDD lists it in Studio's Catalog, decorates it with profiling and enrichment metadata, and, when you take this data for further exploration, takes a sample of it. It also lets you explore the source data further by providing an automatically-generated list of powerful visualizations that illustrate the most interesting characteristics of this data. This helps you cut down on time spent for identifying useful source data sets, and on data set preparation time; it increases the amount of time your team spends on analytics leading to insights and new ideas.

BDD is embedded into your data infrastructure, as part of Hadoop ecosystem. This provides operational simplicity:

Nodes in the BDD cluster deployment can share hardware infrastructure with the existing Hadoop cluster at your site. Note that the existing Hadoop cluster at your site may still be larger than a subset of Hadoop nodes on which data-processing-centric components of BDD are deployed.
Automatic indexing, data profiling, and enrichments take place when your source Hive tables are discovered by BDD. This eliminates the need for a traditional approach of cleaning and loading data into the system, prior to analyzing it.
BDD performs distributed query evaluation at a high scale, letting you interact with data while analyzing it.
A Studio component of BDD also takes advantage of being part of Hadoop ecosystem:
- It brings you insights without having to work for them — this is achieved by data discovery, sampling, profiling, and enrichments.
- It lets you create links between data sets.
- It utilizes its access to Hadoop as an additional processing engine for data analysis.

Benefits of integration of BDD with Hadoop ecosystem

Big Data Discovery is deployed directly on a subset of nodes in the pre-existing Hadoop cluster where you store the data you want to explore, prepare, and analyze.

By analyzing the data in the Hadoop cluster itself, BDD eliminates the cost of moving data around an enterprise’s systems — a cost that becomes prohibitive when enterprises begin dealing with hundreds of Terabytes of data. Furthermore, a tight integration of BDD with HDFS allows profiling, enriching, and indexing data as soon as the data enters the Hadoop cluster in the original file format. By the time you want to see a data set, BDD has already prepared it for exploration and analysis. BDD leverages the resource management capabilities in Hadoop to let you run mixed-workload clusters that provide optimal performance and value.

Finally, direct integration of BDD with the Hadoop ecosystem streamlines the transition between the data preparation done in BDD and the advanced data analysis done in tools such as Oracle R Advanced Analytics for Hadoop (ORAAH), or other 3rd party tools. BDD lets you export a cleaned, sampled data set as a Hive table, making it immediately available for users to analyze in ORAAH. BDD can also export data as a file and register it in Hadoop, so that it is ready for future custom analysis.