Hadoop provides a number of components and tools that BDD requires to process and manage data. The Hadoop Distributed File System (HDFS) stores your source data and Hadoop Spark on YARN runs all Data Processing jobs. This topic discusses how BDD fits into the Spark and Hadoop environment.
Hadoop is a platform for distributed storing, accessing, and analyzing all kinds of data: structured, unstructured, and data from the Internet Of Things. It is broadly adopted by IT organizations, especially those that have high volumes of data.
By coupling tightly with Spark and Hadoop, Oracle Big Data Discovery achieves data discovery for any data, at significantly-large scale, with high query-processing performance.
About Hadoop distributions
Big Data Discovery works with very large amounts of data stored within HDFS. A Hadoop distribution is a prerequisite for the product, and it is critical for the functionality provided by the product.
BDD uses the HDFS, Hive, Spark, and YARN components packaged with a specific Hadoop distribution. For detailed information on Hadoop version support and packages, see the Installation Guide.
BDD inside the Hadoop Infrastructure
Big Data Discovery brings itself to the data that is natively available in Hadoop.
BDD maintains a list of all of a company’s data sources found in Hive and registered in HCatalog. When new data arrives, BDD lists it in Studio's Catalog, decorates it with profiling and enrichment metadata, and, when you take this data for further exploration, takes a sample of it. It also lets you explore the source data further by providing an automatically-generated list of powerful visualizations that illustrate the most interesting characteristics of this data. This helps you cut down on time spent for identifying useful source data sets, and on data set preparation time; it increases the amount of time your team spends on analytics leading to insights and new ideas.
Benefits of integration of BDD with Hadoop and Spark ecosystem
Big Data Discovery is deployed directly on a subset of nodes in the pre-existing Hadoop cluster where you store the data you want to explore, prepare, and analyze.
By analyzing the data in the Hadoop cluster itself, BDD eliminates the cost of moving data around an enterprise’s systems — a cost that becomes prohibitive when enterprises begin dealing with hundreds of terabytes of data. Furthermore, a tight integration of BDD with HDFS allows profiling, enriching, and indexing data as soon as the data enters the Hadoop cluster in the original file format. By the time you want to see a data set, BDD has already prepared it for exploration and analysis. BDD leverages the resource management capabilities in Spark to let you run mixed-workload clusters that provide optimal performance and value.
Finally, direct integration of BDD with the Hadoop ecosystem streamlines the transition between the data preparation done in BDD and the advanced data analysis done in tools such as Oracle R Advanced Analytics for Hadoop (ORAAH), or other 3rd party tools. BDD lets you export a cleaned, sampled data set as a Hive table, making it immediately available for users to analyze in ORAAH. BDD can also export data as a file and register it in Hadoop, so that it is ready for future custom analysis.