The Big Data Discovery software package

Oracle Big Data Discovery is comprised of a number of separate components, which are installed and deployed simultaneously. These components are described below.

Studio

Studio is Big Data Discovery's front-end web application. It provides tools that enable users to create and manage data sets and projects, as well as administrator tools for managing user access and other settings. Studio stores its project data and the majority of its configuration in a relational database.

Studio is a Java-based application. It runs inside the WebLogic Server, along with the Dgraph Gateway.

Dgraph Gateway

The Dgraph Gateway is a Java-based interface that routes requests to the Dgraph instances and provides caching and business logic. It also uses the Cloudera Distribution for Hadoop (CDH) ZooKeeper package to handle cluster services for the Dgraph instances.

The Dgraph Gateway runs inside WebLogic Server, along with Studio.

Data Processing

Data Processing collectively refers to a set of processes and jobs that perform discovery, sampling, profiling, and enrichment of source data. Many of the processes run within Hadoop, so Data Processing must be deployed to CDH nodes.

Data Processing CLI

The Data Processing Command Line Interface (CLI) provides a way to manually launch Data Processing jobs and invoke the Hive Table Detector (see below). Because the CLI shares configuration information with Studio, it is automatically deployed to all Managed Server nodes. It can later be moved to any node that has access to the Big Data Discovery deployment.

Hive Table Detector

The Hive Table Detector is a Data Processing component that monitors the Hive database for new or deleted tables and launches a Data Processing workflow when it discovers one. If you enable the CLI to run as a cron job, the Big Data Discovery installer starts the Hive Table Detector immediately after deployment.

The Hive Table Detector is invoked by the CLI, either manually by the Hive administrator or via the CLI cron job.

Dgraph

The Dgraph indexes the data sets produced by Data Processing and stores them on a shared NFS. It also responds to requests users make for records in the data sets.

The Dgraph is designed to be stateless, which allows each Dgraph instance to respond to requests independently of others. Queries are routed to the Dgraph instances by the Dgraph Gateway.

The Dgraph can be hosted on any node in the Big Data Discovery deployment, although it is recommended that you dedicate specific nodes to hosting it. The nodes that host Dgraph instances form a Dgraph cluster inside the BDD cluster.

Dgraph HDFS Agent

The Dgraph HDFS Agent acts as a data transport layer between the Dgraph and the HDFS environment. It exports records to HDFS on behalf of the Dgraph, and imports records from HDFS during data ingest operations.

The HDFS Agent is dependent on the Dgraph. It is deployed to the same nodes the Dgraph is deployed to, starts when the Dgraph starts, and shuts down when the Dgraph shuts down.