18 Integrating with Druid Timeseries Database for Realtime Interactive Analytics

A cube is a data structure that helps you to quickly analyze data, on multiple dimensions. GoldenGate Stream Analytics cubes are powered by Druid, which is a distributed, in-memory OLAP data store.

A pipeline outputs the processed data into the Kafka streams which in turn feeds the cube.

You can use cubes for:
  • Interactive analysis of historical data
  • Analysis of univariate, bivariate, and multivariate data
  • Exploration of historical data, using a rich set of 30 visualizations. These visualizations range from simple table, line, bar to the advanced visualizations such as sankey, boxplot, maps, etc. You can save the result of the cube explorations to use them with dashboards, for both the operational and strategical analysis needs of the business users.

18.1 Creating a Connection to Druid

To create a connection to Druid:

  1. On the Catalog page, click Create New Item.
  2. Hover the mouse over Connection and select Druid from the submenu.
  3. On the Type Properties screen, enter the following details:
    • Name: Enter a unique name for the connection. This is a mandatory field.
    • Display Name: Enter a display name for the connection. If left blank, the Name field value is copied.
    • Description
    • Tags
    • Connection Type: The selected connection is displayed.
  4. Click Next.
  5. On the Connection Details screen, enter the following details:
    • Zookeepers: Enter the zookeeper URL.

  6. Click Test Connection, to ensure that you have successfully created a connection.
  7. Click Save.

18.2 Creating a Cube

Druid must be running, prior to creating a cube. Visit http://druid.io/downloads.html, to download and install Druid.

To create a cube:

  1. On the Catalog page, click Create New Item, and select Cube from the drop-down list.
  2. On the Type Properties screen, enter the following details:
    • Name
    • Description
    • Tags
    • Source Type: Select Published Pipeline from the drop-down list.
  3. Click Next.
  4. On the Ingestion Details screen, enter the following details:
    • Connection: Select a connection from the drop-down list.
    • Pipeline: Select a pipeline from the drop-down list.
    • Kafka Target Select a Kafka target from the drop-down list.
    • Timestamp: Select a column from the pipeline to be used as the timestamp.
    • Timestamp format: Select or set a suitable format for the timestamp using Joda time format. This is a mandatory field. The default value is auto.
    • Metrics: Select metrics for creating measures.
    • Dimensions: Select dimensions for group by.
    • High Cardinality Dimensions: Select high cardinality dimensions such as unique IDs. Hyperlog approximation will be used.
  5. Click Next.
  6. Select the required values for the metric on the Metric Capabilities screen.
  7. On the Advanced Settings screen, enter the following details:
    • Segment granularity: Select the granularity with which you want to create segments
    • Query granularity: Select the minimum granularity to be able to query results and the granularity of the data inside the segment
    • Task count: Select the maximum number of reading tasks in a replica set. This means that the maximum number of reading tasks is taskCount*replicas and the total number of tasks (reading + publishing) is higher than this. The number of reading tasks is less than taskCount if taskCount > {numKafkaPartitions}.
    • Task duration: Select the length of time before tasks stop reading and begin publishing their segment. The segments are only pushed to deep storage and loadable by historical nodes when the indexing task completes.
    • Maximum rows in memory: Enter a number greater than or equal to 0. This number indicates the number of rows to aggregate before persisting. This number is the post-aggregation rows, so it is not equivalent to the number of input events, but the number of aggregated rows that those events result in. This is used to manage the required JVM heap size. Maximum heap memory usage for indexing scales with maxRowsInMemory*(2 + maxPendingPersists).
    • Maximum rows per segment: Enter a number greater than or equal to 0. This is the number of rows to aggregate into a segment; this number is post-aggregation rows.
    • Immediate Persist Period: Select the period that determines the rate at which intermediate persists occur. This allows the data cube is ready for query earlier before the indexing task finishes.
    • Report Parse Exception: Select this option to throw exceptions encountered during parsing and halt ingestion.
    • Advanced IO Config: Specify name-value pair in a CSV format. Available configurations are replicas, startDelay, period, useEarliestOffset, completionTimeout, and lateMessageRejectionPeriod.
    • Advanced Tuning Config: Specify name-value pair in CSV format. Available configurations are maxPendingPersists, handoffConditionTimeout, resetOffsetAutomatically, workerThreads, chatThreads, httpTimeout, and shutdownTimeout.
  8. Click Save.

18.3 Exploring a Cube

When you create druid based cube, you can explore data in it.

To explore a cube:

  1. In the Catalog, click the cube that you want to explore.
  2. On the Cube Explorationscreen, construct a query by setting the following parameters:
    • Visualization Type: Select a visualization type from the drop-down list.

    • Time: Set the time-related form attributes such as time granularity, origin (starting point of time), and time range

      .
    • Group By: Select the parameters to aggregate the query data

      .
    • Not Grouped By: Select the parameters to query atomic rows.

      .
    • Options

    • Filters — columns that you can use in filters

    • Result Filters — columns that you can use in result filters

  3. Click Query to run the query with the defined parameters.
  4. Click Save As to save the cube exploration. You can save it as a visualization, choose to add it to an existing dashboard, not to add it to a dashboard, or add it to a new dashboard.