7 Using Data Flows to Create Curated Data Sets

You can use data flows to produce curated (combined, organized, and integrated) data sets.

Topics:

Typical Workflow for Creating Curated Data Sets with Data Flows

Here are the common tasks for creating curated data sets with data flows.

Task	Description	More Information
Create a data flow	Create data flows from one or more data sets.	Creating a Data Flow
Add filters	Use filters to limit the data in a data flow output.	Adding Filters to a Data Flow
Add aggregates	Apply aggregate functions to group data in a data flow.	Adding Aggregates to a Data Flow
Merge columns and rows of data sets	Combine two or more columns and rows of data sets in a data flow.	Merging Columns in a Data Flow Merging Rows in a Data Flow
Create a binning column	Assign a value to add a binning column to the data set.	Creating a Binning Column in a Data Flow
Create a sequence of data flows	Create and save a sequential list of data flows.	Creating a Sequence of Data Flows
Create a group	Create a group column of attribute values in a data set.	Creating a Group in a Data Flow
Add cumulative values	Group data by applying cumulative aggregate functions in a data flow.	Adding Cumulative Values to a Data Flow
Save output data to a database	Connect to a database and save the output data from a data flow to a table in a database.	Saving Output Data from a Data Flow to a Database
Execute a data flow	Execute data flows to create data sets.	Executing a Data Flow
Run a data flow	Run a saved data flow to create data sets or to refresh the data in a data set.	Running a Data Flow

About Data Flows

Data flows let you take one or more data sets and organize and integrate them to produce a curated set of data that you can use to easily create effective visualizations.

You use the Data Visualization's data flow editor to select specific data from existing data sets, apply transformations, add joins and filters, remove unwanted columns, add new derived measures, add derived columns, and add other operations. The data flow is then run to produce a data set that you can use to create complex visualizations.

See Creating a Data Flow and Running a Data Flow.

About Editing a Data Flow

Build your data flow by adding steps to select, limit, and customize your data.

The following image shows the Data Flow editor.

Description of GUID-50222F10-C622-45AF-A41F-FFB1EB04BEF8-default.gif follows

Description of the illustration GUID-50222F10-C622-45AF-A41F-FFB1EB04BEF8-default.gif

What Can You Do in the Data Flow Editor?

The Data Flow editor is a flexible tool, designed to help you easily create data flows. You can also:

select, add, and rename columns
add or adjust aggregates
add filters
create a merge column
merge rows
create a binning column
add a sequence
create a group
add cumulative values
customize step names
schedule a data flow
add another data set

You add steps in the workflow diagram pane and specify details for those steps in the Step editor pane.

These helpful tips should help you to use the Step Editor pane more effectively:

You can hide or display the Step editor pane by clicking Step editor at the bottom of the Data Flow editor.

You can hide or display the Preview data columns pane by clicking Preview data at the bottom of the Data Flow editor.
The Preview data columns pane updates automatically as you make changes to the data flow.

For example, you could add a Select Columns step, remove some columns, and then add an Aggregate step. While working on the Aggregate step, the Preview data columns pane already shows the columns and data that you just specified in the Select Columns step.
You can specify whether or not to automatically refresh step changes in the Preview data columns pane by clicking Auto apply .
You can add another data set and join it to the existing data sets in your data flow by selecting Add Data in the Data Flow Steps panel.

Joins are created automatically when you add a data set; however, you can edit the join details in the Join dialog.
Oracle Data Visualization validates data flow steps as you add them to or delete them from the data flow.
If you’re adding an expression (in an Add Column step or a Filter step), then you must click Apply to finalize the step.

If you add a new step to the workflow diagram without clicking Apply, then your expression won’t be applied, and the next step that you add won’t use the correct data.

Creating a Data Flow

You can create a data flow from one or more data sets. With a data flow, you produce a curated data set that you can use to easily and efficiently create meaningful visualizations.

In the Home Page, click Create, then click Data Flow to display the Add Data Set dialog, follow the on-screen instructions to specify a data set, then click Add.
To add steps to your data flow, in the Data Flow editor, go to the workflow diagram pane and click Add a step (+) next to the data set step. See Adding Filters to a Data Flow, Adding Aggregates to a Data Flow, and Merging Columns in a Data Flow.
In the Add step dialog, select the step that you want to add and provide the required details in the Step editor pane.
(Optional) To delete a step from the workflow diagram, click X or right-click the step and select Delete. Note that deleting a step might make the other steps in the data flow invalid, as indicated by red X icons displayed for the invalid steps.
Click Save to save but not run the data flow. Note that you can save a data flow that contains validation errors. When you save a data flow, it’s displayed in the Display pane of the Data page, in the Data Flows area.
When you’ve finished adding steps to the data flow diagram, you can also execute the data flow without saving it, or save the data flow as a database connection. See Executing a Data Flow, Saving Output Data from a Data Flow to a Database.

Adding Filters to a Data Flow

You can use filters to limit the amount of data included in the data flow output. For example, limiting sales revenue data of a column to the years 2010 through 2017.

You can filter a data element by adding the Filter step in the Step Editor pane.

Create or open the data flow that you want to apply a filter to.
Click Add a step (+), and select Filter.

In the Filter pane, select the data element you want to filter:

Field	Description
Add Filter (+)	Select the data element you want to filter, in the Available Data dialog. Alternatively, click Data Elements in the Data Panel, and drag and drop a data element to the Filter pane.
Filter fields	Change the values, data or selection of the filter (for example, maximum and minimum range). Based on the data element, specific filter fields are displayed. You can apply multiple filters to a data element.
Filter menu icon	Select a function to clear the filter selection and disable or delete a filter.
Filter pane menu icon	Select a function to clear all filter selections, remove all filters, and auto-apply filters. You can select to add an expression filter.
Add Expression Filter	Select to add an Expression Filter. Click f(x), select a function type, and then double-click to add a function in the Expression field. Click Apply.
Auto-Apply Filters	Select an auto-apply option for the filters, such as Default (On).

Note:

Based on the applied filter, the data preview (for example, the displayed sales data in a column) is updated.

Click Save.

Adding Aggregates to a Data Flow

You can group data by applying aggregate functions such as count, sum, and average.

If the data set already contains aggregates, then they’re displayed when you add an aggregate step. You can add an aggregate in the Step Editor pane.

Create or open the data flow that you want to add an aggregate step to.
Click Add a step (+), and select Aggregate.
In the Aggregate pane, to add a column to the aggregate, click Actions then click Aggregate.
To select an aggregate function to apply to an aggregate column, click the arrow in the Function field for the selected column and select a value to aggregate by. For example, for the Profit column you could choose Sum.
To remove an aggregate from the selected aggregate list, hover the mouse pointer over the aggregate’s name, click Actions, and click Group By.
To save your changes, click Save Data Flow.

Merging Columns in a Data Flow

You can combine two or more columns to display as one. For example, you can merge the street address, street name, state, and ZIP code columns so that they display as one item in the visualizations using the data flow’s output.

You create a merged column by adding a merge column step in the Step Editor pane.

Create or open the data flow that you want to add a merge column to.
Click Add a step (+), and select Merge Columns.

In the Merge Columns pane, specify the options for combining the columns:

Field	Description
New column name	Change the name of the merge column.
Merge column	Select the first column you want to merge.
With	Select the second column you want to merge.
(+) Add Column	Select more columns you want to merge.
Delimiter	Select a delimiter to separate column names (for example, Space, Comma, Dot, or Custom Delimiter).

Click Save.

Merging Rows in a Data Flow

You can merge the rows of two data sets. The result can include all the rows from both data sets, the unique rows from each data set, the overlapping rows from both data sets, or the rows unique to one data set.

Before you merge the rows, do the following:

Confirm that each data set has the same number of columns.
Check that the data types of the corresponding columns of the data sets match. For example, column 1 of data set 1 must have the same data type as column 1 of data set 2.

You can add a Merge Rows step in the Step Editor pane.

Create or open the data flow that contains the data sets you want to merge.
Select the two data sets, right-click, and select Merge rows.

Select the option for merging the rows, as described in the following table:

Option	Description
All rows from Input 1 and Input 2 (Union All)	All the rows of both the data sets are displayed.
Unique rows from Input 1 and Input 2 (Union)	The data of each unique rows are merged and displayed with the other rows.
Rows common to Input 1 and Input 2 (Intersect)	Only the common rows are displayed with the merged data.
Rows unique to Input 1 (Except)	Only the unique rows of data set 1 are displayed.
Rows unique to Input 2 (Except)	Only the unique rows of data set 2 are displayed.

Click Save.

Creating a Binning Column in a Data Flow

Binning a measure creates a new column based on the value of the measure. You can assign a value to the bin dynamically by creating the number of equal size bins (such as the same number of values in each bin), or by explicitly specifying the range of values for each bin.

You can add a Bin step in the Step Editor pane.

Create or open the data flow in which you want to create a bin column.
Click Add a step (+), and select Bin.
Alternatively, you can select Add Columns, and then click (+) Column to select Bin.
In the Bin pane, click Select Column.
In the Available Columns dialog, select the data element.

In the Bin pane, specify the options for the bin column:

Field	Description
Bin	Select a different data element.
New element name	Change the name of the bin column.
Number of Bins	Enter a number, or use the arrows to increment or decrement the number of bins.
Method	Select one of the methods, Manual, Equal Width, or Equal Height.
Histogram View	Based on your selection in the Method field, the histogram range (width) and histogram count (height) of the bins are updated. In the Manual method, you can move the slider to select the boundary; that is, the histogram range and count. The number of sliders changes based on the histogram count. You can switch to the List view and enter the range manually along with the bin names. In the Equal Width method, the histogram range is divided into intervals of the same size. For equal width binning, the column values are measured, and the range is divided into equal-sized intervals. The edge bins can accommodate very low or very high values in the column. In the Equal Height method, the height of each bin is same or very slightly different but the histogram range is equal. For equal height or frequency binning, the intervals of each bin is based on each interval containing approximately the equal number of elements (that is, records). Equal Height method is preferred specifically for the skewed data.
List View	If you select the Manual method, you can change the name of the bins, and you can define the range for each bin.

Note:

Based on your changes, the data preview (for example, the bin column name) is updated.

Click Save.

Creating a Sequence of Data Flows

A sequence is a saved sequential list of specified data flows and is useful when you want to run multiple data flows as a single transaction. If any flow within a sequence fails, then all the changes done in the sequence are rolled back.

Click Create, and select Sequence.
Drag and drop the data flows and sequences to the Sequence pane.
Click the menu icon to move an item up or down in the list, and to remove an item.
Click Save. When you save a sequence, it’s displayed in the Sequence area of the Data page.
Go to the Sequence area of the Data page, select the sequence, and click Execute Sequence.
After you run a sequence, the resulting data sets are displayed in the Data page.
Go to the Data page and click Data Sets to see the list of resulting data sets.

Creating a Group in a Data Flow

You can use binning attributes to define groups of attribute values in a data set.

Create or open the data flow in which you want to create a group column.
Click Add a step (+), and select Group.
Select the data element in the Available Columns dialog. You can’t select the numbered type data element.

Specify the options for the new group column in the Group pane:

Field	Description
Group	Change the name of a group (for example, Group1).
Available values list	Select the values you want to include in a group. The selected values are displayed in the Selections list. Based on your selection, the histogram is updated. The height of the horizontal bar is based on the count of a group in the data set.
Name	Change the name of the new group column.
Selections	Contains all columns selected for this group.
(+) Group	Add a new group. You can select a group, and click X to delete it.
Include Others	Group values that haven't been added to any of the other groups.
Add all	Add all the values in the available list to a group.
Remove all	Remove all the selected values from a group.

Note:

Based on your changes, the data preview (for example, the group column name) is updated.

Click Save.

Adding Cumulative Values to a Data Flow

You can group data by applying the cumulative aggregate functions such as the moving and running aggregate. A moving aggregate aggregates values over a row and a specific number of preceding rows. A running aggregate aggregates values over all the preceding rows. Because both the moving and running aggregates are based on the preceding rows, the sort order of rows is important. You can specify the order as part of the aggregate.

You can add a Cumulative Value step in the Step Editor pane.

Create or open the data flow in which you want to add a cumulative value column.
Click Add a step (+), and select Cumulative Value.

In the Cumulative Value pane, specify the cumulative aggregate functions for the new column:

Field	Description
Aggregate	Select a data column.
Function	Select a function. The available types of function are based on the data column. If the column data type is incompatible with the function, an error message is displayed.
Rows	Select the value. You can edit this field only for specific functions. If the value isn't a positive integer, an error message is displayed.
New column name	Change the aggregate column name. If two columns have the same name, an error message is displayed.
(+) Aggregate	Create a new aggregate column. Select a column. If no aggregate column is defined, an error message is displayed.
(+) Sort Column	Select a sort by column for the data column. Click Options to move a sort column up or down in the list. Select a sort order. If you add two sort orders to the same column, an error message is displayed.
Sort order list	Select the sort order type. The available types of sort order are based on the selected data element.
(+) Restart Column	Select a restart column for the data column. Click Options to move a restart element up or down in the list. Select a restart element. If you add a duplicate restart column, an error message is displayed.

Note:

Based on your defined values, the data preview (for example, New column name) is updated.

Click Save.

Customizing Data Flow Step Names and Descriptions

You can rename a data flow step and add or edit the description.

Create or open a data flow or data set.
Click Add a step (+), and select a step.
Click the step name (for example, Merge Columns) in the step pane header.
Enter a new name or edit the existing name in the Name field, and enter a description if required.
To save your changes, click Enter, or click outside the header fields.

Executing a Data Flow

Executing a data flow produces a data set that you can use to create visualizations.

To successfully execute a data flow, it must be free of validation errors.

Create or open the data flow that you want to execute and produce a data set from.
Click Add a step (+) and select Save Data.
In the Save Data Set pane enter the Data Set name and Description to identify your data set. Don't change the Save data to field.
Click Execute Data Flow to run the data flow. If there is no validation error, a completion message is displayed.

Note:
When you execute a data flow without saving it, the data flow isn’t saved and isn't displayed in the Data Flows list. Therefore, the data flow isn’t available for you to modify or run.

Go to the Data page and select Data Sets to see your resulting data set in the list.
Click Save As. In the Save Data Flow As dialog enter a Name and Description to identify your data flow.
Go to the Data page and select Data Flows to see your resulting data flow in the list.

Saving Output Data from a Data Flow to a Database

You can connect to a database and save output data from a data flow to a table in that database.

You can save a data flow to a database to securely store the data flow in the database and take advantage of its managed backup and recovery facility. You can transform the data source by overwriting it with the data flow data. The data source and data flow tables must be in the same database and have the same name. To successfully save a data flow to a database, it must be free of validation errors.

Create or open the data flow that you want to save as a database connection.
Click Add a step (+) and select Save Data . Or if you have already saved the data flow, then click the Save Data step.
In the Save Data Set pane, enter the Data Set name and Description to identify your data set.
Click Save data to list and select Database Connection.
Click Select connection to display the Save Data to Database Connection dialog.
Select a connection to save your data flow.
You must already have created a database connection before you can select one. See Creating Database Connections.

For example, you can save to an Oracle database, Apache Hive database, Hortonworks Hive database, or Map R Hive database.
Enter a name in the Table field.
The table name must conform to the naming convention of the selected database. For example, the name of a table in an Oracle database can’t begin with numeric characters.
Click Execute data flow to run the data flow. If there is no validation error, a completion message is displayed and the data is saved in the selected database using the table name that you specified.
- If you have a table in the database with the same name, the data in the table is overwritten when you save to the database.
- When you execute the data flow, the data is only saved to a database, but the data flow isn’t saved and isn't displayed in the Data Flows list. Therefore, the data flow isn’t available for you to modify or run.
Click Save As. In the Save Data Flow As dialog enter a Name and Description to identify your data flow.
Go to the Data page and select Data Flows to see your resulting data flow in the list.

Running a Data Flow

You can run a saved data flow to create a corresponding data set or to refresh the data in the data set created from the data flow.

You run the data flow manually to create or refresh the corresponding data set. For an existing data set, run the data flow if you know that the columns and data from the data set that was used to build the data flow have changed.

In the Data page, go to the Data Flows section, and locate the data flow that you want to run.
Click the data flow’s Actions menu and select Run.

Notes about running data flows:

To run a saved data flow, you must specify a Save Data step as its final step. To add this step to the data flow, click the data flow’s Actions menu and select Open. After you’ve added the step, save the data flow and try to run it again.
When creating a new database data source, set the database’s query mode to Live. Setting the query mode to Live allows the data flow to access data from the database (versus the data cache) and pushes any expensive operations such as joins to the database. See Managing Data Sets.
When you update a data flow that uses data from a database source, the data is either cached or live depending on the query mode of the source database.
Complex data flows take longer to run. While the data flow is running, you can go to and use other parts of the application, and then come back to the Data Flows pane to check the status of the data flow.
You can cancel a long-running data flow. To do so, go to the Data Flows section, click the data flow’s Action menu and select Cancel.
If it’s the first time you’ve run the data flow, then a new data set is created and you can find it in the Data Sets section of the Data page. The data set contains the name that you specify on the data flow’s Save Data step. If you’ve run the data flow before, then the resulting data source already exists and its data is refreshed.