Using Steps

You build data flows using steps to curate your data. Steps are functions that change your data in a specific way. For example, steps can aggregate values, perform time series analysis, or perform machine learning algorithms.

Step Use this step to: More Information
Add Columns Add a new output data column to your data flow using a wide range of functions, conditional expressions, and SQL operators. Add Columns in a Data Flow
Add Data Add a data source to your data flow. Add Data in a Data Flow
Aggregate Apply aggregate functions to group data in a data flow. Add Aggregates to a Data Flow
Analyze Sentiment Detect sentiment for a text column by applying a sentiment analysis to the data flow. Add a Sentiment Analysis to a Data Flow
Apply Model Apply a machine learning model to your data (also known as scoring a data model). Apply a Predictive or Oracle Machine Learning Model to a Data Set
Bin Assign your data values into categories, such as high, low, or medium. Create a Bin Column in a Data Flow
Branch Creates multiple outputs from a data flow using a branch. Create Multiple Pipelines in a Data Flow Using a Branch
Create Essbase Cube Create an Essbase cube from a data set. Create and Customize an Essbase Cube in a Data Flow
Cumulative Value Group data by applying cumulative aggregate functions in a data flow. Add Cumulative Values to a Data Flow
Database Analytics Use advanced analytic functions, such as anomaly detection, unpivot, sampling, and advanced clustering (Requires Oracle database or Oracle Autonomous Data Warehouse). Add Database Analytics to a Data Flow
Filters Use filters to limit the data in a data flow output. Filter Your Data in a Data Flow
Group Create a group column of attribute values in a data set. Create a Group in a Data Flow
Join Join multiple tables or data sets. Add a Join in a Data Flow
Merge Columns Combine two or more columns in your data flow.

Merge Columns in a Data Flow

Merge Rows Combine two or more rows in your data flow.

Merge Rows in a Data Flow

Rename Columns Change the name of data columns to something more meaningful. Rename Columns in a Data Flow
Save Data Before running a data flow, modify or select the database name, attribute or measure, and aggregation rules for each columns of the output data set. Save Output Data from a Data Flow
Save Model Change the default model name (untitled) and provide a description. Save Model
Select Columns Specify which data columns to include in your data flow. Select Columns to Include in a Data Flow
Split Columns Extract useful data from within data columns. Split Columns in a Data Flow
Time Series Forecast Apply a time series forecast calculation to a data set to create additional rows. Add a Time Series Forecast to a Data Flow
Train Binary-Classifier Train a machine learning model to classify your data into one of two predefined categories. Train a Binary Classifier Model in a Data Flow
Train Clustering Train a machine learning model to segregate groups with similar traits and assign them into clusters. Train a Clustering Model in a Data Flow
Train Multi-Classifier Train a machine learning model to classify your data into three or more predefined categories. Train a Multi-Classifier Model in a Data Flow
Train Numeric Prediction Train a machine learning model to predict a numeric value based on known data values. Train a Numeric Prediction Model in a Data Flow
Transform Column Modify data in a column using a wide range of functions, conditional expressions, and SQL operators. Transform Data in a Data Flow

Add Columns in a Data Flow

You can add columns to your target data and customize the format. For example, you might calculate the value of your stock by multiplying the number of units in a UNITS column by the sale price in a RETAIL_PRICE column.

Use the Add Columns step in the data flow editor.
  1. Click Add a step (+), and select Add Columns.
  2. In the Add Columns pane, use the expression builder to define your column. For example, to calculate the value of stock items you might specify UNITS * RETAIL_PRICE.
    Select SQL operators, functions, and conditional expressions from the expression pick list.

Add Data in a Data Flow

When you create a new data flow and select a data set, you'll see a step with the name of your data set. You can add additional data from multiple data sources to your data flow.

Use the Add Data step in the data flow editor. If you've created a new data flow project, your data set will be selected.
  1. Use the options on the Add Data pane to configure the data set. For example, change the default name, or include and exclude columns.
  2. To add another data set to your flow, click Add a step (+), and select Add Data.
    If matching columns are found in data sets, a Join step is automatically added to enable you to define the relationship between the data sets. For example, you might want to combine rows from two data sets where the CustomerID in the first data set matches the CustomerID in the second data set.
  3. If you don't get a Join step automatically, click Add a step (+), and select Join.
    To complete the join, on the data flow diagram click the circle on the dotted line between the data source step and the Join step. Then use the Join pane to configure the relationship between the data sets
  4. Click your data set step again and use the options on the Add Data pane to configure the data set.
    Field Description
    Add Data - <Data source name> Click this pane heading to edit the step name and description.
    Select... Use this option to change the data set or data source. Changing the data set or data source might break other steps in your flow.
    When Run Prompt to select Data Set Select this option to supply the name of the output data set when the data flow is executed. For example, you might want to specify a different name for the output data set each time the flow is executed.

Add Aggregates to a Data Flow

Create group totals by applying aggregate functions such as count, sum, and average.

Use the Aggregate step in the data flow editor.
  1. Click Add a step (+) and select Aggregate.
    In the Aggregate pane you'll see a suggested aggregate column for each numeric column.
  2. Use the options on the Aggregate pane to configure your aggregate:
    Field Description
    Aggregate Select a column you want to add to the aggregate
    Function Select an aggregate function such as Sum, Average, Minimum, or Count to apply to the selected column.
    New column name Change the name of the aggregate column.
  3. Add or remove aggregates.
    • To remove an aggregate, select the aggregate and click X.
    • To see the Add Aggregate option, scroll to the bottom of the Aggregate pane.

Add a Sentiment Analysis to a Data Flow

You can detect sentiment for a given text column by applying a sentiment analysis to your data flow.

Sentiment analysis evaluates text based on words and phrases that indicate a positive, neutral, or negative emotion. Based on the outcome of the analysis, a new column contains a Positive, Neutral, or Negative string type result.
Use the Analyze Sentiment step in the data flow editor.
  1. Click Add a step (+), and select Analyze Sentiment.
  2. In the Analyze Sentiment pane and Output section, specify an output column for the emotion result value.
  3. Optionally change the default column name 'emotion'.
  4. In the Analyze Sentiment pane and Parameters section, specify the value for Text to Analyze.
    Select a text column with natural language content to analyze.

Create a Bin Column in a Data Flow

Use a bin to categorize your data by creating a new column based on the value of a measure. For example, you might categorize values for RISK into three bins for low, medium, and high.

Use the Bin step in the data flow editor.
  1. Click Add a step (+), and select Bin.
    You also create bins when you add columns using the Add Column step.
  2. Select the column whose values you want to categorize.
  3. Use the options on the Bin pane to configure your bin:
    Field Description
    Bin You'll see the column that you selected in Step 2. To categorize values in a different column, click the column name and select a different column.
    Method Specify how the data boundaries are calculated.
    • In the Manual method, the range is divided by the number of bins.
    • In the Equal Width method, the histogram range is divided into intervals of the same size. For equal width binning, the column values are measured, and the range is divided into equal-sized intervals. The edge bins can accommodate very low or very high values in the column.
    • In the Equal Height method, the height of each bin is same or very slightly different but the histogram range is equal. For equal height or frequency binning, the intervals of each bin is based on each interval containing approximately the equal number of elements (that is, records). Equal Height method is preferred specifically for the skewed data.
    Histogram View Based on the Method selected, the histogram range (width) and histogram count (height) of the bins are updated.
    List View If you select the Manual method, you can change the name of the bins, and you can define the range for each bin.

    Based on your changes, the data preview (for example, the bin column name) is updated.

Create Multiple Pipelines in a Data Flow Using a Branch

Creates multiple outputs from a data flow using a branch. For example, if you have sales transactions data based on country, you might save data for United States in the first branch and data for Canada in the second branch.

Use the Branch step in the data flow editor.
  1. Click Add a step (+) and select Branch.
    You'll see a Branch step and two Save Data steps added to the data flow. Select the Branch step and use the Branch into option to add or remove branches. The minimum number of branches is two, and the maximum is five.
  2. To configure each branch, click connection line between the Branch step and the Save Data step, click Add a step (+) and select a step type that processes your branch.
    For example, you might add a Filter to the first branch that saves data from United States, and add a Filter to the second branch that saves data from Canada. Or, you might use the Split Columns step to save some columns in the first branch and other columns in the second branch.
  3. Click each Save Data step and in the Save Data Set pane specify the properties for saving the output data sets.

Create and Customize an Essbase Cube in a Data Flow

Create an Essbase cube from a spreadsheet or database.

You can create Essbase cubes only for Oracle Analytics Cloud – Essbase. When selecting an Essbase connection for creating a cube, you might see remote connections to on-premises Oracle Essbase instances. You can't create a cube from data in on-premises Oracle Essbase instances.
Use the Create Essbase Cube step in the data flow editor.
  1. Click Add a step (+), and select Create Essbase Cube.
  2. In the Create Essbase Cube pane, specify the values for creating the cube such as connection and application name.
  3. To configure the input columns, do the following:
    1. Move the slider to enable the Customize Cube option.
    2. Select the number of rows you want to analyze and click Configure.
    3. Perform the following actions for each column in the Dimensions, Measure, and Skip sections:
      • Cut
      • Paste as Sibling
      • Paste as Child
      • Skip
      • Delete
      Section and Designation Type Cut Paste as Sibling Paste as Child Skip
      Dimension Header No No Yes No
      Dimension Yes Yes Yes Yes
      Generation Yes Yes Yes Yes
      Alias, Attribute, UDA Yes Yes No Yes
      Measure Header No No Yes No
      Measure Yes Yes Yes Yes
      Skip Header No No Yes No
      Skip Yes Yes No No
    4. Change the following column values:
      • Column name in the Data Elements column.
      • Designation type in the Treat As column.
  4. Select the When Run Prompt to specify Data Set option to apply parameters to change the default values when creating the Essbase cube.

Cut, Paste, and Skip Rules

The cut, paste, and skip actions you perform for each column follow pre-configured rules.

  • When you skip a column, it moves to the Skip section of the table. You can only paste a column as a sibling of the Skip header, or as a sibling of any skipped column.
  • Any columns that are pasted as a Measure follow the rule of the paste command. Measure hierarchies are allowed, but the designation type doesn’t change.
  • Paste as Child action for Dimension columns:
    • When a column is pasted as a child of the Dimensions header, the cut column is pasted as a Dimension.
    • When a column is pasted as a child of the Dimension column:
      • The cut column is pasted as a Generation.
      • If the Dimension column already has a Generation child, the existing Generation (and its children) becomes the children of the new Generation column.
    • When a column is pasted as a child of the Generation column:
      • The cut column is pasted as a child of the Generation if the cut column is an Alias, Attribute, or UDA.
      • The cut column is pasted as a Generation if the cut column isn’t an Alias, Attribute, or UDA.
    • Paste as Child for any Dimension column isn't allowed if the target is an Alias, Attribute, or UDA.
  • Paste as Sibling action for Dimension columns:
    • When a column is pasted as a sibling of a Dimension column, it's pasted as a Dimension.
    • When a column is pasted as a sibling of an Attribute, Alias, or UDA and it isn’t an Alias, Attribute, or UDA, the column is pasted as an Attribute.
    • Paste as Sibling for any Dimension column isn't allowed if the target is a Generation.

Designation Change Rules for Generation Columns

The Generation columns follow specific pre-configured rules when you change their designation type.

  • Generation to Attribute/Alias/UDA - If the Generation column has any children, they move up a level and become children of the Generation column’s parent.
  • Attribute/Alias/UDA to Generation - If the new Generation column has a sibling Generation column, the existing Generation column (and its children) become children of the new Generation column.

Add Cumulative Values to a Data Flow

You can calculate cumulative totals such as moving aggregate or running aggregate.

Use the Cumulative Value step in the data flow editor.
  1. Click Add a step (+), and select Cumulative Value.
  2. Use the options on the Cumulative Value pane to configure your aggregate.
    Field Description
    Aggregate Select the data column to calculate.
    Function Select the cumulative function to apply.
    Rows You can edit this field only for specific functions.
    New column name Change the aggregate column name.
    (+) Aggregate Add an aggregate column.
    (+) Sort Column Specify how you'd like to sort each new cumulative column.

Add Database Analytics to a Data Flow

Database analytics enable you to detect anomalies, cluster data, sample data, and unpivot data. Database analytics are executed in the database, not in Oracle Analytics, therefore you must be connected to an Oracle database or Oracle Autonomous Data Warehouse.

Use the Database Analytics step in the data flow editor.
Before you start, create a connection to your Oracle database or Oracle Autonomous Data Warehouse and use it to create a data set.
  1. In the data flow editor, click Add a step (+), and select Database Analytics.
    If you aren't connected to an Oracle database or Oracle Autonomous Data Warehouse, you won't see the Database Analytics option.
  2. At the Select Database Analytics page, select a function type then click OK.
    Function Types Description
    Dymanic Anomaly Detection Detect anomalies in your input data without a pre-defined model. For example, you might want to highlight unusual financial transactions.

    When you deploy this function with large data sets, configure the partition columns to maximise performance.

    Dynamic Clustering Cluster your input data without a pre-defined model. For example, you might want to characterize and discover customer segments for marketing purposes.

    When you deploy this function with large data sets, configure the partition columns to maximise performance.

    Un-pivoting Data Transpose data that's stored in columns into row format. For example, you might want to transpose multiple columns showing a revenue metric value for each year to a single revenue column with multiple value rows for the year dimension. You simply select the metric columns to transpose and specify a name for the new column.You'll get a new dataset with fewer columns and more rows.
    Sampling Data Selects a random sample percentage of data from a table. You simply specify the percentage of data you want to sample. For example, you might want to randomly sample ten percent of your data.
  3. On the Analytics Operation <type> pane, configure the operation.
    • Use the Outputs area to specify the data columns to analyze.
    • Use the Parameters area to configure options for the operation.

Filter Your Data in a Data Flow

You use filters to limit the amount of data included in the data flow output. For example, you might create a filter to limit sales revenue data to the years 2017 through 2019.

Use the Filter step in the data flow editor.
  1. Click Add a step (+), and select Filter.
  2. In the Filter pane, select the data element you want to filter:
    Field Description
    Add Filter (+) Select the data element you want to filter, in the Available Data dialog. Alternatively, click Data Elements in the Data Panel, and drag and drop a data element to the Filter pane.
    Filter fields Change the values, data or selection of the filter (for example, maximum and minimum range). Based on the data element, specific filter fields are displayed. You can apply multiple filters to a data element.
    Filter menu icon Select a function to clear the filter selection and disable or delete a filter.
    Filter pane menu icon Select a function to clear all filter selections, remove all filters, and auto-apply filters. You can select to add an expression filter.
    Add Expression Filter Select to add an Expression Filter. Click f(x), select a function type, and then double-click to add a function in the Expression field.

    Click Apply.

    Auto-Apply Filters Select an auto-apply option for the filters, such as Default (On).

    The data preview is updated using the applied filter.

Create a Group in a Data Flow

You can categorize non-numeric data into groups that you define. For example, you might put orders for lines of business Communication and Digital into a group named Technology, and orders for Games and Stream into a group named Entertainment.

Use the Group step in the data flow editor.
  1. Click Add a step (+), and select Group.
  2. For each group that you want to create, use the Group pane:
    1. Use the pop list of columns to select the column you'd like categorize. For example, to categorize orders by line of business, you might select LINE_OF_BUSINESS.
    2. (Optional) Click the group name to change the default name Group 1. For example, you might change Group 1 to Technology.
    3. (Optional ) In the Name field, change the default name of the new column from new_name1 to a more meaningful name.
    4. In the center box, select one of more categories to add to the group. For example, to analyze line of business you might put Communication and Digital in a group named Technology.
      In the Preview Data pane, you'll see a new column with the groups that you defined displayed as the value for each row. For example, values might be Technology or Entertainment.
  3. To add more groups, click Group (+).

Add a Join in a Data Flow

When you add data from multiple data sources to your data flow, you can join them on a common column. For example, you might join an Orders data set to a Customer_orders data set using a customer ID field.

When you use the Add Data step to add an extra data source, a Join step is automatically added to your data flow. But you can also manually add a Join step if you have more than one data source defined in your data flow.
Use the Join step in the data flow editor.
  1. Add the data sources you'd like to join.
  2. Select a data source, click Add a step, then click Join.
    You'll see a suggested connection with a node on the connection line.
    Node icon to connect two data sources.
  3. Click the node on the connection line to complete the connection.
  4. Use the options on the Join pane to configure your step.
    Field Description
    Keep rows Use these options to specify how you want to join your data. Click an option to preview your merged data (if you're displaying the Data Preview pane).
    Match columns Specify the common field on which you'd like to join the data sources.

Merge Columns in a Data Flow

You can combine multiple columns into a single column. For example, you might merge the street address, street name, state, and ZIP code columns so that they display as one item in visualizations.

Use the Merge Columns step in the data flow editor.
  1. Click Add a step (+), and select Merge Columns.
  2. Use the options on the Merge Columns pane to configure your merge:
    • (+) Column field - Select more columns you want to merge.
    • Delimiter field - Select a delimiter to separate column names (for example, Space, Comma, Dot, or Custom Delimiter).

Rename Columns in a Data Flow

Rename columns to create more meaningful data column names in your generated data sets.

Use the Rename Columns step in the data flow editor.
  1. Click Add a step (+), and select Rename Columns.
  2. Use the Rename fields to specify a more meaningful name for columns in your generated data set.

Save Output Data from a Data Flow

For the data created by a data flow you can change the default name and description, specify where to save the data, and specify runtime parameters. If you're saving the output from your data flow to a database, before you start, create a connection to one of the supported database types.

Use the Save Data step in the data flow editor.
  1. Click Add a step (+) and select Save Data. Or, if you’ve already saved the data flow, then click the Save Data step.
  2. In the Save Data Set pane, optionally change the default Name and add a Description.
    If you don't change the default Name value, you'll generate a data set named 'untitled'. After you run this data flow, you'll see the generated data set in the Data Sets page (click Data from the navigator on the Home page).
  3. Click Save data to and select a location:
    • Choose Data Set Storage to save the output data in a data set in Oracle Analytics.
    • Choose Database Connection save the output data in one of the supported database types.
  4. If you’ve selected Database Connection, specify details about the database connection.
    Before you start, create a connection to one of the supported database types.
    1. Click Select connection to display the Save Data to Database Connection dialog, and select a connection.

      You can save to a range of databases, including Oracle, Oracle Autonomous Data Warehouse, Apache Hive, Hortonworks Hive, and Map R Hive.

      To find out which databases you can write to, refer to the More Information column in Supported Data Sources.

    2. In the Table field, optionally change the default table name.
      The table name must conform to the naming conventions of the selected database. For example, the name of a table in an Oracle database can’t begin with numeric characters.
    3. In the When run field, specify whether you'd like to replace existing data or add new data to existing data.
  5. Select the When Run Prompt to specify Data Set option if you want to specify the name of the output data set or table at run time.
  6. In the Columns table, change or select the database name, the attribute or measure, and the aggregation rules for each column in the output data set:
    Column name Description
    Treat As Select how each output column is treated, as an attribute or measure.
    Default Aggregation

    Select the aggregation rules for each output column (such as Sum, Average, Minimum, Maximum, Count, or Count Distinct).

    You can select the aggregation rules if a specific column is treated as a measure in the output data set.

    Database Name

    Change the database name of the output columns.

    You can change the column name if you’re saving the output data from a data flow to a database.

When you run the data flow
  • If you’ve selected data set storage, go to the Data page and select Data Sets to see your output data set in the list.

    • Click Actions menu or right-click and select Inspect, to open the data set dialog.

    • In the data set dialog, click Data Elements and check the Treat As and Aggregation rules that you’ve selected for each column in the Save Data step.

  • If you're saving output data to a database, go to the table in that database and inspect the output data.

Save Model

You can change the default name of your model and add a description.

Use the Save Model step in the data flow editor. You'll see this step added automatically in the data flow editor when you add one of the train model steps, for example, Train Numeric Prediction, or Train Binary Classifier.
  1. Add one of the train model steps to your data flow. For example, Train Numeric Prediction, or Train Binary Classifier.
  2. Click the Save Model step.
  3. In the Save Model pane, optionally change the default <Model name, and specify a Model description to identify the model type and script used.
    If you don't change the default Model name value, you'll save a model named untitled. After you run this data flow, you'll see your new model in the Machine Learning page. Click Machine Learning from the navigator on the Home page to apply a saved model to your data.

Select Columns to Include in a Data Flow

Select which columns to include in your data flow. By default, all data columns are included in your data flow.

Use the Select Columns step in the data flow editor.
  1. Click Add a step (+), and select Select Columns.
  2. Use the on-screen options to select or remove columns.

Split Columns in a Data Flow

You can strip out useful data from columns of concatenated data. For example, if a column contains 001011Black, you might split this data into two separate columns, 001011 and Black.

Use the Split Columns step in the data flow editor.
Before you start, turn on Data Preview so that you can see the new columns as you configure the split. If your data source has many columns, use a Select Columns step to remove extraneous columns first to improve the preview.
  1. Click Add a step (+), and select Split Columns.
  2. Use the options on the Split Columns panel to configure the data flow.
    Field Description
    Split Column Click Select Column to specify the data column you'd like to split. If a column is already chosen, click the column name to choose a different column.
    On Specify whether to split the column by delimiter or by position.

    Select Delimiter if the column has separator characters, such as commas or spaces.

    Select Position if the column doesn't have separator characters. If you split on position, you can only create two new columns.

    Delimiter (Displayed when On is set to Delimiter) Specify the separator used in your data column (for example, space, comma, custom).
    Position (Displayed when On is set to Position) Specify where the second column starts. For example, if your column contains AABBBCCCDDD, specify 6 to put AABBB in the first column and CCCDDD in the second column.
    Number of parts to create Specify the number of new columns to create when On is set to Delimiter (you can't change the default value 2 if On is set to Position). For example, if your source data column contains AA BBBBB CCC DD, you might select 4 to put each sub-string into a different column.
    Occurrence Specify how many of the sub-strings in the source column to include in each new column.

    Examples based on data AA BBBBB CCC DD with Delimiter set to Space:

    • If you set Occurrence to 1, Number of parts to create to 1, the new column contains AA. If you set the Occurrence to 2, the new column contains AA BBBBB.
    • If you set Occurrence to 1, Number of parts to create to 2, the first new column contains AA and the second new column contains BBBBB CCC DD.
    • If you set Occurrence to 1, Number of parts to create to 4, the first new column contains AA, the second new column contains BBBBB, the third new column contains CCC, and the fourth new column contains DD.
    New column <number> name Change the default name (New column <number>) for new columns to a more meaningful name.

    Use the adjacent check box to display or hide new columns.

Add a Time Series Forecast to a Data Flow

You can calculate forecasted values by applying a Time Series Forecast calculation.

A forecast takes a time column and a target column from a given data set and calculates forecasted values for the target column and puts the values in a new column. All additional columns are used to create groups. For example, if an additional column 'Department' with values 'Sales', 'Finance', and 'IT' is present, the forecasted values of the target column are based on the past values of the given group. Multiple columns with diverse values lead to a large number of groups that affect the precision of the forecast. Select only columns that are relevant to the grouping of the forecast.
Use a Time Series Forecast step in the data flow editor.
  1. Click Add a step (+), and select Time Series Forecast.
  2. In the Time Series Forecast pane and Output section, specify an output column for the forecasted value.
  3. In the Time Series Forecast pane, configure your forecast calculation:
    Field Description
    Target Select a data column with historical values.
    Time Select a column with date information. Forecasted values use a daily grain.
    Periods Select the value that indicates how many periods (days) are forecasted per group.

Train a Binary Classifier Model in a Data Flow

You train a machine learning model using your existing data to evaluate how accurate the model is in predicting known outcomes.

Train a Binary Classifier model to evaluate how accurately it classifies your data into one of two predefined categories. For example, you might predict whether a product instance will pass or fail a quality control test.
Use the Train Binary Classifier step in the data flow editor.
  1. Click Add a step (+), and select Train Binary Classifier.
  2. At the Select Train Two-Classification Model Script dialog, select a script type, then click OK. For example, you might select Naive Bayes.
  3. Click Select a column and select the data column to analyze.
  4. Use the on-screen options to configure the script parameters.

Train a Clustering Model in a Data Flow

You train a machine learning model using your existing data to evaluate how accurate the model is in predicting known outcomes.

Train a Clustering model to evaluate how accurately it segregates groups with similar traits and assigns them into clusters. For example, you might assign your costumers into clusters (such as big-spenders, regular spenders and so on) based on their purchasing habits.
Use the Train Clustering step in the data flow editor.
  1. Click Add a step (+), and select Train Clustering.
  2. At the Select Train Clustering Model Script dialog, select a script type, then click OK. For example, you might select Hierarchical Clustering for model training.
  3. Use the on-screen options to configure the script parameters.

Train a Multi-Classifier Model in a Data Flow

You train a machine learning model using your existing data to evaluate how accurate the model is in predicting known outcomes.

Train a Multi-Classifier model to evaluate how accurately it classifies your data into three or more predefined categories. For example, you might predict whether a piece of fruit is an orange, apple, or pear.
Use the Train Multi-Classifier step in the data flow editor.
  1. Click Add a step (+), and select Train Multi-Classifier.
  2. At the Select Train Two-Classification Model Script dialog, select a script type, then click OK. For example, you might select Naive Bayes.
  3. Click Select a column and select the data column to analyze.
  4. Use the on-screen options to configure the script parameters.

Train a Numeric Prediction Model in a Data Flow

You train a machine learning model using your existing data to evaluate how accurate the model is in predicting known outcomes.

Train a Numeric Prediction model to evaluate how accurately it predicts a numeric value based on known data values. For example, you might predict the value of a property based on square-footage, number of rooms, zip code, and so on.
Use the Train Numeric Prediction step in the data flow editor.
  1. Click Add a step (+), and select Train Numeric Prediction.
  2. At the Select Train Numeric Prediction Model Script dialog, select a script type, then click OK. For example, you might select Random Forest for Numeric model training.
  3. Click Select a column and select the data column to analyze.
  4. Use the on-screen options to configure the train model.

Transform Data in a Data Flow

You can transform the column data of a data set in a data flow.

You can transform data in a column.
You can also quickly transform the data in a column by using the column menu option in Data Preview. The list of available menu options for a column depends on the type of data in that column. You can perform the following types of data transforms:
  • Update or modify the data in a column.
  • Group or merge multiple columns in a data set.
  • Add a column to or remove a column from a data set.
Use the Transform Column step in the data flow editor.
  1. To add a data transform step, do one of the following:
    • Click Add a step (+), select Transform Column, then select a column
    • Drag and drop the Transform Column step from the Data Flow Steps panel to the workflow diagram panel and select a column.
    • Select a column in the Preview data panel and click Options, then select a transform option. See Column Menu Options for Quick Data Transformations.
  2. In the Step editor pane, compose an expression or update the fields to configure the changes. You can review the changes in the Preview data panel.
    If you’re composing an expression, do the following:
    • Click Validate to check if the syntax is correct.
    • If the expression is valid, click Apply to transform the column data.

Merge Rows in a Data Flow

You can merge the rows of two data sources (known as a UNION command in SQL terminology).

Before you merge the rows, do the following:

  • Confirm that each data set has the same number of columns.
  • Check that the data types of the corresponding columns of the data sets match. For example, column 1 of data set 1 must have the same data type as column 1 of data set 2.
Use the Union Rows step in the data flow editor.
  1. In your data flow, add the data sources you want to merge.
    For example, you might add data sets named Order and Orders.
  2. On one of the data sources, click Add a step (+) and select Union Rows.
    You'll see a suggested connection with a node on the connection line.
    Node icon to connect two data sources.
  3. Click the node on the connection line to complete the connection.
  4. Use the options on the Union Rows pane to configure your step.
    Field Description
    Keep Use these options to specify how you want to join your data. Click an option to display an explanatory diagram and preview your merged data (if you're displaying the Data Preview pane).