Transformations

These are the various transformations which can be done from the UI.

Table 8-5 Transformations

No. Transformation Function
1. Add New Feature

A new feature can be added to the dataset which could be derived from the existing features using Script.

Physical Feature Name and Feature Name are the names of the new feature.

Script can be used to create a pandas Series for the new feature

2. Encode Categorical Features This function performs One Hot Encoding on a categorical feature and replaces it with multiple numerical features in the dataset.
3. Encode Datetime Features This function encodes a datetime feature and replaces it with multiple numerical features having the following information derived from the datetime feature - year, month, week, day, hour, minute, dayofweek.
4. Encode Cyclical Features

This function encodes a cyclical feature having hour, minute data, and so on and returns two features carrying the sine and cosine transformation of the cyclical data.

'fmax ' denotes the maximum possible value of the cyclical feature data.

5. Impute Missing Data

This function imputes missing data within a feature.

For numerical features, there are 4 methods for imputing missing data.

simple - imputes with mean, median, most_frequent values based on chosen arg value using the SimpleImputer in sklearn.

const - fills the missing values with the value given in the arg

knn - imputes using the KNNImputer in sklearn with k value given in arg

mice - imputes using the IterativeImputer in sklearn

For non-numerical data, missing values can be imputed using the 'const' method by replacing all missing values with the value given in arg

6. Feature Scaling This function is used to scale multiple selected numerical features using the StandardScaler in sklearn
7. Dimensionality Reduction

This function performs PCA on selected numerical features to reduce the dimensionality using sklearn.decomposition.PCA module.

The number of output features can be specified using dim field.

The names of the output features' names can be specified in the fields 'Physical Feature Name' and 'Feature Name'

8. Outlier removal

This function is used to remove outliers present in a feature based on the specified zscore value.

Non-numerical features are label encoded before removing the outliers.

9. Duplicates Removal - Data Frame This function removes all duplicate rows in the dataframe.
10. Duplicates Removal - Feature This function removes all duplicate rows within a specified subset of features and consequently removes those rows from the data frame
11. Filter Features

This function is used to filter the data frame based on conditions specified on features

Operations allowed : >,>=,=,!=,<,<=,isin

When the chosen operation is 'isin', the input to 'Filter Value' is a list of values that should be present in the output data frame