Transformer Component
=============================
The transformer component provides an easy way to do simple in-memory
transformations on the input data. Some examples of how transformer can be used are:
 * To normalise an input feature.
 * To change the scale of the selected items in a feature.
 * To modify or append columns based on existing columns.
 * To convert the type.

Transformers work as a chain and so take in a list as input. The order
of the list is important as the framework runs the input data sequentially
through the list of transformers and sends the final output of the chain
to metrics and other components.

.. warning::

    The order in which you pass your transformer must be correct.
    The final dataframe column must have all the features defined in the schema provided to insight.

As with other components, transformer interfaces can be extended with custom logic.
by user.

How do they work
--------------------
Transformers expect a dataframe as input and also provide a dataframe as output.
The first dataframe hase access to the full dataframe created by the reader
(based on the input data and the input schema).

The interface method to transform is as follows:

.. code-block:: bash

    def transform(self, data_frame: pd.DataFrame, **kwargs: Any) -> pd.DataFrame:

As mentioned above, be careful not to unintentionally change the columns so that
they are different from input schema. For example, don't drop a feature column.
However, if it is intended, you can provide a modified schema through the transformer’s interface.

 .. code-block:: bash

    def get_output_schema(self, input_schema: pa.Schema, **kwargs: Any) -> pa.Schema:

.. warning::

    The transformers are meant to be applied on a single row.
    Don't create a transformation that requires backward seek, forward seek,
    or access to the entire data set, for example, a group by.

Conditional Feature
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
One of the important transformers is the conditional feature.
The conditional feature lets you write Python expressions to transform
the data without the need to write custom transformer classes.
The conditional feature has many use cases. For example:

* Transform unstructured data to structured data.
* Create composite feature by applying some logic on many columns (of the same row).
* Create variations of a single feature, for example normalization.

How does it work
--------------------

Her is an example to demonstrate how conditional feature works.
Lets assume we have data with the following input features:

* Gender
* JobFunction
* CommuteLength

They are present in the input data. While we can create many metrics for these,
like sum or mean, often users have additional use cases to better understand their data.
For example a user might want to have a metric on how many female software developers
are present in the data or employees with a very long commute time. One option the
user can take here is to run their own ETL, adding these logics to the original input data.
However, this might be quite time-consuming and the user might need to create additional
infrastructure or set-ups.

ML Insights offers a simple solution where, you can write Python expressions (with row
level operations) to create additional features in memory. You can define any metric
available (based on the created feature’s data type, variable type, and column type) on them.

.. note::

    Keeping true to the design principle of transformers, conditional features in
    no way alters the original data nor does it persist any copies of it in an external
    storage. All conditional features are generated in-memory and hence are transient.

Keeping this in mind, let's see how we can define new conditional features and pass the on
to the builder.

#. First, import the required classes
    .. code-block:: bash

        from mlm_insights.core.transformers.conditional_feature_transformer import ConditionalFeatureMetadata, ConditionalFeatureTransformer
        from mlm_insights.core.transformers.expression_evaluator import Expression, ExpressionType
#. Next, define the feature types (as we would have done for an feature coming through input data)
    .. code-block:: bash

        transformers = []
        feature_female_sde = FeatureMetadata(feature_name="FemaleSoftwareDeveloper",
                                       feature_type=FeatureType(data_type=DataType.INTEGER,
                                                                variable_type=VariableType.CONTINUOUS))

        feature_long_travel_time = FeatureMetadata(feature_name="LongTravelTime",
                                                    feature_type=FeatureType(data_type=DataType.INTEGER,
                                                                             variable_type=VariableType.CONTINUOUS))
#. Construct the conditional feature objects with proper logic expressions
    .. code-block:: bash

        conditional_features = [
        ConditionalFeatureMetadata(
            expression=Expression(value="(df['Gender'] == 'Female') & (df['JobFunction'] == 'Software Developer')",
                                  type=ExpressionType.python),
            feature_metadata=feature_female_sde),
        ConditionalFeatureMetadata(expression=Expression(value="df['CommuteLength'] > 5",
                                                         type=ExpressionType.python),
                                   feature_metadata=feature_long_travel_time)]

        transformers.append(ConditionalFeatureTransformer.create(
            config={
                'conditional_features_metadata_config': conditional_features}))
#. Finally, pass the list of conditional feature objects created to the Builder Object
    .. code-block:: bash

        InsightsBuilder().with_transformers(transformers=transformers)


What we did here is create two conditional features:

* FemaleSoftwareDeveloper - We used the logic Gender = 'Female' and JobFunction = 'Software Developer' to create a new feature which provides 1 if the data represents a female software developer, 0 otherwise.
* LongTravelTime - We used the logic CommuteLength > 5 to identify employees with long commute time

We can now add any metric like count on these features to gain more insight on them.