Applying Time-series Clustering

The historical transactional behavior of a customer and how it evolves can have a bearing on the risk profile of the customer. Certain patterns of behavior may be more indicative of suspicious activity than others.

The Time Series Clustering features allow the models to identify those patterns of behavior that correlates with suspicious activity as determined by a case or a Suspicious Activity Report (SAR).

This feature allows you to partition time series of transactional activity into clusters. The model then learns how each cluster correlates to the likelihood of a SAR or a CASE. Based on which cluster a given customer’s transactional activity falls into, the customer’s risk score is appropriately adjusted.

Stage-1 Data is based on time-series and you have to collapse to create a single observation for each group-by level such as customer and model group. The time-series function returns the following variable types to cover both aspects of time-series data:

1. Trend variable to focus more on the magnitude (above or below the mean).

2. Direction variable to focus more on the direction (increasing or decreasing).

The presence of outliers in the dataset leads to a misleading number of clusters formed as a part of Clustering. To avoid the same, a two-step approach is followed:

· Uni-variate outlier capping: Cap the outliers at 95th Percentile.

· Multi-variate outlier Treatment: Identify the outliers in the dataset based on the low-frequency count passed in the input.

NOTE	· These observations or outliers are not a part of the actual Clustering process. · The outliers are scored later after identifying the clusters with inliers.

The following are the inputs:

· Class constructor argument : TsClustering

§ bmp_type: Type of feature extraction - clip or trend

§ max_clus: Maximum number of clusters to be considered (Default=20)

§ multiv_max_clus: Maximum number of clusters to be considered for outlier treatment (Default=100)

§ multiv_max_freq: Value of lowest frequency to be considered for moving the observations to outliers. This is the minimum number of observations in a cluster because of which it can be considerer as an outlier. (Default=5)

· Method argument : fit_transform & transform

§ X: Behavioral data as pandas data frame

§ key_var: Name of the ID variable present in the input data frame. The default is ENTITY_ID.

§ ts_var: Name of the time series variable present in the input data frame. The default is MONTH_ID.

§ feature_include: List of features to be included for Time-series Clustering

§ feature_exclude: List of features to be excluded for Time-series Clustering

The following illustration shows an example of transformation applied to time-series clustering on the OSIT model:

%python

#required imports specific to implementation.

from ofs_auto_ml.feature_transform import ts_clustering as ts

#creating object

ts_obj = ts.TsClustering(bmp_type=['clip','trend'], multiv_max_clus=100, multiv_max_freq=5, max_clus=20 )

#calling method : fit_transform

ts_pdf = ts_obj.fit_transform( B_OSIT_PDF, key_var="ENTITY_ID",ts_var="MONTH_ID", feature_include = ['ATM_TRXN_IN_AM', 'ATM_TRXN_IN_CT', 'ATM_TRXN_OUT_AM', 'ATM_TRXN_OUT_CT'] )

The output contains the return pandas dataframe with the transformed time-series variables as shown in the following example:

%python

print( "Dimension : ", list( ts_pdf.shape ) )

z.show( ts_pdf.head() )

NOTE	· New customers (a scenario where the data is not available for all the months) are assigned a constant Cluster ID, that is, 0. · Either include or exclude parameter must be NULL. If both are NULL, all the input attributes are considered for clustering.