8 Managing Rulesets

A Ruleset is a set of rules that are applied to the defined source and target entities, compares the attributes of the entities to derive a match. For information on matching rulesets, see the Financial Crime Graph Model Matching Guide.

Accessing the Rulesets

To access the Rulesets page, follow these steps:

1. Navigate to the FCC Studio workspace.

2. Click the Navigation Menu on the upper-left corner.

The menu items are listed.

3. Click Ruleset.

The Ruleset page is displayed with all the out-of-the-box rulesets.

Figure 1: Ruleset Page

Creating Rulesets

To create a ruleset, follow these steps:

1. Navigate to the Ruleset page.

2. Click Create.

The Ruleset Details page is displayed.

3. Enter the following details.


Field	Description
Name	Indicates the name of the ruleset.
Description	Indicates the additional description given for the ruleset.
Scoring Aggregation Type	Indicates the scoring aggregation method. Select one of the following options: · Maximum: Considers the highest score obtained out of all the rules created for a ruleset. · Minimum: Considers the lowest score obtained out of all the rules created for a ruleset.
Set Threshold	Indicates the threshold value set for a ruleset. A Similarity Edge is generated only when the maximum score obtained for a ruleset is equal to or higher than the threshold value.
Source	Indicates the source entity (node). The values are auto-populated from the metadata table that contains the elastic search index names generated as a result of running the Sqoop job.
Target	Indicates the target entity (node). The values are auto-populated from the metadata table that contains the elastic search index names generated as a result of running the Sqoop job.

Creating Rules in a Ruleset

To create rules in a ruleset, follow these steps:

1. Navigate to a Ruleset Details page.

2. Click Create to add a new rule.

A New Rule section is displayed.

3. Enter the following details:


Field	Description
Name	Indicates the name of the rule.
Description	Indicates the description of the rule.
Rule Threshold	Indicates the threshold value set for a rule. This rule contributes to the matching, only when the maximum score obtained for a rule is equal to or higher than the threshold value.

4. Click Create to add new Mappings:


Field	Description
Source Attribute	Indicates the source attribute.
Target Attribute	Indicates the target attribute.
Match Type	Indicates the match type. Select one of the following options: · Exact: To obtain the matches that are 100% perfect when finding the entities in a database. · Fuzzy: To obtain the matches that are less than 100% perfect when finding the entities in a database.
Scoring Method	The scoring methods used are as follows: · Default · Jaro Winkler For more information, see Scoring Method.
Threshold	Indicates that a score below the mentioned value does not generate a result from the elastic search.
Weightage	Indicates the weightage given for the attributes in the rule.
Condition	Indicates that this attribute cannot have a null value. This attribute must be populated and must return a value for the matching.

Scoring Method

The scoring methods used in the entity resolution component are as follows:

· Default Method

The distance is computed by finding the number of edits which transforms one string to another. The transformations allowed are as follows:

§ Insertion: Adding a new character

§ Deletion: Deleting a character

§ Substitution: Replace one character with another

By performing these operations, the algorithm attempts to modify the first string to match the second one. The final result obtained is the edit distance.

For example:

a. textdistance.levenshtein('arrow', 'arow')

b. >> textdistance.levenshtein.normalized_similarity('arrow', 'arow')

0.8

Here, if you insert single ‘r’ in string 2, that is, ‘arow’, it becomes same as the string 1. Hence, the edit distance is 1. Similar with Hamming distance, you can generate a bounded similarity score between 0 and 1. The similarity score obtained is 80%.

· Jaro Winkler

This algorithms gives high scores for the following strings:

a. The strings that contain same characters, but within a certain distance from one another.

b. The order of the matching characters is same.

To be precise, the distance of finding similar character is one character less than half of the length of the longest string. So if the longest string has a length of five, a character at the start of the string 1 must be found before or on ((5/2)–1) ~ 2nd position in the string 2. This is considered a valid match. Hence, the algorithm is directional and gives high score if matching is from the beginning of the strings.

For example:

a. textdistance.jaro_winkler("mes", "messi")

0.86

b. textdistance.jaro_winkler("crate", "crat")

0.96

c. textdistance.jaro_winkler("crate", "atcr")

0.0

In first case, as the strings are matching from the beginning, high score is given. Similarly, in the second case, only one character was missing and that too at the end of the string 2, hence a very high score is given. In third case, the last two character of string 2 are rearranged by bringing them at front and hence results in 0% similarity.