About Constructing Datasets
The data miners should apply the following tips when constructing datasets:
- Apply as much filtering as possible at the dataset level. If certain cases cannot satisfy the criteria for a scenario, it is best to eliminate them from consideration through dataset filtering rather than processing these records in the pattern.
- Create dataset thresholds for any filtering criteria that may require adjustment over time. When creating these thresholds, the data miner should not set the maximum and minimum values in a way that would cause the scenario to perform too slowly. For example, if the lookback period is a configurable parameter, the data miner must ensure that the scenario performs adequately when this threshold is set to its maximum value.
- Make appropriate considerations for null values when creating dataset filters. Remember that null values cannot be joined across tables, and null values will not satisfy the criteria for comparisons to specific values. For example, the dataset filter ACCT_TYPE_CD != ‘IRA’ eliminates cases where the ACCT_TYPE_CD field is null. We recommend using the NVL expression when checking that a nullable field is not equal to a particular value (for example, NVL(ACCT_TYPE_CD, ‘NONE’) != ‘IRA’). Use NVL expressions when performing computations with nullable attributes.
- When creating derived attributes, specify the correct data type (that is String, Float, Integer, or Date). The most common mistake is to specify a numeric field as a string. This causes problems when making comparisons in patterns. For example, 100 is less than 99.
- Avoid any situation that can cause a divide-by-zero error. If a zero value is not expected in the denominator of a calculation, the data miner should apply specific logic to handle this situation should it occur.
- Remove references to tables that the detection process does not require. In some cases, information that is used for display—in places such as the alert highlights—is not directly needed to identify matches. Retrieve this information after a match is generated using the augmentation process, rather than bringing extra tables into the original pattern dataset. For example, if a data- set uses the Security Master table only so that the pattern can display a ticker symbol in the alert highlights, the data miner should instead use binding augmentation to add the ticker symbol to the highlights as part of post processing.
- Rely on summarized data whenever possible, rather than calculating aggregate statistics by searching through a large volume of data. For example if a scenario requires information on aggregate trade or transactional data, it is best to retrieve it from the summary tables (for example, ACCT_SMRY_MONTH) rather than computing it in a dataset. If possible, arrange the frequency and lookback periods for scenarios.
- Whenever possible, create the dataset by using the graphical icons. Using Edit SQL mode is required whenever the user needs to apply advanced SQL clauses (for example, exists, not exists, union), aggregate functions, or performance hints. However, datasets that are in Edit SQL mode are somewhat harder to maintain, so data miners should try other options before moving to Edit SQL.
- For sequence scenarios, consider using temporary tables to narrow down the records included in a dataset. In some cases, a relatively simple set of SQL logic can remove a large percentage of situations that have no chance of satisfying the criteria for a particular scenario. Under these conditions, it is often an efficient strategy to perform an initial screening step that identifies candidate entities and stores them in a temporary table. The other datasets can then use the temporary table as a starting point, thereby limiting the volume of data they need to retrieve.
- Use the comment boxes to document your datasets rather than writing in-line comments in the SQL directly. It is not necessary to put comments directly into SQL, and this practice should be avoided. Do not write dataset names (for example, dsview12345) into SQL comments under any circumstances because this causes problems during job runs.
- If it is necessary to paste in the dataset SQL for some reason, do not start from any application that can include unprintable formatting characters with the text, such as Microsoft Word. It is much safer to paste in the information from a text editor such as Notepad. Unprintable characters can cause problems during scenario extraction and migration.
- Review the execution plan for each query either with a Database Engineering (DBE) or via a SQL tuning utility. If the execution plan indicates that the query is inefficient, either modify the query or have the DBE add appropriate indexes. As a last resort, if the query is not using the correct indexes that it should, apply database hints to instruct the query optimizer appropriately.
- Verify that all SQL expressions used are in conformance with the Oracle Financial Services SQL Usage Guide. This ensures that the datasets are compliant with both Oracle and DB2 or UDB.