Processing of Extracts

This chapter provides an overview of the procedures involved in the extraction of large data from the system.

Overview

When an extract is performed, the application creates a global activity that generates multiple child activities. Each child activity fetches a specific number of entities and stores them in a data file.

The ohi.extract.datafilecount system property controls the number of datafiles the system generates during the extract process. This property has the following impact:

  • This property value decides the number of parallel child activities that are processed for the selection of the items to be extracted.

  • The entities or items are evenly distributed among these child activities based on datafilecount property.

  • A higher value allows for more parallel child activities that can improve performance by making the extraction process multi-threaded. However, this generates more data files.

  • Setting a value too low, such as 1 or 10, can result in the extract process to function as a single-threaded operation, potentially reducing performance.

  • The default value of this property is set to 100.

Processing of Entities

The following steps are performed when the extraction is processed:

  1. Entities or items fetched by a child activity are processed by dividing them into chunks.

  2. Initially, each child activity extracts records in chunks based on the system property ohi.processing.chunksize.SELECT_EXTRACT_ITEMS.

  3. The system dynamically adjusts the chunk size by splitting it into smaller chunks based on memory utilization, starting with the value configured in ohi.processing.chunksize.SELECT_EXTRACT_ITEMS and optimize as needed.

  4. Additionally, memory is cleaned during processing if the cumulative memory of the objects exceeds the threshold memory. This combined helps manage memory usage and prevent out-of-memory issues.

  5. When each of these chunks is individually processed, a data file is created for each chunk containing the preferred fetched items' information written to it.

  6. Once there are no more chunks to process, these individual data files per chunk are merged to create a single data file. All the fetched entities per child activity are stored in this data file.

Use Case

Consider an example of performing an extract on policy resources with one(1) million policy items. If ohi.extract.datafilecount is set to the default value of 100, then the following steps occur:

  1. The application creates a global activity that generates 100 child activities same as number of datafilecount.

  2. The one (1) million policy items are evenly distributed among these 100 child activities resulting in 10,000 items per child activity.

  3. If the ohi.processing.chunksize.SELECT_EXTRACT_ITEMS is set to 500 and 8 threads per JVM. Then, each child activity divides its 10,000 items into 20 chunks (10000/500), where each chunk carries 500 items.

  4. When each chunk is processed and memory utilization exceeds the threshold limit, the chunk will further split into two smaller chunks. For example, 5 out of 20 chunks are split into two smaller chunks each. The total number of chunks now becomes 25 (15 unsplit chunks + 5 chunks split into two parts each, which is (5*2)10 small chunks).

  5. Each chunk that cannot be further split creates a data file with information on fetched items.

  6. Once there are no more chunks to process, all these individual data files are merged into a single final data file for each child activity. As a result, 100 data files are generated, each containing 10,000 policy items.

Configuration Properties

The following are configurations properties relevant to extract processing:

  • ohi.extract.datafilecount
    This is an optional configuration property that controls the number of data files generated by the extract activity. It also determines the number of child activities executed. The default value is 100.

  • ohi.extract.datafilecount.resourcename.{0}
    The behavior of this property is similar to that of ohi.extract.datafilecount, but it controls the number of data files generated for a specific resource name or entity. The placeholder value {0} refers to the resourceName used in the extract request.

  • ohi.extract.datafilecount.notificationkey.{0}
    The behavior of this property is similar to that of ohi.extract.datafilecount, but it controls the number of data files generated for a specific notification key. The placeholder value {0} refers to the notificationKey used in the extract request.

The value set in the ohi.extract.datafilecount.notificationkey specific property takes precedence over the ohi.extract.datafilecount.resourcename specific property, which further takes precedence over the default ohi.extract.datafilecount property.