4.6 Data Slicing

Data Slicing is a utility that will segment instrument data into equal parts by populating a numeric value into the DATA SLICE ID column. Data slicing should be used together with Multi-processing which is described in detail under Appendix F – Process Tuning in the FTP User Guide and Appendix B – Performance Tuning in the ALM User Guide.

The purpose of segmenting data into equal parts is to balance the data volumes which are handled by each sub-process that is launched when multi-processing is enabled. The goal of multi-processing is to efficiently utilize the maximum amount of processing power of the application server during peak processing which leads to significantly shorter overall processing time. Through benchmark testing, we have found that breaking instrument data down into equal segments via the Data Slice ID column, is the most efficient way to use multi-processing. The alternative is to use one or more key dimension columns such as Organization Unit and Product ID. The shortcoming of using these other dimensions for segmenting data is that data is not evenly distributed across these dimensions so you end up with a few large segments and a large number of small segments which is not optimal for processing. Because each data segment is handled by the engine via a dedicated sub process, evenly distributing the data into equal segments provides the best results.

Data Slices are utilized by the FTP, ALM and BSP engines only when multi-processing is enabled. The default multi-process setting is 1 Process. This means a single sub process is launched when an FTP, ALM or BSP process is started and this one sub process will iterate through all of the data until processing is complete. Using a single process is fine for implementation and testing, but in production, users should identify the number of sub processes that will lead to the best performance. During Process Tuning, users can increase the number of sub processes to 2, 4, 8 or greater numbers depending on the number of CPU’s available on the server. In multi-processing, a value of 8 processes means that 8 sub processes are automatically launched and each sub process is responsible for processing all of the data for an entire data segment (Data Slice Cd). When a sub process finishes processing the data for a segment, a new sub process is launched and handles the next available data segment. This process repeats until all data segments have been processed. In terms relative improvements in performance, we have observed that a multi process value of 2 is approximately 2x faster than one, 4 is approximately 4x faster and 8 is approximately 8x faster. We have noticed diminishing returns as the number of processes increases, so users will need to iterate on the number of processes setting to find the optimal value.