Using Set Processing

This section provides an overview of set processing and discusses how to:

Use set processing effectively.
Avoid row-by-row processing.
Use set processing examples.

Set processing is a SQL technique used to process groups, or sets of rows, at one time rather than processing each row individually. Set processing enables you to apply a business rule directly to the data (preferably while it resides in a temporary table) in the database using an Update or Insert/Select statement. Most of the performance gain is because the processing occurs in the database instead of loading the data into the application program, processing it, and then inserting the results back into the database tables. Because the data never leaves the database with set processing (whether it remains in the same table), you effectively eliminate the network round-trip and database API overhead.

Note: Because the updates in set processing occur within the database, use temporary tables to hold transient data while your program runs. Although temporary tables are not required for set processing, they are often essential to achieve optimum performance of your batch program.

The information in the topics that follow applies if you are developing new or upgrading older Application Engine programs to adhere to a set-based model.

SQL Expertise

You should be a SQL expert if you are developing row-by-row programs with Application Engine and especially if you are developing set-based programs. The following concepts are particularly important:

Group by and Having clauses.
Complex joins.
Subqueries (correlated and non-correlated).
Tools for your database to analyze complex SQL statements for performance analysis.

Typically, you use these SQL constructs to refine or filter a set to contain only the rows that meet specific criteria. In Application Engine, you code using SQL, and Application Engine passes that SQL directly to the database, where it is processed. If you have a complex SQL statement that works functionally, it may not perform well if it is not tuned properly.

Planning

Well-constructed, robust, and efficient Application Engine programs are usually the product of a detailed planning stage in which loops, program flow, the use of temporary tables, sections, steps, and so on are discussed.

Ideally, you should address batch processing as a whole while you are designing the system. Sometimes, system analysts and developers focus primarily on the online system during the database design, and then they consider the batch component within the existing database design. Set processing works best in an environment in which data models are optimized for set processing.

For example, you could have a separate staging table for new data that has not been processed rather than having numerous cases of existing rows in a table being updated. In set processing, processing the data after moving it to a temporary table using an Insert or Select statement is easier than using an update. Avoid performing updates on real application tables, and try to perform your updates on temporary tables. You can structure your data model to minimize updating real application tables.

Another important consideration is keeping historical data separate from active transactions. After the life cycle of given piece of transaction data is over, so that no more updates are possible, consider moving that data to an archive or history table and deleting it from the real transaction table. This action minimizes the number of rows in the table, which improves performance for queries and updates to your active data.

Temporary Tables

Although temporary tables are not required for set processing, well-designed temporary tables complement your set-based program in a variety of ways.

Creating temporary tables enables you to achieve one of the main objectives of set-based processing: the processing remains on the database server. By storing transient data in temporary tables, you avoid the batch program fetching the data, row by row, and running the business rule, processing the data, and then passing the updated data back to the database. If the program ran on the client, you would encounter performance issues because of the network round-trip and the diminished processing speed of a client compared to the database platform.

Design your temporary tables to:

Hold transaction data for the current run or iteration of your program.
Contain only those rows of data affected by the business rule.
Present key information in a denormalized, or flattened, form, which provides the most efficient processing.
Switch the keys for rows coming from the master tables, if needed.
A transaction may use a different key than what appears in the master tables.

Denormalized Tables

The most efficient temporary tables store data in denormalized form. Because most programs need to access data that resides in multiple tables, you should consolidate all of the affected and related data into one table, a temporary table. The program runs more efficiently against a flattened, temporary table rather than relying on the system to materialize complex joins and views to retrieve or update necessary data for each transaction.

If your program requires the use of a complex view to process transactions, then resolve the view into a temporary table for your program to run against. Each join or view that needs to materialize for each transaction consumes system resources and affects performance. In this approach, the system applies the join or view once (during the filtering process), populates the temporary table with the necessary information that the program needs to complete the transaction, and then runs the program against the temporary table as needed.

For example, consider the following situation:

A program needs to update 10,000 rows in the Customer table, which contains 100,000 rows of data. The Customer table is keyed by setID. To complete the transaction, the program references data that resides in a related table called PS_SET_CNTRL_REC. PS_SET_CNTRL_REC is used to associate setID and BUSINESS_UNIT values. The transaction is keyed by BUSINESS_UNIT.

Given this set of circumstances, the most efficient processing method would be similar to the following:

Isolate affected or necessary data from both tables and insert it into a temporary table.
Now, instead of dealing with a 10,000-row Customer table and a join to a related table, the program faces a 10,000-row temporary table that contains all of the required data to join directly to the transaction data, which can also be in a temporary table. If all necessary columns reside in the temporary tables, then the program can modify all the rows at once in a simple Update statement.

This example presents two different uses of temporary tables. In one situation, the temporary table is designed to hold setup and control data in a modified form. In the other situation, the temporary table is designed to hold transaction data in a denormalized form, perhaps with additional work columns to hold intermediate calculations.
Make sure the data appears in a denormalized form for optimum processing.
Because the transaction is keyed by BUSINESS_UNIT, you should also key the temporary table that holds the control data by BUSINESS_UNIT.
In this case, the table that holds the control data is the Customer table.

A set-based program and row-by-row processing are not mutually exclusive: some rules do call for row-by-row processing, but these rules are the exceptions. You can have a row-by-row component within a mostly set-based program.

For example, suppose your program contains five rules that you will run against your data. Four of those rules lend themselves well to a set-based approach, while the fifth requires a row-by-row process. In this situation, run the four set-based steps or rules first, and then run the row-by-row step last to resolve the exceptions. Although not pure set-based processing, you will obtain better performance than if the entire program used a row-by-row approach.

When performing a row-by-row update, reduce the number of rows and the number of columns that you select to an absolute minimum to decrease the data transfer time.

For logic that cannot be coded entirely in set, try to process most of the transactions in set and process only the exceptions in a row-by-row loop. A good example of an exception is the sequence numbering of detail lines within a transaction when most transactions have only a single detail line. You can set the sequence number on all the detail lines to 1 by default in an initial set-based operation, and then carry out a Select statement to retrieve only the exceptions (duplicates) and update their sequence numbers to 2, 3, and so on.

Avoid the tendency to expand row-by-row processing for more than is necessary. For example, if you are touching all of the rows in a table in a specific row-based process, you do not necessarily gain efficiency by running the rest of your logic on that table in a row-based manner.

When updating a table, you can add another column to be set in the Update statement. However, do not add another SQL statement to your loop simply because your program is looping. If you can apply that SQL in a set-based manner, then in most cases you achieve better performance with a set-based SQL statement outside the loop.

The rest of this section describes techniques for avoiding row-by-row processing and enhancing performance.

Filtering

Using SQL, filter the set to contain only those rows that are affected or meet the criteria and then run the rule on them. Use a Where clause to minimize the number of rows to reflect only the set of affected rows.

Two-Pass Approach

Use a two-pass approach, wherein the first pass runs a rule on all of the rows and the second pass resolves any rows that are exceptions to the rule. For instance, bypass exceptions to the rule during the first pass, and then address the exceptions individually in a row-by-row manner.

Parallel Processes

Divide sets into distinct groups and then run the appropriate rules or logic against each set in parallel processes. For example, you could split an employee data population into distinct sets of hourly and salary employees, and then you could run the appropriate logic for each set in parallel.

Flat Temporary Tables

Flatten your temporary tables. The best temporary tables are denormalized and follow a flat file model for improved transaction processing.

For example, payroll control data might be keyed by setID and effective dates rather than by business unit and accounting date. Use the temporary table to denormalize the data and switch the keys to business unit and accounting date. Afterwards, you can construct a straight join to the Time Clock table and key it by business unit and date.

Techniques to Avoid

Note that:

If you have a series of identical temporary tables, examine your refinement process.
You should not attempt to accomplish a task that your database platform does not support, as in complex mathematics, non-standard SQL, and complex analytical modeling.
Use standard SQL for set processing.
Although subqueries are a useful tool for refining your set, make sure that you are not using the same one multiple times.
If you are using the same subquery in more than one statement, you should probably have denormalized the query results into a temporary table. Identify the subqueries that appear frequently and, if possible, denormalize the queried data into a temporary table.

Each of the following topics contains an example of set processing.

Payroll

In this example, suppose the payroll department needs to give a 1000 USD salary increase to everyone whose department made more than 50,000 USD profit. The following pseudocode enables you to compare the row-by-row and set-based approaches.

Row-by-Row:

declare A cursor for select dept_id from department where profit > 50000;
open A;
fetch A into p_dept_id
while sql_status == OK
   update personnel set salary = (salary+1000) where dept_id = p_dept_id;
   fetch A into p_dept_id;
end while;
close A;
free A;

Set-Based:

update personnel set salary = (salary + 1000)
   where exists
      (select ‘X’ from department
         where profit > 50000
         and personnel.dept_id = department.dept_id)

Note: The set-based example employs a correlated subquery, which is important in set-based processing.

Temporary Tables

One technique for improving database performance is to use temporary tables to hold the results of common subqueries. Effective dating and setID indirection are common types of subqueries that you can replace with joins to temporary tables. With the joins in place, you can access the temporary table instead of doing the subquery multiple times. Not only do most databases prefer joins to subqueries, but if you combine multiple subqueries into a single join as well, the performance benefits can be significant.

In this setID indirection example, you see a join from a transaction table (keyed by BUSINESS_UNIT and ACCOUNTING_DT) to a setup table (keyed by SETID and EFFDT).

If using a single SQL statement, you need to bring in PS_SET_CNTRL_REC to map the business unit to a corresponding setID. Typically, you do this in a subquery. You also need to bring in the setup table a second time in a subquery to get the effective date (MAX(EFFDT) <= ACCOUNTING_DT). If you have a series of similar statements, performance may be negatively affected.

The alternative is to use a temporary table that is the equivalent of the setup table. The temporary table is keyed by BUSINESS_UNIT and ACCOUNTING_DT instead of SETID and EFFDT. You populate it initially by joining in your batch of transactions (presumably also a temporary table) once, as described previously, to get all the business units and accounting dates for this batch. From then on, your transaction and setup temporary tables have common keys, which allow a straight join with no subqueries.

For the example, the original setup table (PS_ITEM_ENTRY_TBL) is keyed by SETID, ENTRY_TYPE and EFFDT.

The denormalized temporary table version (PS_ITEM_ENTRY_TAO) is keyed by PROCESS_INSTANCE, BUSINESS_UNIT, ENTRY_TYPE and ACCOUNTING_DT, and carries the original keys (SETID and EFFDT) as simple attributes for joining to other related setup tables, as in PS_ITEM_LINES_TBL for this example.

If the program references the setup table in only one Insert/Select or Select statement, you would not see increased performance by denormalizing the temporary table. But if several SQL statements are typically executed in a single run, all of which join in the same setup table with similar setID and effective date considerations, then the performance cost of populating the temporary table initially provides long-term advantages.

Original setup table version:

INSERT INTO PS_PG_PENDDST_TAO (...)
SELECT 
. . . . . 
 ( (I.ENTRY_AMT_BASE - I.VAT_AMT_BASE) * L.DST_LINE_MULTIPLR * L.DST_LINE_PERCENT / 100 ),   ( (I.ENTRY_AMT - I.VAT_AMT) * L.DST_LINE_MULTIPLR * L.DST_LINE_PERCENT / 100 ),
. . . . .
FROM  PS_PENDING_ITEM I, PS_PG_REQUEST_TAO R, PS_ITEM_LINES_TBL L,
      PS_ITEM_ENTRY_TBL E, PS_SET_CNTRL_REC S, PS_BUS_UNIT_TBL_AR B
. . . . .WHERE  AND L.ENTRY_REASON = I.ENTRY_REASON AND L.SETID = E.SETID AND L.ENTRY_TYPE = E.ENTRY_TYPE AND L.EFFDT = E.EFFDT. . . . .
 AND E.EFF_STATUS = 'A'
 AND S.RECNAME = 'ITEM_ENTRY_TBL'
 AND S.SETID = E.SETID
 AND S.SETCNTRLVALUE = I.BUSINESS_UNIT
 AND E.ENTRY_TYPE = I.ENTRY_TYPE
 AND E.EFFDT = (SELECT MAX(EFFDT) FROM PS_ITEM_ENTRY_TBL Z
                   WHERE Z.SETID = E.SETID
                     AND Z.ENTRY_TYPE = E.ENTRY_TYPE
                     AND Z.EFF_STATUS = 'A'
                     AND Z.EFFDT <= I.ACCOUNTING_DT )
 AND B.BUSINESS_UNIT = I.BUSINESS_UNIT
/

Denormalized temporary table version:

INSERT INTO PS_ITEM_ENTRY_TAO
. . . . .  
SELECT DISTINCT %BIND(PROCESS_INSTANCE), I.BUSINESS_UNIT, I.ACCOUNTING_DT,
 E.ENTRY_TYPE...
. . . 
FROM  PS_PENDING_ITEM I, PS_PG_REQUEST_TAO R, 
      PS_ITEM_ENTRY_TBL E, PS_SET_CNTRL_REC S, PS_BUS_UNIT_TBL_AR B 
WHERE R.PROCESS_INSTANCE = %BIND(PROCESS_INSTANCE)
 AND R.PGG_GROUP_TYPE = 'B'
 AND I.POSTED_FLAG = 'N'
 AND R.GROUP_BU = I.GROUP_BU
 AND R.GROUP_ID = I.GROUP_ID
 AND E.EFF_STATUS = 'A'
 AND S.RECNAME = 'ITEM_ENTRY_TBL'
 AND S.SETID = E.SETID
 AND S.SETCNTRLVALUE = I.BUSINESS_UNIT
 AND E.ENTRY_TYPE = I.ENTRY_TYPE
 AND E.EFFDT =  ( SELECT MAX(EFFDT) FROM PS_ITEM_ENTRY_TBL Z
                   WHERE Z.SETID = E.SETID
                     AND Z.ENTRY_TYPE = E.ENTRY_TYPE
                     AND Z.EFF_STATUS = 'A'
                     AND Z.EFFDT <= I.ACCOUNTING_DT )
 AND B.BUSINESS_UNIT = I.BUSINESS_UNIT
/
INSERT INTO PS_PG_PENDDST_TAO (...)
SELECT ... 
 ( (I.ENTRY_AMT_BASE - I.VAT_AMT_BASE) * L.DST_LINE_MULTIPLR * L.DST_LINE_PERCENT / 100 ),
( (I.ENTRY_AMT - I.VAT_AMT) * L.DST_LINE_MULTIPLR * L.DST_LINE_PERCENT / 100 ),
 
. . . . .
FROM  PS_PENDING_ITEM I, PS_PG_REQUEST_TAO R, PS_ITEM_LINES_TBL L, 
      PS_ITEM_ENTRY_TAO E
. . . . .
WHERE
. . . . .
 AND L.ENTRY_REASON = I.ENTRY_REASON
 AND L.SETID = E.SETID
 AND L.ENTRY_TYPE = E.ENTRY_TYPE
 AND L.EFFDT = E.EFFDT
. . . . .
 AND E.BUSINESS_UNIT = I.BUSINESS_UNIT
 AND E.ACCOUNTING_DT = I.ACCOUNTING_DT
 AND E.ENTRY_TYPE = I.ENTRY_TYPE
/

Platform Issues

Set processing does not behave the same on every database platform. On some platforms, set processing can encounter performance breakdowns. Some platforms do not optimize update statements that include subqueries.

For example, environments that are accustomed to updates with subqueries get all the qualifying department IDs from the Department table and then, using an index designed by an application developer, update the Personnel table. Other platforms read through every employee row in the Personnel table and query the Department table for each row.

On platforms where these types of updates are a problem, try adding some selectivity to the outer query. In the following example, examine the SQL in the Before section and then notice how it is modified in the After section to run smoothly on all platforms. You can use this approach to work around platforms that have difficulty with updates that include subqueries.

Note: In general, set processing capabilities vary by database platform. The performance characteristics of each database platform differ with more complex SQL and set processing constructs. Some database platforms allow additional set processing constructs that enable you to process even more data in a set-based manner. If performance needs improvement, you must tailor or tune the SQL for your environment. You should be familiar with the capabilities and limitations of your database platform and be able to recognize, through tracing and performance results, the types of modifications you need to incorporate with the basic set processing constructs described.

Basic version:

UPDATE PS_REQ_LINE
SET SOURCE_STATUS = 'I'
WHERE
EXISTS
(SELECT 'X' FROM PS_PO_ITM_STG STG
WHERE
STG.PROCESS_INSTANCE =%BIND(PROCESS_INSTANCE)  AND
STG.PROCESS_INSTANCE =PS_REQ_LINE.PROCESS_INSTANCE AND
STG.STAGE_STATUS = 'I'  AND
STG.BUSINESS_UNIT = PS_REQ_LINE.BUSINESS_UNIT AND
STG.REQ_ID = PS_REQ_LINE.REQ_ID AND
STG.REQ_LINE_NBR = PS_REQ_LINE.LINE_NBR)

Optimized for platform compatibility:

UPDATE PS_REQ_LINE
SET SOURCE_STATUS = 'I'
WHERE
PROCESS_INSTANCE = %BIND(PROCESS_INSTANCE) AND
 EXISTS
(SELECT 'X' FROM PS_PO_ITM_STG STG
WHERE
STG.PROCESS_INSTANCE =%BIND(PROCESS_INSTANCE)  AND
STG.PROCESS_INSTANCE =PS_REQ_LINE.PROCESS_INSTANCE AND
STG.STAGE_STATUS = 'I'  AND
STG.BUSINESS_UNIT = PS_REQ_LINE.BUSINESS_UNIT AND
STG.REQ_ID = PS_REQ_LINE.REQ_ID AND
STG.REQ_LINE_NBR = PS_REQ_LINE.LINE_NBR)

Note: This example assumes that the transaction table (PS_REQ_LINE) has a PROCESS_INSTANCE column to lock rows that are in process. This is another example of designing your database with batch performance and set processing in mind.

This modification enables the system to limit its scan through PS_REQ_LINE to only those rows that the program is currently processing. At the same time, it enables a more set-friendly environment to first scan the smaller staging table and then update the larger outer table.

Using Set Processing

Understanding Set Processing