4 Tuning EDQ Performance

This chapter describes the server properties that can be used to optimize the performance of the EDQ system and how these properties should be configured in various circumstances.

This chapter includes the following topics:

EDQ has a large number of properties that are used to configure various aspects of the system. A relatively small number of these are used to control the performance characteristics of the system.

Performance tuning in EDQ is often discussed in terms of CPU cores. In this chapter, this refers to the number of CPUs reported by the Java Virtual Machine as returned by a call to the Runtime.availableProcessors()method.

4.1 Understanding the Properties File

The tuning controls are exposed as properties in the director.properties file. This file is found in the oedq_local_home configuration directory.

Note:

In most cases, it is not necessary to tune these properties, and their default settings are intended to use as much of the available hardware as possible, which is normally desirable for optimal performance. You should normally only alter these properties if advised to do so by an EDQ expert.

For most use cases, there is little performance advantage in values larger than 16 for these settings, so it may be advisable to set these if deploying EDQ on a very large server. For example, on a single ExaLogic node with perhaps 70 logical CPUs available, multiple EDQ managed servers (in a cluster, if required) each running up to 16 threads will normally provide better overall performance than a single unconstrained server.

The most important tuning properties are as follows:

runtime.threads This property determines the number of threads that will be used for each batch job which is invoked. The default value of this property is zero, meaning that the system should start one thread for each CPU core that is available. You can specify an explicit number of threads by supplying a positive, non-zero integer as the value of this property. For example, if you know that you want to start a total of four threads for each batch process, set runtime.threads to four.
runtime.intervalthreads This property determines the number of threads that will be used by each process when running in interval mode. This will also define the number of requests that can be processed simultaneously. The default behavior is to run a single thread for each process running in interval mode.

workunitexecutor. outputThreads This property determines the number of threads that will be used to write data to the results database. These threads service the queue of results and output data for the whole system, and so are shared by all the processes which are running on the system. The default value of this property is zero, meaning that the system should use one output thread for each CPU core that is available. You can specify an explicit number of output threads by supplying a positive, non-zero integer as the value of this property. For example, if you know that you want to use a total of four threads for each batch process, set workunitexecutor.outputThreads to 4.

4.2 Tuning for Batch Processing

The default tuning settings provided with EDQ are appropriate for most systems that are primarily used for batch processing. Enough threads are started when running a job to use all available cores. If multiple jobs are started, the operating system can schedule the work for efficient sharing between the cores. It is best practice to allow the operating system to perform the scheduling of these kinds of workloads.

4.3 Tuning for Real-Time Processing

When a production system is being used for a significant amount of real time processing, it should not be used for simultaneous batch and real time processing unless the real time response is not critical. In general, we recommend running only those aspects of batch performance that are needed to prepare data for real-time processing (for example, the preparation of reference data for real-time reference data matching).

4.3.1 Tuning Batch Processing On Real-Time Systems

If batch processing must be run on a system that is being used for real time processing, it is best practice to run the batch work when the real time processes are stopped, such as during a scheduled maintenance window. In this case, the default setting of runtime.threads is appropriate. If it is necessary to run batch processing while real time services are running, set runtime.threads to a value that is less than the total number of cores. By reducing the number of threads started for the batch processes, you prevent those processes from placing a load on all of the available cores when they run. Real time service requests that arrive when the batch is running will not be competing with it for CPU time.

4.4 Tuning JVM Parameters

JVM parameters should be configured during the installation of EDQ. For more information, see section Setting Server Parameters to Support Enterprise Data Quality in Oracle Fusion Middleware Installing and Configuring Enterprise Data Quality. If it becomes necessary to tune these parameters post-installation to improve performance, follow the instructions in this section.

Note:

All of the recommendations in this section are based on EDQ installations using the Java HotSpot Virtual Machine. Depending on the nature of the implementations, these recommendations may also apply to other JVMs.

4.4.1 Setting the Maximum Heap Memory

If an OutOfMemory error message is generated in the log file, it may be necessary to increase the maximum heap space parameter, -Xmx. For most use cases, a setting of 8GB is sufficient. However, large EDQ installations may require a higher max heap size, and therefore setting the -Xmx parameter to a value half that of the server memory is the normal recommendation.

4.5 Tuning Database Parameters

The most significant database tuning parameter with respect to performance tuning within EDQ is workunitexecutor.outputThreads. This parameter determines the number of threads, and hence the number of database connections, that will be used to write results and staged data to the database. All processes that are running on the application server share this pool of threads, so there is a risk of processing becoming I/O bound in some circumstances. If there are processes that are particularly I/O intensive relative to their CPU usage, and the database machine is more powerful than the machine hosting the EDQ application server, it may be worth increasing the value of workunitexecutor.outputThreads. The additional database threads would use more connections to the database and put more load on the database.

4.6 Adjusting the Client Heap Size

Under certain conditions, client heap size issues can occur; for example, when:

  • attempting to export a large amount of data to a client-side Excel file, or

  • opening up Match Review when there are many groups.

EDQ allows the client heap size to be adjusted using a property in the blueprints.properties file.

To double the default maximum client heap space for all Java Web Start client applications, create (or edit if it exists) the file blueprints.properties in the local configuration directory of the EDQ server. For more information about the EDQ configuration directories, see "EDQ Directory Requirements" in Installing Oracle Enterprise Data Quality.

Add the line:

*.jvm.memory = 512m

Note:

Increasing this value will cause all connecting clients to change their heap sizes to 512MB. This could have a corresponding impact on client performance if other applications are in use.

To adjust the heap size for a specific application, replace the asterisk, *, with the blueprint name of the client application from the following list:

  • director - (Director)

  • matchreviewoverview - (Match Review)

  • casemanager - (Case Management)

  • casemanageradmin - (Case Management Administration)

  • opsui - (Server Console)

  • diff - (Configuration Analysis)

  • issues - (Issue Manager)

For example, to double the maximum client heap space for Director, add the following line:

director.jvm.memory = 512m

When doubling the client heap space for more than one application, simply repeat the property; for example, for Director and Match Review:

director.jvm.memory = 512m

matchreviewoverview.jvm.memory = 512m

4.7 Designing Fast Jobs: General Performance Options

You can use four general techniques to maximize the performance.

This section includes the following:

4.7.1 Streaming Data and Disabling Staging

You can develop jobs that stream imported data directly into processes instead of, or as well as, staging the imported data in the EDQ repository database. Where only a small number of threads are available to a job, streaming data into that job may enable it to process the data more quickly. This is because bypassing the staging of imported data reduces a job's I/O load. Depending on your job's technical and business requirements, and the resources available to it, you may be able to stream data into it, or stage the data and stream it in, to improve performance. Note, however, that where a large number of threads are available to a job, it may run more quickly if you snapshot the data, so that it is all available from the outset. For the avoidance of doubt: you can stream data into a job with or without staging it. However, you cannot disable the staging of imported data unless you are streaming data.A job that streams imported data directly into and out of a process or chain of processes without staging it acts as a pipe, reading records directly from a data store and writing records to a data target.

Configuration

To stream data into a process:

  • Create a job.

  • Add both the snapshot and the process as tasks within the same phase of the job, ensuring that the snapshot is directly connected to the process.

To additionally disable the staging of imported data in the EDQ repository:

  • Right-click the snapshot within the job and select Configure Task… or Configure Connector...

  • Within the Configure Task dialog box, de-select the Stage data? check box.

    Surrounding text describes pic1_tp.png.

Note: Any record selection criteria (snapshot filtering or sampling options) will still apply when streaming data.:

To stream an export:

  • Create a Process that finishes with a Writer that writes to a Data Interface.

  • Create an Export that reads from the same Data Interface.

  • Within a Job, add the Process and the Export as tasks in the same phase of the Job.

  • Ensure that the Process that writes to your Data Interface is directly connected to the Export.

If you have configured EDQ as outlined above, then, by default, data will not be staged in the repository. This is because you have not selected a Data Interface Output Mapping that points at a set of staged data.

Surrounding text describes pic2_tp.png.

If you want to enable staging of the data that is to be exported:

  • Create a Data Interface Mapping that points to a set of staged data.

  • Right-click the Process within your Job and select Configure Task… or Configure Connector...

  • In the Configure Task dialog, navigate to the Writers tab.

  • Ensure that the Enabled? Check-box beside the writer is ticked (it should be ticked by default).

  • Select the Data Interface Mapping that points to a set of staged data.

Surrounding text describes pic3_tp.png.

When to Stage Data, and When to Disable Staging

Whilst designing a process, you will often run it against data that has been staged in the EDQ repository via a snapshot. Streaming data into a job without staging it may be appropriate when:

  • You are dealing with a production environment.

  • You have a large number of records to process.

  • You always want to use the latest records from the source system.

However, streaming a snapshot without staging it is not always the quickest or best option. If you need to run several processes on the same set of data, or if you need to lookup on staged data, it may be more efficient to stage the data via a snapshot as the first task of a job, and then run the dependent processes. If your job has a large number of threads available to it, it may run more quickly if all of the data is staged at the outset. Additionally, if the source system for the snapshot is live, it may be best to run the snapshot in a phase on its own so that the impact on the source system is minimized. In this case, the data will not be streamed into a process, since the snapshot and process need to be directly connected to each other within the same job phase for streaming to occur.

For the avoidance of doubt: if you connect a process directly to a snapshot, then the data will always be streamed into that process, regardless of whether it is also staged in the repository (which is determined by the Stage data? check box). Streaming the data into EDQ and also staging it may, in some cases, be an efficient approach - for example, if the data is used again later in the job.

Streaming an Export

When an export of a set of staged data is configured to run in the same job after the process that writes the staged data, the export will always write records as they are processed, regardless of whether records are also staged in the repository. However, it is possible to realize a small performance gain by disabling staging so that data is only streamed to its target.

You may choose to disable staging of output data:

  • For deployed data cleansing jobs.

  • If you are writing to an external staging database that is shared between applications. (For example when running a data quality job as part of a larger ETL process, and using an external staging database to pass the data between EDQ and the ETL tool.)

4.7.2 Minimized Results Writing

Minimizing results writing reduces the amount of Results Drilldown data that EDQ writes to the repository from processes, and so saves on I/O.

Each process in EDQ runs in one of three Results Drilldown modes:

  • All (all records in the process are written in the drilldowns)

  • Sample (a sample of records are written at each level of drilldown)

  • None (metrics only are written - no drilldowns will be available)

    All mode should be used only on small volumes of data, to ensure that all records can be fully tracked in the process at every processing point. This mode is useful when processing small data sets, or when debugging a complex process using a small number of records.

    Sample mode is suitable for high volumes of data, ensuring that a limited number of records are written for each drilldown. The System Administrator can set the number of records to write per drilldown; by default this is 1000 records. Sample mode is the default when running processes interactively from the Director User Interface.

    None mode should be used to maximize the performance of tested processes that are running in production, and where users will not need to interact with results. None is the default when processes are run within jobs.

    To change the Results Drilldown mode when executing a process, use the Run Preferences screen, or create a Job and double click the process task to configure it.

    For example, the following process is configured so that it does not write drilldown results when it is deployed in production via a job (this is the default when a process is run within a job):

    Surrounding text describes pic4_tp.png.

The Affect of Run Labels

Note that jobs that are run with run labels from either the Server Console user interface or the command line do not generate results drill-downs.

4.7.3 Disabling Sorting and Filtering

When working with large data volumes, it can take a long time to index snapshots and staged data in order to enable users to sort and filter the data in the Results Browser. In many cases, this sorting and filtering capability will not be needed, or will only be needed when working with smaller samples of the data.

The system applies intelligent sorting and filtering, where it will enable sorting and filtering when working with smaller data sets, but will disable sorting and filtering for large data sets. However, you can choose to override these settings - for example to achieve maximum throughput when working with a number of small data sets.

Snapshot Sort/Filter options

When a snapshot is created, the default setting is to 'Use intelligent Sort/Filtering options', so that the system will decide whether or not to enable sorting and filtering based on the size of the snapshot.

However, if you know that no users will need to sort or filter results that are based on a snapshot in the Results Browser, or if you only want to enable sorting or filtering at the point when the user needs to do it, you can disable sorting and filtering on the snapshot when adding or editing it.

To do this, edit the snapshot, and on the third screen (Column Selection), uncheck the option to Use intelligent Sort/Filtering, and leave all columns unchecked in the Sort/Filter column:

Surrounding text describes pic5_tp.png.

Alternatively, if you know that sorting and filtering will only be needed on a sub-selection of the available columns, use the tick boxes to select the relevant columns. Note that any columns that are used as lookup columns by a Lookup and Return processor should be indexed to boost performance.Disabling sorting and filtering means that the total processing time of the snapshot will be less as the additional task to enable sorting and filtering will be skipped.Note that if a user attempts to sort or filter results based on a column that has not been enabled, the user will be presented with an option to enable it at that point.

Staged Data Sort/Filter options

When staged data is written by a process, the server does not enable sorting or filtering of the data by default. The default setting is therefore maximized for performance.

If you need to enable sorting or filtering on written staged data - for example, because the written staged data is being read by another process which requires interactive data drilldowns - you can enable this by editing the staged data definition, either to apply intelligent sort/filtering options (varying whether or not to enable sorting and filtering based on the size of the staged data table), or to enable it on selected columns (as below):

Surrounding text describes pic6_tp.png.

Match Processor Sort/Filter options

It is possible to set sort/filter enablement options for the outputs of matching.

Note:

This should only be enabled if you wish to review the results of match processing using the Match Review UI.

4.7.4 Resource-Intensive Processors

The following processors are highly resource intensive because they need to write all of the data they process to the EDQ repository before they work on it:

  • Quickstats Profiler

  • Record Duplication Profiler

  • Duplicate Check

  • All match processors

  • Group and Merge

  • Phrase Profiler

  • Merge Data Streams

Note:

This processor should only be used to merge records from separate readers; it is NOT necessary to use it to connect up multiple paths from the same reader.

The following processors work on a record-by-record basis, but are also highly resource intensive:

  • Parse

    Note:

    This Parse processor's performance is highly dependent upon its configuration, it can be fast or slow.
  • Address Verification

Clearly, there are situations in which you will need to use one or more of these resource-intensive processors. For example, a de duplication process requires a match processor. However, when optimal performance is required, you should avoid their use where possible. See below for specific guidance on how to tune the matching, Parse and Address Verification processors.

4.8 Performance Tuning for Parsing and Matching

In the case of Parsing and Matching, a large amount of work is performed by an individual processor, as each processor has many stages of processing. In these cases, options are available to optimize performance at the processor level.

See below for more information on how to maximize performance when parsing or matching data:

4.8.1 Place Parse and Match processors in their own Processes

Both parsing and matching are inherently resource-intensive, and can take time to run. For this reason, it is advisable to place parse and match processors in processes on their own (or with only a small number of other processors). This will enable you to isolate and therefore accurately measure their performance, which should in turn make it easier to tune them.

4.8.2 Parsing performance options

By default, the Parse processor works in Parse and Profile mode. This is useful during configuration, as the parser will output the Token Checks and Unclassified Tokens results views. These will help you to define parsing rules. In production, however, when maximum performance is required from a Parse processor, it should be run in Parse mode, rather than Parse and Profile mode. To change the Parser's run mode, click its Advanced Options link, and then set the run mode in the Options dialog box.

Surrounding text describes pic7_tp.png.

For even better performance where only metrics and data output are required from a Parse processor, the process that includes the parser may be run with no drilldowns - see Minimized results writing above.

When designing a Parse configuration iteratively, where fast drilldowns are required, it is generally best to work with small volumes of data. If a parse processor has configuration that drives it to generate a number of different patterns for a given input record, for example it has many classification and reclassification rules, it may be possible to improve performance by reducing the number of patterns produced using the Patterns limit option (for example to 8) without altering results. If changing this option, parsing results should be tested for changes before and after making the change.

4.8.3 Matching performance options

The following techniques may be used to maximize matching performance:

4.8.3.1 Optimized Clustering

Matching performance may vary greatly depending on the configuration of the match processor, which in turn depends on the characteristics of the data involved in the matching process. The most important aspect of configuration to get right is the configuration of clustering in a match processor.

In general, there is a balance to be struck between ensuring that as many potential matches as possible are found and ensuring that redundant comparisons (between records that are not likely to match) are not performed. Finding the right balance may involve some trial and error - for example, assessment of the difference in match statistics when clusters are widened (perhaps by using fewer characters of an identifier in the cluster key) or narrowed (perhaps by using more characters of an identifier in a cluster key), or when a cluster is added or removed.

The following two general guidelines may be useful:

  • If you are working with data with a large number of well-populated identifiers, such as customer data with address and other contact details such as e-mail addresses and phone numbers, you should aim for clusters with a maximum size of 20 for every million records, and counter sparseness in some identifiers by using multiple clusters rather than widening a single cluster.

  • If you are working with data with a small number of identifiers, for example, where you can only match individuals or entities based on name and approximate location, wider clusters may be inevitable. In this case, you should aim to standardize, enhance and correct the input data in the identifiers you do have as much as possible so that your clusters can be tight using the data available. In this case, you should still aim for clusters with a maximum size of around 500 records if possible (bearing in mind that every record in the cluster will need to be compared with every other record in the cluster - so for a single cluster of 500 records, there will be 500 x 499 = 249500 comparisons performed).

4.8.3.2 Disabling Sort/Filter options in Match processors

By default, sorting, filtering and searching are enabled on all match results to ensure that they are available for user review. However, with large data sets, the indexing process required to enable sorting, filtering and searching may be very time-consuming, and in some cases, may not be required.

If you do not require the ability to review the results of matching using the Match Review Application, and you do not need to be able to sort or filter the outputs of matching in the Results Browser, you should disable sorting and filtering to improve performance. For example, the results of matching may be written and reviewed externally, or matching may be fully automated when deployed in production.

The setting to enable or disable sorting and filtering is available both on the individual match processor level, available from the Advanced Options of the processor (see Sort/Filter options for match processors for details), and as a process or job level override.

To override the individual settings on all match processors in a process, and disable the sorting, filtering and review of match results, deselect the option to Enable Sort/Filter in Match processors in a job configuration, or process execution preferences:

Surrounding text describes pic8_tp.png.

Note:

Sort / Filter in Match is disabled by default when processes are included in jobs.

4.8.3.3 Minimizing Output

Match processors may write out up to three types of output:

  • Match (or Alert) Groups (records organized into sets of matching records, as determined by the match processor. If the match processor uses Match Review, it will produce Match Groups, whereas if uses Case Management, it will produce Alert Groups.)

  • Relationships (links between matching records)

  • Merged Output (a merged master record from each set of matching records)

By default, all available output types are written. (Merged Output cannot be written from a Link processor.)

However, not all the available outputs may be needed in your process. For example you should disable Merged Output if you only want to identify sets of matching records.

Note that disabling any of the outputs will not affect the ability of users to review the results of a match processor.

To disable Match (or Alert) Groups output:

  • Open the match processor on the canvas and open the Match sub-processor.

  • Select the Match (or Alert) Groups tab at the top.

  • Un-check the option to Generate Match Groups report, or to Generate Alert Groups report.

    Or, if you know you only want to output the groups of related or unrelated records, use the other tick boxes on the same part of the screen.

To disable Relationships output:

  • Open the match processor on the canvas and open the Match sub-processor.

  • Select the Relationships tab at the top.

  • Un-check the option to Generate Relationships report.

    Or, if you know you only want to output some of the relationships (such as only Review relationships, or only relationships generated by certain rules), use the other tick boxes on the same part of the screen.

To disable Merged Output:

  • Open the match processor on the canvas and open the Merge sub-processor.

  • Un-check the option to Generate Merged Output.

    Or, if you know you only want to output the merged output records from related records, or only the unrelated records, use the other tick boxes on the same part of the screen.

4.8.3.4 Streaming Inputs

Batch matching processes require a copy of the data in the EDQ repository in order to compare records efficiently.

As data may be transformed between the Reader and the match processor in a process, and in order to preserve the capability to review match results if a snapshot used in a matching process is refreshed, match processors always generate their own snapshots of data (except from real time inputs) to work from. For large data sets, this can take some time.

Where you want to use the latest source data in a matching process, therefore, it may be advisable to stream the snapshot rather than running it first and then feeding the data into a match processor, which will generate its own internal snapshot (effectively copying the data twice). See Streaming a Snapshot above.

4.9 Performance Tuning for Address Verification

EDQ's Address Verification processor is a conduit to the Enterprise Data Quality Address Verification Server (EDQ AV). EDQ AV attempts to match each input record against all of the addresses that exist for that country in its Global Knowledge Repository. This operation is inherently resource-intensive, and it does take time to run. For this reason, it is advisable to place the Address Verification processor in a process on its own, or with only a small number of other processors. This will enable you to isolate and therefore accurately measure its performance, which should in turn make it easier to tune.The EDQ Address Verification server requires a substantial amount of memory outside of the EDQ Application Server's Java Heap. Address Verification's performance may suffer if insufficient memory is available. See Application Server Tuning for more information about tuning the EDQ Application Server's Java Heap.You can adjust Address Verification performance by tuning its caching options. You can control these parameters using the Address Verification Processor's Additional Options field, which is available from the processor's Options tab. Two parameters that it may be beneficial to adjust are:

  • ReferenceDatasetCacheSize

  • ReferencePageCacheSize

Full information on the available options is available on Loqate's support site: https://www.loqate.com/support/options/

You should seek advice from Loqate Support before adjusting these parameters.

In addition to adjusting AV's caching parameters, Address Verification performance can be significantly improved by creating a different EDQ process for each country that you want to screen addresses from.

4.10 What Makes Processes Slow? Common Pitfalls

See below for more information.

4.10.1 Poor Matching Processor Configuration

EDQ Match processors are inherently very efficient, and feature many automatic optimizations. However, performance can be severely compromised by poor configuration. If a matching process takes a long time to run, this may be caused by its cluster configuration, as too many large clusters often result in too many comparisons, which will slow matching down.

4.10.2 Unnecessary Merge Data Streams Processors

A common misconception is that the Merge Data Stream processor is required to join up multiple paths from the same reader. It is not. Whilst the Merge Data Stream processor should be used to join up genuinely different data streams from different readers, any regular EDQ processor can join multiple paths from the same reader, and will simply work with all the distinct records from all of the joined paths.

4.10.3 Doing Too Much in a Single Process

It can be very difficult to identify the cause of a performance issue in a very large, complex process. Instead, you should create distinct modular processes for distinct operations, chaining them together with Data Interfaces. If you take this approach, you can easily see how long each process takes to run, which makes diagnoses much easier. An added benefit is that smaller processes are easier to understand and maintain.

4.10.4 Using the Script Processor when You Could Use a Core Processor

It is sometimes necessary to use the Script processor, but in nearly all cases this will result in slower performance than running a core processor, which uses compiled Java code. Don't use the script processor unless you really have to.

4.10.5 Using Matching Processors Unnecessarily

EDQ's audit and transformation processors work on a single record at a time. They can do all of their processing in memory, and can scale to as much CPU power as the application server has at its disposal. EDQ match processors, on the other hand, operate on sets of data. (This is also true of some profiling processors - see the list of Resource Intensive processors, above, for details.) In order to assess the similarity of the records in the data set, match processors write these data sets to the EDQ repository database. This I/O overhead is a requirement of match processors (and also of the Record Duplication Profiler, the Quick Stats Profiler, the Record Duplication Check processor and the Phrase profiler). There are a number of scenarios in which match processors are absolutely necessary. For example, you should use match processors:

  • To identify fuzzy matches in large sets of data.

  • To identify matches using multiple fields.

  • Where you need to review possible matches.

However, if you simply need to return records in which a single field matches exactly, the Lookup and Return processor is likely to run more quickly than a match processor.

4.11 Tuning EDQ's Platform

Beyond designing efficient processes, EDQ itself does not require extensive tuning. There are only a few parameters you can usefully alter, and in most cases you can simply leave these set to their default values. (See the 'Oracle Fusion Middleware Administering Oracle Enterprise Data Quality' guide for more information). However, EDQ exists within an ecosystem. Aside from the physical hardware and network infrastructure, the most critical aspect of this ecosystem is the platform that EDQ runs on: specifically its application server and its database repository. Most performance issues are caused by sub-optimal process and job configuration, and it is not necessary to tune the platform to resolve them. However, in some cases, a few simple platform optimization steps can provide a performance boost.Before we discuss tuning the platform, let's just note that, when configuring a new EDQ installation, you should run a realistic load of test data through your system and observe the results before you finalize your settings.

4.11.1 The Application Server and the Database Repository

4.11.1.1 Relative Importance of the Application Server and the Database Repository

Tuning the application server's maximum Java heap size can provide performance benefits. When tuning the maximum Java Heap size, please bear the following points in mind:

  • The more processor cores (and therefore threads) available, the more memory you should allocate to the Java Heap. We recommend that, for optimal performance, you should allocate 2GB of memory for each runtime thread. (Note that EDQ will employ a runtime thread for every logical CPU that it detects).

  • For most use cases, a setting of 8 GB is sufficient.

    Note:

    EDQ Customer Data Services Pack (CDS) has intensive memory requirements, and may require more than 8 GB.
  • Note that allocating too much of your server's overall memory to the Java Heap can cause performance problems, as there may not be enough spare memory left to run applications that require non-Java Heap memory (an example is the EDQ Address Verification Server). Each thread also requires non-Java Heap memory, and so where EDQ has access to many threads, this will also increase the non-Java Heap memory requirement. As a rule of thumb, you should not allocate more than two thirds of your server's overall memory to the Java Heap.

For more information about tuning the application server, see the 'Oracle Fusion Middleware Administering Oracle Enterprise Data Quality' guide.

4.11.1.2 Database Tuning

Whilst EDQ does require optimal database I/O to perform certain operations, such as matching, efficiently, how to tune your database depends on your specific use case and circumstances. It is, however, possible to offer some general advice. Database Administrators should:

  • Ensure that they allocate sufficient Tablespace for EDQ. (You can find guidelines about Tablespaces sizes and other Database settings in the 'Oracle Fusion Middleware Installing and Configuring Oracle Enterprise Data Quality' guide.)

  • Adopt an experiential approach to tuning the database's I/O performance.

  • Set important settings such as PGA, SGA, Processes and Sessions in line with the EDQ Installation Guide (and link to the right section).

    It also may be worth noting that on WebLogic, the configured Data Sources control the maximum number of connections to the database. On large systems running many threads, it may be necessary to adjust the maximum connections to the Results database (where data is written and read by EDQ processes) from the default value of 200. A value of 500 is sufficient for nearly all use cases.

  • Archive (redo) logging is resource expensive. In some cases, where all EDQ processing on a server is entirely stateless and the server can be re-provisioned automatically with no loss of service, it may be appropriate to turn off archive logging in the database for better performance. Or, if this is not possible, you may be able to mount redo log files on separate disks to improve performance.

4.11.2 Processor Cores and Process Threads

EDQ will create a runtime thread for each logical CPU (or 'core') that it detects. It will, where possible, divide processing amongst parallel threads. In general, process run times decrease as cores are added. However, after a certain number of cores have been made available to the Java Virtual Machine (JVM), the decrease in processing time for each extra core tends to become marginal, as the increased processing power is offset by greater contention. This is the case even when other aspects of your system, such as reading and writing to files and databases, have been optimized. In typical batch processing, improvements in run-times for each core added to the JVM became marginal when the total number of cores used by the JVM exceeded 16.

4.11.2.1 Process Threads

The number of process threads used to execute jobs is automatically set to the number of available cores, and should not usually be changed. However, when EDQ is installed on servers that have more than 16 cores and which are running heavy batch processing workloads, you may want to manually set the number of threads used by EDQ to 16. This is because a higher number of threads may lead to contention for resources when the threads have finished their work.

In order to set the number of threads manually, amend the following parameters in the director.properties file, which should be located in your EDQ instance's oedq.local.home folder:

  • runtime.threads = 16

  • runtime.indexingthreads = 16

  • workunitexecutor.outputThreads = 16

When EDQ is installed on servers with more than 16 cores, you can scale by adding additional managed servers. Every managed server should be placed within the same WebLogic Cluster, but each will run within a separate Java Virtual Machine. For more information about how to add additional managed servers see the High Availability section of the Oracle® Fusion Middleware Understanding Oracle Enterprise Data Quality guide. A single EDQ batch job will always run on a single managed server, so adding managed servers will not necessarily enable individual jobs to run more quickly. The advantage of having multiple managed servers is that it enables different jobs to run on different managed servers concurrently.

Note that the database server may be remote from the application server where EDQ will run, but it must be on a fast network connection.