1 Overview of Oracle Enterprise Data Quality

This chapter gives an overview of Oracle Fusion Middleware and Oracle Enterprise Data Quality.

This chapter includes the following sections:

1.1 Oracle Fusion Middleware Overview

Oracle Fusion Middleware is a collection of standards-based software products that spans a range of tools and services: from Java EE and developer tools, to integration services, business intelligence, and collaboration. Oracle Fusion Middleware offers complete support for development, deployment, and management of applications. Oracle Fusion Middleware components are monitored at run time using Oracle Enterprise Manager Fusion Middleware Control Console.

1.2 About Oracle Enterprise Data Quality

EDQ provides a comprehensive data quality management environment that is used to understand, improve, protect and govern data quality. EDQ facilitates best practice master data management, data integration, business intelligence, and data migration initiatives. EDQ provides integrated data quality in customer relationship management and other applications.

Following are the key features of EDQ:

  • Integrated data profiling, auditing, cleansing and matching

  • Browser-based client access

  • Ability to handle all types of data (for example, customer, product, asset, financial, and operational)

  • Connection to any Java Database Connectivity (JDBC) compliant data sources and targets

  • Multi-user project support (role-based access, issue tracking, process annotation, and version control)

  • Services Oriented Architecture (SOA) support for designing processes that may be exposed to external applications as a service

  • Designed to process large data volumes

  • A single repository to hold data along with gathered statistics and project tracking information, with shared access

  • Intuitive graphical user interface designed to help you solve real world information quality issues quickly

  • Easy, data-led creation and extension of validation and transformation rules

  • Fully extensible architecture allowing the insertion of any required custom processing

1.3 Understanding the Software Components

EDQ is a Java Web Application that uses a Java Servlet Engine, a Java Web Start graphical user interface, and a Structured Query Language (SQL) relational database management system (RDBMS) system for data storage.

EDQ is a client-server architecture. It is comprised of several client applications that are Graphical User Interfaces (GUIs), a data repository, and a business layer. This section provides details on the architecture of these components, their data storage, data access, and I/O requirements.

1.3.1 What Are the Client Applications?

EDQ provides a number of client applications that are used to configure and operate the product. Most are Java Web Start applications, and the remainder are simple web pages. The following table lists all the client applications, how they are started, and what each does:

Application Name Starts In Purpose

Director

Web Start

Design and test data quality processing

Server Console

Web Start

Operate and monitor jobs

Match Review

Web Start

Review match results and make manual match decisions

Dashboard

Browser

Monitor data quality key performance indicators and trends

Case Management

Web Start

Perform detailed investigations into data issues through configurable workflows

Case Management Administration

Web Start

Configure workflows and permissions for Case Management

Web Service Tester

Browser

Test EDQ Web Services

Configuration Analysis

Web Start

Report on configuration and perform differences between versions of configuration

Issue Manager

Web Start

Manage a list of DQ issues

Administration

Browser

Administer the EDQ server (users, groups, extensions, launchpad configuration)

Change Password

Browser

Change password

Configuration Analysis

Web Start

Analyze project configurations, and report on differences'.


The client applications can be accessed from the EDQ Launchpad on the EDQ server. When a client launches one of the Java Web Start applications, such as Director, the application is downloaded, installed, and run on the client machine. The application communicates with the EDQ server to instantiate changes and receive messages from the server, such as information about tasks that are running and changes made by other users.

Since EDQ is an extensible system, it can be extended to add further user applications when installed to work for a particular use case. For example, Oracle Watchlist Screening extends EDQ to add a user application for screening data against watchlists.

Note:

Many of the client applications are available either separately (for dedicated use) or within another application. For example, the Configuration Analysis, Match Review and Issue Manager applications are also available in Director.

1.3.1.1 Where Is Data Stored?

The client computer only stores user preferences for the presentation of the client applications, while all other information is stored on the EDQ server.

1.3.1.2 Network Communications

The client applications communicate over either an Hypertext Transfer Protocol (HTTP) or a Secure Hypertext Transfer Protocol (HTTPS) connection, as determined by the application configuration on start-up. For simplicity, this connection is referred to as 'the HTTP connection' in the remainder of this document.

1.3.2 How is Data Stored in the EDQ Repository?

EDQ uses a repository that contains two database schemas: the Config schema and the Results schema.

Note:

Each EDQ server must have its own Config and Results schemas. If multiple servers are deployed in a High Availability architecture, then the configuration cannot be shared by pointing both servers to the same schemas.

1.3.2.1 What Is the Config Schema?

The Config schema stores configuration data for EDQ. It is generally used in the typical transactional manner common to many web applications: queries are run to access small numbers of records, which are then updated as required.

Normally, only a small amount of data is held in this schema. In simple implementations, it is likely to be in the order of several megabytes. In the case of an exceptionally large EDQ system, especially where Case Management is heavily used, the storage requirements could reach 10 GB.

Access to the data held in the Config schema is typical of configuration data in other relational database management system (RDBMS) applications. Most database access is in the form of read requests, with relatively few data update and insert requests.

1.3.2.2 What Is the Results Schema

The Results schema stores snapshot, staged, and results data. It is highly dynamic, with tables being created and dropped as required to store the data handled by processors running on the server. Temporary working tables are also created and dropped during process execution to store any working data that cannot be held in the available memory.

The amount of data held in the Results schema will be vary significantly over time, and data capture and processing can involve gigabytes of data. Data may also be stored in the Results database on a temporary basis while a process or a job runs. In the case of a job, several versions of the data may be written to the database during processing.

The Results schema shows a very different data access profile to the Config schema, and is extremely atypical of a conventional web-based database application. Typically, tables in the Results schema are:

  • Created on demand

  • Populated with data using bulk JDBC application programming interfaces (APIs)

  • Queried using full table scans to support process execution

  • Indexed

  • Queried using complex SQL statements in response to user interactions with the client applications

  • Dropped when the process or snapshot they are associated with is run again

The dynamic nature of this schema means that it must be handled carefully. For example, it is often advisable to mount redo log files on a separate disk.

1.3.3 Where does EDQ Store Working Data on Disk?

EDQ uses two configuration directories, which are separate from the installation directory that contains the program files. These directories are:

  • The base configuration directory: This directory contains default configuration data. This directory is named oedq.home in an Oracle WebLogic installation but can be named anything in an Apache Tomcat installation.

  • The local configuration directory: This directory contains overrides to the base configuration, such as data for extension packs or overrides to default settings. EDQ looks in this directory first, for any overrides, and then looks in the base directory if it does not find the data it needs in the local directory. The local configuration directory is named oedq.local.home in an Oracle WebLogic installation but can be named anything in an Apache Tomcat installation.

Some of the files in the configuration directories are used when processing data from and to file-based data stores. Other files are used to store server configuration properties, such as which functional packs are enabled, how EDQ connects to its repository databases, and other critical information.

The names and locations of the home and local home directories are important to know in the event that you need to perform any manual updates to templates or other individual components.

These directories are created when you install EDQ. For more information, see Oracle Fusion Middleware Installing and Configuring Enterprise Data Quality.

1.3.4 What Is the Business Layer?

The business layer fulfills three main functions:

  • Provides the API that the client applications use to interact with the rest of the system.

  • Notifies the client applications of server events that may require client applications updates.

  • Runs the processes that capture and process data.

The business layer stores configuration data in the Config schema, and working data and results in the Results schema.

When passing data to and from the client application, the business layer behaves in a manner common to most traditional Java Web Applications. The business layer makes small database transactions and sends small volumes of information to the front-end using the HTTP connection. This is somewhat unusual in that the application front-ends are mostly rich GUIs rather than browsers. Therefore the data sent to the client application consists mostly of serialized Java objects rather than the more traditional HTML.

However, when running processes and creating snapshots, the business layer behaves more like a traditional batch application. In its default configuration, it spawns multiple threads and database connections in order to handle potentially very large volumes of data, and uses all available CPU cores and database I/O capacity.

It is possible to configure EDQ to limit its use of available resources, but this has clear performance implications. For further information, see the EDQ Installation Guide and EDQ Admin Guide.