1 Introduction to Oracle Data Masking and Subsetting

This chapter explains the basics of Oracle Data Masking and Subsetting pack by providing an overview on:

It is recommended that you understand the above concepts of Oracle Data Masking and Subsetting prior to the implementation. If you are already aware of these concepts or want to start masking and subsetting data, please refer to the chapters below this chapter:
  • Start with the Prerequisites chapter to understand the privileges, roles, and storage requirements.
  • Refer to the Application Data Modeling chapter to discover and model sensitive columns.
  • Refer to the Data Masking chapter to define and execute masking criteria.
  • Refer to the Data Subsetting chapter to define and execute subsetting criteria.

Note:

For Oracle Data Masking and Subsetting licensing information, please refer to Oracle Database Licensing Guide.

Why Data Masking and Subsetting (DMS)

The reasons to mask and subset data include the following:

  • Limit sensitive data proliferation: The growing security threats have increased the need to limit exposure of sensitive information. At the same time, copying production data for non-production purposes such as test and development is proliferating sensitive data, expanding the security and compliance boundary, and increasing the likelihood of data breaches.

  • Share what is necessary: Often, companies have to share a production data set with internal and external parties for various reasons. For example, a Cloud application provider may have to extract and share information specific to their individual subscribers on demand. Another example is a company serving a court order must extract and share a subset of production data with the court. In several cases, it is efficient to extract and share a portion or subset of information instead of sharing the entire production dataset.

  • Comply with data privacy laws and standards: Data privacy standards such as PCI-DSS and European Union (EU) General Data Protection Regulation (GDPR) emphasize on protecting sensitive information in non-production environments because these environments are typically not as protected or monitored as production systems. EU GDPR also mandates an individual’s right to be forgotten, erasure, portability or rectification which requires identifying and processing a subset of information.

  • Minimize storage costs: Using the entire production data for test, development, and QA purposes will incur additional storage costs and prolong the test and development cycles, increasing the overall storage and operational cost.

Masking and subsetting data addresses the above use cases. Data Masking is the process of replacing sensitive data with fictitious yet realistic looking data. Data Subsetting is the process of downsizing either by discarding or extracting data. Masking limits sensitive data proliferation by anonymizing sensitive production data. Subsetting helps to minimize storage costs by deleting data or extracting a subset of data for sharing or archival. Data Masking is also known as Static Data Masking, and Data Subsetting is also known as Test Data Management.

Challenges Organization Face for Masking and Subsetting Data

Organizations typically mask and subset data using custom scripts or solutions. While these in-house solutions might work for few columns, they may not cater to large applications with distributed databases and thousands of columns, and result in challenges such as:

  • How to locate sensitive data in the midst of numerous applications, databases, and environments?

  • How to accurately protect sensitive data as data has different shapes and forms such as VISA, AMEX, Discoverer, Master, Social Security Numbers, and more?

  • Is the protected data usable to developers, testers, and applications?

  • Will the applications continue to work after masking and subsetting is done?

In addition to the above challenges, organizations may not have the resources to develop and maintain such a solution in this ever-changing IT landscape.

How Oracle Data Masking and Subsetting Addresses Masking/Subsetting Challenge

Oracle Data Masking and Subsetting addresses the above challenges by providing an automated, flexible, and easy-to-use solution that masks and subsets sensitive production data, thereby allowing data to be shared safely across non-production environments.

The Oracle Data Masking and Subsetting (DMS) pack of the Oracle Enterprise Manager (EM) helps:

  • maximize the business value of data by masking sensitive information

  • minimize the compliance boundary by not proliferating the sensitive production information

  • lower the storage costs on test and development environments by subsetting data

  • automate the discovery of sensitive data and parent-child relationships

  • provide a comprehensive library of masking formats, masking transformations, subsetting techniques, and select application templates

  • mask and subset data in-Database or on-the-file by extracting the data from a source database

  • mask and subset both Oracle and non-Oracle databases

  • mask and subset Oracle Databases hosted on the Oracle cloud

  • preserve data integrity during masking and subsetting and offers many more unique features

  • integrate with select Oracle testing, security, and integration products.

Figure 1-1 Oracle Data Masking and Subsetting Used in a Production and Test Database Setup

Description of Figure 1-1 follows
Description of "Figure 1-1 Oracle Data Masking and Subsetting Used in a Production and Test Database Setup"

This flowchart provides an overview of how Oracle Data Masking and Subsetting works, focusing on the core functionalities of performing Data Masking and Data Subsetting. It does not include all features such as Import/Export Application Data Model, Managing Dictionary and Non-Dictionary Referential Relationships, and many more.

Oracle Data Masking and Subsetting Use Cases

What to Know About DMS

Oracle Data Masking and Subsetting unlocks the value of data without increasing risk and minimizes storage costs by enabling secure and cost-effective data provisioning for various scenarios, including test, development, and partner environments. It protects sensitive data by discovering and masking it, ensuring that only relevant data is shared. This approach accelerates regulatory compliance by demonstrating data protection by design and default, maximizes data value by safely sharing realistic and functionally intact data with stakeholders, and reduces costs by extracting and sharing only the necessary data, thereby lowering time, storage, and infrastructure expenses.

The Challenges

  1. How to Locate Sensitive Data at Short and Long Term? Finding sensitive data across a sprawling landscape of numerous applications, databases, and environments can be a daunting task. Organizations must implement robust discovery tools and methodologies to ensure all sensitive data is accurately identified both initially and on an ongoing basis.
  2. How to Accurately Protect Sensitive Data? Sensitive data comes in various shapes and forms, such as VISA, AMEX, Discover, MasterCard, SSN, and more. Protecting such diverse data types necessitates sophisticated masking and encryption techniques tailored to each specific data format to ensure comprehensive data security.
  3. Is the Protected Data Usable? For developers, testers, applications, and more, it's crucial that protected data remains usable. Ensuring that data masking and subsetting solutions preserve the functionality and realism of the data is key to maintaining productivity and the integrity of testing and development processes.
  4. Will the Applications Continue to Work? Developing and maintaining data protection solutions in an ever-changing IT landscape, encompassing both on-premises and cloud environments, requires continuous adaptation. Ensuring that applications continue to work seamlessly despite these changes is vital for ongoing operations and development.
  5. Limit Sensitive Data Proliferation Sensitive data often proliferates across various environments such as Test, Dev, QA, Training, Research, Outsourced, Cloud, and more. Implementing strict controls and monitoring to limit the spread of sensitive data is essential for reducing exposure and maintaining data security.
  6. Save Storage Costs Managing data storage efficiently in non-production environments, such as Test/Dev and large data warehouses, is crucial for controlling costs. By extracting and sharing only relevant data, organizations can significantly reduce their storage expenses while maintaining necessary access to data.
  7. Share What is Necessary When sharing data with subscribers, auditors, courts, partners, testers, developers, and more, it is essential to ensure that only necessary information is provided. This minimizes the risk of unnecessary data exposure while complying with regulatory and operational requirements.
  8. Right to be Forgotten/Erasure (GDPR in Europe) Compliance with regulations such as GDPR in Europe mandates the right to be forgotten, requiring organizations to effectively erase personal data upon request. Implementing efficient data management processes to handle such requests is critical for regulatory compliance and maintaining customer trust.

Benefits

  • Sensitive Data Protection: Ensure the confidentiality of sensitive information such as Personally Identifiable Information (PII) or financial data by masking it before it's used in non-production environments. This maximizes the business value of data by safeguarding against unauthorized access.

  • Compliance Boundary Management: Ensure enterprise-wide compliance with regulations and laws for managing sensitive data and only allowing access to non-sensitive data in testing and development environments, thereby reducing the risk of compliance violations.
  • Cost-Effective Test and Development Environments: Reduce storage costs in test and development environments by subsetting data. By only including relevant subsets of data necessary for testing and development purposes, storage requirements are minimized without sacrificing the quality of testing.
  • Automated Sensitive Data Discovery: Automatically identify sensitive data and its relationships within databases. This feature streamlines compliance efforts by providing a clear understanding of where sensitive data resides and how it's connected within the database.

  • Comprehensive Masking and Subsetting Capabilities: Access a wide range of masking formats, transformations, subsetting techniques, and application templates. This empowers users to tailor data masking and subsetting strategies to their specific requirements, ensuring comprehensive data protection and optimization.
  • Flexible Deployment Options: Mask and subset data either directly in the database or from extracted files. This flexibility accommodates diverse data environments and preferences, allowing users to choose the method that best suits their infrastructure and workflows.
  • Support for Oracle and Non-Oracle Databases: Seamlessly mask and subset data in both Oracle and non-Oracle databases. This ensures consistent data protection and optimization across heterogeneous database environments.
  • Oracle Cloud Integration: Mask and subset Oracle Databases hosted on the Oracle Cloud. This enables organizations to leverage cloud-based database services while ensuring data security and compliance.
  • Data Integrity Preservation: Maintain data integrity throughout the masking and subsetting process by ensuring a consistent format and retaining the relationships between production and masked data. This feature provides greater confidence in any conclusions drawn from the masked data and its applicability to the original production data.
  • Integration with Oracle Products: Integrate with select Oracle testing, security, and integration products. This facilitates seamless interoperability with existing Oracle ecosystems, enhancing efficiency and streamlining data management processes.

Use Cases

  1. Automated data discovery for Dictionary and Non-Dictionary Based Referential Relationships:It allows organizations to efficiently identify and manage sensitive information within their databases using Oracle's predefined and user-defined sensitive data types. It automatically scans and identifies sensitive data, enabling users to review and classify it accordingly. For instance, In a business database, Data Discovery can identify explicit dictionary-based relationships between tables through primary and foreign keys, ensuring that data masking preserves these relationships for accurate testing. Additionally, it detects non-dictionary, application-level relationships by analyzing data patterns and application behavior. This ensures the preservation of both explicit and implicit data relationships during masking, providing a robust and realistic dataset for non-production environments while safeguarding sensitive information and ensuring compliance with data protection regulations.
  2. Export-Edit-Import data models for manual sensitive data management: It facilitates the seamless import and export of Application Data Models (ADM) across Enterprise Manager environments by exporting an ADM, which can subsequently be imported into the new repository. You can edit an exported ADM XML file prior to import. When exporting an ADM for subsequent import, it is best to have one that uses most or all of the features—applications, tables, table types, referential relationships, sensitive columns. This way, if you are going to edit the exported file prior to import, it is clear which XML tags are required and where they belong in the file. Given the complexity and size of the ADM, users can efficiently manage sensitive columns by manually editing them in a text editor, bypassing the Enterprise Manager interface. This approach ensures consistency in data masking practices across regions while providing flexibility in managing intricate data structures efficiently.
  3. Limit sensitive data proliferation: The growing security threats have increased the need to limit exposure of sensitive information. At the same time, copying production data for non-production purposes is proliferating sensitive data, expanding the security and compliance boundary, and increasing the likelihood of data breaches. Oracle Data Masking helps reduce this risk by irreversibly replacing the original sensitive data with fictitious yet realistic data so that production data can be shared safely with non-production use cases.
  4. Share what is necessary and protect sensitive data: Often, companies need to share a production dataset with internal and external parties for various reasons. For example, a Cloud application provider may need to extract and share information specific to their individual subscribers on demand. Another example is a company serving a court order that must extract and share a subset of production data with the court. In many cases, it is efficient to extract and share a portion or subset of information instead of sharing the entire production dataset. Using the subsetting feature, customers can extract only the necessary data and then apply masking to sensitive information, ensuring data privacy while fulfilling sharing requirements.

Major Components of Oracle Data Masking and Subsetting

Oracle Data Masking and Subsetting consists of the following major components.

Architecture

Oracle Data Masking and Subsetting is part of the Oracle Enterprise Manager infrastructure. Organizations using Oracle Enterprise Manager do not need to download and install the Oracle Data Masking and Subsetting Pack separately. Oracle Enterprise Manager provides unified browser-based user interface for administration. All Data Masking and Subsetting objects are centrally located in the Oracle Enterprise Manager repository, which facilitates centralized creation and administration of Application Data Models, Data Masking and Subsetting rules or definitions. In addition to its intuitive cloud control Graphical User Interface (GUI), Oracle Enterprise Manager also provides Command Line Interface (EMCLI) to automate select Data Masking and Subsetting tasks.

Figure 1-2 Architecture of the Oracle Data Masking and Subsetting

Description of Figure 1-2 follows
Description of "Figure 1-2 Architecture of the Oracle Data Masking and Subsetting "

For more details on Oracle Enterprise Manager Architecture, please refer to the Oracle Enterprise Manager Introduction Guide.

Methodology

Oracle Data Masking and Subsetting uses the following methodology to secure non-production database and replace sensitive data with fictitious, but relevant data that meets compliance requirements.

  • Creating an Application Data Model— Discover sensitive data and data relationships, and then create or assign an Application Data Model

  • Selecting Masking Formats and Criteria— Create data masking definition and masking format types and templates based on the sensitive data that is discovered

  • Previewing and Validating — Secure sensitive data by previewing the masking algorithm results and the subset reduction results

  • Executing Masking Transformations— Execute In-Database or In-Export masking and subsetting transformations and validate the data that is masked

The following figure describes the methodology used in Oracle Data Masking and Subsetting.

Application Data Modeling

The Application Data Modeling (ADM) module of the Oracle Data Masking and Subsetting pack simplifies the effort of sensitive data discovery through automated discovery procedures and sensitive column types. These procedures not only discover columns holding sensitive information, but also discover the parent-child relationships between these columns that are defined in the database.

A sensitive column type creates forensics of sensitive data elements, such as national insurance numbers, using a combination of data patterns in column names, column data, and columns comments. Automated discovery procedures leverage sensitive column types to sample data in the database table columns while scanning the database for sensitive information.

Figure 1-4 ADM Workflow

This flowchart provides an overview of how to create an Application Data Model with sensitive columns. It does not include additional features such as Editing Referential Relationships, Creating and Managing Custom Sensitive Types, and Importing/Exporting an ADM.

Application Data Modeling provides several out-of-the-box sensitive columns types such as credit card numbers, social security numbers, phone numbers. Customer-sensitive column types can be easily created using regular expressions.

Figure 1-5 Editing an Application Data Model

Description of Figure 1-5 follows
Description of "Figure 1-5 Editing an Application Data Model"

Related Topics

Data Masking Format Library

While Application Data Modeling automates the task of sensitive data discovery and modeling, a comprehensive library of masking formats and transformations simplify the effort of defining a masking criteria on the sensitive columns that are discovered.

One of the key aspects of the data masking exercise is to replace the sensitive information with fake data, without breaking the semantics and structure of the data element. The masked data must be realistic and pass format-specific checks such as Luhn check. For example, a masked credit card number must not only be a valid credit card number, but also a valid VISA, Master, American Express or Discover card number. Failing to maintain this data integrity will affect the development or test processes and may break the corresponding application.

Oracle Data Masking and Subsetting is packed with a comprehensive library of masking formats that covers most of the Personally Identifiable Information (PII) and Payment Card Information (PCI). Different types of credit card numbers or national identifiers of different countries or bank account numbers, the masking format library will meet the enterprise needs.

In addition to the several out-of-the-box masking formats, the product provides various built-in tools to easily create custom-masking formats. There are simple tools that generate fixed or random numbers, strings, and dates. There are tools that facilitate substitution from lookup tables. There are tools such as SQL Expression and User Defined Function to accommodate complex user-defined masking logic.

Figure 1-6 Data Masking Format Library

Description of Figure 1-6 follows
Description of "Figure 1-6 Data Masking Format Library"

Data Masking Transformations

Data Masking format library and application templates accelerate the task of defining masking rules and preserving the integrity and structure of data elements.

Depending on the business use cases, organizations may have different requirements while mapping masking formats to sensitive columns. For example, one of the requirements in a large distributed database environment is to generate consistent masked outputs for the given input across multiple databases. Oracle Data Masking and Subsetting provides sophisticated masking transformations that fit into broader business context. If masking formats are considered as building blocks of a data masking definition, then masking transformations align these masking formats according to the different business requirements.

Figure 1-7 Data Masking Workflow

This flowchart highlights the core steps to perform data masking. Please note that additional features, such as Customized Masking Formats and Import Masking Template, are not shown in the flowchart.

Conditional Masking

Conditional transformation provides an ability to arrange masking formats according to different conditions. For example, consider masking a column containing unique person identifiers. Identifiers that belong to country USA can be masked using Social Security Number format and that belong to country UK can be masked using National Insurance Number format.

Compound Masking

Compound transformation (also known as grouping option), masks related columns as a group, ensuring the masked data across the related columns retain the same relationship. For example, consider masking address fields such as city, state, and postal codes. These values must be consistent after masking.

Deterministic/Consistent Masking

Deterministic transformation generates consistent outputs for a given input across databases. This transformation will be helpful to maintain data integrity across multiples applications and preserve system integrity in a single sign-on environment. For example, consider three applications: a human capital management application, a customer relationship management application, and a sales data warehouse. These three applications may have key common fields such as EMPLOYEE ID that must be masked consistently across these applications. Substitute and Encrypt masking formats provide deterministic masking transformation.

Shuffle

Shuffle transformation shuffles fields within a column in a random fashion. This transformation is helpful in breaking one-to-one mapping between sensitive data elements. For example, columns containing personal health records can be shuffled while masking health care information.

Key Based Reversible Masking (Encrypt Format)

This transformation encrypts and decrypts the original data using a secure key string. The input data format is preserved during encryption and decryption. This transformation uses powerful industry-standard 3DES algorithm. This transformation is helpful when businesses need to mask and send their data to a third-party for analysis, reporting, or any other business processing purpose. After the processed data is received from the third-party, the original data can be recovered using the same key string that was used to encrypt the data.

Format Preserving Randomization (Auto Mask Format)

This transformation randomizes the data, preserving the input length, position of the characters and numbers, case of the character (upper or lower), and special characters in the input.

Data Subsetting

Subsetting modern enterprise class applications is a challenging task. Oracle Data Masking and Subsetting simplifies this effort through its easy-to-define goal and condition-based subsetting techniques. Data can be subsetted based on different goals. A goal can be relative table size, for example, extracting 1% subset of a table containing 1 million rows. Data can also be subsetted based on different conditions. A condition can be based on time, for example, discarding all user records created prior to a particular year. A condition can be based on region, for example, extracting Asia Pacific information for a new application development. The conditions are specified using “SQL where clause”. The “SQL where clause” also supports bind variables.

Figure 1-8 Subsetting based on a condition

Description of Figure 1-8 follows
Description of "Figure 1-8 Subsetting based on a condition"

Data Subsetting generates a real-time dynamic view of the application schema with before and after storage size and the percentage of data within the tables being subsetted, including the dependent tables. Administrators can use this view to validate the subsetting criteria even before subsetting the data.

Figure 1-9 Data Subsetting Workflow

This flowchart covers the steps to perform data subsetting. Please note that several additional associated features, such as Importing/Exporting Subset Templates, are not highlighted here.

Application Templates

Oracle Data Masking Application Templates deliver pre-identified sensitive columns, their relationships, and industry-standard best practice masking techniques out-of-the box for packaged applications such as Oracle E-Business Suite and Oracle Fusion Applications. Use the Self Update feature to get the latest masking and subsetting templates available from Oracle.

Deployment Options

Oracle Data Masking and Subsetting provides the following modes for masking and subsetting data:

  • In-Database mode directly masks and subsets data within a non-production database with minimal or zero impact on production environments. As In-Database mode permanently changes the data in a database, it is recommended for non-production environments such as staging, test or development databases instead of production databases.

  • In-Export mode masks and subsets the data in near real-time while extracting the data from a database. The masked and subsetted data that is extracted is written to data pump export files, which can be further imported into test, development or QA databases. In general, In-Export mode is used for production databases. In-Export method of masking and subsetting is a unique offering from Oracle that sanitizes sensitive information within the product perimeter.

  • Heterogeneous mode Oracle Data Masking and Subsetting can mask and subset data in non-Oracle databases. Target production data is first copied from the non-Oracle environment into Oracle Database using an Oracle Database Gateway, and is then masked and subsetted within the Oracle Database, and is finally copied back to the non-Oracle environment. This approach is very similar to the steps used in various ETL (Extract, Transform, and Load) tools, except that the Oracle Database is the intermediary that transforms the data. Oracle Database Gateways enable Oracle Data Masking and Subsetting to operate on data from Oracle MySQL, Microsoft SQLServer, Sybase SQLServer, IBM DB2 (UDB, 400, z/OS), IBM Informix, and Teradata.

Note:

For information on licensing, please refer to Oracle Database Licensing Guide.

Workflow

The following diagram explains the Oracle Data Masking and Subsetting workflow.

Figure 1-10 Oracle Data Masking and Subsetting Workflow

The DMS_workflow.png explains the Oracle Data Masking and Subsetting workflow in a flow chart format. The three important steps of the Data Masking and Subsetting workflow are a) creating an Application Data Model b) creating a data masking definition and c) creating a data subsetting definition.

The following steps describe the Oracle Data Masking and Subsetting Workflow:

  1. Create an Application Data Model — To begin using Oracle Data Masking and Subsetting, you must create an Application Data Model (ADM). ADMs capture application metadata, referential relationships, and discover sensitive data from the source database.

  2. Create a Data Masking Definition — After an ADM is created, the next step is to create a data masking definition. A masking definition includes information regarding the table columns and the masking format for each of these columns. The mask can be created by writing the masked data to the export file.

  3. Create a Data Subsetting Definition — Create a data subsetting definition to define the table rules and rule parameters. The subset can be created by writing the subset data to the export file.