Synthetic Data Generation

Generate synthetic data using random generators, algorithms, statistical models, and Large Language Models (LLMs) to simulate real data for developing and testing solutions effectively.

Note:

Support for Synthetic Data Generation is available in Oracle Database 19c starting with version 19.29, and in Oracle AI Database 26ai starting with version 23.26.

Synthetic data can be a powerful tool when developing and testing solutions, especially when actual data doesn’t yet exist or isn’t allowed to be used. Synthetic, or artificially generated, data can have many of the characteristics of real data. Synthetic data is typically created using random generators, algorithms, or statistical models to simulate the characteristics and distributions of real data. However, this can be complex to produce or rely on tools with features of varying sophistication. With the availability of Large Language Models (LLMs), more relevant and schema-specific data may be generated that considers characteristics expressed in natural language.

Topics

Benefits of Synthetic Data Generation

Synthetic data generation enables populating database metadata clones, supporting development, testing, and machine learning projects without using sensitive data from original tables.

Synthetic Data Generation offers the following benefits:
  • Populating metadata clones with synthetic data: A metadata clone replicates the structure of a database or schema without including actual data. Select AI allows synthetic data generation to populate these clones, protecting sensitive data while enabling development, testing, and creating templates. This approach supports performance and scalability testing.

  • Starting new projects: When starting a new project, actual data may not be available. Synthetic data provides realistic samples to help demonstrate concepts and gain support for project proposals.

  • Validating user experience: Synthetic data aids in testing user interfaces by providing diverse data sets to uncover design flaws, performance, and scalability issues.

  • Supporting AI and machine learning projects: Synthetic data is useful for training AI and machine learning models when real data is unavailable or restricted. LLMs can generate data with specific patterns to facilitate model training and scoring.

Generate Synthetic Data

Use DBMS_CLOUD_AI.GENERATE_SYNTHETIC_DATA function to generate synthetic data and query the data using Select AI actions.

Monitor and Troubleshoot Synthetic Data Generation

When generating large amounts of data across many tables, Select AI splits synthetic data generation tasks into smaller chunks and runs tasks in parallel. The status of each chunk is tracked in the SYNTHETIC_DATA$<operation_id>_STATUS table.

Synthetic data generation operations are logged in the tables DBA_LOAD_OPERATIONS and USER_LOAD_OPERATIONS. Use these tables to monitor the DBMS_CLOUD_AI.GENERATE_SYNTHETIC_DATA operation. See Track DBMS_CLOUD Load Operations for more details.

After running synthetic data generation in a given session, you can get the latest <operation_id> from USER_LOAD_OPERATION using the following:
SELECT max(id) FROM user_load_operations;

To view the synthetic data operations running in a different session, use the DBA_LOAD_OPERATIONS view.

View Status of Synthetic Data Generation

The status table of synthetic data generation operation shows the progress of each table and its corresponding chunk. The STATUS_TABLE column in USER_LOAD_OPERATIONS or DBA_LOAD_OPERATIONS shows the status table name. The table name is SYNTHETIC_DATA$<operation_id>_STATUS and it has following columns:

Name Datatype Description

ID

NUMBER

Unique identifier of the record.

NAME

VARCHAR2

Qualified name of the table, such as "ADB_USER"."EMPLOYEES"

BYTES

NUMBER

Desired number of records for this data generation task

ROWS_LOADED

NUMBER

Actual number of records generated.

CHECKSUM

VARCHAR2

Starting value for the primary key during this data generation task.

LAST_MODIFIED

TIMESTAMP WITH TIME ZONE

Timestamp indicating when the record was last modified.

STATUS

VARCHAR2

Status of the data generation task. The valid values are:
  • COMPLETED
  • FAILED
  • PENDING
  • SKIPPED

ERROR_CODE

NUMBER

Error code, if the data generation task fails.

ERROR_MESSAGE

VARCHAR2

Error message provided if the task fails.

END_TIME

TIMESTAMP WITH TIME ZONE

Timestamp marking the end of the data generation task.
Example: Check the Number of Records Generated for Each Table

To check the number of records generated for each table, issue the following:

SELECT name, SUM(rows_loaded) FROM synthetic_data$<operation_id>_status group by name;

Query ROWS_LOADED to confirm how many number of rows are loaded for each chunk, and SUM(ROWS_LOADED) for rows for each table.

BEGIN
    DBMS_CLOUD_AI.GENERATE_SYNTHETIC_DATA(
        profile_name => 'GENAI',
        object_list => '[{"owner": "ADB_USER", "name": "DIRECTOR","record_count":150},
                         {"owner": "ADB_USER", "name": "MOVIE_ACTOR","record_count":300},
                         {"owner": "ADB_USER", "name": "CLASSES", "user_prompt":"all in fall semester","record_count":5},
                         {"owner": "ADB_USER", "name": "ACTOR","record_count":220},
                         {"owner": "ADB_USER", "name": "MOVIE","record_count":50}]'
    );
END;
/
 
 
-- Check loaded rows for each chunk
SQL> SELECT name, rows_loaded FROM synthetic_data$141_status order by name;
NAME                           ROWS_LOADED
------------------------------------------
"ADB_USER"."ACTOR"                        188
"ADB_USER"."ACTOR"                         32
"ADB_USER"."CLASSES"                        5
"ADB_USER"."DIRECTOR"                     150
"ADB_USER"."MOVIE"                         50
"ADB_USER"."MOVIE_ACTOR"                   38
"ADB_USER"."MOVIE_ACTOR"                  114
"ADB_USER"."MOVIE_ACTOR"                  148
 
 
 
-- Check loaded rows for each table
SQL> SELECT name, SUM(rows_loaded) FROM synthetic_data$141_status group by name;
 
NAME                      SUM(ROWS_LOADED)
------------------------------------------
"ADB_USER"."DIRECTOR"                     150
"ADB_USER"."MOVIE_ACTOR"                  300
"ADB_USER"."CLASSES"                        5
"ADB_USER"."ACTOR"                        220
"ADB_USER"."MOVIE"                         50

Example: Generate Synthetic Data

This example explores how you can generate synthetic data mimicking the characteristics and distribution of real data.

Note:

Support for Synthetic Data Generation is available in Oracle Database 19c starting with version 19.29, and in Oracle Database 26ai starting with version 23.26.

The following example shows how to create a few tables in your schema, use OCI Generative AI as your AI provider to create an AI profile, synthesize data into those tables using the DBMS_CLOUD_AI.GENERATE_SYNTHETIC_DATA function, and query or generate responses to natural language prompts with Select AI.

--Create tables or use cloned tables

CREATE TABLE ADB_USER.Director (
    director_id     INT PRIMARY KEY,
    name            VARCHAR(100)
);
CREATE TABLE ADB_USER.Movie (
    movie_id        INT PRIMARY KEY,
    title           VARCHAR(100),
    release_date    DATE,
    genre           VARCHAR(50),
    director_id     INT,
    FOREIGN KEY (director_id) REFERENCES ADB_USER.Director(director_id)
);
CREATE TABLE ADB_USER.Actor (
    actor_id        INT PRIMARY KEY,
    name            VARCHAR(100)
);
CREATE TABLE ADB_USER.Movie_Actor (
    movie_id        INT,
    actor_id        INT,
    PRIMARY KEY (movie_id, actor_id),
    FOREIGN KEY (movie_id) REFERENCES ADB_USER.Movie(movie_id),
    FOREIGN KEY (actor_id) REFERENCES ADB_USER.Actor(actor_id)
);

-- Create the GenAI credential
BEGIN                                                                       
  DBMS_CLOUD.create_credential(                                             
    credential_name => 'GENAI_CRED',                                        
    user_ocid       => 'ocid1.user.oc1....',
    tenancy_ocid    => 'ocid1.tenancy.oc1....',
    private_key     => 'vZ6cO...',
    fingerprint     => '86:7d:...'    
  );                                                                        
END;                                                                       
/
 
-- Create a profile
BEGIN                                                                      
  DBMS_CLOUD_AI.CREATE_PROFILE(                                            
      profile_name =>'GENAI',                                                           
      attributes  =>'{"provider": "oci",                                                                 
        "credential_name": "GENAI_CRED",                                   
        "object_list": [{"owner": "ADB_USER", 
		"oci_compartment_id": "ocid1.compartment.oc1...."}]          
       }');                                                                
END;                                                                       
/
 
 
EXEC DBMS_CLOUD_AI.set_profile('GENAI');

-- Run the API for single table
BEGIN
    DBMS_CLOUD_AI.GENERATE_SYNTHETIC_DATA(
        profile_name => 'GENAI',
        object_name  => 'Director',
        owner_name   => 'ADB_USER',
        record_count => 5
    );
END;
/
PL/SQL procedure successfully completed.
 
 
-- Query the table to see results
SQL> SELECT * FROM ADB_USER.Director;
 
DIRECTOR_ID NAME
----------- ----------------------------------------------------------------------------------------------------
          1 John Smith
          2 Emily Chen
          3 Michael Brown
          4 Sarah Taylor
          5 David Lee
 
 
-- Or ask select ai to show the results
SQL> select ai how many directors are there;
 
NUMBER_OF_DIRECTORS
-------------------
                  5
Example: Generate Synthetic Data for Multiple Tables

After you create and set your AI provider profile, use the DBMS_CLOUD_AI.GENERATE_SYNTHETIC_DATA to generate data for multiple tables. You can query or use Select AI to respond to the natural language prompts.

BEGIN
    DBMS_CLOUD_AI.GENERATE_SYNTHETIC_DATA(
        profile_name => 'GENAI',
        object_list => '[{"owner": "ADB_USER", "name": "Director","record_count":5},
                         {"owner": "ADB_USER", "name": "Movie_Actor","record_count":5},
                         {"owner": "ADB_USER", "name": "Actor","record_count":10},
                         {"owner": "ADB_USER", "name": "Movie","record_count":5,"user_prompt":"all movies released in 2009"}]'
    );
END;
/
PL/SQL procedure successfully completed.
 
 
-- Query the table to see results
SQL> select * from ADB_USER.Movie;

 MOVIE_ID TITLE                                                     RELEASE_D                            GENRE                                 DIRECTOR_ID	
---------- -------------------------------------------------------- --------- --------------------------------------------------------------- -----------	
         1 The Dark Knight                                           15-JUL-09                              Action                              8	
         2 Inglourious Basterds                                      21-AUG-09                              War                                 3	
         3 Up in the Air                                             04-SEP-09                              Drama                               6	
         4 The Hangover                                              05-JUN-09                              Comedy                              1	
         5 District 9                                                14-AUG-09                              Science Fiction                     10	
	

 
-- Or ask select ai to show the results
SQL> select ai how many actors are there;
 
Number of Actors
----------------
              10
Example: Guide Synthetic Data Generation with Sample Rows

To guide AI service in generating synthetic data, you can randomly select existing records from a table. For instance, by adding {"sample_rows": 5} to the params argument, you can send 5 sample rows from a table to the AI provider. This example generates 10 additional rows based on the sample rows from the Transactions table.

BEGIN
  DBMS_CLOUD_AI.GENERATE_SYNTHETIC_DATA(
    profile_name => 'GENAI',
    object_name  => 'Transactions',
    owner_name   => 'ADB_USER',
    record_count => 10,
    params       => '{"sample_rows":5}'
  );
END;
/
Example: Customize Synthetic Data Generation with User Prompts

The user_prompt argument enables you to specify additional rules or requirements for data generation. This can be applied to a single table or as part of the object_list argument for multiple tables. For example, in the following calls to DBMS_CLOUD_AI.GENERATE_SYNTHETIC_DATA, the prompt instructs the AI to generate synthetic data on movies released in 2009.

-- Definition for the Movie table CREATE TABLE Movie 

CREATE TABLE Movie (
    movie_id        INT PRIMARY KEY,
    title           VARCHAR(100),
    release_date    DATE,
    genre           VARCHAR(50),
    director_id     INT,
    FOREIGN KEY (director_id) REFERENCES Director(director_id)
);
 
 
 
BEGIN
  DBMS_CLOUD_AI.GENERATE_SYNTHETIC_DATA(
    profile_name      => 'GENAI',
    object_name       => 'Movie',
    owner_name        => 'ADB_USER',
    record_count      => 10,
    user_prompt       => 'all movies are released in 2009',
    params            => '{"sample_rows":5}'
  );
END;
/
 
BEGIN
    DBMS_CLOUD_AI.GENERATE_SYNTHETIC_DATA(
        profile_name => 'GENAI',
        object_list => '[{"owner": "ADB_USER", "name": "Director","record_count":5},
                         {"owner": "ADB_USER", "name": "Movie_Actor","record_count":5},
                         {"owner": "ADB_USER", "name": "Actor","record_count":10},
                         {"owner": "ADB_USER", "name": "Movie","record_count":5,"user_prompt":"all movies are released in 2009"}]'
    );
END;
/
Example: Improve Synthetic Data Quality by Using Table Statistics

If a table has column statistics or is cloned from a database that includes metadata, Select AI can use these statistics to generate data that closely resembles or is consistent with the original data.

For NUMBER columns, the high and low values from the statistics guide the value range. For instance, if the SALARY column in the original EMPLOYEES table ranges from 1000 to 10000, the synthetic data for this column will also fall within this range.

For columns with distinct values, such as a STATE column with values CA, WA, and TX, the synthetic data will use these specific values. You can manage this feature using the {"table_statistics": true/false} parameter. By default, the table statistics are enabled.

BEGIN
  DBMS_CLOUD_AI.GENERATE_SYNTHETIC_DATA(
    profile_name      => 'GENAI',
    object_name       => 'Movie',
    owner_name        => 'ADB_USER',
    record_count      => 10,
    user_prompt => 'all movies released in 2009',
    params            => '{"sample_rows":5,"table_statistics":true}'
  );
END;
/
Example: Use Column Comments to Guide Data Generation

If column comments exist, Select AI automatically includes them to provide additional information for the LLM during data generation. For example, a comment on the Status column in a Transaction table might list allowed values such as successful, failed, pending, canceled, and need manual check. You can also add comments to further explain the column, giving AI services more precise instructions or hints for generating accurate data. By default, comments are disabled. See Optional Parameters for more details.

-- Use comment on column
COMMENT ON COLUMN Transaction.status IS 'the value for state should either be ''successful'', ''failed'', ''pending'' or ''canceled''';
/
 
BEGIN
    DBMS_CLOUD_AI.GENERATE_SYNTHETIC_DATA(
        profile_name  => 'GENAI',
        object_name   => 'employees',
        owner_name    => 'ADB_USER',
        record_count  => 10
        params        => '{"comments":true}'
 
    );
END;
/
Example: Set Unique Values in Synthetic Data Generation

When generating large amounts of synthetic data with LLMs, duplicate values are likely to occur. To prevent this, set up a unique constraint on the relevant column. This ensures that Select AI ignores rows with duplicate values in the LLM response. Additionally, to restrict values for certain columns, you can use the user_prompt or add comments to specify the allowed values, such as limiting a STATE column to CA, WA, and TX.

-- Use 'user_prompt'
BEGIN
    DBMS_CLOUD_AI.GENERATE_SYNTHETIC_DATA(
        profile_name  => 'GENAI',
        object_name   => 'employees',
        owner_name    => 'ADB_USER',
        user_prompt   => 'the value for state should either be CA, WA, or TX',
        record_count  => 10
    );
END;
/
 
 
-- Use comment on column
COMMENT ON COLUMN EMPLOYEES.state IS 'the value for state should either be CA, WA, or TX'
/
Example: Enhance Synthetic Data Generation by Parallel Processing

To reduce runtime, Select AI splits synthetic data generation tasks into smaller chunks for tables without primary keys or with numeric primary keys. These tasks run in parallel, interacting with the AI provider to generate data more efficiently. The Degree of Parallelism (DOP) in your database, influenced by your Autonomous AI Database on Dedicated Exadata Infrastructure service level and ECPU or OCPU settings, determines the number of records each chunk processes. Running tasks in parallel generally improves performance, especially when generating large amounts of data across many tables. To manage the parallel processing of synthetic data generation, set priority as an optional parameter. See Optional Parameters.

Example: Enable or Disable Data Access

This example illustrates how administrators can control data access and prevent Select AI from sending actual schema tables to the LLM.

Disabling Data Access

To restrict access to schema tables, log in as an administrator and run the following procedure.

EXEC DBMS_CLOUD_AI.DISABLE_DATA_ACCESS;
 
PL/SQL procedure successfully completed.

Disabling data access limits Select AI's narrate action and Synthetic Data Generation. The narrate action and synthetic data generation raise an error.

Log in as database user, create and configure your AI profile. Review Use DBMS_CLOUD_AI to Configure AI Profiles to configure your AI profile.

BEGIN
  DBMS_CLOUD_AI.CREATE_PROFILE(
          profile_name =>'DATA_ACCESS',
          attributes   =>'{"provider": "openai",
            "credential_name": "OPENAI_CRED",
            "object_list": [{"owner":"SH"}]
          }');
END;
/

EXEC DBMS_CLOUD_AI.SET_PROFILE('DATA_ACCESS');
 

select ai how many customers;

NUM_CUSTOMERS
55500

select ai narrate what are the top 3 customers in San Francisco;

ORA-20000: Data access is disabled for SELECT AI.
ORA-06512: at "C##CLOUD$SERVICE.DBMS_CLOUD", line 2228
ORA-06512: at "C##CLOUD$SERVICE.DBMS_CLOUD_AI", line 13157
ORA-06512: at line 1 https://docs.oracle.com/error-help/db/ora-20000/
The stored procedure 'raise_application_error' was called which causes this error to be generated
Error at Line: 1 Column: 6

The following example shows the errors that are triggered when you try to generate synthetic data.

BEGIN
DBMS_CLOUD_AI.GENERATE_SYNTHETIC_DATA(
profile_name => 'DATA_ACCESS_SDG',
object_name => 'CUSTOMERS_NEW',
owner_name => 'ADB_USER,
record_count => 5
);
END;
/ 

ERROR at line 1:

ORA-20000: Data access is disabled for SELECT AI.
ORA-06512: at "C##CLOUD$SERVICE.DBMS_CLOUD", line 2228
ORA-06512: at "C##CLOUD$SERVICE.DBMS_CLOUD_AI", line 13401

ORA-06512: at line 2
Enabling Data Access

The following example shows enabling data access. Log in as an administrator and run the following procedure:

EXEC DBMS_CLOUD_AI.ENABLE_DATA_ACCESS;
 
PL/SQL procedure successfully completed.

Log in as database user, create and configure your AI profile. Review Use DBMS_CLOUD_AI to Configure AI Profiles to configure your AI profile.

BEGIN
  DBMS_CLOUD_AI.CREATE_PROFILE(
          profile_name =>'DATA_ACCESS_NEW',
          attributes   =>'{"provider": "openai",
            "credential_name": "OPENAI_CRED",
            "object_list": [{"owner":"SH"}]
          }');
   END;
   /

PL/SQL procedure successfully completed.

EXEC DBMS_CLOUD_AI.SET_PROFILE('DATA_ACCESS_NEW');

PL/SQL procedure successfully completed.


select ai how many customers;

NUM_CUSTOMERS
55500

select ai narrate what are the top 3 customers in San Francisco;

"RESPONSE"
"The top 3 customers in San Francisco are Cody Seto, Lauren Yaskovich, and Ian Mc"

The following example shows successful synthetic data generation after enabling data access.

BEGIN
DBMS_CLOUD_AI.GENERATE_SYNTHETIC_DATA(
profile_name => 'DATA_ACCESS_SDG',
object_name => 'CUSTOMERS_NEW',
owner_name => 'ADB_USER',
record_count => 5
);
END;
/ 

PL/SQL procedure successfully completed.