Synthetic Data Generation
Generate synthetic data using random generators, algorithms, statistical models, and Large Language Models (LLMs) to simulate real data for developing and testing solutions effectively.
Note:
Support for Synthetic Data Generation is available in Oracle Database 19c starting with version 19.29, and in Oracle AI Database 26ai starting with version 23.26.Synthetic data can be a powerful tool when developing and testing solutions, especially when actual data doesn’t yet exist or isn’t allowed to be used. Synthetic, or artificially generated, data can have many of the characteristics of real data. Synthetic data is typically created using random generators, algorithms, or statistical models to simulate the characteristics and distributions of real data. However, this can be complex to produce or rely on tools with features of varying sophistication. With the availability of Large Language Models (LLMs), more relevant and schema-specific data may be generated that considers characteristics expressed in natural language.
Topics
Related Topics
Benefits of Synthetic Data Generation
Synthetic data generation enables populating database metadata clones, supporting development, testing, and machine learning projects without using sensitive data from original tables.
-
Populating metadata clones with synthetic data: A metadata clone replicates the structure of a database or schema without including actual data. Select AI allows synthetic data generation to populate these clones, protecting sensitive data while enabling development, testing, and creating templates. This approach supports performance and scalability testing.
-
Starting new projects: When starting a new project, actual data may not be available. Synthetic data provides realistic samples to help demonstrate concepts and gain support for project proposals.
-
Validating user experience: Synthetic data aids in testing user interfaces by providing diverse data sets to uncover design flaws, performance, and scalability issues.
-
Supporting AI and machine learning projects: Synthetic data is useful for training AI and machine learning models when real data is unavailable or restricted. LLMs can generate data with specific patterns to facilitate model training and scoring.
Generate Synthetic Data
Use DBMS_CLOUD_AI.GENERATE_SYNTHETIC_DATA function to
generate synthetic data and query the data using Select AI actions.
To learn more, see GENERATE_SYNTHETIC_DATA Function and Example: Generate Synthetic Data.
Monitor and Troubleshoot Synthetic Data Generation
When generating large amounts of data across many tables,
Select AI splits synthetic data generation tasks into smaller chunks and runs tasks in
parallel. The status of each chunk is tracked in the SYNTHETIC_DATA$<operation_id>_STATUS table.
Synthetic data generation operations are logged in the tables DBA_LOAD_OPERATIONS and USER_LOAD_OPERATIONS. Use these tables to monitor the DBMS_CLOUD_AI.GENERATE_SYNTHETIC_DATA operation. See Track DBMS_CLOUD Load Operations for more details.
USER_LOAD_OPERATION using the
following:SELECT max(id) FROM user_load_operations;
To view the synthetic data operations running in a different session, use the
DBA_LOAD_OPERATIONS view.
View Status of Synthetic Data Generation
The status table of synthetic data generation operation shows the
progress of each table and its corresponding chunk. The
STATUS_TABLE column in USER_LOAD_OPERATIONS or
DBA_LOAD_OPERATIONS shows the status table name. The table name
is SYNTHETIC_DATA$<operation_id>_STATUS and it has following
columns:
| Name | Datatype | Description |
|---|---|---|
|
|
|
Unique identifier of the record. |
|
|
|
Qualified name of the table, such as
"ADB_USER"."EMPLOYEES" |
|
|
|
Desired number of records for this data generation task |
|
|
|
Actual number of records generated. |
|
|
|
Starting value for the primary key during this data generation task. |
|
|
|
Timestamp indicating when the record was last modified. |
|
|
|
Status of the data generation task. The valid values
are:
|
|
|
|
Error code, if the data generation task fails. |
|
|
|
Error message provided if the task fails. |
|
|
|
Timestamp marking the end of the data generation task. |
To check the number of records generated for each table, issue the following:
SELECT name, SUM(rows_loaded) FROM synthetic_data$<operation_id>_status group by name;Query ROWS_LOADED to confirm how many number of rows are
loaded for each chunk, and SUM(ROWS_LOADED) for rows for each
table.
BEGIN
DBMS_CLOUD_AI.GENERATE_SYNTHETIC_DATA(
profile_name => 'GENAI',
object_list => '[{"owner": "ADB_USER", "name": "DIRECTOR","record_count":150},
{"owner": "ADB_USER", "name": "MOVIE_ACTOR","record_count":300},
{"owner": "ADB_USER", "name": "CLASSES", "user_prompt":"all in fall semester","record_count":5},
{"owner": "ADB_USER", "name": "ACTOR","record_count":220},
{"owner": "ADB_USER", "name": "MOVIE","record_count":50}]'
);
END;
/
-- Check loaded rows for each chunk
SQL> SELECT name, rows_loaded FROM synthetic_data$141_status order by name;
NAME ROWS_LOADED
------------------------------------------
"ADB_USER"."ACTOR" 188
"ADB_USER"."ACTOR" 32
"ADB_USER"."CLASSES" 5
"ADB_USER"."DIRECTOR" 150
"ADB_USER"."MOVIE" 50
"ADB_USER"."MOVIE_ACTOR" 38
"ADB_USER"."MOVIE_ACTOR" 114
"ADB_USER"."MOVIE_ACTOR" 148
-- Check loaded rows for each table
SQL> SELECT name, SUM(rows_loaded) FROM synthetic_data$141_status group by name;
NAME SUM(ROWS_LOADED)
------------------------------------------
"ADB_USER"."DIRECTOR" 150
"ADB_USER"."MOVIE_ACTOR" 300
"ADB_USER"."CLASSES" 5
"ADB_USER"."ACTOR" 220
"ADB_USER"."MOVIE" 50
Example: Generate Synthetic Data
Note:
Support for Synthetic Data Generation is available in Oracle Database 19c starting with version 19.29, and in Oracle Database 26ai starting with version 23.26.The following example shows how to create a few tables in your schema,
use OCI Generative AI as your AI provider to create an AI profile, synthesize data
into those tables using the DBMS_CLOUD_AI.GENERATE_SYNTHETIC_DATA
function, and query or generate responses to natural language prompts with Select
AI.
--Create tables or use cloned tables
CREATE TABLE ADB_USER.Director (
director_id INT PRIMARY KEY,
name VARCHAR(100)
);
CREATE TABLE ADB_USER.Movie (
movie_id INT PRIMARY KEY,
title VARCHAR(100),
release_date DATE,
genre VARCHAR(50),
director_id INT,
FOREIGN KEY (director_id) REFERENCES ADB_USER.Director(director_id)
);
CREATE TABLE ADB_USER.Actor (
actor_id INT PRIMARY KEY,
name VARCHAR(100)
);
CREATE TABLE ADB_USER.Movie_Actor (
movie_id INT,
actor_id INT,
PRIMARY KEY (movie_id, actor_id),
FOREIGN KEY (movie_id) REFERENCES ADB_USER.Movie(movie_id),
FOREIGN KEY (actor_id) REFERENCES ADB_USER.Actor(actor_id)
);
-- Create the GenAI credential
BEGIN
DBMS_CLOUD.create_credential(
credential_name => 'GENAI_CRED',
user_ocid => 'ocid1.user.oc1....',
tenancy_ocid => 'ocid1.tenancy.oc1....',
private_key => 'vZ6cO...',
fingerprint => '86:7d:...'
);
END;
/
-- Create a profile
BEGIN
DBMS_CLOUD_AI.CREATE_PROFILE(
profile_name =>'GENAI',
attributes =>'{"provider": "oci",
"credential_name": "GENAI_CRED",
"object_list": [{"owner": "ADB_USER",
"oci_compartment_id": "ocid1.compartment.oc1...."}]
}');
END;
/
EXEC DBMS_CLOUD_AI.set_profile('GENAI');
-- Run the API for single table
BEGIN
DBMS_CLOUD_AI.GENERATE_SYNTHETIC_DATA(
profile_name => 'GENAI',
object_name => 'Director',
owner_name => 'ADB_USER',
record_count => 5
);
END;
/
PL/SQL procedure successfully completed.
-- Query the table to see results
SQL> SELECT * FROM ADB_USER.Director;
DIRECTOR_ID NAME
----------- ----------------------------------------------------------------------------------------------------
1 John Smith
2 Emily Chen
3 Michael Brown
4 Sarah Taylor
5 David Lee
-- Or ask select ai to show the results
SQL> select ai how many directors are there;
NUMBER_OF_DIRECTORS
-------------------
5
After you create and set your AI provider
profile, use the DBMS_CLOUD_AI.GENERATE_SYNTHETIC_DATA to generate
data for multiple tables. You can query or use Select AI to respond to the natural
language prompts.
BEGIN
DBMS_CLOUD_AI.GENERATE_SYNTHETIC_DATA(
profile_name => 'GENAI',
object_list => '[{"owner": "ADB_USER", "name": "Director","record_count":5},
{"owner": "ADB_USER", "name": "Movie_Actor","record_count":5},
{"owner": "ADB_USER", "name": "Actor","record_count":10},
{"owner": "ADB_USER", "name": "Movie","record_count":5,"user_prompt":"all movies released in 2009"}]'
);
END;
/
PL/SQL procedure successfully completed.
-- Query the table to see results
SQL> select * from ADB_USER.Movie;
MOVIE_ID TITLE RELEASE_D GENRE DIRECTOR_ID
---------- -------------------------------------------------------- --------- --------------------------------------------------------------- -----------
1 The Dark Knight 15-JUL-09 Action 8
2 Inglourious Basterds 21-AUG-09 War 3
3 Up in the Air 04-SEP-09 Drama 6
4 The Hangover 05-JUN-09 Comedy 1
5 District 9 14-AUG-09 Science Fiction 10
-- Or ask select ai to show the results
SQL> select ai how many actors are there;
Number of Actors
----------------
10To guide AI service in generating
synthetic data, you can randomly select existing records from a table. For instance,
by adding {"sample_rows": 5} to the params
argument, you can send 5 sample rows from a table to the AI provider. This example
generates 10 additional rows based on the sample rows from the
Transactions
table.
BEGIN
DBMS_CLOUD_AI.GENERATE_SYNTHETIC_DATA(
profile_name => 'GENAI',
object_name => 'Transactions',
owner_name => 'ADB_USER',
record_count => 10,
params => '{"sample_rows":5}'
);
END;
/The user_prompt
argument enables you to specify additional rules or requirements for data
generation. This can be applied to a single table or as part of the
object_list argument for multiple tables. For example, in the
following calls to DBMS_CLOUD_AI.GENERATE_SYNTHETIC_DATA, the prompt
instructs the AI to generate synthetic data on movies released in
2009.
-- Definition for the Movie table CREATE TABLE Movie
CREATE TABLE Movie (
movie_id INT PRIMARY KEY,
title VARCHAR(100),
release_date DATE,
genre VARCHAR(50),
director_id INT,
FOREIGN KEY (director_id) REFERENCES Director(director_id)
);
BEGIN
DBMS_CLOUD_AI.GENERATE_SYNTHETIC_DATA(
profile_name => 'GENAI',
object_name => 'Movie',
owner_name => 'ADB_USER',
record_count => 10,
user_prompt => 'all movies are released in 2009',
params => '{"sample_rows":5}'
);
END;
/
BEGIN
DBMS_CLOUD_AI.GENERATE_SYNTHETIC_DATA(
profile_name => 'GENAI',
object_list => '[{"owner": "ADB_USER", "name": "Director","record_count":5},
{"owner": "ADB_USER", "name": "Movie_Actor","record_count":5},
{"owner": "ADB_USER", "name": "Actor","record_count":10},
{"owner": "ADB_USER", "name": "Movie","record_count":5,"user_prompt":"all movies are released in 2009"}]'
);
END;
/If a table has column statistics or is cloned from a database that includes metadata, Select AI can use these statistics to generate data that closely resembles or is consistent with the original data.
For NUMBER columns, the high and
low values from the statistics guide the value range. For instance, if the
SALARY column in the original EMPLOYEES table
ranges from 1000 to 10000, the synthetic data for this column will also fall within
this range.
For columns with distinct values, such as a
STATE column with values CA,
WA, and TX, the
synthetic data will use these specific values. You can manage this feature using the
{"table_statistics": true/false} parameter. By default, the
table statistics are
enabled.
BEGIN
DBMS_CLOUD_AI.GENERATE_SYNTHETIC_DATA(
profile_name => 'GENAI',
object_name => 'Movie',
owner_name => 'ADB_USER',
record_count => 10,
user_prompt => 'all movies released in 2009',
params => '{"sample_rows":5,"table_statistics":true}'
);
END;
/If column comments exist, Select AI automatically includes them to provide additional information for the LLM during data generation. For example, a comment on the Status column in a Transaction table might list allowed values such as successful, failed, pending, canceled, and need manual check. You can also add comments to further explain the column, giving AI services more precise instructions or hints for generating accurate data. By default, comments are disabled. See Optional Parameters for more details.
-- Use comment on column
COMMENT ON COLUMN Transaction.status IS 'the value for state should either be ''successful'', ''failed'', ''pending'' or ''canceled''';
/
BEGIN
DBMS_CLOUD_AI.GENERATE_SYNTHETIC_DATA(
profile_name => 'GENAI',
object_name => 'employees',
owner_name => 'ADB_USER',
record_count => 10
params => '{"comments":true}'
);
END;
/When generating large amounts of
synthetic data with LLMs, duplicate values are likely to occur. To prevent this, set
up a unique constraint on the relevant column. This ensures that Select AI ignores
rows with duplicate values in the LLM response. Additionally, to restrict values for
certain columns, you can use the user_prompt or add comments to
specify the allowed values, such as limiting a STATE column to
CA, WA, and
TX.
-- Use 'user_prompt'
BEGIN
DBMS_CLOUD_AI.GENERATE_SYNTHETIC_DATA(
profile_name => 'GENAI',
object_name => 'employees',
owner_name => 'ADB_USER',
user_prompt => 'the value for state should either be CA, WA, or TX',
record_count => 10
);
END;
/
-- Use comment on column
COMMENT ON COLUMN EMPLOYEES.state IS 'the value for state should either be CA, WA, or TX'
/To reduce runtime, Select AI splits synthetic data generation tasks into smaller chunks for tables without primary keys or with numeric primary keys. These tasks run in parallel, interacting with the AI provider to generate data more efficiently. The Degree of Parallelism (DOP) in your database, influenced by your Autonomous AI Database on Dedicated Exadata Infrastructure service level and ECPU or OCPU settings, determines the number of records each chunk processes. Running tasks in parallel generally improves performance, especially when generating large amounts of data across many tables. To manage the parallel processing of synthetic data generation, set priority as an optional parameter. See Optional Parameters.
Example: Enable or Disable Data Access
This example illustrates how administrators can control data access and prevent Select AI from sending actual schema tables to the LLM.
To restrict access to schema tables, log in as an administrator and run the following procedure.
EXEC DBMS_CLOUD_AI.DISABLE_DATA_ACCESS;
PL/SQL procedure successfully completed.
Disabling data access limits Select AI's narrate action and Synthetic Data Generation. The narrate action and synthetic data generation raise an error.
Log in as database user, create and configure your AI profile. Review Use DBMS_CLOUD_AI to Configure AI Profiles to configure your AI profile.
BEGIN
DBMS_CLOUD_AI.CREATE_PROFILE(
profile_name =>'DATA_ACCESS',
attributes =>'{"provider": "openai",
"credential_name": "OPENAI_CRED",
"object_list": [{"owner":"SH"}]
}');
END;
/
EXEC DBMS_CLOUD_AI.SET_PROFILE('DATA_ACCESS');
select ai how many customers;
NUM_CUSTOMERS
55500
select ai narrate what are the top 3 customers in San Francisco;
ORA-20000: Data access is disabled for SELECT AI.
ORA-06512: at "C##CLOUD$SERVICE.DBMS_CLOUD", line 2228
ORA-06512: at "C##CLOUD$SERVICE.DBMS_CLOUD_AI", line 13157
ORA-06512: at line 1 https://docs.oracle.com/error-help/db/ora-20000/
The stored procedure 'raise_application_error' was called which causes this error to be generated
Error at Line: 1 Column: 6The following example shows the errors that are triggered when you try to generate synthetic data.
BEGIN
DBMS_CLOUD_AI.GENERATE_SYNTHETIC_DATA(
profile_name => 'DATA_ACCESS_SDG',
object_name => 'CUSTOMERS_NEW',
owner_name => 'ADB_USER,
record_count => 5
);
END;
/
ERROR at line 1:
ORA-20000: Data access is disabled for SELECT AI.
ORA-06512: at "C##CLOUD$SERVICE.DBMS_CLOUD", line 2228
ORA-06512: at "C##CLOUD$SERVICE.DBMS_CLOUD_AI", line 13401
ORA-06512: at line 2The following example shows enabling data access. Log in as an administrator and run the following procedure:
EXEC DBMS_CLOUD_AI.ENABLE_DATA_ACCESS;
PL/SQL procedure successfully completed.Log in as database user, create and configure your AI profile. Review Use DBMS_CLOUD_AI to Configure AI Profiles to configure your AI profile.
BEGIN
DBMS_CLOUD_AI.CREATE_PROFILE(
profile_name =>'DATA_ACCESS_NEW',
attributes =>'{"provider": "openai",
"credential_name": "OPENAI_CRED",
"object_list": [{"owner":"SH"}]
}');
END;
/
PL/SQL procedure successfully completed.
EXEC DBMS_CLOUD_AI.SET_PROFILE('DATA_ACCESS_NEW');
PL/SQL procedure successfully completed.
select ai how many customers;
NUM_CUSTOMERS
55500
select ai narrate what are the top 3 customers in San Francisco;
"RESPONSE"
"The top 3 customers in San Francisco are Cody Seto, Lauren Yaskovich, and Ian Mc"
The following example shows successful synthetic data generation after enabling data access.
BEGIN
DBMS_CLOUD_AI.GENERATE_SYNTHETIC_DATA(
profile_name => 'DATA_ACCESS_SDG',
object_name => 'CUSTOMERS_NEW',
owner_name => 'ADB_USER',
record_count => 5
);
END;
/
PL/SQL procedure successfully completed.For information about Oracle's commitment to accessibility, visit the Oracle Accessibility Program website at http://www.oracle.com/pls/topic/lookup?ctx=acc&id=docacc.
Access to Oracle Support
Oracle customer access to and use of Oracle support services will be pursuant to the terms and conditions specified in their Oracle order for the applicable services.