Training Data in Generative AI
Here are guidelines for creating training data for fine-tuning the pretrained models in OCI
Generative AI. A custom model can be fine‑tuned with only one dataset, which the system automatically splits into 80 % training and 20 % validation data. The dataset must be a JSONL file containing at least 32 prompt/completion pairs, each line formatted as: {"prompt": "<your prompt>", "completion": "<expected response>"}. Save the file in an OCI
Object Storage bucket and reference it when creating the custom model.
Dataset Requirements
Datasets for training custom models have the following requirements:
- A maximum of one fine-tuning dataset is allowed per custom model. This dataset is randomly split to a 80:20 ratio for training and validating.
- Each file must have at least 32 prompt/completion pair examples.
- The file format is JSONL.
- Each line in the JSONLfile has the following format:{"prompt": "<a prompt>", "completion": "<expected response given the prompt>"}\n
- The file must be stored in an OCI Object Storage bucket.
JSONL Format
- About JSONL
- 
A JSONLfile contains a newJSONvalue or object on each line. The file isn't evaluated as a whole, like a regularJSONfile. Instead, each line is treated as if it is a separateJSONfile. This format is ideal for storing a set of inputs inJSONformat.The OCI Generative AI service accepts a JSONLfile for fine-tuning custom models in the following format:{"prompt": "<first prompt>", "completion": "<expected completion given first prompt>"} {"prompt": "<second prompt>", "completion": "<expected completion given second prompt>"} . . .
- JSONLExample
Ensure that each
JSONL dataset file that you create for Generative AI has the following properties: - The file is UTF-8encoded.
- Each line item contains a valid JSONobject.
- Each JSONobject has two properties:"prompt"and"completion".
- Each JSONobject is entered in a new line or followed by a newline character (\n).
After you create the JSONL file, add your dataset to an Object Storage bucket.