Training Data Requirements in Generative AI

Understand the guidelines for creating training data for fine-tuning the pretrained models in OCI Generative AI.

Custom models accept only one training dataset file in a JSONL (JSON Lines) format. The file must have a minimum of 32 prompt/completion pair examples per file. This dataset is randomly split to a 80:20 ratio for training and validation. There's no maximum number of sentences for the training file, but large datasets take longer to train.

About JSONL

A JSONL file contains a new JSON value or object on each line. The file isn't evaluated as a whole, like a regular JSON file. Instead, each line is treated as if it is a separate JSON file. This format is ideal for storing a set of inputs in JSON format.

The OCI Generative AI service accepts a JSONL file for fine-tuning custom models in the following format:

{"prompt": "<first prompt>", "completion": "<expected completion given first prompt>"}
{"prompt": "<second prompt>", "completion": "<expected completion given second prompt>"}
.
.
.
JSONL Example
{"prompt": "What is the capital of France?", "completion": "The capital of France is Paris."}
{"prompt": "What is the smallest state in the USA?", "completion": "The smallest state in the USA is Rhode Island."}
Note

Ensure that each JSONL dataset file that you create for Generative AI has the following properties:
  • The file is UTF-8 encoded.
  • Each line item contains a valid JSON object.
  • Each JSON object has two properties: "prompt" and "completion".
  • Each JSON object is entered in a new line or followed by a newline character (\n).

After you create the JSONL file, add your dataset to an Object Storage bucket.