Hyperparameters for Fine-Tuning a Model in Generative AI

OCI Generative AI uses hyperparameters to fine-tune a base model with your provided training dataset. The fine-tuning hyperparameters are described in the following table, and you can find them when you create a model and in the custom model's detail page.

Tip

Start training a model with the default hyperparameter values. After the model is created, in the model's detail page, under Model Performance, check the values for accuracy and loss. If you're not happy with the results, create another model with either a larger dataset or different hyperparameters until the performance improves.
Hyperparameter Description Valid Range
Total training epochs The number of times the training iterates through the entire training dataset. For example, 1 epoch means that the model is trained by using the entire training dataset one time.

Enter 1 or a higher integer. The default is 3.

Learning rate The speed at which the model weights are updated against the error gradient.

Enter a number between 0 and 1.0.

For the LoRA training method, the default is 0.0002.

For the T-Few training method, the default is 0.01.

For the Vanilla method, the default is 0.01 for the cohere.command model, and 0.0000006 (6e-7) for the cohere.command-light model.

Training batch size The number of samples in a mini batch to go through before updating the model's parameters.

Enter 8 for the cohere.command, and an integer between 8 and 16 for the cohere.command-light and the meta.llama-3-70b-instruct models. The default is 8 for the cohere.commandand the meta.llama-3-70b-instruct models.

For the LoRA training method, the default is 8.

For the T-Few and the Vanilla training methods, the default is 8 for the cohere.command model, and 16 for the cohere.command-light model.

Early stopping patience Defines the number of grace periods to continue the evaluation cycle, after the early stopping threshold is triggered. Training stops if the loss metric doesn't improve beyond the early stopping threshold for this many times of evaluation.

To add a grace period, enter 1 or a higher integer. To disable, enter 0. For the T-Few and the Vanilla training methods, the default is 6. For the LoRA training method, the default is 15.

Early stopping threshold Loss improves when it decreases in the next training cycle. If loss doesn't improve enough, you can stop the training. Define the minimum evaluation loss improvement that should trigger the early stopping counter. If loss doesn't improve beyond the minimum value during the patience period, training stops. Otherwise, training continues and the counter resets.

Enter 0 or a higher number. For the T-Few and the Vanilla training methods, the default is 0.01. For the LoRA training method, the default is 0.0001

Log model metrics interval in steps The number of steps per logging. Model metrics such as training loss and learning rate are logged. If the training loss is not decreasing as expected, review the training data or training rate.

Enter an integer between 1 and the total number of training steps. To disable, enter 0. The default is 10.

Number of last layers (for Vanilla method only) The number of last layers to be fine-tuned in the Vanilla method.

Enter an integer between 1 and 14 (default) for the cohere.command-light model.

Enter an integer between 1 and 15 (default) for the cohere.command model.

The default is 14.

LoRA r (for LoRA method only) The attention dimension (rank) of the update matrices. A lower rank results in smaller update matrices with fewer trainable parameters. Enter an integer between 1 and 64. The default is 8.
LoRA alpha (for LoRA method only) The alpha parameter for LoRA scaling. The LoRA weight matrices are scaled by dividing LoRA alpha by LoRA r. The alpha parameter defines the LoRA weights, which are a smaller number of new weights and are the only weights that are trained in the model. Enter an integer between 1 and 128. The default is 8.
LoRA dropout (For LoRA method only) The dropout probability for neurons in the LoRA layers. The dropout method prevents overfitting by randomly ignoring (dropping out) neurons within a layer. A 10% dropout means that each neuron has a 10% chance of being dropped. Enter a decimal number less than 1 for percentage. The default is 0.1 for 10%.
The following equation shows how the model calculates the totalTrainingSteps parameter.
totalTrainingSteps = (totalTrainingEpochs * size(trainingDataset)) / trainingBatchSize
In the preceding equation, the model ignores some rounding calculations.