Concepts for Generative AI

To help you underatand OCIGenerative AI, review some concepts and terms related to the service.

Generative AI Model

An AI model trained on large amounts of data which takes inputs that it hasn't seen before and generates new content.

Retrieval-Augmented Generation (RAG)

A program that retrieves data from given sources and augments large language model (LLM) responses with the given information to generate grounded responses.

Prompts and Prompt Engineering

Prompts
A string of text in natural language used to instruct or extract information from a large language model. For example,
  • What is the summer solstice?
  • Write a poem about trees swaying in the breeze.
  • Rewrite the previous text in a lighter tone.
Prompt Engineering
The iterative process of crafting specific requests in natural language for extracting optimized prompts from a large language model (LLM). Based on the exact language used, the prompt engineer can guide the LLM to provide better or different outputs.

Inference

The ability of a large language model (LLM) to generate a response based on instructions and context provided by the user in the prompt. An LLM can generate new data, make predictions, or draw conclusions based on its learned patterns and relationships in the training data, without having been explicitly programmed.

Inference is a key feature of natural language processing (NLP) tasks such as question answering, summarizing text, and translating. You can use the foundational models in Generative AI for inference.

Streaming

Generation of content by a large language model (LLM) where the user can see the tokens being generated one at a time instead of waiting for a complete response to be generated before returning the response to the user.

Embedding

A numerical representation that has the property of preserving the meaning of a piece of text. This text can be a phrase, a sentence, or one or more paragraphs. The Generative AI embedding models transform each phrase, sentence, or paragraph that you input, into an array with 384 or 1024 numbers, depending on the embedding model that you choose. You can use these embeddings for finding similarity in phrases that are similar in context or category. Embeddings are typically stored in a vector database. Embeddings are mostly used for semantic searches where the search function focuses on the meaning of the text that it's searching through rather than finding results based on keywords. To create the embeddings, you can input phrases in English and other languages.

Playground

An interface in the Oracle Cloud Console for exploring the hosted pretrained and custom models without writing a single line of code. Use the playground to test your use cases and refine prompts and parameters. When you're happy with the results, copy the generated code or use the model's endpoint to integrate Generative AI into your applications.

Custom Model

A model that you create by using a pretrained model as a base and using your own dataset to fine-tune that model.

Tokens

A token is a word, part of a word, or a punctuation. For example, apple is one token and friendship is two tokens (friend and ship), and don’t is two tokens (don and ‘t). When you run a model in the playground, you can set the maximum number of output tokens. Estimate four characters per token.

Temperature

The level of randomness used to generate the output text. To generate a similar output for a prompt every time that you run that prompt, use 0. To generate a random new text for that prompt, increase the temperature.

Tip

Start with the temperature set to 0 and increase the temperature as you regenerate the prompts to refine the output. High temperatures cant introduce hallucinations and factually incorrect information.

Top k

A sampling method in which the model chooses the next token randomly from the top k most likely tokens. A higher value for k generates more random output, which makes the output text sound more natural. The default value for k is 0 for command models and -1 for Llama 2 models, which means that the models should consider all tokens and not use this method.

Top p

A sampling method that eliminates tokens with a low likelihood by assigning p a minimum percentage for the next token's likelihood. The default value for p is 0.75, which eliminates the bottom 25 percent for the next token.

The top p method ensures that only the most likely tokens with the sum p of their probabilities are considered for generation at each step. A higher value for p introduces more randomness into the output. Set the value to either 1.0 to consider all tokens or set to 0 to disable this method.

If you're also using top k, then the model considers only the top tokens whose probabilities add up to p percent and ignores the rest of the k tokens. For example, if k is 20 but only the probabilities of the top 10 add up to the value of p, then only the top 10 tokens are chosen.

Frequency Penalty

A penalty that is assigned to a token when that token appears frequently. High penalties encourage fewer repeated tokens and produce a more random output.

Presence Penalty

A penalty that is assigned to each token when it appears in the output to encourage generating outputs with tokens that haven't been used.

Likelihood

In the output of a large language model (LLM), how likely it is for a token to follow the current generated token. When an LLM generates a new token for the output text, a likelihood is assigned to all tokens, where tokens with higher likelihoods are more likely to follow the current token. For example, it's more likely that the word favorite is followed by the word food or book rather than the word zebra. Likelihood is defined by a number between -15 and 0 and the more negative the number, the less likely it is that the token follows the current token.

Model Endpoint

A designated point on a dedicated AI cluster where a large language model (LLM) can accept user requests and send back responses such as the model's generated text.

In OCI Generative AI, you can create endpoints for ready-to-use pretrained models and custom models. Those endpoints are listed in the playground for testing the models. You can also reference those endpoints in applications.

Content Moderation

A feature that removes biased, toxic, violent, abusive, derogatory, hateful, threatening, insulting, and harassing phrases from generated responses in large language models (LLMs). In OCI Generative AI, content moderation is divided into the following four categories.
  • Hate and harassment, such as identity attacks, insults, threats of violence, and sexual aggression
  • Self-inflicted harm, such as self-harm and eating-disorder promotion
  • Ideological harm, such as extremism, terrorism, organized crime, and misinformation
  • Exploitation, such as scams and sexual abuse
By default, OCI Generative AI's pretrained ready-to-use models don't include this feature. However, pretrained models might have some level of content moderation that filters the output responses. To incorporate content moderation into models, you must enable content moderation when creating an endpoint for a pretrained or a fine-tuned model. Learn more about Creating an Endpoint in Generative AI.

Dedicated AI Clusters

Compute resources that you can use for fine-tuning custom models or for hosting endpoints for pretrained and custom models. The clusters are dedicated to your models and not shared with other customers.