Dedicated AI Cluster Performance Benchmarks in Generative AI

Review the hosting dedicated AI cluster benchmarks in OCI Generative AI.

Note

Performance Benchmark Terms

Term Unit Definition

Concurrency

(number)

Number of users that make requests at the same time.

Metric 1: Token-level Inference Speed

token/second

This metric is defined as the number of output tokens generated per unit of end-to-end latency.

For applications where matching the average human reading speed is required, users should focus on scenarios where the speed is 5 tokens/sec or more, which is the average human reading speed.

In other scenarios requiring faster near real-time token generation, such as 15 tokens/second inference speed, for example, dialog/chatbot where the number of concurrent users that could be served is lower, and the overall throughput is lower.

Metric 2: Token-level Throughput

token/second

This metric quantifies the average total number of tokens generated by the server across all simultaneous user requests. It provides an aggregate measure of server capacity and efficiency to serve requests across users.

When inference speed is less critical, such as in offline batch processing tasks, the focus should be where throughput peaks and therefore server cost efficiency is highest. This indicates the LLM's capacity to handle a high number of concurrent requests, ideal for batch processing or background tasks where immediate response is not essential.

Note: The token-level throughput benchmark was done using the LLMPerf tool. The throughput computation has an issue where it includes the time it requires to encode the generated text for token computation.

Metric 3: Request-level Latency

second

Average time elapsed between the request submission and the time it took to complete the request, such as after the last token of the request was generated.

Metric 4: Request-level throughput (RPM)

request/second

The number of requests served per unit time, in this case per minute.

Important

The performance (inference speed, throughput, latency) of a hosting dedicated AI cluster depends on the traffic scenarios going through the model that it's hosting. Traffic scenarios depend on:

  1. The number of concurrent requests.
  2. The number of tokens in the prompt.
  3. The number of tokens in the response.
  4. The variance of (2) and (3) across requests.

Text Generation Scenarios

Scenario Description

Scenario 1: Stochastic Prompt and Response Lengths

This scenario mimics text generation use cases where the size of the prompt and response are unknown ahead of time.

In this scenario, because of the unknown length of the prompt and response, we've used a stochastic approach where both the prompt and response length follow a normal distribution:

  • The prompt length follows a normal distribution with a mean of 480 tokens and a standard deviation of 240 tokens
  • The response length follows a normal distribution with a mean of 300 tokens and a standard deviation of 150 tokens.

Scenario 2: Retrieval-augmented Generation (RAG)

The RAG scenario has a very long prompt and a short response. This scenario also mimics summarization use cases.

  • The prompt length is fixed to 2,000 tokens.
  • The response length is fixed to 200 tokens.
Scenario 3: Generation Heavy

This scenario is for generation / model response heavy use cases. For example, a long job description generated from a short bullet list of items. For this case, we set the following token lengths:

  • The prompt length is fixed to 100 tokens.
  • The response length is fixed to 1,000 tokens.

Scenario 4: Chatbot

This scenario covers chatbot / dialog use cases where the prompt and responses are shorter.

  • The prompt length is fixed to 100 tokens.
  • The response length is fixed to 100 tokens.

Text Embedding Scenarios

Scenario Description

Scenario 5: Embeddings

Scenario 5 is only applicable to the embedding models. This scenario mimics embedding generation as part of the data ingestion pipeline of a vector database.

In this scenario, all requests are the same size, which is 96 documents, each one with 512 tokens. An example would be a collection of large PDF files, each file with 30,000+ words that a user wants to ingest into a vector DB.

Scenario 6: Lighter Embeddings Workload

The lighter embeddings scenario is similar to scenario 5, except that we reduce the size of each request to 16 documents, each with 512 tokens. Smaller files with fewer words could be supported by scenario 6.