Dedicated AI Cluster Performance Benchmarks in Generative AI

Review the inference speed, latency, and throughput in several scenarios when one or more concurrent users call large language models hosted on dedicated AI clusters in OCI Generative AI.

The benchmarks are provided for models in the following families:

The following metrics are used for the benchmarks. For metric definitions, see About the Metrics.

Benchmark Metrics
Metric	Unit
Token-level inference speed	tokens per second (TPS)
Token-level throughput	tokens per second (TPS)
Request-level latency	seconds
Request-level throughput	requests per minute (RPM) or requests per second (RPS)

About the Metrics

Review the definitions for the following benchmark metrics.

Metric 1: Token-level inference speed

This metric is defined as the number of output tokens generated per unit of end-to-end latency.

For applications where matching the average human reading speed is required, users should focus on scenarios where the speed is 5 tokens/second or more, which is the average human reading speed.

In other scenarios requiring faster near real-time token generation, such as 15 tokens/second inference speed, for example in dialog and chat scenarios where the number of concurrent users that could be served is lower, and the overall throughput is lower.

Metric 2: Token-level throughput

This metric quantifies the average total number of tokens generated by the server across all simultaneous user requests. It provides an aggregate measure of server capacity and efficiency to serve requests across users.

When inference speed is less critical, such as in offline batch processing tasks, the focus should be where throughput peaks and therefore server cost efficiency is highest. This indicates the LLM's capacity to handle a high number of concurrent requests, ideal for batch processing or background tasks where immediate response is not essential.

Note: The token-level throughput benchmark was done using the LLMPerf tool. The throughput computation has an issue where it includes the time it requires to encode the generated text for token computation.

Metric 3: Request-level latency

This metric represents the average time elapsed between the request submission and the time it took to complete the request, such as after the last token of the request was generated.

Metric 4: Request-level throughput

The number of requests served per unit time, either per minute or per second.

Concurrency: Number of users that make requests at the same time.

Important

The performance (inference speed, throughput, latency) of a hosting dedicated AI cluster depends on the traffic scenarios going through the model that it's hosting. Traffic scenarios depend on:

The number of concurrent requests.
The number of tokens in the prompt.
The number of tokens in the response.
The variance of (2) and (3) across requests.

Oracle Cloud Infrastructure Documentation

Dedicated AI Cluster Performance Benchmarks in Generative AI

About the Metrics