Scenario 2: Retrieval-Augmented Generation (RAG) Benchmarks in Generative AI

The RAG scenario has a very long prompt and a short response. This scenario also mimics summarization use cases.

  • The prompt length is fixed to 2,000 tokens.
  • The response length is fixed to 200 tokens.
Important

The performance (inference speed, throughput, latency) of a hosting dedicated AI cluster depends on the traffic scenarios going through the model that it's hosting. Traffic scenarios depend on:

  1. The number of concurrent requests.
  2. The number of tokens in the prompt.
  3. The number of tokens in the response.
  4. The variance of (2) and (3) across requests.

Review the terms used in the hosting dedicated AI cluster benchmarks. For a list of scenarios and their descriptions, see Chat and Text Generation Scenarios. The retrieval-augmented generation scenario is performed in the following region.

Germany Central (Frankfurt)

Model: meta.llama-3-70b-instruct (Meta Llama 3) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 47.78 47.82 4.28 14.02
2 45.51 90.14 4.50 26.42
4 42.24 164.92 4.81 48.51
8 37.44 289.82 5.48 85.13
16 28.00 421.00 7.19 123.72
32 18.73 542.99 10.65 159.56
64 11.63 668.78 16.17 196.44
128 6.20 700.83 32.89 205.70
256 3.97 756.00 54.71 222.02
Model: cohere.command-r-16k v1.2 (Cohere Command R) model hosted on one Small Cohere V2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 49.33 47.66 4.14 14.24
2 45.65 86.90 4.50 26.04
4 40.32 152.10 5.09 45.51
8 30.69 235.78 6.57 70.43
16 24.60 310.44 9.74 93.07
32 9.95 307.32 18.21 91.81
64 5.43 297.06 31.41 89.08
128 4.44 313.47 44.90 93.89
256 2.36 312.97 85.35 93.53

US Midwest (Chicago)

Model: meta.llama-3-70b-instruct (Meta Llama 3) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 28.84 28.82 7.11 8.44
2 26.52 52.69 7.66 15.51
4 24.23 94.86 8.38 27.92
8 20.01 155.97 10.21 45.76
16 14.34 216.26 14.12 63.43
32 9.33 275.28 21.30 80.89
64 5.68 334.46 32.55 98.11
128 3.13 364.18 64.59 106.94
256 1.59 359.21 128.67 105.44
Model: cohere.command-r-16k v1.2 (Cohere Command R) model hosted on one Small Cohere V2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 49.33 47.66 4.14 14.24
2 45.65 86.90 4.50 26.04
4 40.32 152.10 5.09 45.51
8 30.69 235.78 6.57 70.43
16 24.60 310.44 9.74 93.07
32 9.95 307.32 18.21 91.81
64 5.43 297.06 31.41 89.08
128 4.44 313.47 44.90 93.89
256 2.36 312.97 85.35 93.53
Model: cohere.command (Cohere Command 52 B) model hosted on one Large Cohere unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 33.13 25.28 6.68 8.62
8 23.24 90.64 13.29 29.84
32 13.03 163.48 26.56 54.21
128 5.60 186.31 65.30 61.32
Model: cohere.command-light (Cohere Command Light 6 B) model hosted on one Small Cohere unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 56.71 50.88 3.14 17.61
8 24.70 148.42 6.15 53.93
32 11.06 235.31 13.37 85.14
128 3.40 280.3 31.64 105.77