Scenario 6: Lighter Embeddings Workload Benchmarks in Generative AI

The lighter embeddings scenario is similar to the text embeddings scenario 5, except that we reduce the size of each request to 16 documents, each with 512 tokens. Smaller files with fewer words could be supported by scenario 6.

Review the terms used in the hosting dedicated AI cluster benchmarks. For a list of scenarios and their descriptions, see Text Embedding Scenarios. The text embedding scenario is performed in the following region.

US Midwest (Chicago)

Model: cohere.embed-english-v3.0 hosted on one Embed Cohere unit of a dedicated AI cluster
Concurrency Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 1.19 54
8 1.41 348
32 3.47 600
128 12.08 558
Model: cohere.embed-english-light-v3.0 hosted on one Embed Cohere unit of a dedicated AI cluster
Concurrency Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 0.85 48
8 1.15 354
32 3.15 594
128 11.26 846
Model: cohere.embed-multilingual-v3.0 hosted on one Embed Cohere unit of a dedicated AI cluster
Concurrency Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 1.28 42
8 1.38 288
32 3.44 497
128 11.94 702
Model: cohere.embed-multilingual-light-v3.0 hosted on one Embed Cohere unit of a dedicated AI cluster
Concurrency Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 1.03 54
8 1.35 300
32 3.11 570
128 11.50 888