Scenario 1: Stochastic Length Benchmarks in Generative AI

This scenario mimics text generation use cases where the size of the prompt and response are unknown ahead of time. In this scenario, because of the unknown length of the prompt and response, we've used a stochastic approach where both the prompt and response length follow a normal distribution:

The prompt length follows a normal distribution with a mean of 480 tokens and a standard deviation of 240 tokens
The response length follows a normal distribution with a mean of 300 tokens and a standard deviation of 150 tokens.

Important

The performance (inference speed, throughput, latency) of a hosting dedicated AI cluster depends on the traffic scenarios going through the model that it's hosting. Traffic scenarios depend on:

The number of concurrent requests.
The number of tokens in the prompt.
The number of tokens in the response.
The variance of (2) and (3) across requests.

Review the terms used in the hosting dedicated AI cluster benchmarks. For a list of scenarios and their descriptions, see Chat and Text Generation Scenarios. The fusion scenario is performed in the following region.

Brazil East (Sao Paulo)

Model: cohere.command-r-08-2024 (Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	143.82	142.16	3.89	15.07
2	141.16	276.64	4.28	27.37
4	136.15	517.89	4.98	45.85
8	121.71	858.28	4.97	84.62
16	105.84	1,243.61	5.53	122.45
32	88.15	2,126.25	6.53	210.29
64	67.40	3,398.12	8.63	319.28
128	45.86	4,499.76	13.96	427.76
256	24.14	4,784.32	25.79	453.83

Model: cohere.command-r-plus-08-2024 (Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	119.49	118.18	4.50	13.08
2	115.14	225.40	4.90	23.69
4	109.71	404.66	4.63	48.83
8	95.83	702.76	5.03	85.92
16	81.12	1,029.98	6.07	125.54
32	70.92	1,819.24	7.02	182.65
64	52.10	2,778.58	8.79	313.12
128	35.58	3,566.59	13.80	438.64
256	20.75	4,065.93	24.69	481.11

Model: meta.llama-3.3-70b-instruct (Meta Llama 3.3 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	53.62	52.71	5.43	10.97
2	52.65	102.99	5.48	21.65
4	52.06	205.56	5.58	42.61
8	51.06	393.93	5.68	82.31
16	46.755	715.89	6.08	152.11
32	39.55	1,152.97	7.80	228.8
64	31.22	1,663.88	9.36	353.91
128	23.00	2,055.51	13.94	433.91
256	17.44	1,873.44	22.85	427.95

Model: meta.llama-3.2-90b-vision-instruct (Meta Llama 3.2 90B Vision) model (text input only) hosted on one Large Generic V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	48.75	47.98	6.37	9.40
2	47.28	92.89	6.63	18.00
4	45.10	176.53	6.65	35.80
8	42.53	333.45	7.04	67.80
16	38.39	597.84	7.95	119.70
32	29.86	929.18	10.12	187.40
64	30.00	933.09	20.11	187.20
128	30.03	934.30	39.85	186.00
256	30.05	932.61	76.19	187.79

Model: meta.llama-3.2-11b-vision-instruct (Meta Llama 3.2 11B Vision) model (text input only) hosted on one Small Generic V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	105.74	104.30	2.75	21.70
2	103.21	204.22	2.82	42.40
4	99.41	393.69	3.10	77.10
8	93.98	745.29	3.26	146.70
16	81.62	1,294.14	3.64	262.60
32	60.55	1,924.74	4.97	384.40
64	60.54	1,928.70	10.03	379.40
128	62.57	1,912.53	19.68	383.09
256	60.00	1,911.45	38.36	386.14

Model: meta.llama-3.1-405b-instruct (Meta Llama 3.1 (405B)) model hosted on one Large Generic 2 unit of a dedicated AI cluster

Important

You can host the meta.llama-3.1-405b-instruct model only on a dedicated AI cluster of type Large Generic 2. This type in intended to provide better throughput with less hardware and a lower cost than its predecessor, Large Generic 4.

The following tables provide benchmarks that were performed for the meta.llama-3.1-405b-instruct model hosted on one Large Generic 2 unit and on one Large Generic 4 unit. If your model is currently hosted on a Large Generic 4 unit, compare the following tables to decide whether to host the model on this new unit.


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	27.44	26.84	11.66	5.10
2	26.56	51.93	11.44	10.39
4	25.66	100.31	11.97	19.89
8	24.98	193.34	11.96	39.48
16	20.73	322.99	14.86	63.76
32	18.39	562.55	16.50	114.21
64	15.05	877.61	20.42	180.76
128	10.79	1,210.61	29.53	241.73
256	8.67	1,301.65	47.22	282.78

Model: meta.llama-3.1-405b-instruct (Meta Llama 3.1 (405B)) model hosted on one Large Generic 4 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	32.66	25.79	10.78	5.56
2	31.36	50.81	10.06	11.68
4	29.86	96.01	10.87	21.52
8	27.89	170.45	10.87	34.09
16	24.74	282.52	13.51	60.35
32	21.51	457.24	16.73	91.42
64	17.68	676.90	18.29	152.47
128	13.06	1,035.08	25.59	222.67
256	7.82	1,302.71	41.88	289.08

Model: meta.llama-3.1-70b-instruct (Meta Llama 3.1 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	95.50	51.58	6.12	9.78
2	92.25	98.89	6.44	18.53
4	90.51	184.54	7.37	30.67
8	83.38	326.71	7.64	57.06
16	71.45	509.03	8.77	90.02
32	58.48	724.23	10.00	138.82
64	44.74	1,146.92	14.07	206.58
128	27.00	1,434.57	22.48	268.58
256	18.03	1,635.95	41.06	309.97

Model: meta.llama-3-70b-instruct (Meta Llama 3) model hosted on one Large Generic unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	49.76	49.58	6.42	9.33
2	48.04	95.38	6.80	17.53
4	46.09	181.21	6.99	33.60
8	44.19	330.46	7.43	60.67
16	40.56	591.52	8.40	104.42
32	31.35	869.36	9.68	168.46
64	23.87	1062.52	12.57	201.11
128	16.86	1,452.66	17.64	276.09
256	9.84	1,792.81	30.08	347.26

Model: cohere.command-r-16k (Cohere Command R) model hosted on one Small Cohere V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	51.30	50.46	4.63	12.75
2	51.06	97.86	5.07	23.14
4	47.52	186.75	5.30	44.48
8	43.55	305.45	5.68	75.18
16	36.49	505.11	6.71	127.88
32	29.02	768.40	8.84	177.03
64	18.57	735.37	14.55	168.00
128	12.59	809.50	21.27	186.76
256	6.54	859.45	38.69	200.42

Model: cohere.command-r-plus (Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	122.46	101.28	4.31	13.21
2	114.38	177.67	5.70	17.78
4	107.48	367.88	5.09	45.22
8	95.32	644.56	7.23	62.61
16	82.42	1,036.84	7.91	62.61
32	66.46	1,529.28	10.12	145.82
64	45.70	1,924.84	12.43	206.26
128	33.96	2,546.35	18.22	272.53
256	23.86	2,914.77	30.75	298.88

Germany Central (Frankfurt)

Model: cohere.command-r-08-2024 (Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	143.82	142.16	3.89	15.07
2	141.16	276.64	4.28	27.37
4	136.15	517.89	4.98	45.85
8	121.71	858.28	4.97	84.62
16	105.84	1,243.61	5.53	122.45
32	88.15	2,126.25	6.53	210.29
64	67.40	3,398.12	8.63	319.28
128	45.86	4,499.76	13.96	427.76
256	24.14	4,784.32	25.79	453.83

Model: cohere.command-r-plus-08-2024 (Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	119.49	118.18	4.50	13.08
2	115.14	225.40	4.90	23.69
4	109.71	404.66	4.63	48.83
8	95.83	702.76	5.03	85.92
16	81.12	1,029.98	6.07	125.54
32	70.92	1,819.24	7.02	182.65
64	52.10	2,778.58	8.79	313.12
128	35.58	3,566.59	13.80	438.64
256	20.75	4,065.93	24.69	481.11

Model: meta.llama-3.3-70b-instruct (Meta Llama 3.3 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	53.62	52.71	5.43	10.97
2	52.65	102.99	5.48	21.65
4	52.06	205.56	5.58	42.61
8	51.06	393.93	5.68	82.31
16	46.755	715.89	6.08	152.11
32	39.55	1,152.97	7.80	228.8
64	31.22	1,663.88	9.36	353.91
128	23.00	2,055.51	13.94	433.91
256	17.44	1,873.44	22.85	427.95

Model: meta.llama-3.1-405b-instruct (Meta Llama 3.1 (405B)) model hosted on one Large Generic 2 unit of a dedicated AI cluster

Important


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	27.44	26.84	11.66	5.10
2	26.56	51.93	11.44	10.39
4	25.66	100.31	11.97	19.89
8	24.98	193.34	11.96	39.48
16	20.73	322.99	14.86	63.76
32	18.39	562.55	16.50	114.21
64	15.05	877.61	20.42	180.76
128	10.79	1,210.61	29.53	241.73
256	8.67	1,301.65	47.22	282.78

Model: meta.llama-3.1-405b-instruct (Meta Llama 3.1 (405B)) model hosted on one Large Generic 4 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	32.66	25.79	10.78	5.56
2	31.36	50.81	10.06	11.68
4	29.86	96.01	10.87	21.52
8	27.89	170.45	10.87	34.09
16	24.74	282.52	13.51	60.35
32	21.51	457.24	16.73	91.42
64	17.68	676.90	18.29	152.47
128	13.06	1,035.08	25.59	222.67
256	7.82	1,302.71	41.88	289.08

Model: meta.llama-3.1-70b-instruct (Meta Llama 3.1 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	95.50	51.58	6.12	9.78
2	92.25	98.89	6.44	18.53
4	90.51	184.54	7.37	30.67
8	83.38	326.71	7.64	57.06
16	71.45	509.03	8.77	90.02
32	58.48	724.23	10.00	138.82
64	44.74	1,146.92	14.07	206.58
128	27.00	1,434.57	22.48	268.58
256	18.03	1,635.95	41.06	309.97

Model: meta.llama-3-70b-instruct (Meta Llama 3) model hosted on one Large Generic unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	49.76	49.58	6.42	9.33
2	48.04	95.38	6.80	17.53
4	46.09	181.21	6.99	33.60
8	44.19	330.46	7.43	60.67
16	40.56	591.52	8.40	104.42
32	31.35	869.36	9.68	168.46
64	23.87	1062.52	12.57	201.11
128	16.86	1,452.66	17.64	276.09
256	9.84	1,792.81	30.08	347.26

Model: cohere.command-r-16k (Cohere Command R) model hosted on one Small Cohere V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	51.30	50.46	4.63	12.75
2	51.06	97.86	5.07	23.14
4	47.52	186.75	5.30	44.48
8	43.55	305.45	5.68	75.18
16	36.49	505.11	6.71	127.88
32	29.02	768.40	8.84	177.03
64	18.57	735.37	14.55	168.00
128	12.59	809.50	21.27	186.76
256	6.54	859.45	38.69	200.42

Model: cohere.command-r-plus (Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	122.46	101.28	4.31	13.21
2	114.38	177.67	5.70	17.78
4	107.48	367.88	5.09	45.22
8	95.32	644.56	7.23	62.61
16	82.42	1,036.84	7.91	62.61
32	66.46	1,529.28	10.12	145.82
64	45.70	1,924.84	12.43	206.26
128	33.96	2,546.35	18.22	272.53
256	23.86	2,914.77	30.75	298.88

Japan Central (Osaka)

Model: cohere.command-r-08-2024 (Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	143.82	142.16	3.89	15.07
2	141.16	276.64	4.28	27.37
4	136.15	517.89	4.98	45.85
8	121.71	858.28	4.97	84.62
16	105.84	1,243.61	5.53	122.45
32	88.15	2,126.25	6.53	210.29
64	67.40	3,398.12	8.63	319.28
128	45.86	4,499.76	13.96	427.76
256	24.14	4,784.32	25.79	453.83

Model: cohere.command-r-plus-08-2024 (Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	119.49	118.18	4.50	13.08
2	115.14	225.40	4.90	23.69
4	109.71	404.66	4.63	48.83
8	95.83	702.76	5.03	85.92
16	81.12	1,029.98	6.07	125.54
32	70.92	1,819.24	7.02	182.65
64	52.10	2,778.58	8.79	313.12
128	35.58	3,566.59	13.80	438.64
256	20.75	4,065.93	24.69	481.11

Model: meta.llama-3.3-70b-instruct (Meta Llama 3.3 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	53.62	52.71	5.43	10.97
2	52.65	102.99	5.48	21.65
4	52.06	205.56	5.58	42.61
8	51.06	393.93	5.68	82.31
16	46.755	715.89	6.08	152.11
32	39.55	1,152.97	7.80	228.8
64	31.22	1,663.88	9.36	353.91
128	23.00	2,055.51	13.94	433.91
256	17.44	1,873.44	22.85	427.95

Model: meta.llama-3.2-90b-vision-instruct (Meta Llama 3.2 90B Vision) model (text input only) hosted on one Large Generic V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	48.75	47.98	6.37	9.40
2	47.28	92.89	6.63	18.00
4	45.10	176.53	6.65	35.80
8	42.53	333.45	7.04	67.80
16	38.39	597.84	7.95	119.70
32	29.86	929.18	10.12	187.40
64	30.00	933.09	20.11	187.20
128	30.03	934.30	39.85	186.00
256	30.05	932.61	76.19	187.79

Model: meta.llama-3.2-11b-vision-instruct (Meta Llama 3.2 11B Vision) model (text input only) hosted on one Small Generic V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	105.74	104.30	2.75	21.70
2	103.21	204.22	2.82	42.40
4	99.41	393.69	3.10	77.10
8	93.98	745.29	3.26	146.70
16	81.62	1,294.14	3.64	262.60
32	60.55	1,924.74	4.97	384.40
64	60.54	1,928.70	10.03	379.40
128	62.57	1,912.53	19.68	383.09
256	60.00	1,911.45	38.36	386.14

Model: meta.llama-3.1-405b-instruct (Meta Llama 3.1 (405B)) model hosted on one Large Generic 2 unit of a dedicated AI cluster

Important


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	27.44	26.84	11.66	5.10
2	26.56	51.93	11.44	10.39
4	25.66	100.31	11.97	19.89
8	24.98	193.34	11.96	39.48
16	20.73	322.99	14.86	63.76
32	18.39	562.55	16.50	114.21
64	15.05	877.61	20.42	180.76
128	10.79	1,210.61	29.53	241.73
256	8.67	1,301.65	47.22	282.78

Model: meta.llama-3.1-405b-instruct (Meta Llama 3.1 (405B)) model hosted on one Large Generic 4 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	32.66	25.79	10.78	5.56
2	31.36	50.81	10.06	11.68
4	29.86	96.01	10.87	21.52
8	27.89	170.45	10.87	34.09
16	24.74	282.52	13.51	60.35
32	21.51	457.24	16.73	91.42
64	17.68	676.90	18.29	152.47
128	13.06	1,035.08	25.59	222.67
256	7.82	1,302.71	41.88	289.08

Model: meta.llama-3.1-70b-instruct (Meta Llama 3.1 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	95.50	51.58	6.12	9.78
2	92.25	98.89	6.44	18.53
4	90.51	184.54	7.37	30.67
8	83.38	326.71	7.64	57.06
16	71.45	509.03	8.77	90.02
32	58.48	724.23	10.00	138.82
64	44.74	1,146.92	14.07	206.58
128	27.00	1,434.57	22.48	268.58
256	18.03	1,635.95	41.06	309.97

Model: cohere.command-r-16k (Cohere Command R) model hosted on one Small Cohere V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	51.30	50.46	4.63	12.75
2	51.06	97.86	5.07	23.14
4	47.52	186.75	5.30	44.48
8	43.55	305.45	5.68	75.18
16	36.49	505.11	6.71	127.88
32	29.02	768.40	8.84	177.03
64	18.57	735.37	14.55	168.00
128	12.59	809.50	21.27	186.76
256	6.54	859.45	38.69	200.42

UAE East (Dubai)

Model: cohere.command-r-08-2024-tp4 (Cohere Command R 08-2024) model hosted on one SMALL_COHERE_4 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	81.02	63.73	2.42	24.75
2	78.13	124.94	2.60	46.00
4	73.31	229.25	2.51	93.51
8	60.98	388.68	2.98	154.25
16	49.82	633.95	3.58	252.82
32	40.24	894.11	4.18	379.33
64	25.35	1,137.97	5.72	553.41
128	14.20	721.99	12.99	443.50
256	14.91	1,117.35	16.16	581.66

UK South (London)

Model: cohere.command-r-08-2024 (Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	143.82	142.16	3.89	15.07
2	141.16	276.64	4.28	27.37
4	136.15	517.89	4.98	45.85
8	121.71	858.28	4.97	84.62
16	105.84	1,243.61	5.53	122.45
32	88.15	2,126.25	6.53	210.29
64	67.40	3,398.12	8.63	319.28
128	45.86	4,499.76	13.96	427.76
256	24.14	4,784.32	25.79	453.83

Model: cohere.command-r-plus-08-2024 (Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	119.49	118.18	4.50	13.08
2	115.14	225.40	4.90	23.69
4	109.71	404.66	4.63	48.83
8	95.83	702.76	5.03	85.92
16	81.12	1,029.98	6.07	125.54
32	70.92	1,819.24	7.02	182.65
64	52.10	2,778.58	8.79	313.12
128	35.58	3,566.59	13.80	438.64
256	20.75	4,065.93	24.69	481.11

Model: meta.llama-3.3-70b-instruct (Meta Llama 3.3 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	53.62	52.71	5.43	10.97
2	52.65	102.99	5.48	21.65
4	52.06	205.56	5.58	42.61
8	51.06	393.93	5.68	82.31
16	46.755	715.89	6.08	152.11
32	39.55	1,152.97	7.80	228.8
64	31.22	1,663.88	9.36	353.91
128	23.00	2,055.51	13.94	433.91
256	17.44	1,873.44	22.85	427.95

Model: meta.llama-3.2-90b-vision-instruct (Meta Llama 3.2 90B Vision) model (text input only) hosted on one Large Generic V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	48.75	47.98	6.37	9.40
2	47.28	92.89	6.63	18.00
4	45.10	176.53	6.65	35.80
8	42.53	333.45	7.04	67.80
16	38.39	597.84	7.95	119.70
32	29.86	929.18	10.12	187.40
64	30.00	933.09	20.11	187.20
128	30.03	934.30	39.85	186.00
256	30.05	932.61	76.19	187.79

Model: meta.llama-3.2-11b-vision-instruct (Meta Llama 3.2 11B Vision) model (text input only) hosted on one Small Generic V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	105.74	104.30	2.75	21.70
2	103.21	204.22	2.82	42.40
4	99.41	393.69	3.10	77.10
8	93.98	745.29	3.26	146.70
16	81.62	1,294.14	3.64	262.60
32	60.55	1,924.74	4.97	384.40
64	60.54	1,928.70	10.03	379.40
128	62.57	1,912.53	19.68	383.09
256	60.00	1,911.45	38.36	386.14

Model: meta.llama-3.1-405b-instruct (Meta Llama 3.1 (405B)) model hosted on one Large Generic 2 unit of a dedicated AI cluster

Important


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	27.44	26.84	11.66	5.10
2	26.56	51.93	11.44	10.39
4	25.66	100.31	11.97	19.89
8	24.98	193.34	11.96	39.48
16	20.73	322.99	14.86	63.76
32	18.39	562.55	16.50	114.21
64	15.05	877.61	20.42	180.76
128	10.79	1,210.61	29.53	241.73
256	8.67	1,301.65	47.22	282.78

Model: meta.llama-3.1-405b-instruct (Meta Llama 3.1 (405B)) model hosted on one Large Generic 4 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	32.66	25.79	10.78	5.56
2	31.36	50.81	10.06	11.68
4	29.86	96.01	10.87	21.52
8	27.89	170.45	10.87	34.09
16	24.74	282.52	13.51	60.35
32	21.51	457.24	16.73	91.42
64	17.68	676.90	18.29	152.47
128	13.06	1,035.08	25.59	222.67
256	7.82	1,302.71	41.88	289.08

Model: meta.llama-3.1-70b-instruct (Meta Llama 3.1 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	95.50	51.58	6.12	9.78
2	92.25	98.89	6.44	18.53
4	90.51	184.54	7.37	30.67
8	83.38	326.71	7.64	57.06
16	71.45	509.03	8.77	90.02
32	58.48	724.23	10.00	138.82
64	44.74	1,146.92	14.07	206.58
128	27.00	1,434.57	22.48	268.58
256	18.03	1,635.95	41.06	309.97

Model: meta.llama-3-70b-instruct (Meta Llama 3) model hosted on one Large Generic unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	49.76	49.58	6.42	9.33
2	48.04	95.38	6.80	17.53
4	46.09	181.21	6.99	33.60
8	44.19	330.46	7.43	60.67
16	40.56	591.52	8.40	104.42
32	31.35	869.36	9.68	168.46
64	23.87	1062.52	12.57	201.11
128	16.86	1,452.66	17.64	276.09
256	9.84	1,792.81	30.08	347.26

Model: cohere.command-r-16k (Cohere Command R) model hosted on one Small Cohere V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	51.30	50.46	4.63	12.75
2	51.06	97.86	5.07	23.14
4	47.52	186.75	5.30	44.48
8	43.55	305.45	5.68	75.18
16	36.49	505.11	6.71	127.88
32	29.02	768.40	8.84	177.03
64	18.57	735.37	14.55	168.00
128	12.59	809.50	21.27	186.76
256	6.54	859.45	38.69	200.42

Model: cohere.command-r-plus (Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	122.46	101.28	4.31	13.21
2	114.38	177.67	5.70	17.78
4	107.48	367.88	5.09	45.22
8	95.32	644.56	7.23	62.61
16	82.42	1,036.84	7.91	62.61
32	66.46	1,529.28	10.12	145.82
64	45.70	1,924.84	12.43	206.26
128	33.96	2,546.35	18.22	272.53
256	23.86	2,914.77	30.75	298.88

US Midwest (Chicago)

Model: cohere.command-r-08-2024 (Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	143.82	142.16	3.89	15.07
2	141.16	276.64	4.28	27.37
4	136.15	517.89	4.98	45.85
8	121.71	858.28	4.97	84.62
16	105.84	1,243.61	5.53	122.45
32	88.15	2,126.25	6.53	210.29
64	67.40	3,398.12	8.63	319.28
128	45.86	4,499.76	13.96	427.76
256	24.14	4,784.32	25.79	453.83

Model: cohere.command-r-plus-08-2024 (Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	119.49	118.18	4.50	13.08
2	115.14	225.40	4.90	23.69
4	109.71	404.66	4.63	48.83
8	95.83	702.76	5.03	85.92
16	81.12	1,029.98	6.07	125.54
32	70.92	1,819.24	7.02	182.65
64	52.10	2,778.58	8.79	313.12
128	35.58	3,566.59	13.80	438.64
256	20.75	4,065.93	24.69	481.11

Model: meta.llama-3.3-70b-instruct (Meta Llama 3.3 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	53.62	52.71	5.43	10.97
2	52.65	102.99	5.48	21.65
4	52.06	205.56	5.58	42.61
8	51.06	393.93	5.68	82.31
16	46.755	715.89	6.08	152.11
32	39.55	1,152.97	7.80	228.8
64	31.22	1,663.88	9.36	353.91
128	23.00	2,055.51	13.94	433.91
256	17.44	1,873.44	22.85	427.95

Model: meta.llama-3.2-90b-vision-instruct (Meta Llama 3.2 90B Vision) model (text input only) hosted on one Large Generic V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	48.75	47.98	6.37	9.40
2	47.28	92.89	6.63	18.00
4	45.10	176.53	6.65	35.80
8	42.53	333.45	7.04	67.80
16	38.39	597.84	7.95	119.70
32	29.86	929.18	10.12	187.40
64	30.00	933.09	20.11	187.20
128	30.03	934.30	39.85	186.00
256	30.05	932.61	76.19	187.79

Model: meta.llama-3.2-11b-vision-instruct (Meta Llama 3.2 11B Vision) model (text input only) hosted on one Small Generic V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	105.74	104.30	2.75	21.70
2	103.21	204.22	2.82	42.40
4	99.41	393.69	3.10	77.10
8	93.98	745.29	3.26	146.70
16	81.62	1,294.14	3.64	262.60
32	60.55	1,924.74	4.97	384.40
64	60.54	1,928.70	10.03	379.40
128	62.57	1,912.53	19.68	383.09
256	60.00	1,911.45	38.36	386.14

Model: meta.llama-3.1-405b-instruct (Meta Llama 3.1 (405B)) model hosted on one Large Generic 2 unit of a dedicated AI cluster

Important


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	27.44	26.84	11.66	5.10
2	26.56	51.93	11.44	10.39
4	25.66	100.31	11.97	19.89
8	24.98	193.34	11.96	39.48
16	20.73	322.99	14.86	63.76
32	18.39	562.55	16.50	114.21
64	15.05	877.61	20.42	180.76
128	10.79	1,210.61	29.53	241.73
256	8.67	1,301.65	47.22	282.78

Model: meta.llama-3.1-405b-instruct (Meta Llama 3.1 (405B)) model hosted on one Large Generic 4 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	32.66	25.79	10.78	5.56
2	31.36	50.81	10.06	11.68
4	29.86	96.01	10.87	21.52
8	27.89	170.45	10.87	34.09
16	24.74	282.52	13.51	60.35
32	21.51	457.24	16.73	91.42
64	17.68	676.90	18.29	152.47
128	13.06	1,035.08	25.59	222.67
256	7.82	1,302.71	41.88	289.08

Model: meta.llama-3.1-70b-instruct (Meta Llama 3.1 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	95.50	51.58	6.12	9.78
2	92.25	98.89	6.44	18.53
4	90.51	184.54	7.37	30.67
8	83.38	326.71	7.64	57.06
16	71.45	509.03	8.77	90.02
32	58.48	724.23	10.00	138.82
64	44.74	1,146.92	14.07	206.58
128	27.00	1,434.57	22.48	268.58
256	18.03	1,635.95	41.06	309.97

Model: meta.llama-3-70b-instruct (Meta Llama 3) model hosted on one Large Generic unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	30.51	30.36	10.47	5.73
2	28.85	57.37	11.09	10.68
4	27.99	108.49	11.13	21.08
8	25.61	196.68	13.27	34.65
16	21.97	318.82	15.36	56.37
32	16.01	428.45	18.55	82.88
64	11.60	563.70	24.31	108.58
128	7.50	650.40	40.64	40.64
256	4.58	927.31	67.42	172.42

Model: cohere.command-r-16k (Cohere Command R) model hosted on one Small Cohere V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	51.30	50.46	4.63	12.75
2	51.06	97.86	5.07	23.14
4	47.52	186.75	5.30	44.48
8	43.55	305.45	5.68	75.18
16	36.49	505.11	6.71	127.88
32	29.02	768.40	8.84	177.03
64	18.57	735.37	14.55	168.00
128	12.59	809.50	21.27	186.76
256	6.54	859.45	38.69	200.42

Model: cohere.command-r-plus (Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	122.46	101.28	4.31	13.21
2	114.38	177.67	5.70	17.78
4	107.48	367.88	5.09	45.22
8	95.32	644.56	7.23	62.61
16	82.42	1,036.84	7.91	62.61
32	66.46	1,529.28	10.12	145.82
64	45.70	1,924.84	12.43	206.26
128	33.96	2,546.35	18.22	272.53
256	23.86	2,914.77	30.75	298.88

Model: cohere.command (Cohere Command 52 B) model hosted on one Large Cohere unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	36.32	31.29	8.15	7.12
8	30.15	106.03	13.19	23.86
32	23.94	204.41	23.90	45.84
128	14.36	254.54	65.26	56.58

Model: cohere.command-light (Cohere Command Light 6 B) model hosted on one Small Cohere unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	69.17	69.19	3.57	15.69
8	38.75	208.22	6.54	45.08
32	17.98	337.35	13.49	75.50
128	4.01	397.36	37.69	92.17

Model: meta.llama-2-70b-chat (Llama2 70 B) model hosted on one Llama2 70 unit of a dedicated AI cluster


Concurrency	Token-level Inference Speed (token/second)	Token-level Throughput (token/second)	Request-level Latency (second)	Request-level Throughput (Request per minute) (RPM)
1	17.86	17.18	13.60	4.32
8	14.48	68.62	16.63	16.58
32	9.82	174.40	20.78	44.58
128	3.89	319.34	43.87	85.33

Oracle Cloud Infrastructure Documentation

Scenario 1: Stochastic Length Benchmarks in Generative AI

Brazil East (Sao Paulo)

Germany Central (Frankfurt)

Japan Central (Osaka)

UAE East (Dubai)

UK South (London)

US Midwest (Chicago)