Cohere Command A Reasoning

Released in August 2025, Cohere Command A Reasoning (cohere.command-a-reasoning) is Cohere's flagship 111-billion parameter large language model designed for advanced enterprise reasoning, agentic workflows, and tool use. This model is designed for tasks that demand advanced logical processing, in-depth analysis, and multi-step reasoning, such as comprehensive Q&A, intricate document review, and creating structured arguments. The model supports a 256,000 token context window, making it ideal for large-scale data analysis and agentic workflows.

Regions for this Model

Important

For supported regions, endpoint types (on-demand or dedicated AI clusters), and hosting (OCI Generative AI or external calls) for this model, see the Models by Region page. For details about the regions, see the Generative AI Regions page.

Access this Model

Access this model through the Console, API, and the CLI:

Note

The API endpoints for all supported commercial, sovereign, and government regions are listed in the Management API and Inference API links. You can access each model only through its supported regions.

Key Features

Model Purpose: Unlike general-purpose models, Command A is purpose-built for enterprise scenarios, prioritizing accuracy, reasoning, and security. Primary use cases include:
- Autonomous Agents: Managing complex workflows, acting as a research agent, and interacting with environments.
- Advanced RAG: Deep document analysis, financial report generation, and data extraction with precise citations.
- Multi-turn Chatbots: Maintaining coherence and logical consistency over long, complex conversations.
Context Window: Supports a 256,000-token window with up to 32,000 tokens of output, enabling analysis of extensive documents and maintains context from its long conversation history of 256,000 tokens. For on-demand inferencing, the response length is capped at 4,000 tokens for each run. For the dedicated mode, the response length isn't capped off and the context length is 256,000 tokens.
Agentic Use Cases: Excels at ReAct (Reasoning + Acting) agents, dividing complex, multi-step questions into subgoals, using external tools, taking autonomous actions, and interacting with the environment to solve problems.
Tool Use and RAG: Designed interact with external APIs, and leverage various tools such as search engines, and databases, with built-in support for grounding citations.
Multilingual Support: Can reason natively across 23 languages, including English, Spanish, Chinese, Arabic, and German.
Architecture & Efficiency: Uses a four-layer transformer architecture with hybrid attention (sliding window + global) to handle long context, and can run on one or two GPUs.
Configuration: Users can set reasoning budgets to balance latency, accuracy, and throughput.
Knowledge Cutoff: June 1, 2024

See Cohere's documentation for Command A Reasoning Model and Reasoning Guide.

API Endpoints

Important

The Cohere Command A Reasoning model is compatible only with the version 2 of the OCI Generative AI Chat API for Cohere models. For implementation details, see the CohereChatRequestV2 API and other related endpoints that end in V2. For example, CohereChatResponseV2.

Dedicated AI Cluster for the Model

In the preceding region list, models in regions that aren't marked with (dedicated AI cluster only) have both on-demand and dedicated AI cluster options. For on-demand mode, you don't need clusters and you can reach the model in the Console playground or through the API.

To reach a model through a dedicated AI cluster in any listed region, you must create an endpoint for that model on a dedicated AI cluster. For the cluster unit size that matches this model, see the following table.


Base Model	Fine-Tuning Cluster	Hosting Cluster	Pricing Page Information	Request Cluster Limit Increase
Model Name: Cohere Command A Reasoning OCI Model Name: `cohere.command-a-reasoning`	Not available for fine-tuning	Unit Size: LARGE_COHERE_V2_2 Required Units: 1	Pricing Page Product Name: Large Cohere - Dedicated For Hosting, Multiply the Unit Price: x2	Limit Name: `dedicated-unit-large-cohere-count` For Hosting, Request Limit Increase by: 2
Model Name: Cohere Command A Reasoning (UAE East (Dubai) only) OCI Model Name: `cohere.command-a-reasoning`	Not available for fine-tuning	Unit Size: SMALL_COHERE_4 Required Units: 1	Pricing Page Product Name: Small Cohere - Dedicated For Hosting, Multiply the Unit Price: x4	Limit Name: `dedicated-unit-small-cohere-count` For Hosting, Request Limit Increase by: 4

Tip

If you don't have enough cluster limits in the tenancy for hosting the Cohere Command A Reasoning model on a dedicated AI cluster,
- For the UAE East (Dubai) region, request the dedicated-unit-small-cohere-count limit to increase by 4.
- For all other regions, request the dedicated-unit-large-cohere-count limit to increase by 2.
See Requesting a Service Limit Increase.

Endpoint Rules for Clusters

A dedicated AI cluster can hold up to 50 endpoints.
Use these endpoints to create aliases that all point either to the same base model or to the same version of a custom model, but not both types.
Several endpoints for the same model make it easy to assign them to different users or purposes.


Hosting Cluster Unit Size	Endpoint Rules
LARGE_COHERE_V2_2	Base model: To run the `cohere.command-a-reasoning` model on several endpoints, create as many endpoints as you need on a LARGE_COHERE_V2_2 cluster (unit‑size). Custom model: You can't fine‑tune `cohere.command-a-reasoning`, so you can't create and host custom models built from that base.
SMALL_COHERE_4 (UAE East (Dubai) only)	Base model: To run the `cohere.command-a-reasoning` model on several endpoints in UAE East (Dubai), create as many endpoints as you need on a SMALL_COHERE_4 cluster (unit‑size). Custom model: You can't fine‑tune `cohere.command-a-reasoning`, so you can't create and host custom models built from that base.

Tip

To increase the call volume supported by a hosting cluster, increase its instance count by editing the dedicated AI cluster. See Updating a Dedicated AI Cluster.
For more than 50 endpoints per cluster, request an increase for the limit, endpoint-per-dedicated-unit-count. See Requesting a Service Limit Increase and Service Limits for Generative AI.

OCI Release and Retirement Dates

For release and retirement dates and replacement model options, see the following pages based on the mode (on-demand or dedicated):

Model Parameters

To change the model responses, you can change the values of the following parameters in the playground or the API.

Maximum output tokens

The maximum number of tokens that you want the model to generate for each response. Estimate four characters per token. Because you're prompting a chat model, the response depends on the prompt and each response doesn't necessarily use up the maximum allocated tokens.

Preamble override

An initial context or guiding message for a chat model. When you don't give a preamble to a chat model, the default preamble for that model is used. You can assign a preamble in the Preamble override parameter, for the models. The default preamble for the Cohere family is:

You are Command.
            You are an extremely capable large language model built by Cohere. 
            You are given instructions programmatically via an API
            that you follow to the best of your ability.

Overriding the default preamble is optional. When specified, the preamble override replaces the default Cohere preamble. When adding a preamble, for best results, give the model context, instructions, and a conversation style.

Tip

For chat models without the preamble override parameter, you can include a preamble in the chat conversation and directly ask the model to answer in a certain way.

Safety Mode

Adds a safety instruction for the model to use when generating responses. Options are:

Contextual: (Default) Puts fewer constraints on the output. It maintains core protections by aiming to reject harmful or illegal suggestions, but it allows profanity and some toxic content, sexually explicit and violent content, and content that contains medical, financial, or legal information. Contextual mode is suited for entertainment, creative, or academic use.
Strict: Aims to avoid sensitive topics, such as violent or sexual acts and profanity. This mode aims to provide a safer experience by prohibiting responses or recommendations that it finds inappropriate. Strict mode is suited for corporate use, such as for corporate communications and customer service.
Off: No safety mode is applied.

Temperature

The level of randomness used to generate the output text.

Tip

Start with the temperature set to 0 or less than one, and increase the temperature as you regenerate the prompts for a more creative output. High temperatures can introduce hallucinations and factually incorrect information.

Top p

A sampling method that controls the cumulative probability of the top tokens to consider for the next token. Assign p a decimal number between 0 and 1 for the probability. For example, enter 0.75 for the top 75 percent to be considered. Set p to 1 to consider all tokens.

Top k

A sampling method in which the model chooses the next token randomly from the top k most likely tokens. A high value for k generates more random output, which makes the output text sound more natural. The default value for k is 0 for Cohere Command models and -1 for Meta Llama models, which means that the model should consider all tokens and not use this method.

Frequency penalty

A penalty that's assigned to a token when that token appears often. High penalties encourage fewer repeated tokens and produce a more random output.

For the Meta Llama family models, this penalty can be positive or negative. Positive numbers encourage the model to use new tokens and negative numbers encourage the model to repeat the tokens. Set to 0 to disable.

Presence penalty

A penalty that's assigned to each token when it appears in the output to encourage generating outputs with tokens that haven't been used.

Seed

A parameter that makes a best effort to sample tokens deterministically. When this parameter is assigned a value, the large language model aims to return the same result for repeated requests when you assign the same seed and parameters for the requests.

Allowed values are integers and assigning a large or a small seed value doesn't affect the result. Assigning a number for the seed parameter is similar to tagging the request with a number. The large language model aims to generate the same set of tokens for the same integer in consecutive requests. This feature is especially useful for debugging and testing. The seed parameter has no maximum value for the API, and in the Console, its maximum value is 9999. Leaving the seed value blank in the Console, or null in the API disables this feature.

Warning

The seed parameter might not produce the same result in the long-run, because the model updates in the OCI Generative AI service might invalidate the seed.

API Parameter for Reasoning

thinking

By default the reasoning feature for the cohere.command-a-reasoning model is enabled through the thinking parameter. See CohereThinkingV2

When the thinking parameter is enabled, the model works through complex problems step-by-step, breaking down the problems internally, before providing a final answer. You can control this feature in several ways:

Thinking is enabled by default, but you can disable it. When disabled, the reasoning model functions similar to any other LLM without the internal reasoning step.

token_budget

You can specify a token budget with the token_budget parameter to limit how many thinking tokens the model produces. When the budget is exceeded, the model immediately proceeds with the final response.

When using thinking budgets, Cohere recommends:

Using unlimited thinking when possible
If using a budget, leave at least 1000 tokens for the response
For maximum reasoning, a budget of 31,000 tokens is recommended

See CohereThinkingContentV2 Reference and CohereThinkingV2 Reference in the API documentation and Reasoning Guide in the Cohere documentation.

Oracle Cloud Infrastructure Documentation