Google Gemini 2.5 Flash
The Gemini 2.5 Flash model (google.gemini-2.5-flash
) is a multimodal fast reasoning model that offers a balance of price, performance, and a wide range of capabilities, including thinking features. Gemini 2.5 Flash and Gemini 2.5 Flash-Lite models are both efficient models. Flash-Lite is optimized for lower cost and faster performance on high-volume, less complex tasks. Gemini 2.5 Flash offers a balance of speed and intelligence for more complex applications.
Available in This Region
- US East (Ashburn) (Oracle Interconnect for Google Cloud only) and (on-demand only)
- US Midwest (Chicago) (on-demand only)
- US West (Phoenix) (on-demand only)
External Calls
The Google Gemini 2.5 models that can be accessed through the OCI Generative AI service, are hosted externally by Google. Therefore, a call to a Google Gemini model (through the OCI Generative AI service) results in a call to a Google location.
Key Features
- Model Name in OCI
Generative AI:
google.gemini-2.5-flash
- Available On-Demand: Access this model on-demand, through the Console playground or the API.
- Multimodal Support: Input text, code, and images and get a text output. Audio and video file inputs are supported through API only. See Image Understanding, Audio Understanding and Video Understanding.
- Knowledge: Has a deep domain knowledge in science, mathematics, and code.
- Context Length: One million tokens
- Maximum Input Tokens: 1,048,576 (Console and API)
- Maximum Output Tokens: 65,536 (default) (Console and API)
- Excels at These Use Cases: For general-purpose everyday tasks that require a fast, cost-effective model with strong reasoning abilities. For example, for most user-facing applications where a fast, yet intelligent, response is needed.
- Has Reasoning: Yes. Includes text and visual reasoning and image understanding. For reasoning problems increase the maximum output tokens. See Model Parameters.
- Knowledge Cutoff: January 2025
See the following table for the features supported in the Google Vertex AI Platform for OCI Generative, with links to each feature.
Feature | Supported? |
---|---|
Code execution | Yes |
Tuning | No |
System instructions | Yes |
Structured output | Yes |
Batch prediction | No |
Function calling | Yes |
Count Tokens | No |
Thinking | Yes, but turning off the thinking process isn't supported. |
Context caching | Yes, the model can cache the input tokens, but this feature isn't controlled through the API. |
Vertex AI RAG Engine | No |
Chat completions | Yes |
Grounding | No |
For key feature details, see the Google Gemini 2.5 Flash documentation and the Gemini 2.5 Flash model card.
Image Understanding
- Image Size
-
- Console: Maximum image size: 5 MB
- API: Maximum images per prompt: 3,000 and maximum image size before encoding: 7 MB
- Supported Image Inputs
-
- Console:
png
andjpeg
formats - API: In the Chat operation submit a
base64
encoded version of an image. For example, a 512 x 512 image typically converts to around 1,610 tokens. Supported MIME types are:image/png
,image/jpeg
,image/webp
,image/heic
, andimage/heif
. For the format, see ImageContent Reference.
- Console:
- Technical Details
- Supports object detection and segmentation. See Image Understanding in the Gemini API documentation.
Audio Understanding
- Supported Audio Formats
-
- Console: not available
- API: Supported media files are
audio/wav
,audio/mp3
,audio/aiff
,audio/aac
,audio/ogg
, andaudio/flac
.
- Supported Audio Inputs for the API
-
- URL: Convert a supported audio format to a
base64
encoded version of the audio file. - URI: Submit the audio in a Uniform Resource Identifier (URI) format so without uploading the file, the model can access the audio.
- URL: Convert a supported audio format to a
- Technical Details
-
- Token Conversion Each second of audio represents 32 tokens, so one minute of audio corresponds to 1,920 tokens.
- Non‑speech Detection: The model can recognize non‑speech components such as bird songs and sirens.
- Maximum Length: The maximum supported audio length in a single prompt is 9.5 hours. You can submit several files as long as their combined duration stays under 9.5 hours.
- Downsampling: The model downsamples audio files to a 16 kbps resolution.
- Channel Merging: If an audio source has several channels, the model merges them into a single channel.
See Audio Understanding in the Gemini API documentation.
Video Understanding
- Supported Audio Formats
-
- Console: not available
- API: Supported media files are
video/mp4
,video/mpeg
,video/mov
,video/avi
,video/x-flv
,video/mpg
,video/webm
,video/wmv
, andvideo/3gpp
.
- Supported Video Inputs for the API
-
- URL: Convert a supported video format to a
base64
encoded version of the video file. - URI: Submit the video in a Uniform Resource Identifier (URI) format so without uploading the file, the model can access the video.
- URL: Convert a supported video format to a
- Technical Details
-
See Video Understanding in Gemini API documentation.
Limitations
- Complex prompts
- The Gemini 2.5 Flash model might show limitations around causal understanding, complex logical eduction, and counterfactual reasoning. For complex tasks, we recommend using the Google Gemini 2.5 Pro model.
On-Demand Mode
The Gemini models are available only in the on-demand mode.
Model Name | OCI Model Name | Pricing Page Product Name |
---|---|---|
Gemini 2.5 Flash | google.gemini-2.5-flash |
Gemini 2.5 Flash |
-
You pay as you go for each inference call when you use the models in the playground or when you call the models through the API.
- Low barrier to start using Generative AI.
- Great for experimentation, proof of concept, and model evaluation.
- Available for the pretrained models in regions not listed as (dedicated AI cluster only).
We recommend implementing a back-off strategy, which involves delaying requests after a rejection. Without one, repeated rapid requests can lead to further rejections over time, increased latency, and potential temporary blocking of client by the Generative AI service. By using a back-off strategy, such as an exponential back-off strategy, you can distribute requests more evenly, reduce load, and improve retry success, following industry best practices and enhancing the overall stability and performance of your integration to the service.
Release Date
Model | Release Date | On-Demand Retirement Date | Dedicated Mode Retirement Date |
---|---|---|---|
google.gemini-2.5-flash |
2025-10-01 | Tentative | This model isn't available for the dedicated mode. |
To learn about OCI Generative AI model deprecation and retirement, see Retiring the Models.
Model Parameters
To change the model responses, you can change the values of the following parameters in the playground or the API.
- Maximum output tokens
-
The maximum number of tokens that you want the model to generate for each response. Estimate four characters per token. Because you're prompting a chat model, the response depends on the prompt and each response doesn't necessarily use up the maximum allocated tokens. The maximum prompt + output length is 128,000 tokens for each run.
Tip
For large inputs with difficult problems, set a high value for the maximum output tokens parameter. - Temperature
-
The level of randomness used to generate the output text. Min: 0, Max: 2, Default: 1
Tip
Start with the temperature set to 0 or less than one, and increase the temperature as you regenerate the prompts for a more creative output. High temperatures can introduce hallucinations and factually incorrect information. - Top p
-
A sampling method that controls the cumulative probability of the top tokens to consider for the next token. Assign
p
a decimal number between 0 and 1 for the probability. For example, enter 0.75 for the top 75 percent to be considered. Setp
to 1 to consider all tokens. - Top k
-
A sampling method in which the model chooses the next token randomly from the
top k
most likely tokens. In the Gemini 2.5 models, the top k has a fixed value of 64, which means that the model considers only the 64 most likely tokens (words or word parts) for each step of generation. The final token is then chosen from this list. - Number of Generations (API only)
-
The
numGenerations
parameter in the API controls how many different response options the model generates for each prompt.- When you send a prompt, the Gemini model generates a set of possible answers. By default, it returns only the response with the highest probability (
numGenerations = 1
). - If you increase the
numGenerations
parameter to a number between or equal to 2 and 8 you can have the model generate 2 to 8 distinct responses.
- When you send a prompt, the Gemini model generates a set of possible answers. By default, it returns only the response with the highest probability (