xAI Voice (Text to Speech)
Use Text to Speech in OCI Generative AI to convert text into spoken audio with xAI Voice.
You can convert text to speech in two ways:
- OCI OpenAI-compatible Audio Speech API for request-based speech generation.
- WebSocket streaming for streaming text input and audio output.
Use the API option to submit text and receive an audio file. Use WebSocket streaming to send text incrementally and receive audio chunks as they're generated.
Supported Model
The Text to Speech model is available only in on-demand mode.
| Model | Description |
|---|---|
xai.grok-tts |
Text-to-speech model for generating spoken audio from text. |
Regions for this Model
For supported regions, endpoint types (on-demand or dedicated AI clusters), and hosting (OCI Generative AI or external calls) for this model, see the Models by Region page. For details about the regions, see the Generative AI Regions page.
Voices
The following voices are available. Voice names are case-insensitive. For example, ara, Ara, and ARA are accepted.
| Voice | Description |
|---|---|
ara |
Warm and conversational |
eve |
Energetic and upbeat |
leo |
Authoritative and strong |
rex |
Clear and professional |
sal |
Smooth and balanced |
Access Options
You can convert text to speech by using either the OCI OpenAI-compatible Audio Speech API or WebSocket streaming.
| Access option | Endpoint | Parameter style | Use when |
|---|---|---|---|
| OCI OpenAI-compatible Audio Speech API | https://inference.generativeai.{region}.oci.oraclecloud.com/openai/v1 |
OpenAI-compatible audio speech request format, with xAI-specific options in extra_body |
You want to submit text and receive an audio file in a single request. |
| WebSocket streaming | wss://inference.generativeai.{region}.oci.oraclecloud.com/xai/v1/tts |
xAI text-to-speech streaming parameters | You want to stream text input and receive audio chunks as they're generated. |
The OCI OpenAI-compatible Audio Speech API doesn’t support realtime streaming. For streaming text-to-speech, use the WebSocket endpoint.
OCI OpenAI-Compatible Audio Speech API
Use the OCI OpenAI-compatible Audio Speech API to generate audio from a single request.
- OCI OpenAI-Compatible Endpoint
-
https://inference.generativeai.{region}.oci.oraclecloud.com/openai/v1
In the request, call the xai.grok-tts model and use one of the Grok Voice voices listed in this topic. Don’t use OpenAI text-to-speech model names or OpenAI voice names.
Specify these values in the standard OpenAI-compatible audio speech request:
model:xai.grok-ttsinput: Text to convert to speechvoice: One of the supported Grok Voice voices, such asara,eve,leo,rex, orsalresponse_format: Audio response format, such asmp3
Put xAI-specific options, such as language and output_format, in extra_body.
For example, use extra_body for settings such as:
languageoutput_format.sample_rateoutput_format.bit_rate
When using the OCI OpenAI-compatible Audio Speech API, use the OpenAI-compatible request structure with the OCI endpoint, but use the xai.grok-tts model and supported xAI voices. Don’t use OpenAI-only voices, OpenAI TTS model names, or OpenAI custom voice objects.
WebSocket Streaming
Use WebSocket streaming for real-time or interactive text-to-speech workflows. With this option, you send text to the service as messages and receive audio as base64-encoded audio chunks.
- OCI WebSocket endpoint:
-
wss://inference.generativeai.{region}.oci.oraclecloud.com/xai/v1/tts
Don’t use the xAI endpoint directly.
Set up the WebSocket connection with xAI text-to-speech query parameters such as:
| Parameter | Description |
|---|---|
voice |
Voice to use for speech generation. |
language |
Language code, such as en, or auto for automatic language detection. |
codec |
Audio codec, such as mp3, wav, pcm, mulaw, or alaw. |
sample_rate |
Audio sample rate. |
bit_rate |
MP3 bit rate. Applies to MP3 output. |
optimize_streaming_latency |
Optimizes for lower time-to-first-audio when enabled. |
text_normalization |
Normalizes written text into spoken form when enabled. |
After opening the WebSocket connection, send text using text.delta messages. Send text.done to indicate the end of the current utterance.
The service returns:
| Event | Description |
|---|---|
audio.delta |
Base64-encoded audio chunk. |
audio.done |
Audio generation for the current utterance is complete. |
error |
Error message from the service. |
The WebSocket connection can remain open after audio.done, so you can send another text.delta and text.done sequence without reconnecting.
For WebSocket streaming, use xAI text-to-speech streaming parameters with the OCI WebSocket endpoint.
Parameter Usage
The parameters you use depend on the access option.
For the OCI OpenAI-compatible Audio Speech API, use the OpenAI-compatible audio speech request format. Set model to xai.grok-tts, use a supported xAI voice, and put xAI-specific settings in extra_body.
For WebSocket streaming, use xAI text-to-speech streaming parameters. Configure voice, language, codec, sample rate, and bit rate as WebSocket query parameters. Send text with text.delta messages and finish each utterance with text.done.
Output Formats
Text to Speech supports common audio formats for different use cases.
| Format | Use case |
|---|---|
mp3 |
General use and broad compatibility |
wav |
Higher-fidelity audio and editing workflows |
pcm |
Raw audio for real-time processing |
mulaw |
Telephony workflows |
alaw |
Telephony workflows |
For the OCI OpenAI-compatible Audio Speech API, set the audio format in response_format and put xAI audio settings, such as sample rate and bit rate, in extra_body.output_format. For WebSocket streaming, set the audio format, sample rate, and bit rate as query parameters when opening the connection.
The bit rate applies to MP3 output.
Languages
Text to Speech supports many languages. Use a supported language code, such as en, or use auto for automatic language detection when supported by the access method.
Limits
Text to Speech limits depend on the access option.
| Limit | OCI OpenAI-compatible Audio Speech API | WebSocket streaming |
|---|---|---|
| Character limit | Up to 15,000 characters per request | Up to 15,000 characters per text.delta message |
| Longer content | Split content into smaller requests and combine the audio output | Split content into multiple text.delta messages or separate utterances |
| Rate limit | 600 requests per minute or 10 requests per second | 600 requests per minute or 10 requests per second |
| Concurrent requests or sessions | Up to 100 concurrent requests | Up to 50 concurrent sessions |
| Session permit TTL | Not applicable | 600 seconds |
For longer text, split the content into logical segments, such as paragraphs or sentences. This helps keep each request or text chunk within the character limit and makes it easier to combine the generated audio in order.
For WebSocket streaming, send text by using text.delta messages and send text.done when the current utterance is complete. Each text.delta message must stay within the character limit.
OCI Release and Retirement Dates
For release and retirement dates and replacement model options, see the Model Retirement Dates (On-Demand Mode).
External Documentation Links
- Text to Speech on xAI
- Streaming TTS WebSocket on xAI
- Text to Speech on OpenAI
Examples
The following examples show the two supported access options. The OCI OpenAI-compatible example uses the OCI OpenAI-compatible endpoint and the xai.grok-tts model. The WebSocket example uses the OCI xAI WebSocket endpoint and xAI streaming parameters.
- OCI OpenAI-compatible Audio Speech API
-
from openai import OpenAI from oci_openai import OciSessionAuth client = OpenAI( api_key="<not-used>", base_url="https://inference.generativeai.us-chicago-1.oci.oraclecloud.com/openai/v1", http_client=httpx.Client( auth=OciSessionAuth(profile_name=<profile>), headers={"CompartmentId": <compartment_id>} ), ) speech = client.audio.speech.create( model="xai.grok-tts", input="hello", voice="ara", response_format="mp3", extra_body={ "language": "en", "output_format": { "sample_rate": 44100, "bit_rate": 192000 } } ) audio_file = f"output.mp3" with open(audio_file, "wb") as f: f.write(speech.content)
- WebSocket Streaming
-
import asyncio import base64 import inspect import json from datetime import datetime, timezone from pathlib import Path from urllib.parse import urlencode import websockets CONFIG = { "endpoint": "wss://inference.generativeai.us-chicago-1.oci.oraclecloud.com/xai/v1/tts", "api_key": "<YOUR GENAI API KEY>", "text": "Hi, this is an audio sample.", "voice": "eve", "language": "en", "codec": "mp3", "sample_rate": 24000, "bit_rate": 128000, "output_dir": "./", } def tts_url(): params = { "language": CONFIG["language"], "voice": CONFIG["voice"], "codec": CONFIG["codec"], "sample_rate": CONFIG["sample_rate"], } if CONFIG["codec"] == "mp3": params["bit_rate"] = CONFIG["bit_rate"] return "{}?{}".format(CONFIG["endpoint"], urlencode(params)) def output_file(): folder = Path(CONFIG["output_dir"]).expanduser() folder.mkdir(parents=True, exist_ok=True) timestamp = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ") return folder / "grok_tts_{}.{}".format(timestamp, CONFIG["codec"]) async def main(): if CONFIG["api_key"] == "<YOUR GENAI API KEY>": raise ValueError("Set api_key before running this sample.") path = output_file() headers = {"Authorization": "Bearer {}".format(CONFIG["api_key"])} header_arg = ( "additional_headers" if "additional_headers" in inspect.signature(websockets.connect).parameters else "extra_headers" ) async with websockets.connect(tts_url(), **{header_arg: headers}) as ws: await ws.send(json.dumps({"type": "text.delta", "delta": CONFIG["text"]})) await ws.send(json.dumps({"type": "text.done"})) with open(str(path), "wb") as audio_file: async for message in ws: event = json.loads(message) if event["type"] == "audio.delta": audio_file.write(base64.b64decode(event["delta"])) elif event["type"] == "audio.done": print("Saved audio to {}".format(path)) break elif event["type"] == "error": raise RuntimeError(event.get("message", message)) if __name__ == "__main__": asyncio.run(main())