xAI Voice (Text to Speech)

Use Text to Speech in OCI Generative AI to convert text into spoken audio with xAI Voice.

You can convert text to speech in two ways:

OCI OpenAI-compatible Audio Speech API for request-based speech generation.
WebSocket streaming for streaming text input and audio output.

Use the API option to submit text and receive an audio file. Use WebSocket streaming to send text incrementally and receive audio chunks as they're generated.

Supported Model

Note

The Text to Speech model is available only in on-demand mode.


Model	Description
`xai.grok-tts`	Text-to-speech model for generating spoken audio from text.

Regions for this Model

Important

For supported regions, endpoint types (on-demand or dedicated AI clusters), and hosting (OCI Generative AI or external calls) for this model, see the Models by Region page. For details about the regions, see the Generative AI Regions page.

Voices

The following voices are available. Voice names are case-insensitive. For example, ara, Ara, and ARA are accepted.


Voice	Description
`ara`	Warm and conversational
`eve`	Energetic and upbeat
`leo`	Authoritative and strong
`rex`	Clear and professional
`sal`	Smooth and balanced

Access Options

You can convert text to speech by using either the OCI OpenAI-compatible Audio Speech API or WebSocket streaming.


Access option	Endpoint	Parameter style	Use when
OCI OpenAI-compatible Audio Speech API	`https://inference.generativeai.{region}.oci.oraclecloud.com/openai/v1`	OpenAI-compatible audio speech request format, with xAI-specific options in `extra_body`	You want to submit text and receive an audio file in a single request.
WebSocket streaming	`wss://inference.generativeai.{region}.oci.oraclecloud.com/xai/v1/tts`	xAI text-to-speech streaming parameters	You want to stream text input and receive audio chunks as they're generated.

The OCI OpenAI-compatible Audio Speech API doesn’t support realtime streaming. For streaming text-to-speech, use the WebSocket endpoint.

OCI OpenAI-Compatible Audio Speech API

Use the OCI OpenAI-compatible Audio Speech API to generate audio from a single request.

OCI OpenAI-Compatible Endpoint

https://inference.generativeai.{region}.oci.oraclecloud.com/openai/v1

In the request, call the xai.grok-tts model and use one of the Grok Voice voices listed in this topic. Don’t use OpenAI text-to-speech model names or OpenAI voice names.

Specify these values in the standard OpenAI-compatible audio speech request:

model: xai.grok-tts
input: Text to convert to speech
voice: One of the supported Grok Voice voices, such as ara, eve, leo, rex, or sal
response_format: Audio response format, such as mp3

Put xAI-specific options, such as language and output_format, in extra_body.

For example, use extra_body for settings such as:

language
output_format.sample_rate
output_format.bit_rate

Note

When using the OCI OpenAI-compatible Audio Speech API, use the OpenAI-compatible request structure with the OCI endpoint, but use the xai.grok-tts model and supported xAI voices. Don’t use OpenAI-only voices, OpenAI TTS model names, or OpenAI custom voice objects.

WebSocket Streaming

Use WebSocket streaming for real-time or interactive text-to-speech workflows. With this option, you send text to the service as messages and receive audio as base64-encoded audio chunks.

OCI WebSocket endpoint:

wss://inference.generativeai.{region}.oci.oraclecloud.com/xai/v1/tts

Don’t use the xAI endpoint directly.

Set up the WebSocket connection with xAI text-to-speech query parameters such as:


Parameter	Description
`voice`	Voice to use for speech generation.
`language`	Language code, such as `en`, or `auto` for automatic language detection.
`codec`	Audio codec, such as `mp3`, `wav`, `pcm`, `mulaw`, or `alaw`.
`sample_rate`	Audio sample rate.
`bit_rate`	MP3 bit rate. Applies to MP3 output.
`optimize_streaming_latency`	Optimizes for lower time-to-first-audio when enabled.
`text_normalization`	Normalizes written text into spoken form when enabled.

After opening the WebSocket connection, send text using text.delta messages. Send text.done to indicate the end of the current utterance.

The service returns:


Event	Description
`audio.delta`	Base64-encoded audio chunk.
`audio.done`	Audio generation for the current utterance is complete.
`error`	Error message from the service.

The WebSocket connection can remain open after audio.done, so you can send another text.delta and text.done sequence without reconnecting.

Note

For WebSocket streaming, use xAI text-to-speech streaming parameters with the OCI WebSocket endpoint.

Parameter Usage

The parameters you use depend on the access option.

For the OCI OpenAI-compatible Audio Speech API, use the OpenAI-compatible audio speech request format. Set model to xai.grok-tts, use a supported xAI voice, and put xAI-specific settings in extra_body.

For WebSocket streaming, use xAI text-to-speech streaming parameters. Configure voice, language, codec, sample rate, and bit rate as WebSocket query parameters. Send text with text.delta messages and finish each utterance with text.done.

Output Formats

Text to Speech supports common audio formats for different use cases.


Format	Use case
`mp3`	General use and broad compatibility
`wav`	Higher-fidelity audio and editing workflows
`pcm`	Raw audio for real-time processing
`mulaw`	Telephony workflows
`alaw`	Telephony workflows

Note

For the OCI OpenAI-compatible Audio Speech API, set the audio format in response_format and put xAI audio settings, such as sample rate and bit rate, in extra_body.output_format. For WebSocket streaming, set the audio format, sample rate, and bit rate as query parameters when opening the connection.

Note

The bit rate applies to MP3 output.

Languages

Text to Speech supports many languages. Use a supported language code, such as en, or use auto for automatic language detection when supported by the access method.

Limits

Text to Speech limits depend on the access option.


Limit	OCI OpenAI-compatible Audio Speech API	WebSocket streaming
Character limit	Up to 15,000 characters per request	Up to 15,000 characters per `text.delta` message
Longer content	Split content into smaller requests and combine the audio output	Split content into multiple `text.delta` messages or separate utterances
Rate limit	600 requests per minute or 10 requests per second	600 requests per minute or 10 requests per second
Concurrent requests or sessions	Up to 100 concurrent requests	Up to 50 concurrent sessions
Session permit TTL	Not applicable	600 seconds

For longer text, split the content into logical segments, such as paragraphs or sentences. This helps keep each request or text chunk within the character limit and makes it easier to combine the generated audio in order.

For WebSocket streaming, send text by using text.delta messages and send text.done when the current utterance is complete. Each text.delta message must stay within the character limit.

OCI Release and Retirement Dates

For release and retirement dates and replacement model options, see the Model Retirement Dates (On-Demand Mode).

Examples

The following examples show the two supported access options. The OCI OpenAI-compatible example uses the OCI OpenAI-compatible endpoint and the xai.grok-tts model. The WebSocket example uses the OCI xAI WebSocket endpoint and xAI streaming parameters.

OCI OpenAI-compatible Audio Speech API

from openai import OpenAI
from oci_openai import OciSessionAuth

client = OpenAI(
    api_key="<not-used>",
    base_url="https://inference.generativeai.us-chicago-1.oci.oraclecloud.com/openai/v1",
    http_client=httpx.Client(
        auth=OciSessionAuth(profile_name=<profile>),
        headers={"CompartmentId": <compartment_id>}
    ),
)

speech = client.audio.speech.create(
    model="xai.grok-tts",
    input="hello",
    voice="ara",
    response_format="mp3",
    extra_body={
        "language": "en",
        "output_format": {
            "sample_rate": 44100,
            "bit_rate": 192000
        }
    }
)

audio_file = f"output.mp3"
with open(audio_file, "wb") as f:
    f.write(speech.content)

WebSocket Streaming

import asyncio
import base64
import inspect
import json
from datetime import datetime, timezone
from pathlib import Path
from urllib.parse import urlencode

import websockets


CONFIG = {
    "endpoint": "wss://inference.generativeai.us-chicago-1.oci.oraclecloud.com/xai/v1/tts",
    "api_key": "<YOUR GENAI API KEY>",
    "text": "Hi, this is an audio sample.",
    "voice": "eve",
    "language": "en",
    "codec": "mp3",
    "sample_rate": 24000,
    "bit_rate": 128000,
    "output_dir": "./",
}


def tts_url():
    params = {
        "language": CONFIG["language"],
        "voice": CONFIG["voice"],
        "codec": CONFIG["codec"],
        "sample_rate": CONFIG["sample_rate"],
    }
    if CONFIG["codec"] == "mp3":
        params["bit_rate"] = CONFIG["bit_rate"]
    return "{}?{}".format(CONFIG["endpoint"], urlencode(params))


def output_file():
    folder = Path(CONFIG["output_dir"]).expanduser()
    folder.mkdir(parents=True, exist_ok=True)
    timestamp = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")
    return folder / "grok_tts_{}.{}".format(timestamp, CONFIG["codec"])


async def main():
    if CONFIG["api_key"] == "<YOUR GENAI API KEY>":
        raise ValueError("Set api_key before running this sample.")

    path = output_file()
    headers = {"Authorization": "Bearer {}".format(CONFIG["api_key"])}
    header_arg = (
        "additional_headers"
        if "additional_headers" in inspect.signature(websockets.connect).parameters
        else "extra_headers"
    )

    async with websockets.connect(tts_url(), **{header_arg: headers}) as ws:
        await ws.send(json.dumps({"type": "text.delta", "delta": CONFIG["text"]}))
        await ws.send(json.dumps({"type": "text.done"}))

        with open(str(path), "wb") as audio_file:
            async for message in ws:
                event = json.loads(message)
                if event["type"] == "audio.delta":
                    audio_file.write(base64.b64decode(event["delta"]))
                elif event["type"] == "audio.done":
                    print("Saved audio to {}".format(path))
                    break
                elif event["type"] == "error":
                    raise RuntimeError(event.get("message", message))


if __name__ == "__main__":
    asyncio.run(main())

Oracle Cloud Infrastructure Documentation

xAI Voice (Text to Speech)

Supported Model

Regions for this Model

Voices

Access Options

OCI OpenAI-Compatible Audio Speech API

WebSocket Streaming

Parameter Usage

Output Formats

Languages

Limits

OCI Release and Retirement Dates

External Documentation Links

Examples