xAI Voice (Text to Speech)

Use Text to Speech in OCI Generative AI to convert text into spoken audio with xAI Voice.

You can convert text to speech in two ways:

  • OCI OpenAI-compatible Audio Speech API for request-based speech generation.
  • WebSocket streaming for streaming text input and audio output.

Use the API option to submit text and receive an audio file. Use WebSocket streaming to send text incrementally and receive audio chunks as they're generated.

Supported Model

Note

The Text to Speech model is available only in on-demand mode.

Model Description
xai.grok-tts Text-to-speech model for generating spoken audio from text.

Regions for this Model

Important

For supported regions, endpoint types (on-demand or dedicated AI clusters), and hosting (OCI Generative AI or external calls) for this model, see the Models by Region page. For details about the regions, see the Generative AI Regions page.

Voices

The following voices are available. Voice names are case-insensitive. For example, ara, Ara, and ARA are accepted.

Voice Description
ara Warm and conversational
eve Energetic and upbeat
leo Authoritative and strong
rex Clear and professional
sal Smooth and balanced

Access Options

You can convert text to speech by using either the OCI OpenAI-compatible Audio Speech API or WebSocket streaming.

Access option Endpoint Parameter style Use when
OCI OpenAI-compatible Audio Speech API https://inference.generativeai.{region}.oci.oraclecloud.com/openai/v1 OpenAI-compatible audio speech request format, with xAI-specific options in extra_body You want to submit text and receive an audio file in a single request.
WebSocket streaming wss://inference.generativeai.{region}.oci.oraclecloud.com/xai/v1/tts xAI text-to-speech streaming parameters You want to stream text input and receive audio chunks as they're generated.

The OCI OpenAI-compatible Audio Speech API doesn’t support realtime streaming. For streaming text-to-speech, use the WebSocket endpoint.

OCI OpenAI-Compatible Audio Speech API

Use the OCI OpenAI-compatible Audio Speech API to generate audio from a single request.

OCI OpenAI-Compatible Endpoint
https://inference.generativeai.{region}.oci.oraclecloud.com/openai/v1

In the request, call the xai.grok-tts model and use one of the Grok Voice voices listed in this topic. Don’t use OpenAI text-to-speech model names or OpenAI voice names.

Specify these values in the standard OpenAI-compatible audio speech request:

  • model: xai.grok-tts
  • input: Text to convert to speech
  • voice: One of the supported Grok Voice voices, such as ara, eve, leo, rex, or sal
  • response_format: Audio response format, such as mp3

Put xAI-specific options, such as language and output_format, in extra_body.

For example, use extra_body for settings such as:

  • language
  • output_format.sample_rate
  • output_format.bit_rate
Note

When using the OCI OpenAI-compatible Audio Speech API, use the OpenAI-compatible request structure with the OCI endpoint, but use the xai.grok-tts model and supported xAI voices. Don’t use OpenAI-only voices, OpenAI TTS model names, or OpenAI custom voice objects.

WebSocket Streaming

Use WebSocket streaming for real-time or interactive text-to-speech workflows. With this option, you send text to the service as messages and receive audio as base64-encoded audio chunks.

OCI WebSocket endpoint:
wss://inference.generativeai.{region}.oci.oraclecloud.com/xai/v1/tts

Don’t use the xAI endpoint directly.

Set up the WebSocket connection with xAI text-to-speech query parameters such as:

Parameter Description
voice Voice to use for speech generation.
language Language code, such as en, or auto for automatic language detection.
codec Audio codec, such as mp3, wav, pcm, mulaw, or alaw.
sample_rate Audio sample rate.
bit_rate MP3 bit rate. Applies to MP3 output.
optimize_streaming_latency Optimizes for lower time-to-first-audio when enabled.
text_normalization Normalizes written text into spoken form when enabled.

After opening the WebSocket connection, send text using text.delta messages. Send text.done to indicate the end of the current utterance.

The service returns:

Event Description
audio.delta Base64-encoded audio chunk.
audio.done Audio generation for the current utterance is complete.
error Error message from the service.

The WebSocket connection can remain open after audio.done, so you can send another text.delta and text.done sequence without reconnecting.

Note

For WebSocket streaming, use xAI text-to-speech streaming parameters with the OCI WebSocket endpoint.

Parameter Usage

The parameters you use depend on the access option.

For the OCI OpenAI-compatible Audio Speech API, use the OpenAI-compatible audio speech request format. Set model to xai.grok-tts, use a supported xAI voice, and put xAI-specific settings in extra_body.

For WebSocket streaming, use xAI text-to-speech streaming parameters. Configure voice, language, codec, sample rate, and bit rate as WebSocket query parameters. Send text with text.delta messages and finish each utterance with text.done.

Output Formats

Text to Speech supports common audio formats for different use cases.

Format Use case
mp3 General use and broad compatibility
wav Higher-fidelity audio and editing workflows
pcm Raw audio for real-time processing
mulaw Telephony workflows
alaw Telephony workflows
Note

For the OCI OpenAI-compatible Audio Speech API, set the audio format in response_format and put xAI audio settings, such as sample rate and bit rate, in extra_body.output_format. For WebSocket streaming, set the audio format, sample rate, and bit rate as query parameters when opening the connection.

Note

The bit rate applies to MP3 output.

Languages

Text to Speech supports many languages. Use a supported language code, such as en, or use auto for automatic language detection when supported by the access method.

Limits

Text to Speech limits depend on the access option.

Limit OCI OpenAI-compatible Audio Speech API WebSocket streaming
Character limit Up to 15,000 characters per request Up to 15,000 characters per text.delta message
Longer content Split content into smaller requests and combine the audio output Split content into multiple text.delta messages or separate utterances
Rate limit 600 requests per minute or 10 requests per second 600 requests per minute or 10 requests per second
Concurrent requests or sessions Up to 100 concurrent requests Up to 50 concurrent sessions
Session permit TTL Not applicable 600 seconds

For longer text, split the content into logical segments, such as paragraphs or sentences. This helps keep each request or text chunk within the character limit and makes it easier to combine the generated audio in order.

For WebSocket streaming, send text by using text.delta messages and send text.done when the current utterance is complete. Each text.delta message must stay within the character limit.

Examples

The following examples show the two supported access options. The OCI OpenAI-compatible example uses the OCI OpenAI-compatible endpoint and the xai.grok-tts model. The WebSocket example uses the OCI xAI WebSocket endpoint and xAI streaming parameters.

OCI OpenAI-compatible Audio Speech API
from openai import OpenAI
from oci_openai import OciSessionAuth

client = OpenAI(
    api_key="<not-used>",
    base_url="https://inference.generativeai.us-chicago-1.oci.oraclecloud.com/openai/v1",
    http_client=httpx.Client(
        auth=OciSessionAuth(profile_name=<profile>),
        headers={"CompartmentId": <compartment_id>}
    ),
)

speech = client.audio.speech.create(
    model="xai.grok-tts",
    input="hello",
    voice="ara",
    response_format="mp3",
    extra_body={
        "language": "en",
        "output_format": {
            "sample_rate": 44100,
            "bit_rate": 192000
        }
    }
)

audio_file = f"output.mp3"
with open(audio_file, "wb") as f:
    f.write(speech.content)
WebSocket Streaming
import asyncio
import base64
import inspect
import json
from datetime import datetime, timezone
from pathlib import Path
from urllib.parse import urlencode

import websockets


CONFIG = {
    "endpoint": "wss://inference.generativeai.us-chicago-1.oci.oraclecloud.com/xai/v1/tts",
    "api_key": "<YOUR GENAI API KEY>",
    "text": "Hi, this is an audio sample.",
    "voice": "eve",
    "language": "en",
    "codec": "mp3",
    "sample_rate": 24000,
    "bit_rate": 128000,
    "output_dir": "./",
}


def tts_url():
    params = {
        "language": CONFIG["language"],
        "voice": CONFIG["voice"],
        "codec": CONFIG["codec"],
        "sample_rate": CONFIG["sample_rate"],
    }
    if CONFIG["codec"] == "mp3":
        params["bit_rate"] = CONFIG["bit_rate"]
    return "{}?{}".format(CONFIG["endpoint"], urlencode(params))


def output_file():
    folder = Path(CONFIG["output_dir"]).expanduser()
    folder.mkdir(parents=True, exist_ok=True)
    timestamp = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")
    return folder / "grok_tts_{}.{}".format(timestamp, CONFIG["codec"])


async def main():
    if CONFIG["api_key"] == "<YOUR GENAI API KEY>":
        raise ValueError("Set api_key before running this sample.")

    path = output_file()
    headers = {"Authorization": "Bearer {}".format(CONFIG["api_key"])}
    header_arg = (
        "additional_headers"
        if "additional_headers" in inspect.signature(websockets.connect).parameters
        else "extra_headers"
    )

    async with websockets.connect(tts_url(), **{header_arg: headers}) as ws:
        await ws.send(json.dumps({"type": "text.delta", "delta": CONFIG["text"]}))
        await ws.send(json.dumps({"type": "text.done"}))

        with open(str(path), "wb") as audio_file:
            async for message in ws:
                event = json.loads(message)
                if event["type"] == "audio.delta":
                    audio_file.write(base64.b64decode(event["delta"]))
                elif event["type"] == "audio.done":
                    print("Saved audio to {}".format(path))
                    break
                elif event["type"] == "error":
                    raise RuntimeError(event.get("message", message))


if __name__ == "__main__":
    asyncio.run(main())