Using Text to Speech

Learn how to use Text to speech.

Text to speech (TTS) transforms written text into spoken words, bridging the gap between the written word and the spoken voice.

TTS tools offer several valuable use cases for businesses, enhancing productivity and user experience:

Audiobook Production: TTS technology can automate the conversion of written content into audiobooks, saving time and resources while catering to a broader audience's preferences for audio content.

Accessibility Compliance: Businesses can ensure their digital content is accessible to individuals with visual impairments by using TTS to convert text into spoken words, making websites and documents compliant with accessibility regulations.

Interactive Voice Response (IVR) Systems: TTS is vital for creating natural-sounding voice prompts in IVR systems enhancing customer service by providing automated but human-like interactions, such as call routing and information retrieval.

Virtual Assistants and Chatbots: Integrating TTS into virtual assistants and chatbots allows businesses to provide personalized and engaging interactions with users, whether on websites or through messaging apps, enhancing customer engagement and support.

Enhanced Product Demonstrations: Sales teams can use TTS to create audio-enhanced product demonstrations or tutorials. This makes it easier for potential customers to understand product features and benefits, leading to more informed purchase decisions.

Capabilities

Synchronous API: Text to speech supports synchronous API over HTTPS protocols. You can send text input and get audio as a response.
Multiple Output Formats: Text to speech can generate PCM, MP3, OGG, and JSON format.
Standard and Natural Voices: Text to speech offers male and female standard and natural (human-like) voices.
Chunk Streaming Support: Text to speech service supports chunk transfer encoding over HTTPS protocol. You can send a request with input text and get audio output in chunks. This helps in reducing latency at client side.
Speech Synthesis Markup Language (SSML): You can send Speech Synthesis Markup Language (SSML) in your Text to speech request to for more customization in your audio response by providing details on pauses, and audio formatting for acronyms, dates, times, and abbreviations.
Note

SSML is only supported for some English (US) speakers and isn't supported for any speakers in any other language.
Multilingual Support: Text to speech Natural model supports nine languages, including:
- English (US)
- English (British)
- Spanish (Spain)
- Portuguese (Brazilian)
- French
- Italian
- Hindi
- Japanese
- Chinese (Mandarin)

Language and Feature Support

Language Codes


Language	Language Code
English—United States	`en-US`
English—Great Britain	`en-GB`
Spanish—Spain	`es-ES`
Portuguese—Brazil	`pt-BR`
French—French	`fr-FR`
Italian—Italy	`it-IT`
Hindi—India	`hi-IN`
Japanese—Japan	`ja-JP`
Chinese—China Mandarin	`cmn-CN`

English—United States Supported Features


Natural (TTS_2_NATURAL)		Standard (TTS_1_STANDARD)		Chunk Streaming	Output Formats
VoiceId(Gender)	SSML Support?	VoiceId(Gender)	SSML Support?	Chunk Streaming	Output Formats
Brian(Male) Annabelle(Female) Bob(Male) Stacy(Female) Phil(Female) Cindy(Female) Brad(Male) Richard(Male) Mary(Female) Amanda(Female) Grace(Female) Laura(Female) Megan(Female) Olivia(Female) Rachel(Female) Stephanie(Female) Teresa(Female) Victoria(Female) Ashley(Female) Adam(Male) Ethan(Male) Henry(Male) Jack(Male) Chris(Male) Mark(Male) Paul(Male) Steve(Male) Kevin(Male)	Yes Yes Yes Yes Yes Yes Yes Yes No No No No No No No No No No No No No No No No No No No No	Brian(Male) Annabelle(Female) Bob(Male) Stacy(Female) Phil(Female) Cindy(Female)	Yes Yes Yes Yes Yes Yes	Yes	`MP3` `PCM` `OGG` `JSON`

English—Great Britain Supported Features


Natural (TTS_2_NATURAL)		Standard (TTS_1_STANDARD)		Chunk Streaming	Output Formats
VoiceId(Gender)	SSML Support?	VoiceId(Gender)	SSML Support?	Chunk Streaming	Output Formats
Charlotte(Female) Emily(Female) Sophie(Female) Isla(Female) Oliver(Male) Harry(Male) Theo(Male) Arthur(Male)	Not Supported	Not Supported	Not Supported	Yes	`MP3` `PCM` `OGG` `JSON`

Spanish—Spain Supported Features


Natural (TTS_2_NATURAL)		Standard (TTS_1_STANDARD)		Chunk Streaming	Output Formats
VoiceId(Gender)	SSML Support?	VoiceId(Gender)	SSML Support?	Chunk Streaming	Output Formats
Carmen(Female) Mateo(Male) Lucas(Male)	Not Supported	Not Supported	Not Supported	Yes	`MP3` `PCM` `OGG` `JSON`

Portuguese—Brazil Supported Features


Natural (TTS_2_NATURAL)		Standard (TTS_1_STANDARD)		Chunk Streaming	Output Formats
VoiceId(Gender)	SSML Support?	VoiceId(Gender)	SSML Support?	Chunk Streaming	Output Formats
Mariana(Female) Felix(Male) Miguel(Male)	Not Supported	Not Supported	Not Supported	Yes	`MP3` `PCM` `OGG` `JSON`

French—France Supported Features


Natural (TTS_2_NATURAL)		Standard (TTS_1_STANDARD)		Chunk Streaming	Output Formats
VoiceId(Gender)	SSML Support?	VoiceId(Gender)	SSML Support?	Chunk Streaming	Output Formats
Claire(Female)	Not Supported	Not Supported	Not Supported	Yes	`MP3` `PCM` `OGG` `JSON`

Italian—Italy Supported Features


Natural (TTS_2_NATURAL)		Standard (TTS_1_STANDARD)		Chunk Streaming	Output Formats
VoiceId(Gender)	SSML Support?	VoiceId(Gender)	SSML Support?	Chunk Streaming	Output Formats
Giulia(Female) Luca(Male)	Not Supported	Not Supported	Not Supported	Yes	`MP3` `PCM` `OGG` `JSON`

Hindi—India Supported Features


Natural (TTS_2_NATURAL)		Standard (TTS_1_STANDARD)		Chunk Streaming	Output Formats
VoiceId(Gender)	SSML Support?	VoiceId(Gender)	SSML Support?	Chunk Streaming	Output Formats
Asha(Female) Priya(Female) Arjun(Male) Rahul(Male)	Not Supported	Not Supported	Not Supported	Yes	`MP3` `PCM` `OGG` `JSON`

Japanese—Japan Supported Features


Natural (TTS_2_NATURAL)		Standard (TTS_1_STANDARD)		Chunk Streaming	Output Formats
VoiceId(Gender)	SSML Support?	VoiceId(Gender)	SSML Support?	Chunk Streaming	Output Formats
Aiko(Female) Hana(Female) Sakura(Female) Yuki(Female) Satoshi(Male)	Not Supported	Not Supported	Not Supported	Yes	`MP3` `PCM` `OGG` `JSON`

Chinese—China Mandarin Supported Features


Natural (TTS_2_NATURAL)		Standard (TTS_1_STANDARD)		Chunk Streaming	Output Formats
VoiceId(Gender)	SSML Support?	VoiceId(Gender)	SSML Support?	Chunk Streaming	Output Formats
Jia(Female) Ling(Female) Mei(Female) Xiu(Female) Jun(Male) Hao(Male) Ming(Male) Wang(Male)	Not Supported	Not Supported	Not Supported	Yes	`MP3` `PCM` `OGG` `JSON`

SSML Tags

Note

For a list of Text to speech languages and voice models that support SSML tags, see Language and Feature Support.

<speak>

SSML Root tag. All SSML-enhanced text must be enclosed within a pair of <speak> tags. Both natural and standard voices available.

Example:

<speak> This is the root tag for SSML. </speak>

<break>

Add a pause in your message. Both natural and standard voices available.

`<break>` Attributes
Attribute	Value	Description
`time`	`[number]s`	The duration of the pause, in seconds.
`time`	`[number]ms`	The duration of the pause, in milliseconds.
`strength`	`none`	No pause. Use `none` to remove a normally occurring pause, such as after a period. Equivalent to "0ms".
	`x-weak`	Has the same strength as `none`, no pause.
	`weak`	Sets a pause of the same duration as the pause after a comma. Equivalent to "150ms".
	`medium`	Has the same strength as `weak`.
	`strong`	Sets a pause of the same duration as the pause after a sentence. Equivalent to "400ms".
	`x-strong`:	Sets a pause of the same duration as the pause after a paragraph. Equivalent to "800ms".

Example 1:

<speak>
    Close your eyes, take a deep breath <break time="1s"/>
    and let go of all the stress and worries.
    Feel the gentle breeze <break time="1500ms"/> as
    it caresses your skin, and listen to the
    soothing sounds of nature.
</speak>

Example 2:

<speak> 
    Let me give you a demonstration of the <break strength="x-strong"/> strong pause. 
    Now, let's try a <break strength="strong"/> medium pause. 
    Finally, we have a <break strength="weak"/> weak pause. 
</speak>

<s>

To add a pause between lines or sentences in the text. Same effect as ending sentence with period or <break strength="strong"/>. Both natural and standard voices available.

<speak>
    <s>This is the first sentence</s>
    <s>This is the second sentence</s>
    This is the last sentence.
</speak>

<p>

To add a pause at the end of paragraphs in your text. It provides a longer pause than native speakers usually place at commas or the end of a sentence. Both natural and standard voices available.

<speak>
    <p>Good morning, ladies and gentlemen. I would like to take this opportunity to welcome you all to our annual conference on artificial intelligence.</p>
    <p>Our keynote speaker for this event is Dr. Samantha Johnson, a renowned expert in machine learning and data analytics.</p>
</speak>

<say-as>

Used to tell how to say certain characters, words, and numbers. Both natural and standard voices available.


Attribute	Value	Description
`interpret-as`	`date`	Interprets the contained text as a Gregorian calendar date. The format of the date must be specified with the `format` attribute. The date separator character can be forward slash (/), dash (-), and period (.). No white space is allowed inside a date string.
	`time`	Interprets the numerical text as duration, in hours, minutes, and seconds. The text must be in `hour:min` or `hour:min:seconds` . Optionally, it can be followed by "A.M." or "P.M.". Here A.M. can also be written as AM, a.m., or am. Setting `detail` = "1" instructs the SSML parser to give the text output in the 24-hour format and setting `detail` = "2" instructs the SSML parser to give output in 12-hour format.
	`fraction`	Interprets the numerical text as a fraction. It works for both common and mixed fraction.
	`digits`	Spells out each digit individually, Example 1234 as 1-2-3-4.
	`cardinal`	Interprets the numerical text as a cardinal number.
	`ordinal`	Interprets the numerical text as an ordinal number. Example '1' is interpreted as 1st, '2' as '2nd' and so on.
	`spell-out`	Speaks out each character of the text enclosed between the `say-as` tag. This includes punctuation marks, special symbols and spaces also.
	`unit`	Interprets a numerical text as a measurement. The value must be either a number or a fraction followed by a unit with no spaces.

Example:

<speak>
    <p>Say As tag controls how special types of words are spoken, such as numbers, currencies, units, dates, times and acronyms</p>
    For Example:
    I can speak acronym <say-as interpret-as="spell-out">IRFC</say-as> for Indian Railway Finance Corporation.
    I can speak India currency <say-as interpret-as="currency">₹5200</say-as>.
    I can speak US currency <say-as interpret-as="currency">$5200</say-as>.
    I can speak dimensions <say-as interpret-as="unit">5cm</say-as> length and <say-as interpret-as="unit">10cm</say-as> width.
    I can speak temperature <say-as interpret-as="unit">25°C</say-as>.
    I can speak fraction values <say-as interpret-as="fraction">3/4</say-as>.
    I can speak ordinals <say-as interpret-as="ordinal">1731</say-as> Rank.
    I can speak digits <say-as interpret-as="digits">1234 and 5678</say-as>.
    I can speak date <say-as interpret-as="date" format="ymd">2022-11-13</say-as> and time <say-as interpret-as="time">10:00 AM</say-as>.
</speak>

<sub>

Used with the alias attribute to substitute a different word (or pronunciation) for selected text such as an acronym or abbreviation. Both natural and standard voices available.

Example:

<speak>
    My favorite chemical element is <sub alias="Mercury">Hg</sub>, because it looks so shiny.
</speak>

Replaces the phonemes of a particular word with the one that's specified in attribute ph. Both natural and standard voices available.


Attribute	Value	Description
`alphabet`	`ipa`	Indicates that the International Phonetic Alphabet (IPA) will be used.
`alphabet`	`x-sampa`	Indicates that the Extended Speech Assessment Methods Phonetic Alphabet (X-SAMPA) will be used.
`ph`		Specifies the phonemes for custom pronunciation

Example:

<speak>
    Para is short for <phoneme alphabet="ipa" ph='pˈæɹəɡɹˌæf'>para</phoneme>. 
</speak>

Refers to the patterns of stress and intonation in a language. Only standard voices are available.


Attribute	Value	Description
`rate`	`"X%"`	Controls the speed of speech. The value in percentage must be less than 100 % and the increase or decrease in rate is relative to default speaking rate. X denotes increase (+X%) or decrease (-X%) in the rate.
	`default`	Default speaking rate
	`x-slow`	Very slow speaking rate.
	`slow`	Slow speaking rate.
	`medium`	Medium speaking rate. Default speaking rate.
	`fast`	Fast speaking rate.
	`x-fast`	Very fast speaking rate.
`volume`	`"XdB"`	Controls the volume of the speech. With the help of this attribute, you aren't assigning a fixed volume, but changing it relative to the current volume. X can be a positive or a negative number depending on whether you want to increase or decrease volume.
	`default`	Default volume.
	`x-soft`	Very low volume. It's approx 12 dB lower than default.
	`soft`	Low volume. It's approx 6 dB lower than default.
	`medium`	Medium volume rate. Default value.
	`loud`	Loud volume. It's approx 6 dB higher than default.
	`x-loud`	Very loud volume. It's approx 12 dB higher than default.
`pitch`	`default`	Default pitch.
	`x-low`	Very low pitch.
	`low`	Low pitch.
	`medium`	Medium pitch Default pitch.
	`high`	High pitch.
	`x-high`	Very high pitch.

Example 1:

<speak>
    <prosody rate="0%">This is the default speaking rate.</prosody> 
    <prosody rate="-50%">Decrease the speaking rate by half the default rate.</prosody> 
    <prosody rate="+50%">Increase the speaking rate by fifty percent of the default rate.</prosody>
</speak>

Example 2:

<speak>
    <p>
        <s>Hi, this is a normal sentence.</s>
        <s>
            <prosody volume="+10dB">This is a louder sentence!</prosody>
        </s> 
        <s>
            <prosody volume="-8dB">This is a quieter sentence.</prosody>
        </s>
    </p>
</speak>

Example 3:

<speak>
    <prosody pitch='default'>This is the default pitch.</prosody>
    <prosody pitch='medium'>This is the default pitch.</prosody> 
    <prosody pitch='x-low'>This is the very low pitch.</prosody> 
    <prosody pitch='x-high'>This is the very high pitch.</prosody>
</speak>

<voice>

Allows you to use multiple voices in a single SSML request. Both natural and standard voices available.

Example:

<speak>
    <voice name="Bob">Hello Cindy, how are you doing.</voice>
    <voice name="Cindy">Hello Bob, I am good, thank you.</voice>
    <voice name="Bob">Hope you enjoyed your stay with us.</voice>
    <voice name="Cindy">Yes, it was lovely. I enjoyed the food and the services a lot. Thank you for hosting me. I would love to be back sometime soon.</voice>
</speak>

Data Handling

Does Oracle use the input text that I upload to the TTS service, or the audio files the service generates, for other purposes?

No, we don't use the input text that you upload to the TTS service, nor the resulting generated audio files, for any purpose except to provide you with a speech rendition of the input text.

Does Oracle use my input text to train the TTS service?

No, we don't use the input text you provide to train the TTS service.

Is the input text I send to the TTS service, the results, or other information about the request itself stored on Oracle servers?

The input text you send to the TTS service is processed in memory during the audio file generation. We temporarily log some metadata about your requests to improve the service, for billing and metering purposes, and to combat abuse. A metadata example is the time and size of the request.

Oracle Cloud Infrastructure Documentation

Using Text to Speech

Capabilities

Language and Feature Support

Language Codes

SSML Tags

Data Handling