Using Text to Speech

Learn how to use Text to Speech.

Text-to-speech (TTS) transforms written text into spoken words, bridging the gap between the written word and the spoken voice.

TTS tools offer several valuable use cases for businesses, enhancing productivity and user experience:

Audiobook Production
TTS technology can automate the conversion of written content into audiobooks, saving time and resources while catering to a broader audience's preferences for audio content.
Accessibility Compliance
Businesses can ensure their digital content is accessible to individuals with visual impairments by using TTS to convert text into spoken words, making websites and documents compliant with accessibility regulations.
Interactive Voice Response (IVR) Systems
TTS is vital for creating natural-sounding voice prompts in IVR systems enhancing customer service by providing automated but human-like interactions, such as call routing and information retrieval.
Virtual Assistants and Chatbots
Integrating TTS into virtual assistants and chatbots allows businesses to provide personalized and engaging interactions with users, whether on websites or through messaging apps, enhancing customer engagement and support.
Enhanced Product Demonstrations
Sales teams can use TTS to create audio-enhanced product demonstrations or tutorials. This makes it easier for potential customers to understand product features and benefits, leading to more informed purchase decisions.

Capabilities

  • Synchronous API: Text to Speech supports synchronous API over HTTPS protocols. You can send text input and get audio as a response.
  • Multiple Output Formats: Text to Speech can generate PCM, MP3, OGG, and JSON format.
  • Standard and Natural Voices: Text to Speech offers male and female standard and natural (human-like) voices.
  • Chunk Streaming Support: Text to Speech service supports chunk transfer encoding over HTTPS protocol. You can send a request with input text and get audio output in chunks. This helps in reducing latency at client side.
  • Speech Synthesis Markup Language (SSML): You can send Speech Synthesis Markup Language (SSML) in your Text to Speech request to for more customization in your audio response by providing details on pauses, and audio formatting for acronyms, dates, times, and abbreviations.

SSML Tags

<speak>

SSML Root tag. All SSML-enhanced text must be enclosed within a pair of <speak> tags. Both natural and standard voices available.

Example:

<speak> This is the root tag for SSML. </speak>
<break>

Add a pause in your message. Both natural and standard voices available.

<break> Attributes
Attribute Value Description
time [number]s The duration of the pause, in seconds.
[number]ms The duration of the pause, in milliseconds.
strength none No pause. Use none to remove a normally occurring pause, such as after a period. Equivalent to "0ms".
x-weak Has the same strength as none, no pause.
weak Sets a pause of the same duration as the pause after a comma. Equivalent to "150ms".
medium Has the same strength as weak.
strong Sets a pause of the same duration as the pause after a sentence. Equivalent to "400ms".
x-strong: Sets a pause of the same duration as the pause after a paragraph. Equivalent to "800ms".

Example 1:

<speak>
    Close your eyes, take a deep breath <break time="1s"/>
    and let go of all the stress and worries.
    Feel the gentle breeze <break time="1500ms"/> as
    it caresses your skin, and listen to the
    soothing sounds of nature.
</speak>

Example 2:

<speak> 
    Let me give you a demonstration of the <break strength="x-strong"/> strong pause. 
    Now, let's try a <break strength="strong"/> medium pause. 
    Finally, we have a <break strength="weak"/> weak pause. 
</speak>
<s>

To add a pause between lines or sentences in the text. Same effect as ending sentence with period or <break strength="strong"/>. Both natural and standard voices available.

<speak>
    <s>This is the first sentence</s>
    <s>This is the second sentence</s>
    This is the last sentence.
</speak>
<p>

To add a pause at the end of paragraphs in your text. It provides a longer pause than native speakers usually place at commas or the end of a sentence. Both natural and standard voices available.

<speak>
    <p>Good morning, ladies and gentlemen. I would like to take this opportunity to welcome you all to our annual conference on artificial intelligence.</p>
    <p>Our keynote speaker for this event is Dr. Samantha Johnson, a renowned expert in machine learning and data analytics.</p>
</speak>
<say-as>

Used to tell how to say certain characters, words, and numbers. Both natural and standard voices available.

Attribute Value Description
interpret-as date Interprets the contained text as a Gregorian calendar date. The format of the date must be specified with the format attribute. The date separator character can be forward slash (/), dash (-), and period (.). No white space is allowed inside a date string.
time Interprets the numerical text as duration, in hours, minutes, and seconds. The text must be in hour:min or hour:min:seconds . Optionally, it can be followed by "A.M." or "P.M.". Here A.M. can also be written as AM, a.m., or am. Setting detail = "1" instructs the SSML parser to give the text output in the 24-hour format and setting detail = "2" instructs the SSML parser to give output in 12-hour format.
fraction Interprets the numerical text as a fraction. It works for both common and mixed fraction.
digits Spells out each digit individually, Example 1234 as 1-2-3-4.
cardinal Interprets the numerical text as a cardinal number.
ordinal Interprets the numerical text as an ordinal number. Example '1' is interpreted as 1st, '2' as '2nd' and so on.
spell-out Speaks out each character of the text enclosed between the say-as tag. This includes punctuation marks, special symbols and spaces also.
unit Interprets a numerical text as a measurement. The value must be either a number or a fraction followed by a unit with no spaces.

Example:

<speak>
    <p>Say As tag controls how special types of words are spoken, such as numbers, currencies, units, dates, times and acronyms</p>
    For Example:
    I can speak acronym <say-as interpret-as="spell-out">IRFC</say-as> for Indian Railway Finance Corporation.
    I can speak India currency <say-as interpret-as="currency">₹5200</say-as>.
    I can speak US currency <say-as interpret-as="currency">$5200</say-as>.
    I can speak dimensions <say-as interpret-as="unit">5cm</say-as> length and <say-as interpret-as="unit">10cm</say-as> width.
    I can speak temperature <say-as interpret-as="unit">25°C</say-as>.
    I can speak fraction values <say-as interpret-as="fraction">3/4</say-as>.
    I can speak ordinals <say-as interpret-as="ordinal">1731</say-as> Rank.
    I can speak digits <say-as interpret-as="digits">1234 and 5678</say-as>.
    I can speak date <say-as interpret-as="date" format="ymd">2022-11-13</say-as> and time <say-as interpret-as="time">10:00 AM</say-as>.
</speak>
<sub>

Used with the alias attribute to substitute a different word (or pronunciation) for selected text such as an acronym or abbreviation. Both natural and standard voices available.

Example:

<speak>
    My favorite chemical element is <sub alias="Mercury">Hg</sub>, because it looks so shiny.
</speak>
<phoneme>

Replaces the phonemes of a particular word with the one that's specified in attribute ph. Both natural and standard voices available.

Attribute Value Description
alphabet ipa  Indicates that the International Phonetic Alphabet (IPA) will be used.
x-sampa Indicates that the Extended Speech Assessment Methods Phonetic Alphabet (X-SAMPA) will be used.
ph Specifies the phonemes for custom pronunciation

Example:

<speak>
    Para is short for <phoneme alphabet="ipa" ph='pˈæɹəɡɹˌæf'>para</phoneme>. 
</speak>
<prosody>

Refers to the patterns of stress and intonation in a language. Only standard voices are available.

Attribute Value Description
rate "X%"

Controls the speed of speech. The value in percentage must be less than 100 % and the increase or decrease in rate is relative to default speaking rate.

X denotes increase (+X%) or decrease (-X%) in the rate.

default Default speaking rate
x-slow Very slow speaking rate.
slow Slow speaking rate.
medium Medium speaking rate. Default speaking rate.
fast Fast speaking rate.
x-fast Very fast speaking rate.
volume "XdB"

Controls the volume of the speech. With the help of this attribute, you aren't assigning a fixed volume, but changing it relative to the current volume.

X can be a positive or a negative number depending on whether you want to increase or decrease volume.

default Default volume.
x-soft Very low volume. It's approx 12 dB lower than default.
soft Low volume. It's approx 6 dB lower than default.
medium Medium volume rate. Default value.
loud Loud volume. It's approx 6 dB higher than default.
x-loud Very loud volume. It's approx 12 dB higher than default.
pitch default Default pitch.
x-low Very low pitch.
low Low pitch.
medium Medium pitch Default pitch.
high High pitch.
x-high Very high pitch.

Example 1:

<speak>
    <prosody rate="0%">This is the default speaking rate.</prosody> 
    <prosody rate="-50%">Decrease the speaking rate by half the default rate.</prosody> 
    <prosody rate="+50%">Increase the speaking rate by fifty percent of the default rate.</prosody>
</speak>

Example 2:

<speak>
    <p>
        <s>Hi, this is a normal sentence.</s>
        <s>
            <prosody volume="+10dB">This is a louder sentence!</prosody>
        </s> 
        <s>
            <prosody volume="-8dB">This is a quieter sentence.</prosody>
        </s>
    </p>
</speak>

Example 3:

<speak>
    <prosody pitch='default'>This is the default pitch.</prosody>
    <prosody pitch='medium'>This is the default pitch.</prosody> 
    <prosody pitch='x-low'>This is the very low pitch.</prosody> 
    <prosody pitch='x-high'>This is the very high pitch.</prosody>
</speak>
<voice>

Allows you to use multiple voices in a single SSML request. Both natural and standard voices available.

Example:

<speak>
    <voice name="Bob">Hello Cindy, how are you doing.</voice>
    <voice name="Cindy">Hello Bob, I am good, thank you.</voice>
    <voice name="Bob">Hope you enjoyed your stay with us.</voice>
    <voice name="Cindy">Yes, it was lovely. I enjoyed the food and the services a lot. Thank you for hosting me. I would love to be back sometime soon.</voice>
</speak>