Speech Synthesis Markup Language SSML

SSML (Speech Synthesis Markup Language) is an XML-based markup that gives developers precise control over how text is converted to speech. With SSML you can shape tone, pacing, pronunciation, and other delivery aspects so synthesized audio sounds more natural and expressive.

Core SSML capabilities

Speaking styles — set the voice’s tone or emotion (for example: cheerful, excited, empathetic).
Pauses and silence — insert breaks or delays to control pacing and rhythm.
Phonemes — define custom pronunciations for technical terms, names, or nonstandard words.

An infographic slide titled "Speech Synthesis Markup Language (SSML)" showing three panels: 01 Speaking Styles (modify tone and emotion), 02 Pauses and Silence (control timing and pacing), and 03 Phonemes (define custom pronunciations). Each panel includes a simple icon and brief explanatory text.

Additional expressive features

Prosody adjustments — change pitch, rate, and volume to create a more dynamic delivery.
Say-as formatting — control how numbers, dates, times, phone numbers, and other tokens are spoken (for example, as a year, ordinal, or telephone number).
Embedded audio — insert pre-recorded audio or background music for branding or effects.

A presentation slide titled "Speech Synthesis Markup Language (SSML)" showing three numbered feature cards: Prosody Adjustments, "Say-as" Formatting, and Embedded Audio with short descriptions. Each card lists what the feature does (modify pitch/rate/volume; specify how numbers/dates/times are spoken; insert background or recorded audio).

Common SSML tags and when to use them

Tag	Purpose	Example use
speak	Root element for SSML	Wrap all SSML content in `<speak>`
voice	Select a voice or locale	`<voice name="en-US-JennyNeural">`
prosody	Adjust rate, pitch, volume	`<prosody rate="-10%" pitch="+4%">`
break	Insert pauses	`<break time="300ms"/>`
phoneme	Force pronunciation	`<phoneme alphabet="ipa" ph="ælɡəˌrɪðəm">algorithm</phoneme>`
say-as	Control formatting of numbers/dates	`<say-as interpret-as="date">2026-03-17</say-as>`
mstts:express-as	Apply provider-specific speaking styles	`<mstts:express-as style="cheerful">`

Example SSML — C# string literal This C# example shows two voices with different behaviors, using expressive styles, phonemes, and a pause:

string ssmlString = @"
<speak xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='http://www.w3.org/2001/mstts' version='1.0' xml:lang='en-US'>
    <voice name='en-US-JaneNeural'>
        <mstts:express-as style='empathetic'>I love programming!</mstts:express-as>
    </voice>
    <voice name='en-US-MarkNeural'>
        I pronounce <phoneme alphabet='ipa' ph='ælɡəˌrɪðəm'>algorithm</phoneme> differently.
        <break time='500ms'/> Let's continue!
    </voice>
</speak>";

This snippet demonstrates:

mstts:express-as — apply emotional/speaking styles (provider-specific).
phoneme — use IPA to precise pronunciation.
break — insert a pause for natural pacing.

Authoring and previewing SSML in Speech Studio You can author and preview SSML directly in the browser with Azure Speech Studio. The UI helps configure voice selection, pronunciation rules, rate, pitch, and volume, then lets you export the resulting SSML for programmatic use.

Speech Studio’s real-time preview functionality is supported in Edge and Chrome. If you use other browsers (for example, Opera), some preview features may not work as expected.

When you export SSML from Speech Studio you may see metadata comments followed by the SSML itself. Example exported SSML with metadata:

<!--ID=B7267351-473F-409D-9765-754A8EBCDDE05;Version=1|{"VoiceNameToldMapItems":[{"Id":"6c640df5-9977-4a98-b785-6b2f195db0e3c","Name":"Microsoft Server Speech Text to Speech Voice (de-DE, SeraphinaMultilingualNeural)","ShortName":"de-DE-SeraphinaMultilingualNeural","Locale":"de-DE","VoiceType":"StandardVoice"}]}-->
<!--ID=FCB40C2B-1F9F-4C26-B1A1-CF8E67B0E7D1;Version=1|{"Files":[]}-->
<!--ID=5B95B1CC-2C7B-494F-B746-CF22A0E779B7;Version=1|{"Locales":{"de-DE":{"AutoApplyCustomLexiconFiles":[]}}}-->
<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="de-DE">
  <voice name="de-DE-SeraphinaMultilingualNeural"> </voice>
</speak>

SSML from code — Python example using the Azure Speech SDK When synthesizing SSML programmatically with the Azure Speech SDK, call the SSML-specific method (for example, speak_ssml_async) instead of plain-text APIs. The Python example below demonstrates creating a SpeechSynthesizer and synthesizing expressive SSML:

import azure.cognitiveservices.speech as speechsdk

# Replace with your subscription key and service region
speech_key = "YourSubscriptionKey"
service_region = "YourServiceRegion"

speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
audio_config = speechsdk.audio.AudioOutputConfig(use_default_speaker=True)
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)

ssml = """<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" version="1.0" xml:lang="en-US">
  <voice name="en-US-JennyNeural">
    <mstts:express-as style="cheerful">
      <prosody rate="-10%" pitch="+4%">Welcome to the [AI-102: Microsoft Certified Azure AI Engineer Associate](https://learn.kodekloud.com/user/courses/ai-102-microsoft-certified-azure-ai-engineer-associate) course, your gateway to building smart apps with Azure AI.</prosody>
    </mstts:express-as>

    <break time="300ms"/>

    <prosody rate="-12%">From computer vision... to chatbots... we'll cover it all.</prosody>

    <break time="400ms"/>

    <mstts:express-as style="excited">
      <prosody rate="-10%">Let's get started - and level up your AI skills.</prosody>
    </mstts:express-as>
  </voice>
</speak>"""

# Speak the SSML content
result = speech_synthesizer.speak_ssml_async(ssml).get()

# Check result
if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
    print("Speech synthesized and played through speaker.")
elif result.reason == speechsdk.ResultReason.Canceled:
    cancellation = result.cancellation_details
    print(f"Speech synthesis canceled: {cancellation.reason}")
    if cancellation.reason == speechsdk.CancellationReason.Error:
        print(f"Error details: {cancellation.error_details}")

Sample run

$ python3 app_ssml.py
Speech synthesized and played through speaker.

Implementation notes and best practices

Use express-as (or provider-specific equivalents) to apply emotional or speaking styles (cheerful, excited, empathetic, etc.).
Use prosody to fine-tune rate, pitch, and volume. Negative rate values slow speech; positive values speed it up.
Use break to add pauses for natural pacing.
Use phoneme tags to force correct pronunciations for technical terms and names.
Export SSML from Speech Studio to iterate quickly in the UI, then integrate the SSML into your application code.
Always test SSML playback on target platforms and browsers, as preview features and supported styles may vary.

Tip: When programmatically synthesizing SSML, prefer SSML-specific synthesis methods (for example, speak_ssml_async in the Azure Speech SDK) to ensure the markup is interpreted correctly.

Conclusion SSML enables precise control over voice, timing, pronunciation, and emotion, helping you craft natural, expressive speech for accessibility, conversational interfaces, voice-enabled apps, and branded audio experiences. You can author SSML in code, export it from Azure Speech Studio, or use the SDKs to synthesize SSML directly in your application. Links and references

Azure Speech Studio: https://speech.microsoft.com/
Azure Speech SDK documentation: https://learn.microsoft.com/azure/cognitive-services/speech-service/
W3C SSML specification: https://www.w3.org/TR/speech-synthesis/
Azure Text-to-Speech voices and styles: https://learn.microsoft.com/azure/cognitive-services/speech-service/voice-styles

Now that you know how to convert text to speech and enhance it with SSML, begin applying these techniques to build conversational and accessible experiences in your applications.

Watch Video