Explains SSML, an XML markup for controlling text-to-speech voice, prosody, pronunciation, pauses, expressive styles, and using Azure Speech Studio and SDKs to author and synthesize speech.
SSML (Speech Synthesis Markup Language) is an XML-based markup that gives developers precise control over how text is converted to speech. With SSML you can shape tone, pacing, pronunciation, and other delivery aspects so synthesized audio sounds more natural and expressive.
Core SSML capabilities
Speaking styles — set the voice’s tone or emotion (for example: cheerful, excited, empathetic).
Pauses and silence — insert breaks or delays to control pacing and rhythm.
Phonemes — define custom pronunciations for technical terms, names, or nonstandard words.
Additional expressive features
Prosody adjustments — change pitch, rate, and volume to create a more dynamic delivery.
Say-as formatting — control how numbers, dates, times, phone numbers, and other tokens are spoken (for example, as a year, ordinal, or telephone number).
Embedded audio — insert pre-recorded audio or background music for branding or effects.
Authoring and previewing SSML in Speech StudioYou can author and preview SSML directly in the browser with Azure Speech Studio. The UI helps configure voice selection, pronunciation rules, rate, pitch, and volume, then lets you export the resulting SSML for programmatic use.
Speech Studio’s real-time preview functionality is supported in Edge and Chrome. If you use other browsers (for example, Opera), some preview features may not work as expected.
When you export SSML from Speech Studio you may see metadata comments followed by the SSML itself. Example exported SSML with metadata:
Copy
<!--ID=B7267351-473F-409D-9765-754A8EBCDDE05;Version=1|{"VoiceNameToldMapItems":[{"Id":"6c640df5-9977-4a98-b785-6b2f195db0e3c","Name":"Microsoft Server Speech Text to Speech Voice (de-DE, SeraphinaMultilingualNeural)","ShortName":"de-DE-SeraphinaMultilingualNeural","Locale":"de-DE","VoiceType":"StandardVoice"}]}--><!--ID=FCB40C2B-1F9F-4C26-B1A1-CF8E67B0E7D1;Version=1|{"Files":[]}--><!--ID=5B95B1CC-2C7B-494F-B746-CF22A0E779B7;Version=1|{"Locales":{"de-DE":{"AutoApplyCustomLexiconFiles":[]}}}--><speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="de-DE"> <voice name="de-DE-SeraphinaMultilingualNeural"> </voice></speak>
SSML from code — Python example using the Azure Speech SDKWhen synthesizing SSML programmatically with the Azure Speech SDK, call the SSML-specific method (for example, speak_ssml_async) instead of plain-text APIs. The Python example below demonstrates creating a SpeechSynthesizer and synthesizing expressive SSML:
Copy
import azure.cognitiveservices.speech as speechsdk# Replace with your subscription key and service regionspeech_key = "YourSubscriptionKey"service_region = "YourServiceRegion"speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)audio_config = speechsdk.audio.AudioOutputConfig(use_default_speaker=True)speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)ssml = """<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" version="1.0" xml:lang="en-US"> <voice name="en-US-JennyNeural"> <mstts:express-as style="cheerful"> <prosody rate="-10%" pitch="+4%">Welcome to the [AI-102: Microsoft Certified Azure AI Engineer Associate](https://learn.kodekloud.com/user/courses/ai-102-microsoft-certified-azure-ai-engineer-associate) course, your gateway to building smart apps with Azure AI.</prosody> </mstts:express-as> <break time="300ms"/> <prosody rate="-12%">From computer vision... to chatbots... we'll cover it all.</prosody> <break time="400ms"/> <mstts:express-as style="excited"> <prosody rate="-10%">Let's get started - and level up your AI skills.</prosody> </mstts:express-as> </voice></speak>"""# Speak the SSML contentresult = speech_synthesizer.speak_ssml_async(ssml).get()# Check resultif result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted: print("Speech synthesized and played through speaker.")elif result.reason == speechsdk.ResultReason.Canceled: cancellation = result.cancellation_details print(f"Speech synthesis canceled: {cancellation.reason}") if cancellation.reason == speechsdk.CancellationReason.Error: print(f"Error details: {cancellation.error_details}")
Sample run
Copy
$ python3 app_ssml.pySpeech synthesized and played through speaker.
Implementation notes and best practices
Use express-as (or provider-specific equivalents) to apply emotional or speaking styles (cheerful, excited, empathetic, etc.).
Use prosody to fine-tune rate, pitch, and volume. Negative rate values slow speech; positive values speed it up.
Use break to add pauses for natural pacing.
Use phoneme tags to force correct pronunciations for technical terms and names.
Export SSML from Speech Studio to iterate quickly in the UI, then integrate the SSML into your application code.
Always test SSML playback on target platforms and browsers, as preview features and supported styles may vary.
Tip: When programmatically synthesizing SSML, prefer SSML-specific synthesis methods (for example, speak_ssml_async in the Azure Speech SDK) to ensure the markup is interpreted correctly.
ConclusionSSML enables precise control over voice, timing, pronunciation, and emotion, helping you craft natural, expressive speech for accessibility, conversational interfaces, voice-enabled apps, and branded audio experiences. You can author SSML in code, export it from Azure Speech Studio, or use the SDKs to synthesize SSML directly in your application.Links and references
Now that you know how to convert text to speech and enhance it with SSML, begin applying these techniques to build conversational and accessible experiences in your applications.