-
Speech-to-Text (STT)
- Converts spoken audio into written text.
- Common uses: transcriptions, live captions, voice commands, and conversational logging.
- Supports real-time streaming and batch transcription modes.
-
Text-to-Speech (TTS)
- Synthesizes natural-sounding audio from text.
- Supports customizable voices, speaking styles, and SSML for fine-grained control.
- Useful for accessibility, IVR systems, and spoken responses in assistants.
-
Speech Translation
- Performs real-time translation of spoken language into another language (text or synthesized audio).
- Ideal for multilingual conversations in travel, customer support, and meetings.
-
Speaker Recognition
- Identifies or verifies an individual by their voice (speaker identification and verification).
- Used in biometric authentication, user personalization, and audit trails.
-
Intent Recognition
- Extracts user intent and entities from spoken input.
- Often combined with language understanding models such as Conversational Language Understanding (CLU) or LUIS to power voice assistants and conversational agents.
| Capability | Primary Use Cases | Quick Link |
|---|---|---|
| Speech-to-Text | Transcription, captions, voice commands | https://learn.microsoft.com/azure/cognitive-services/speech-service/ |
| Text-to-Speech | Spoken responses, accessibility, IVR | https://learn.microsoft.com/azure/cognitive-services/speech-service/ |
| Speech Translation | Real-time multilingual conversations | https://learn.microsoft.com/azure/cognitive-services/speech-service/ |
| Speaker Recognition | Biometric verification, personalization | https://learn.microsoft.com/azure/cognitive-services/speech-service/ |
| Intent Recognition | Voice-driven conversational agents | https://learn.microsoft.com/azure/cognitive-services/language-service/conversational-language-understanding/overview |
- Speech SDKs for platforms such as Windows, macOS, Linux, iOS, Android, and JavaScript (recommended for low-latency, real-time streaming).
- REST APIs for batch processing, server-side integration, or when SDKs are not available.
Use the Speech SDK for low-latency, real-time scenarios (streaming recognition and synthesis). For batch transcription, file-based workflows, or simple server-side integrations, the REST APIs are often the most convenient choice.
Getting started (high level)
- Create a Speech resource in the Azure portal or obtain an endpoint and API key from an existing Cognitive Services or Speech resource.
- Choose SDK vs REST based on your scenario: SDK for streaming, REST for batch or server-to-server.
- Implement authentication (shared key or Azure AD) and configure region/endpoint.
- Start with a small integration (transcribe a test audio file or synthesize “Hello world”) and iterate to add custom voices, intent models, or translation.
Minimal examples
JavaScript (Speech SDK) — Real-time recognitionBest practices
- For real-time interactive apps (voice assistants, live captions), prefer the Speech SDK to minimize latency and benefit from built-in audio management.
- Use SSML to control prosody, pronunciation, and voice selection for higher-quality synthesized speech.
- For sensitive use cases (authentication, verification), use secure key management and consider Azure AD authentication and role-based access.
- Evaluate model costs and latency trade-offs when choosing between streaming and batch transcription.
Links and references
- Speech SDK: https://learn.microsoft.com/azure/cognitive-services/speech-service/speech-sdk
- Speech REST APIs: https://learn.microsoft.com/azure/cognitive-services/speech-service/rest-apis
- Conversational Language Understanding (CLU): https://learn.microsoft.com/azure/cognitive-services/language-service/conversational-language-understanding/overview
- LUIS overview: https://learn.microsoft.com/azure/cognitive-services/luis/overview