Speech to Text

Swiftly convert audio to text for natural responsiveness. The Speech to Text API is part of the Speech services.

Speech transcription

Convert spoken audio to text. Call the API to recognise audio coming from the microphone, from other real-time streaming audio sources or from a recorded audio file. As audio is sent to the server, partial recognition results are returned if requested.

You can use the API to build voice-triggered smart apps. Try the demo to see how it works. Select your target language, then click on the microphone and start speaking. Or simply click on one of the sample speech phrases.*

See it in action

To try out the demo with your own voice using a microphone, please change to a different browser with WebRTC support, for example a recent version of Microsoft Edge, Firefox or Chrome.

Want to build this?

Custom speech service: Speech Transcription with Custom Model

Overcome speech recognition barriers such as speaking style, vocabulary, and background noise. Our speech recognition technologies combine multiple APIs to produce the text output. Customers can customise the APIs to their needs and available data.

See it in action

Sample Sentences


Custom Speech

Create custom language models tailored to users’ speaking styles

Do not let varied vocabularies and speaking styles block understanding. Customise the language model of your app’s speech recognition by tailoring it to your industry expressions, technical, geography or market terms and even speaker style.

Adapt to user environment with custom acoustic models

Make sure your app’s speech recognition can function in all environments. With custom acoustic models, you can account for background noise and match your users’ expected environments.

Use robust speech models from Microsoft

Enable powerful, personalised speech recognition by building your own customised speech recognition models on top of Microsoft’s existing state-of-the-art models.

Want to build this?

Explore a speech scenario

Intelligent kiosk

Speech services combined with Language Understanding enables apps and users to interact naturally. Use Speech to Text to capture a user’s question, Language Understanding to parse intent and formulate an appropriate reply and Text to Speech to synthesise the text into a spoken response. Create conversational interfaces for various scenarios like banking, travel and entertainment.

Chatbot de comércioJuntos, o Serviço de Bot do Azure e o serviço de Reconhecimento Vocal permitem que os desenvolvedores criem interfaces de conversação para vários cenários, como bancos, viagens e entretenimento. Por exemplo, o concierge de um hotel pode usar um bot para aprimorar as interações tradicionais de email e chamada telefônica ao validar um cliente por meio do Azure Active Directory e usar os Serviços Cognitivos para melhor processar as solicitações dos clientes de forma contextual usando texto e voz. O serviço de reconhecimento de fala pode ser adicionado para dar suporte a comandos de voz.1234567
  1. Overview
  2. Flow

Commerce chatbot


Together, the Azure Bot Service and Language Understanding service enable developers to create conversational interfaces for various scenarios like banking, travel and entertainment. For example, a hotel’s concierge can use a bot to enhance traditional e-mail and phone call interactions by validating a customer via Azure Active Directory and using Cognitive Services to better contextually process customer requests using text and voice. The Speech recognition service can be added to support voice commands.


  1. 1 Customer uses your mobile app
  2. 2 Using Azure AD B2C, the user authenticates
  3. 3 Using the custom Application Bot, user requests information
  4. 4 Cognitive Services helps process the natural language request
  5. 5 Response is reviewed by customer who can refine the question using natural conversation
  6. 6 Once the user is happy with the results, the Application Bot updates the customer’s reservation
  7. 7 Application insights gathers runtime telemetry to help development with Bot performance and usage

Explore the Cognitive Services APIs

Computer Vision

Distill actionable information from images


Detect, identify, analyse, organise, and tag faces in photos

Video Indexer

Unlock video insights

Content Moderator

Automated image, text and video moderation

Custom Vision

Easily customise your own state-of-the-art computer vision models for your unique use case

Text Analytics

Easily evaluate sentiment and topics to understand what users want

Translator Text

Easily conduct machine translation with a simple REST API call

Bing Spell Check

Detect and correct spelling mistakes in your app

QnA Maker

Distill information into conversational, easy-to-navigate answers

Content Moderator

Automated image, text and video moderation

Language Understanding

Teach your apps to understand commands from your users

Speech to Text

The Speech to Text API is part of Azure Cognitive Services Speech Services

Speaker Recognition PREVIEW

Use speech to identify and verify individual speakers

Text to Speech

Convert text to speech to create more natural, accessible interfaces

Speech Translation

Easily integrate real-time speech translation to your app

Anomaly Detector PREVIEW

Easily add anomaly detection capabilities to your apps.

Use the Speech Devices SDK to build an ambient device and create a custom wake word

Learn More