Speech-to-Text

Swiftly convert audio to text for natural responsiveness.

Cognitive Services Speech to Text offers a range of capabilities that you can embed into your apps to support various transcription scenarios, including conversation transcription, speech transcription and custom speech transcription.

Conversation transcription

Enable in-person meeting transcription. Conversation transcription captures speech in real time so that all meeting participants can fully engage in the discussion, identify who said what, when and quickly follow up on next steps.

Use conversation transcription to:

  • Capture speech from all around the meeting room.
  • Help safeguard data with industry-leading security and compliance certifications.
  • Support meeting and conference setups that use microphones and video cameras, through pairing with the Speech Devices SDK.

See it in action

An error occurred while loading this demo. Please wait and try again

Speaker Transcript

This demo is incompatible with your browser. For best experience, please use a different browser.

Want to build this?

Speech transcription

Convert spoken audio to text. Call the API to recognise audio coming from the microphone, from other real-time streaming audio sources or from a recorded audio file. As audio is sent to the server, partial recognition results will be returned if requested.

You can use the API to build voice-triggered smart apps. Try the demo to see how it works. Select your target language, then click on the microphone and start speaking. Or simply click on one of the sample speech phrases.*

See it in action

To try out the demo with your own voice using a microphone, please change to a different browser that supports WebRTC, for example a recent version of Microsoft Edge, Firefox or Chrome.

Want to build this?

Custom speech service: Speech transcription with custom model

Overcome speech recognition barriers such as speaking style, vocabulary and background noise. Our speech recognition technologies combine multiple APIs to produce the text output. Customers can customise the APIs to their needs and available data.

See it in action

Sample sentences

Baseline

Custom Speech

Create custom language models tailored to users’ speaking styles

Don’t let varied vocabularies and speaking styles block understanding. Customise the language model of your app’s speech recognition by tailoring it to your industry expressions, technical, geography or market terms, and even speaker style.

Adapt to user environment with custom acoustic models

Make sure that your app’s speech recognition can function in all environments. With custom acoustic models, you can account for background noise and match your users’ expected environments.

Use robust speech models from Microsoft

Enable powerful, personalised speech recognition by building your own customised speech recognition models on top of Microsoft’s existing state-of-the-art models.

Want to build this?

Explore a speech scenario

Call centre

Speech ServicesWith Speech Services, it is easy to transcribe every call. Index the transcription for full-text search or apply Text Analytics to detect sentiment, language and key phrases for insights. If your call center recordings involve specialized terminology, such as product names or IT jargon, create a custom language model to teach Speech Services the vocabulary. A custom acoustic model helps Speech Services understand speakers even with background noise or poor phone connections. For more information, read how batch transcription works with Speech Services.
  1. Overview
  2. Flow

Speech services

Overview

With Speech Services, it’s easy to transcribe every call. Index the transcription for full-text search, or apply Text Analytics to detect sentiment, language and key phrases for insights. If your call centre recordings involve specialist terminology, such as product names or IT jargon, create a custom language model to teach Speech Services the vocabulary. A custom acoustic model helps Speech Services understand speakers even with background noise or poor phone connections.

For more information, read how batch transcription works with Speech Services.

Flow

  1. 1 Adapt a model for your domain and deploy that model
  2. 2 Upload your recordings to a blob container
  3. 3 Create a POST request to batch transcription
  4. 4 Speech Services schedules the transcription job
  5. 5 Stereo files are split into two channels
  6. 6 Mono files undergo diarisation to distinguish between speakers
  7. 7 Download the transcription using the transcription ID

Explore the Cognitive Services APIs

Computer Vision

Distill actionable information from images

Face

Detect, identify, analyse, organise and tag faces in photos

Ink Recogniser PREVIEW

An AI service that recognises digital ink content, such as handwriting, shapes and ink document layout

Video Indexer

Unlock video insights

Custom Vision

Easily customise your own state-of-the-art computer vision models for your unique use case

Form Recogniser PREVIEW

The AI-powered document extraction service that understands your forms

Text Analytics

Easily evaluate sentiment and topics to understand what users want

Translator Text

Easily conduct machine translation with a simple REST API call

QnA Maker

Distill information into conversational, easy-to-navigate answers

Language Understanding

Teach your apps to understand commands from your users

Immersive Reader PREVIEW

Empower users of all ages and abilities to read and comprehend text

Speech services

Unified speech services for speech-to-text, text-to-speech and speech translation

Speaker Recognition PREVIEW

Use speech to identify and verify individual speakers

Content moderator

Automated image, text and video moderation

Anomaly detector PREVIEW

Easily add anomaly detection capabilities to your apps.

Personaliser PREVIEW

An AI service that delivers a personalised user experience

Use the Speech Devices SDK to build an ambient device and create a custom wake word

Learn more