Speech to Text

Swiftly convert audio to text for natural responsiveness.

Cognitive Services Speech to Text offers a range of capabilities you can embed into your apps to support various transcription scenarios, including conversation transcription, speech transcription, and custom speech transcription.

Conversation transcription

Enable in-person meeting transcription. Conversation transcription captures speech in real time so that all meeting participants can fully engage in the discussion, identify who said what, when, and quickly follow up on next steps.

Use conversation transcription to:

  • Capture speech from all around the meeting room.
  • Help safeguard data with industry-leading security and compliance certifications.
  • Support meeting and conference setups that use microphones and video cameras, through pairing with the Speech Devices SDK.

See it in action

An error occurred while loading this demo, please wait and try again

Speaker Transcript

This demo is incompatible with your browser. For best experience, please use a different browser.

Want to build this?

Speech transcription

Convert spoken audio to text. Call the API to recognize audio coming from the microphone, from other real-time streaming audio sources, or from a recorded audio file. As audio is sent to the server, partial recognition results are returned if requested.

You can use the API to build voice-triggered smart apps. Try the demo to see how it works. Select your target language, then click on the microphone and start speaking. Or simply click on one of the sample speech phrases.*

See it in action

To try out the demo with your own voice using a microphone, please change to a different browser with WebRTC support, for example a recent version of Microsoft Edge, Firefox or Chrome.

Want to build this?

Custom speech service: Speech Transcription with Custom Model

Overcome speech recognition barriers such as speaking style, vocabulary, and background noise. Our speech recognition technologies combine multiple APIs to produce the text output. Customers can customize the APIs to their needs and available data.

See it in action

Sample Sentences

Baseline

Custom Speech

Create custom language models tailored to users’ speaking styles

Don’t let varied vocabularies and speaking styles block understanding. Customize the language model of your app’s speech recognition by tailoring it to your industry expressions, technical, geography or market terms, and even speaker style.

Adapt to user environment with custom acoustic models

Make sure your app’s speech recognition can function in all environments. With custom acoustic models, you can account for background noise and match your users’ expected environments.

Use robust speech models from Microsoft

Enable powerful, personalized speech recognition by building your own customized speech recognition models on top of Microsoft’s existing state-of-the-art models.

Want to build this?

Explore a speech scenario

Call center

음성 서비스음성 서비스로 모든 통화를 간편하게 전사할 수 있습니다. 전체 텍스트 검색을 위해 전사를 인덱싱하거나, Text Analytics를 적용하여 감정, 언어 및 핵심 구문 감지를 통해 인사이트를 얻을 수 있습니다. 콜 센터 녹음이 전문 용어(예: 제품 이름 또는 IT 전문 용어)를 포함하는 경우 사용자 지정 언어 모델을 만들어 해당 어휘를 음성 서비스에 학습시킬 수 있습니다. 음성 서비스는 백그라운드 노이즈가 있거나 통화 연결 품질이 좋지 못한 상황에서도 사용자 지정 어쿠스틱 모델을 통해 발표자를 인식할 수 있습니다. 자세한 내용은 음성 서비스에서의 배치 전사 작동 방식을 확인하세요.
  1. Overview
  2. Flow

Speech Services

Overview

With Speech Services, it's easy to transcribe every call. Index the transcription for full-text search, or apply Text Analytics to detect sentiment, language, and key phrases for insights. If your call center recordings involve specialized terminology, such as product names or IT jargon, create a custom language model to teach Speech Services the vocabulary. A custom acoustic model helps Speech Services understand speakers even with background noise or poor phone connections.

For more information, read how batch transcription works with Speech Services.

Flow

  1. 1 Adapt a model for your domain and deploy that model
  2. 2 Upload your recordings to a blob container
  3. 3 Create a POST request to batch transcription
  4. 4 Speech Services schedules the transcription job
  5. 5 Stereo files are split into two channels
  6. 6 Mono files undergo diarization to distinguish between speakers
  7. 7 Download the transcription using the transcription ID

Explore the Cognitive Services APIs

Computer Vision

Distill actionable information from images

Face

Detect, identify, analyze, organize, and tag faces in photos

Ink Recognizer PREVIEW

An AI service that recognizes digital ink content, such as handwriting, shapes, and ink document layout

Video Indexer

Unlock video insights

Custom Vision

Easily customize your own state-of-the-art computer vision models for your unique use case

Form Recognizer PREVIEW

The AI-powered document extraction service that understands your forms

Text Analytics

Easily evaluate sentiment and topics to understand what users want

Translator Text

Easily conduct machine translation with a simple REST API call

Bing Spell Check

Detect and correct spelling mistakes in your app

QnA Maker

Distill information into conversational, easy-to-navigate answers

Language Understanding

Teach your apps to understand commands from your users

Speech Services

Unified speech services for speech-to-text, text-to-speech and speech translation

Speaker Recognition PREVIEW

Use speech to identify and verify individual speakers

Content Moderator

Automated image, text, and video moderation

Anomaly Detector PREVIEW

Easily add anomaly detection capabilities to your apps.

Personalizer PREVIEW

An AI service that delivers a personalized user experience

Use the Speech Devices SDK to build an ambient device and create a custom wake word

Learn more