Unified speech services for speech-to-text, text-to-speech and speech translation
The unified Speech services provide a wide range of speech recognition and generation capabilities including speech transcription, text-to-speech and speech translation. The Speech service provides a wide range of speech recognition and generation capabilities including speech transcription, text-to-speech, speech translation, and speaker recognition.
Explore pricing options
Apply filters to customize pricing options to your needs.
Prices are estimates only and are not intended as actual price quotes. Actual pricing may vary depending on the type of agreement entered with Microsoft, date of purchase, and the currency exchange rate. Prices are calculated based on US dollars and converted using Thomson Reuters benchmark rates refreshed on the first day of each calendar month. Sign in to the Azure pricing calculator to see pricing based on your current program/offer with Microsoft. Contact an Azure sales specialist for more information on pricing or to request a price quote. For more information on Azure pricing see frequently asked questions.
US government entities are eligible to purchase Azure Government services from a licensing solution provider with no upfront financial commitment, or directly through a pay-as-you-go online subscription.
Important—The price in R$ is merely a reference; this is an international transaction and the final price is subject to exchange rates and the inclusion of IOF taxes. An eNF will not be issued.
US government entities are eligible to purchase Azure Government services from a licensing solution provider with no upfront financial commitment, or directly through a pay-as-you-go online subscription.
Important—The price in R$ is merely a reference; this is an international transaction and the final price is subject to exchange rates and the inclusion of IOF taxes. An eNF will not be issued.
Free (F0)
Category | Features | Price |
---|---|---|
Speech to Text (per second billing) |
Standard | 5 audio hours free per month |
Custom |
5 audio hours free per month Endpoint hosting: 1 model free per month1 |
|
Conversation Transcription Multichannel Audio PREVIEW | 5 audio hours free per month | |
Text to Speech (per character billing) |
Neural | 0.5 million characters free per month |
Speech Translation (per second billing) |
Standard | 5 audio hours free per month |
Speaker Recognition (per transaction billing) |
Speaker Verification2 | 10,000 transactions free per month |
Speaker Identification2 | 10,000 transactions free per month | |
Voice Profile Storage | 10,000 transactions free per month |
Pay as You Go: pay only for what you use.
Category | Features | Price |
---|---|---|
Speech to Text (per second billing) |
Standard | $- per audio hour |
Custom |
$- per audio hour Endpoint hosting: $- per model per hour |
|
Enhanced add-on features:
|
$- per audio hour per feature | |
Conversation Transcription Multichannel Audio PREVIEW | $- per audio hour1 | |
Text to Speech (per character billing) |
Neural |
Real-time & batch synthesis: $- per 1M
characters Long audio creation: $- per 1M characters |
Custom Neural2 |
Training: $- per compute hour, up to $- per training Real-time & batch synthesis: $- per 1M characters Endpoint hosting: $- per model per hour Long audio creation: $- per 1M characters |
|
Speech Translation (per second billing) |
Standard | $- per audio hour |
Speaker Recognition (per transaction billing) |
Speaker Verification3 | $- per 1,000 transactions |
Speaker Identification3 | $- per 1,000 transactions | |
Voice Profile Storage | $- per 1,000 voice profiles (10,000 free voice profiles per month) |
Commitment Tiers
Instance | Category | Features | Price (per month) | Overage |
---|---|---|---|---|
Azure - Standard | Speech to Text | Standard | $- for 2,000 hours | $- per hour |
$- for 10,000 hours | $- per hour | |||
$- for 50,000 hours | $- per hour | |||
Custom | $- for 2,000 hours | $- per hour | ||
$- for 10,000 hours | $- per hour | |||
$- for 50,000 hours | $- per hour | |||
Text to Speech | Neural1 | $- for 80M characters | $- per 1M characters | |
$- for 400M characters | $- per 1M characters | |||
$- for 2,000M characters | $- per 1M characters | |||
Connected container - Standard | Speech to Text | Standard | $- for 2,000 hours | $- per hour |
$- for 10,000 hours | $- per hour | |||
$- for 50,000 hours | $- per hour | |||
Custom | $- for 2,000 hours | $- per hour | ||
$- for 10,000 hours | $- per hour | |||
$- for 50,000 hours | $- per hour | |||
Text to Speech | Neural1 | $- for 80M characters | $- per 1M characters | |
$- for 400M characters | $- per 1M characters | |||
$- for 2,000M characters | $- per 1M characters | |||
Disconnected container | Speech to Text | Standard |
Sign up to get access
Learn more |
|
Custom |
Sign up to get access
Learn more |
|||
Text to Speech | Neural1 |
Sign up to get access
Learn more |
These features are being deprecated and only available for existing customers to use. Check details and learn how to migrate to new features.
Instance | Category | Features | Price |
---|---|---|---|
Free - Web/Container 1 concurrent request |
Text to Speech | Standard | 5 million characters free per month |
Custom |
5 million characters free per month Endpoint hosting: 1 model free per month |
||
Standard - Web/Container 100 concurrent requests for Base model 20 concurrent requests for Custom model |
Text to Speech | Standard | $- per 1M characters |
Custom |
$- per 1M characters Endpoint hosting: $- per model per hour |
Azure pricing and purchasing options

Connect with us directly
Get a walkthrough of Azure pricing. Understand pricing for your cloud solution, learn about cost optimization and request a custom proposal.
Talk to a sales specialistSee ways to purchase
Purchase Azure services through the Azure website, a Microsoft representative, or an Azure partner.
Explore your optionsAdditional resources
Speech Services
Learn more about Speech Services features and capabilities.
Pricing calculator
Estimate your expected monthly costs for using any combination of Azure products.
Documentation
Review technical tutorials, videos, and more Speech Services resources.
Frequently asked questions
-
- For Speech to Text and Speech Translation, usage is billed in one-second increments.
- For Text to Speech: usage is billed per character. Check the definition of character in the pricing note.
- For Speech to Text and Text to Speech, endpoint hosting for custom models is billed per second per model.
- For Custom Commands: billing is tracked as consumption of Speech to Text, Text to Speech, and Language Understanding. Custom Commands does not introduce new billing meters.
- There is no charge for training Speech to Text models. The only costs are endpoint hosting per model once deployed, and then the cost per audio hour of Custom Speech to Text.
-
The Speech service enables users to adapt baseline models based on their own acoustic and language data, leading to custom speech models that can be used against both Speech to Text and Speech Translation.
-
The language model is a probability distribution over sequences of words. The language model helps the system decide among sequences of words that sound similar, based on the likelihood of the word sequences themselves. For example, “recognize speech” and “wreck a nice beach” sound alike but the first hypothesis is far more likely to occur, and therefore will be assigned a higher score by the language model. If you expect voice queries to your application to contain particular vocabulary items, such as product names or jargon that rarely occur in typical speech, it is likely that you can obtain improved performance by customizing the language model. For example, if you were building an app to search MSDN by voice, it’s likely that terms like “object-oriented” or “namespace” or “dot net” will appear more frequently than in typical voice applications. Customizing the language model will enable the system to learn this.
-
The acoustic model is a classifier that labels short fragments of audio into one of several phonemes, or sound units, in each language. These phonemes can then be stitched together to form words. For example, the word “speech” is comprised of four phonemes “s p iy ch”. These classifications are made on the order of 100 times per second. Customizing the acoustic model can enable the system to learn to do a better job recognizing speech in atypical environments. For example, if you have an app designed to be used by workers in a warehouse or factory, a customized acoustic model can more accurately recognize speech in the presence of the noises found in these environments.
-
Speech service offers a wide range of text-to-speech (TTS) voice fonts, however custom neural voice allows you to build your own custom voice that suits your needs and your brand. Read the blog for more information.
-
Language identification allows you to identify a switch in spoken language and transcribe speech accordingly. This can be applied in scenarios where the audio language is unknown, or when speaker(s) may speak multiple languages. Single Language Identification is available at no additional cost. Continuous Language Identification is an enhanced add-on feature. Visit docs to learn more.
Talk to a sales specialist for a walk-through of Azure pricing. Understand pricing for your cloud solution.
Get free cloud services and a $200 credit to explore Azure for 30 days.