Skip navigation

Speech Services pricing

Unified speech services for speech-to-text, text-to-speech and speech translation

The unified Speech services provide a wide range of speech recognition and generation capabilities including speech transcription, text-to-speech and speech translation. The Speech service provides a wide range of speech recognition and generation capabilities, including speech transcription, text-to-speech, speech translation and speaker recognition.

Explore pricing options

Apply filters to customise pricing options to your needs.

Prices are estimates only and are not intended as actual price quotes. Actual pricing may vary depending on the type of agreement entered with Microsoft, date of purchase, and the currency exchange rate. Prices are calculated based on US dollars and converted using Thomson Reuters benchmark rates refreshed on the first day of each calendar month. Sign in to the Azure pricing calculator to see pricing based on your current programme/offer with Microsoft. Contact an Azure sales specialist for more information on pricing or to request a price quote. See frequently asked questions about Azure pricing.

Free

Instance Category Features Price
Free - Web/Container
1 concurrent request1
Speech-to-Text Standard2 5 audio hours free per month
Custom 5 audio hours free per month
Endpoint hosting: 1 model free per month 3
Conversation transcription multi-channel audio PREVIEW 4 5 audio hours free per month
Text to Speech Neural 0.5 million characters free per month
Speech translation Standard 5 audio hours free per month
Speaker Recognition Speaker verification 10,000 free transactions per month
Speaker identification 10,000 free transactions per month
Voice storage 10,000 free transactions per month

See the documentation for additional detailed information on quotas and limits for all pricing tiers.

1To increase concurrent requests, please see instructions.

2Speech to Text will now include pronunciation assessment for both the Free instance (e.g. 5 audio hours free per month) as well as the Standard instance, which will follow Standard pricing of $1 per audio hour.

3Unused models will be automatically decommissioned after 7 days.

4Conversation transcription multichannel recommends a circular microphone array device. For more details, you can refer to Microsoft Speech Device SDK.

Pay as You Go: pay only for what you use.

Instance Category Features Price
Standard - Web/Container
100 concurrent requests for Base model
20 concurrent requests for Custom model1
Speech-to-Text Standard2 $- per audio hour
Custom $- per audio hour
Endpoint hosting: $- per model per hour
Conversation transcription multi-channel audio PREVIEW $- per audio hour3
Text to Speech Neural4 Real-time synthesis: $- per 1M characters4
Long audio creation: $- per 1M characters
Custom Neural4, 5 Training: $- per compute hour, up to $- per training session
Real-time synthesis: $- per 1M characters
Endpoint hosting: $- per model per hour
Long audio creation: $- per 1M characters
Speech translation Standard $- per audio hour
Speaker Recognition Speaker verification $- per 1,000 transactions
Speaker identification $- per 1,000 transactions
Voice storage $- per 1,000 voice profiles (10,000 free voice profiles per month)

See the documentation for additional detailed information on quotas and limits for all pricing tiers.

1 To increase concurrent requests, please see instructions.

2 Speech to Text will now include pronunciation assessment for both the Free instance (e.g. 5 audio hours free per month) as well as the Standard instance, which will follow Standard pricing of $1 per audio hour.

3 This reflects public preview pricing. GA price will be announced later at GA.

4 Text to Speech is billed for each character that's converted to speech, including punctuation. Learn more.

5 Customised Neural Voice (CNV) is a limited access capability featuring Pro and Lite versions. With CNV Lite (public preview), customers can record their own voice and create a model for demonstration/evaluation, before applying for access to Pro. Check out where CNV is available.

Commitment Tiers

This pricing is limited access. Apply here.

Instance Category Features Price (per month) Overage
Azure – Standard Speech-to-Text Standard $- for 2,000 hours $- per hour
$- for 10,000 hours $- per hour
$- for 50,000 hours $- per hour
Text to Speech Neural1 $- for 80M characters $- per 1M characters
$- for 400M characters $- per 1M characters
$- for 2,000M characters $- per 1M characters
Connected container – Standard Speech-to-Text Standard $- for 2,000 hours $- per hour
$- for 10,000 hours $- per hour
$- for 50,000 hours $- per hour
Text to Speech Neural1 $- for 80M characters $- per 1M characters
$- for 400M characters $- per 1M characters
$- for 2,000M characters $- per 1M characters
Disconnected container Speech-to-Text Standard Sign up to get access
Learn more
Text to Speech Neural1 Sign up to get access
Learn more
1 Real-time synthesis only, this does not include long audio creation.

These features are being deprecated and only available for existing customers to use. Check details and learn how to migrate to new features.

Instance Category Features Price
Free - Web/Container
1 concurrent request
Text to Speech Standard 5 million characters free per month
Custom 5 million characters free per month
Endpoint hosting: 1 model free per month
Standard - Web/Container
100 concurrent requests for Base model
20 concurrent requests for Custom model
Text to Speech Standard $- per 1M characters
Custom $- per 1M characters
Endpoint hosting: $- per model per hour

Azure pricing and purchasing options

Connect with us directly

Get a walkthrough of Azure pricing. Understand pricing for your cloud solution, learn about cost optimisation and request a custom proposal.

Talk to a sales specialist

See ways to purchase

Purchase Azure services through the Azure website, a Microsoft representative or an Azure partner.

Explore your options

Additional resources

Speech Services details

Learn more about Speech Services features and capabilities.

Pricing calculator

Estimate your expected monthly costs for using any combination of Azure products.

Documentation

Review technical tutorials, videos, and more Speech Services resources.

    • For Speech Translation, Speech to Text and Speech to Text with Custom Speech Model: usage is billed in one-second increments.
    • For Text to Speech with Neural or Custom Neural Voices: usage is billed per character. Check the definition of character in the pricing note.
    • For Custom Speech Model Hosting: usage is billed hourly; For Custom Voice Font Hosting: usage is billed daily.
    • For Custom Commands: billing is tracked as consumption of Speech to Text, Text to Speech and Language Understanding. Custom Commands does not introduce new billing meters.
    • There is no charge for training Speech models. The only costs are for hosting the model once trained and then the cost per hour of speech transcription.
  • The Speech service enables users to adapt baseline models based on their own acoustic and language data, leading to custom speech models that can be used against both Speech to Text and Speech Translation.

  • The language model is a probability distribution over sequences of words. The language model helps the system to decide among sequences of words that sound similar, based on the likelihood of the word sequences themselves. For example, “recognize speech” and “wreck a nice beach” sound alike but the first hypothesis is far more likely to occur, and therefore will be assigned a higher score by the language model. If you expect voice queries to your application to contain particular vocabulary items, such as product names or jargon that rarely occur in typical speech, it is likely that you can obtain improved performance by customising the language model. For example, if you were building an app to search MSDN by voice, it’s likely that terms like “object-oriented”, “namespace” or “dot net” will appear more frequently than in typical voice applications. Customising the language model will enable the system to learn this.

  • The acoustic model is a classifier that labels short fragments of audio into one of several phonemes, or sound units, in each language. These phonemes can then be stitched together to form words. For example, the word “speech” is comprised of four phonemes “s p iy ch”. These classifications are made on the order of 100 times per second. Customising the acoustic model can enable the system to learn to do a better job recognising speech in atypical environments. For example, if you have an app designed to be used by workers in a warehouse or factory, a customised acoustic model can more accurately recognise speech in the presence of the noises found in these environments.

  • Speech service offers a wide range of text-to-speech (TTS) voice fonts; however, custom neural voice allows you to build your own custom voice that suits your needs and your brand. Read the blog for more information.

  • There are scenarios in which a speaker or multiple speakers may speak multiple languages over the same audio file or live presentation. Continuous language detection allows you to identify a switch in spoken language and accurately transcribe speech accordingly. This feature will be free for private preview and can be accessed via the Speech SDK. Visit docs to learn more.

Talk to a sales specialist for a walk-through of Azure pricing. Understand pricing for your cloud solution.

Get free cloud services and a $200 credit to explore Azure for 30 days.

Added to estimate. Press 'v' to view on calculator
Can we help you?