Cognitive Services pricing – Speech Services

Use intelligence APIs to enable vision, language and search capabilities

The unified Speech services provide a wide range of speech recognition and generation capabilities including speech transcription, text-to-speech and speech translation. The Speech service provides a wide range of speech recognition and generation capabilities, including speech transcription, text-to-speech, speech translation and speaker recognition.

Pricing Details

Instance Category Features Price
Free - Web/Container
1 concurrent request1
Speech-to-Text Standard 5 audio hours free per month
Custom 5 audio hours free per month
Endpoint hosting: 1 model free per month 2
Conversation transcription multi-channel audio PREVIEW 3 5 audio hours free per month
Text to Speech Standard 5 million characters free per month
Neural 0.5 million characters free per month
Custom 5 million characters free per month
Endpoint hosting: 1 model free per month
Speech translation Standard 5 audio hours free per month
Speaker Recognition7 Speaker verification 10,000 free transactions per month
Speaker identification 10,000 free transactions per month
Standard - Web/Container
20 concurrent request 1
Speech-to-Text Standard $- per audio hour
Custom $- per audio hour
Endpoint hosting: $- per model per hour
Conversation transcription multi-channel audio PREVIEW 3 $- per audio hour 4
Text to Speech Standard $- per 1M characters
Neural $- per 1M characters 5
Long audio creation: $- per 1M characters
Custom $- per 1M characters
Endpoint hosting: $- per model per hour
Custom Neural PREVIEW 6 Voice building: contact us
Real-time synthesis: $- per 1M characters
Endpoint hosting: $- per model per hour
Long audio creation: $- per 1M characters
Speech translation Standard $- per audio hour
Speaker Recognition7 Speaker verification $- per 1,000 transactions
Speaker identification $- per 1,000 transactions

See the documentation for additional detailed information on quotas and limits for all pricing tiers.

1To increase concurrent requests, please see instructions.

2Unused models will be automatically decommissioned after 7 days.

3Conversation transcription multichannel recommends a circular microphone array device. For more details, you can refer to Microsoft Speech Device SDK.

4This reflects public preview pricing. GA price will be announced later at GA.

5Check the neural documentation for the regions where Neural Text to Speech is available.

6The Custom Neural Voice capability is in gated preview. Learn more about the gating process.

7Speaker Recognition is currently only available in West US. Please select “West US” as the Region to see pricing for Speaker Recognition.

Support and SLA

  • Free billing and subscription management support are included.
  • We guarantee that Cognitive Services running in the standard tier will be available at least 99.9 per cent of the time. No SLA is provided for the free trial. Read the SLA


    • For Speech Translation, Speech to Text and Speech to Text with Custom Speech Model: usage is billed in one-second increments.
    • For Text to Speech and Text To Speech with Custom Voice Font: usage is billed per character.
    • For Custom Speech Model Hosting: usage is billed hourly; For Custom Voice Font Hosting: usage is billed daily.
    • For Custom Commands: billing is tracked as consumption of Speech to Text, Text to Speech and Language Understanding. Custom Commands does not introduce new billing meters.
    • There is no charge for training Speech models. The only costs are for hosting the model once trained and then the cost per hour of speech transcription.
  • The Speech service enables users to adapt baseline models based on their own acoustic and language data, leading to custom speech models that can be used against both Speech to Text and Speech Translation.

  • The language model is a probability distribution over sequences of words. The language model helps the system to decide among sequences of words that sound similar, based on the likelihood of the word sequences themselves. For example, “recognize speech” and “wreck a nice beach” sound alike but the first hypothesis is far more likely to occur, and therefore will be assigned a higher score by the language model. If you expect voice queries to your application to contain particular vocabulary items, such as product names or jargon that rarely occur in typical speech, it is likely that you can obtain improved performance by customising the language model. For example, if you were building an app to search MSDN by voice, it’s likely that terms like “object-oriented”, “namespace” or “dot net” will appear more frequently than in typical voice applications. Customising the language model will enable the system to learn this.

  • The acoustic model is a classifier that labels short fragments of audio into one of several phonemes, or sound units, in each language. These phonemes can then be stitched together to form words. For example, the word “speech” is comprised of four phonemes “s p iy ch”. These classifications are made on the order of 100 times per second. Customising the acoustic model can enable the system to learn to do a better job recognising speech in atypical environments. For example, if you have an app designed to be used by workers in a warehouse or factory, a customised acoustic model can more accurately recognise speech in the presence of the noises found in these environments.

  • Microsoft Speech Services provide 70+ default voices (aka voice fonts) in 40+ languages to help you convert your text into audio. With the rise of the virtual assistant and various speech-enabled applications, however, many companies would like to have a unique voice that represents their business and is carefully designed for their own brand identity. For example, if you are developing a chat bot for your customer care service, you can associate it with a unique brand voice of your company to develop customer attachment. Likewise, an in-car navigation software developer can enable Text to Speech in different custom voices to enrich user experience.

    Through Voice Studio, the custom voice building portal, that is easy. Using your own audio data (recorded human voice with their associated scripts), you can generate a custom voice font which will then be deployed to Microsoft Text to Speech service and can be easily plugged in your applications with an API endpoint for your own use.


Estimate your monthly costs for Azure services

Review Azure pricing frequently asked questions

Learn more about Azure Cognitive Services

Review technical tutorials, videos and more resources

Added to estimate. Press 'v' to view on calculator

Learn and build with $200 in credit, and keep going for free