Cognitive Services pricing – Speech Services
Use intelligence APIs to enable vision, language and search capabilities
- No upfront cost
- No termination fees
- Pay only for what you use
US government entities are eligible to purchase Azure Government services from a licensing solution provider with no upfront financial commitment or directly through a pay-as-you-go online subscription.
Important: The price in R$ is merely a reference; this is an international transaction and the final price is subject to exchange rates and the inclusion of IOF taxes. An eNF will not be issued.
Azure Germany is available to customers and partners who have already purchased this, doing business in the European Union (EU), the European Free Trade Association (EFTA) and in the United Kingdom (UK). It provides data residency in Germany with additional levels of control and data protection. You can also sign up for a free Azure trial.
US government entities are eligible to purchase Azure Government services from a licensing solution provider with no upfront financial commitment or directly through a pay-as-you-go online subscription.
Important: The price in R$ is merely a reference; this is an international transaction and the final price is subject to exchange rates and the inclusion of IOF taxes. An eNF will not be issued.
Azure Germany is available to customers and partners who have already purchased this, doing business in the European Union (EU), the European Free Trade Association (EFTA) and in the United Kingdom (UK). It provides data residency in Germany with additional levels of control and data protection. You can also sign up for a free Azure trial.
The unified Speech services provide a wide range of speech recognition and generation capabilities including speech transcription, text-to-speech and speech translation. The Speech service provides a wide range of speech recognition and generation capabilities, including speech transcription, text-to-speech, speech translation and speaker recognition.
Pricing Details
Instance | Category | Features | Price |
---|---|---|---|
Free - Web/Container 1 concurrent request1 |
Speech-to-Text | Standard | 5 audio hours free per month |
Custom |
5 audio hours free per month Endpoint hosting: 1 model free per month 2 |
||
Conversation transcription multi-channel audio PREVIEW 3 | 5 audio hours free per month | ||
Text to Speech | Standard | 5 million characters free per month | |
Neural | 0.5 million characters free per month | ||
Custom |
5 million characters free per month Endpoint hosting: 1 model free per month |
||
Speech translation | Standard | 5 audio hours free per month | |
Speaker Recognition 7 | Speaker verification | 10,000 free transactions per month | |
Speaker identification | 10,000 free transactions per month | ||
Voice Storage | 10,000 free transactions per month | ||
Standard - Web/Container 20 concurrent request 1 |
Speech-to-Text | Standard | $- per audio hour |
Custom |
$- per audio hour Endpoint hosting: $- per model per hour |
||
Conversation transcription multi-channel audio PREVIEW 3 | $- per audio hour 4 | ||
Text to Speech | Standard | $- per 1M characters | |
Neural |
$- per 1M
characters 5 Long audio creation: $- per 1M characters |
||
Custom |
$- per 1M characters Endpoint hosting: $- per model per hour |
||
Custom Neural 6 |
Training: $- per compute hour, up to $- per training Real-time synthesis: $- per 1M characters Endpoint hosting: $- per model per hour Long audio creation: $- per 1M characters |
||
Speech translation | Standard | $- per audio hour | |
Speaker Recognition 7 | Speaker verification | $- per 1,000 transactions | |
Speaker identification | $- per 1,000 transactions | ||
Voice Storage | $- per 1,000 transactions |
Support and SLA
- Free billing and subscription management support are included.
- We guarantee that Cognitive Services running in the standard tier will be available at least 99.9 per cent of the time. No SLA is provided for the free trial. Read the SLA
FAQ
-
- For Speech Translation, Speech to Text and Speech to Text with Custom Speech Model: usage is billed in one-second increments.
- For Text to Speech and Text To Speech with Custom Voice Font: usage is billed per character.
- For Custom Speech Model Hosting: usage is billed hourly; For Custom Voice Font Hosting: usage is billed daily.
- For Custom Commands: billing is tracked as consumption of Speech to Text, Text to Speech and Language Understanding. Custom Commands does not introduce new billing meters.
- There is no charge for training Speech models. The only costs are for hosting the model once trained and then the cost per hour of speech transcription.
-
The Speech service enables users to adapt baseline models based on their own acoustic and language data, leading to custom speech models that can be used against both Speech to Text and Speech Translation.
-
The language model is a probability distribution over sequences of words. The language model helps the system to decide among sequences of words that sound similar, based on the likelihood of the word sequences themselves. For example, “recognize speech” and “wreck a nice beach” sound alike but the first hypothesis is far more likely to occur, and therefore will be assigned a higher score by the language model. If you expect voice queries to your application to contain particular vocabulary items, such as product names or jargon that rarely occur in typical speech, it is likely that you can obtain improved performance by customising the language model. For example, if you were building an app to search MSDN by voice, it’s likely that terms like “object-oriented”, “namespace” or “dot net” will appear more frequently than in typical voice applications. Customising the language model will enable the system to learn this.
-
The acoustic model is a classifier that labels short fragments of audio into one of several phonemes, or sound units, in each language. These phonemes can then be stitched together to form words. For example, the word “speech” is comprised of four phonemes “s p iy ch”. These classifications are made on the order of 100 times per second. Customising the acoustic model can enable the system to learn to do a better job recognising speech in atypical environments. For example, if you have an app designed to be used by workers in a warehouse or factory, a customised acoustic model can more accurately recognise speech in the presence of the noises found in these environments.
-
Microsoft Speech Services provide 70+ default voices (aka voice fonts) in 40+ languages to help you convert your text into audio. With the rise of the virtual assistant and various speech-enabled applications, however, many companies would like to have a unique voice that represents their business and is carefully designed for their own brand identity. For example, if you are developing a chat bot for your customer care service, you can associate it with a unique brand voice of your company to develop customer attachment. Likewise, an in-car navigation software developer can enable Text to Speech in different custom voices to enrich user experience.
Through Voice Studio, the custom voice building portal, that is easy. Using your own audio data (recorded human voice with their associated scripts), you can generate a custom voice font which will then be deployed to Microsoft Text to Speech service and can be easily plugged in your applications with an API endpoint for your own use.
Resources
Talk to a sales specialist for a walk-through of Azure pricing. Understand pricing for your cloud solution.
Get free cloud services and $200 in credit to explore Azure for 30 days.