Cognitive Services Pricing - Custom Speech Service PREVIEW

Use intelligence APIs to enable vision, speech, language, and knowledge capabilities

The Custom Speech Service lets you create custom speech recognition models and deploy them to a speech-to-text endpoint that’s tailored to your application. With Custom Speech Service, you can customize the language model of the speech recognizer, so it learns the vocabulary of your application and the speaking style of your users. You can also customize the acoustic model of the speech recognizer to better match the application’s expected environment and user population.

Pricing details

Model adaptation is free.

Free S1
Model Deployments 1 model $-/model/month
Model Adaptation 3 hours/month Unlimited
Accuracy Tests 2 hours/month 2 hours free and then $-/hour
Scale Out N/A $-/unit/day where each unit allows you to send five concurrent requests
No Trace N/A $-/model/month
Request Pricing 2 hours/month 2 hours free and then $-/hour

Support & SLA

  • Free billing and subscription management support are included.
  • Need tech support for preview services? Use our forums.
  • We guarantee that Cognitive Services running in the standard tier will be available at least 99.9% of the time. No SLA is provided for the free tier. Read the SLA.
  • No SLA during preview period. Learn more.

FAQ

Custom Speech Service

  • Tier 1 can process up to four pieces of audio (i.e. four transcriptions) at the same time and still respond in real time. If the user sends more than four concurrent pieces of audio each subsequent piece of audio is rejected and sent back with an error code indicating too many concurrent recognitions. The same applies to Tier 2 where 12 simultaneous transcriptions can be processed. The Free tier offers one concurrent transcription. It is assumed that the audio will be uploaded in real-time. If audio is uploaded faster, for concurrency purposes the request will still be assumed to be ongoing until the duration of the audio has passed (even though the recognition result might be sent back earlier).

    Note: If higher level of concurrency is required please contact us.

  • The language model is a probability distribution over sequences of words. The language model helps the system decide among sequences of words that sound similar, based on the likelihood of the word sequences themselves. For example, “recognize speech” and “wreck a nice beach” sound alike but the first hypothesis is far more likely to occur, and therefore will be assigned a higher score by the language model. If you expect voice queries to your application to contain particular vocabulary items, such as product names or jargon that rarely occur in typical speech, it is likely that you can obtain improved performance by customizing the language model. For example, if you were building an app to search MSDN by voice, it’s likely that terms like “object-oriented” or “namespace” or “dot net” will appear more frequently than in typical voice applications. Customizing the language model will enable the system to learn this.

  • The acoustic model is a classifier that labels short fragments of audio into one of several phonemes, or sound units, in each language. These phonemes can then be stitched together to form words. For example, the word “speech” is comprised of four phonemes “s p iy ch”. These classifications are made on the order of 100 times per second. Customizing the acoustic model can enable the system to learn to do a better job recognizing speech in atypical environments. For example, if you have an app designed to be used by workers in a warehouse or factory, a customized acoustic model can more accurately recognize speech in the presence of the noises found in these environments.

  • Short Phrase recognition supports utterances up to 15 seconds long. When used with the Speech Client library, as data is sent to the server, the client will receive multiple partial results and one final multiple N-best choice result.

  • Long Dictation recognition supports utterances up to two minutes long. When used with the Speech Client library, as data is sent to the server, the client will receive multiple partial results and multiple final results, based on where the server indicates sentence pauses.

  • For instance, if a customer uses the S1 tier to process one million transcriptions they will be charged the tier price ($-), the first 100,000 transcriptions are billed at $- per 1,000 transcriptions and the remaining 900,000 transcriptions are billed at $- per 1,000 transcriptions. So, in effect, the customer is billed $- + 100,000 * ($- / 1,000) + 900,000 * ($- / 1,000) = $4500.

  • Please see the Custom Speech Service information on the Microsoft Cognitive Services webpage and on the Custom Speech Service website, www.cris.ai.

  • Custom model deployment is the process of wrapping a custom model then exposing it as a service. The resulting deployed custom model exposes an endpoint via which it can be accessed. Users can choose to deploy as many models as they require.

  • Custom Speech Service enables users to adapt baseline models based on their own acoustic and language data. We call this process model customization.

  • When a custom model is created, users have the option to upload test data to evaluate the newly created model. Users can test the new custom models with as much data as they require i.e. execute unlimited accuracy tests.

  • When a custom model has been deployed, its URI can process one audio request at a time. For scenarios that send more than one audio request simultaneously to that URI, users can opt to scale out at a rate of five concurrent requests at the time. This is achieved by purchasing scale units. Each scale unit guarantees up to five simultaneous concurrent audio requests at a cost of $200 per scale unit. For example, if a user envisages hitting that endpoint with 23 audio requests at the same time, the user would need to purchase five scale units to guarantee up to 25 concurrent requests.

  • Log management enables users to switch off logging for their deployed models. Users concerned about privacy can opt to switch off logging for a deployed model at a rate of $20 per month.

  • Request pricing refers to the cost of processing audio requests by the endpoint of a deployed custom model.

General

  • The Emotion API, Face API, Language Understanding Intelligent Service API, Bing Speech-to-Text API, and Bing Text-to-Speech API are billed per 1,000 API transaction calls when a production API call is being actively executed. Billing is prorated for production API transaction call quantities.

    The Bing Long Form Speech API service is billed per hour of speech that is analyzed. The billing is prorated on a per-minute basis.

    The Recommendations API and Text Analytics API can be purchased in units of the standard tiers at a fixed price. Each unit of a tier comes with included quantities of API transactions. If the user exceeds the included quantities, overages are charged at the rate specified in the pricing table above. These overages are prorated, and the service is billed on a monthly basis. The included quantities in a tier are reset each month.

  • Usage is throttled if the transaction limit is reached on the free tier. Customers cannot accrue overages on the free tier.

  • Any annotation to a document counts as a transaction. Batch scoring calls will also take into consideration the number of documents that need to be scored in that transaction. So for instance, if 1,000 documents are sent for sentiment analysis in a single API call, that will count for 1,000 transactions. If an API supports more than one annotation operation, that will also be considered. Let’s say an API call performs both sentiment analysis and key-phrase extraction on 1,000 documents, that will count for 2,000 transactions (2 annotations * 1,000 documents).

  • If the usage on a standard tier is exceeded, the account starts to accrue overages. These overages are billed on a monthly basis, and are calculated at the rate specified for each tier.

  • Any API call (with the exception of batch scoring calls) counts as a transaction. Batch scoring calls will count based on the number of items that need to be scored in that transaction.

  • Usage is throttled if the transaction limit is reached on the free tier. Customers cannot accrue overages on the free tier. Batch scoring is not supported on the free tier.

  • The Recommendations API can be purchased in units of the standard tiers at a fixed price. Each unit of a tier comes with included quantities of API transactions. If the user exceeds the included quantities, overages are charged at the rate specified in the pricing table above. These overages are prorated, and the service is billed on a monthly basis. The included quantities in a tier are reset each month.

  • You may upgrade to a higher tier at any time. Billing rate and included quantities corresponding to the higher tier will begin immediately.

  • Below table provides a list of available end-points for each API. The response for the same end-point of Bing Web Search API may vary depending on the Tier purchased. Refer to the next question for details.

    Included APIs Endpoints Available in Tiers
    Bing Web Search API https://api.cognitive.microsoft.com/bing/v7.0/search S1-S8
    Bing Image Search API https://api.cognitive.microsoft.com/bing/v7.0/images/search S1, S3, S7, S8
    Bing News Search API https://api.cognitive.microsoft.com/bing/v7.0/news/search S1, S5, S8
    Bing Video Search API https://api.cognitive.microsoft.com/bing/v7.0/videos/search S1, S4, S7, S8
    (Preview, EN-US Only)
    https://api.cognitive.microsoft.com/bing/v7.0/entities S1, S6
    Bing Autosuggest API https://api.cognitive.microsoft.com/bing/v7.0/autosuggest S1, S2
    Bing Spell Check API https://api.cognitive.microsoft.com/bing/v7.0/spellcheck S1, S2
  • No, the Bing Web Search API is curtailed to meet specific offering of each Tier. For example, Tier S3 is meant for customers wanting to utilize only web search results and images in their applications. The customers also have an option of calling just a specific end point within a tier and their transactions will count against the overall bundle transactions (for example in Tier S3 a customer can just call Image API end point and make 400 transactions and can call Web Search API end point for 600 transactions and the total will be counted as 1,000 transactions).

  • No, both the APIs could potentially return different results even if you are only looking for images. For example, for a certain type of query, Bing Web Search API may return a combination of web results, videos, news but may not return images. However, for the same query, Bing Image Search API may return images.

  • Tiers are priced based on the number of transactions. As an example, for Tier S3, price per 1,000 transactions is $4. At the end of billing period if 12,000 transactions are logged for the Bing Web Search API and 1,000 transactions are logged for the Bing Image Search API, then you will have billed for $52 calculated as $4*(13,000/1,000).

  • Bing Spell Check and Bing Autosuggest APIs are billed at 25,000 transactions increment in Tier S1. Whereas, other APIs are billed at 1,000 transactions increment in Tier S1.

    For example, if you are subscribed to Tier S1 and at the end of the billing period 15,000 transactions are logged for the Bing Web Search API, 3,000 transactions logged for Bing Video Search API, and 25,000 for Bing Autosuggest API. In this case, the approximate bill would be $133 calculated by $7*((15,000+3,000)/1,000) + $7*(25,000/25,000).

    Note: For billing, only the end-point is considered and not the requested response. For example, calling the Bing Web Search API only for image response will be counted towards the Bing Web Search API and not towards the Bing Image Search API.

Resources

Estimate your monthly costs for Azure services

Review Azure pricing frequently asked questions

Learn more about Cognitive Services

Review technical tutorials, videos, and more resources

Learn and build with $200 in credit, and keep going for free

Free account