Pricing - Azure Speech in Foundry Tools

Ujednolicone usługi przetwarzania mowy umożliwiające zamianę mowy na tekst i tekstu na mowę oraz tłumaczenie mowy

The unified Speech services provide a wide range of speech recognition and generation capabilities including speech transcription, text-to-speech and speech translation.

Eksplorowanie opcji cennika

Zastosuj filtry, aby dostosować opcje cennika do własnych potrzeb.

Ceny są jedynie szacunkowe i nie stanowią rzeczywistych ofert cenowych. Rzeczywiste ceny mogą się różnić w zależności od typu umowy zawartej z firmą Microsoft, daty zakupu i kursu wymiany walut. Ceny są obliczane na podstawie kursu dolara amerykańskiego i przeliczane przy użyciu londyńskich kursów zamknięcia typu spot z dwóch dni roboczych przed ostatnim dniem roboczym końca poprzedniego miesiąca. Jeżeli dwa dni robocze poprzedzające koniec miesiąca przypadają na dni świąteczne na głównych rynkach, dniem ustalenia kursu jest zazwyczaj dzień bezpośrednio poprzedzający te dwa dni robocze. Ten kurs dotyczy wszystkich transakcji w nadchodzącym miesiącu. Zaloguj się do kalkulatora cen platformy Azure, aby zobaczyć ceny na podstawie bieżącego programu/oferty firmy Microsoft. Skontaktuj się ze specjalistą ds. sprzedaży platformy Azure, aby uzyskać więcej informacji na temat cen lub poprosić o wycenę. Zobacz często zadawane pytania dotyczące cen platformy Azure.

Region:

Waluta:

Free (F0)

Speech services quotas and limits by tier (Free F0)
Kategoria	Funkcje	Cena
Speech to Text (per second billing)	Standard	Real-time Transcription: 5 audio hours free per month²
Speech to Text (per second billing)	Custom	Real-time Transcription: 5 audio hours free per month² Endpoint hosting: 1 model free per month¹
Text to Speech (per character billing)	Neural	0.5 million characters free per month
Speech Translation (per second billing)	Standardowa	5 audio hours free per month

See the documentation for information on quotas, limits and instructions on how to increase concurrent requests.

¹Unused models will be automatically decommissioned after 7 days.

²Free audio hours for speech to text is shared between Standard and Custom, Batch is not supported.

Pay as You Go: pay only for what you use

Voice Live Prices

Kategoria	Funkcje	Cena
Voice Live Pro (per m token billing)^VL1	Text	Input: $- Cached Input: $- Output: $-
	Audio - Standard	Input: $- Cached Input: $- Output: $-
	Audio - Custom^VL2	Input: $- Cached Input: $- Output: $-
	Native audio with speech-to-speech real-time model	Input: $- Cached Input: $- Output: $-
	Image	Input: $- Cached Input: $-
Voice Live Standard (per m token billing)^VL1	Text	Input: $- Cached Input: $- Output: $-
	Audio - Standard	Input: $- Cached Input: $- Output: $-
	Audio - Custom^VL2	Input: $- Cached Input: $- Output: $-
	Native audio with speech-to-speech real-time model	Input: $- Cached Input: $- Output: $-
Voice Live Lite (per m token billing)^VL1	Text	Input: $- Cached Input: $- Output: $-
	Audio - Standard	Input: $- Cached Input: $- Output: $-
	Audio - Custom^VL2	Cached Input: $- Output: $-
	Native audio with speech-to-speech real-time model	Input: $- Cached Input: $-
Voice Live BYO (per m token billing)^VL1	Audio - Standard	Input: $- Output: $-
Voice Live BYO (per m token billing)^VL1	Audio - Custom^VL2	Input: $- Output: $-
Voice Live Avatar (per minute billing)	Avatar output with Voice Live	Charged through Text to Speech Avatar ‘interactive avatar (real-time)’. See below Text to Speech pricing table for details.

^VL1Voice Live pricing tiers are determined by the LLM models selected by developers. Models within each tier may be updated or retired as new options become available. To learn more about how Voice Live API pricing works, click here.

^VL2You will be charged separately for custom speech and custom voice model training and hosting. Refer to the ‘Speech to Text – Custom Transcription’ and ‘Text to Speech – Custom Voice – Professional’ pricing for details. Custom voice is a limited access feature. Learn more about how to create custom voices.When using MAI-Transcribe, Standard-Audio pricing applies.

Speech Model Prices

Kategoria	Funkcje	Cena
Speech to Text (per second billing)	Standard Transcription	Real-time Transcription: $- per hour Fast Transcription: $- per hour⁸ Batch Transcription: $- per hour¹
	Custom Transcription	Real-time Transcription: $- per hour Batch Transcription: $- per hour¹ Endpoint hosting: $- za model na godzinę Custom Speech Training⁵: $- per compute hour
	Enhanced add-on features: Continuous Language identification Diarization Pronunciation Assessment (prosody)	Real-time: $- per hour per feature Batch (Continuous Language identification, Diarization): Included in Standard/Custom (no extra charge)
Speech Translation (per second billing)	Real-time Speech Translation	$- per audio hour³
	Live Interpreter	Input audio: $- per audio hour Output text: $- per 1M characters Output audio (Standard voice): $- per audio hour^LI Output audio (Custom voice): $- per audio hour^LI
	Video Translation	Input video: $- per hour Output video (Standard voice): $- per hour Output video (Personal voice): $- per hour
LLM Speech⁹	Standard Transcription	$- per hour
	Standard Translation	$- per hour
	MAI-transcribe	$- per hour
Text to Speech⁷	Standard Voice	Neural/Neural HD (real-time and batch): $- per 1M characters Neural HD (real-time and batch)⁴: $- per 1M characters
	Custom Voice	Professional Voice: Synthesis (real-time and batch): $- per 1M characters Synthesis (neural HD real-time and batch): $- per 1M characters Voice model training: $- za godzinę obliczeniową, up to $- za trenowanie Endpoint hosting: $- za model na godzinę
	Custom Voice	Personal Voice⁶: Synthesis (real-time and batch): $- per 1M characters Voice creation: Free Voice profile storage: $- per 1,000 voice profiles per month
Text to Speech Add-on feature: Avatar	Standard Avatar	Real-time (Interactive): $- per minute Batch: $- per minute
	Custom Avatar	Photo Avatar: Real-time (Interactive): $- per minute Batch: $- per minute Photo Avatar Creation: $- per avatar
	Custom Avatar	Video Avatar: Real-time (Interactive): $- per minute Batch: $- per minute Interactive 4K avatar (real-time): $- per minute 4K avatar video (batch): $- per minute Avatar model training: $- per compute hour Endpoint hosting: $- per model per hour

See the documentation for information on quotas, limits and instructions on how to increase concurrent requests.

Speech to text hours are measured as the hours of audio sent to the service, billed in second increments.

¹To take advantage of this new Batch Transcription pricing you need to use Speech to text REST API V3.2 or later versions. See Speech to text REST API for information.

²This reflects public preview pricing.

³This price includes 1 audio input and output, up to 2 text translation language using standard or custom Speech to Text and standard Translation. For custom Translation or 3+ translation languages, please reference the Azure Translator in Foundry Tools Text Translation pricing page.

⁴Selected text to speech voices are available via two model variants: Neural and NeuralHD. Learn more here.

⁵Custom Speech Training applies when customizing any base model released on or after October 1, 2023.

⁶Personal Voice is a limited access feature restricted to certain pre-approved use cases only, with a need to applying for access. To learn more about the service, check the document.

⁷Text to Speech: speech synthesis usage is billed per character. Avatar is billed per second. Training and model hosting is billed per second.

⁸To use Fast Transcription you need to use Speech to text REST API 2024-11-15 or later versions. See Speech to text REST API for information.

⁹LLM Speech is currently sharing the same price and SKU ID with Fast Transcription.

^LIThis price includes text output

Commitment Tiers – Standard

Kategoria	Funkcje	Cena (miesięcznie)	Nadwyżka
Speech to Text	Standard	$- za 2,000 godz.	$- za godzinę
		$- za 10,000 godz.	$- za godzinę
		$- za 50,000 godz.	$- za godzinę
	Custom	$- za 2,000 godz.	$- za godzinę
		$- za 10,000 godz.	$- za godzinę
		$- za 50,000 godz.	$- za godzinę
	Enhanced add-on features:² Continuous Language identification Diarization Pronunciation Assessment (prosody)	$- za 2,000 godz.	$- za godzinę
		$- za 10,000 godz.	$- za godzinę
		$- za 50,000 godz.	$- za godzinę
Text to Speech	Neural¹	$- za 80 mln znaków	$- na 1 mln znaków
		$- za 400 mln znaków	$- na 1 mln znaków
		$- za 2,000 mln znaków	$- na 1 mln znaków

¹This includes both real-time synthesis and batch synthesis with prebuilt non-HD and non-AOAI neural voices. HD voices, AOAI voices, Custom Neural Voice and Personal Voice are not included.

²Real-time speech to text only, Continuous Language Identification and Diarization add-on features included with batch speech to text.

Commitment Tiers – Connected container

Kategoria	Funkcje	Cena (miesięcznie)	Nadwyżka
Speech to Text²	Standard	$- za 2,000 godz.	$- za godzinę
		$- za 10,000 godz.	$- za godzinę
		$- za 50,000 godz.	$- za godzinę
	Custom	$- za 2,000 godz.	$- za godzinę
		$- za 10,000 godz.	$- za godzinę
		$- za 50,000 godz.	$- za godzinę
	Enhanced add-on features:² Language identification Diarization	$- za 2,000 godz.	$- za godzinę
		$- za 10,000 godz.	$- za godzinę
		$- za 50,000 godz.	$- za godzinę
Text to Speech	Neural¹	$- za 80 mln znaków	$- na 1 mln znaków
		$- za 400 mln znaków	$- na 1 mln znaków
		$- za 2,000 mln znaków	$- na 1 mln znaków

¹This includes real-time synthesis with prebuilt non-HD and non-AOAI neural voices. HD voices, AOAI voices, and custom voices (both professional and personal voices) are not included. Batch synthesis is not included.

²Pricing applies to real-time and batch use cases. There is no separate batch pricing for containers.

See the documentation for information on Commitment tiers.

Commitment Tiers – Disconnected container

Sign up to access speech in disconnected containers, or learn more

Kategoria	Funkcje	Price (per year)	Max usage (per year)	Projected usage (per month)
Speech to Text²	Standard	$- $- Zarejestruj się, aby uzyskać dostęp Dowiedz się więcej	120,000 hours 600,000 hours	10,000 hours 50,000 hours
	Custom	$- $- Zarejestruj się, aby uzyskać dostęp Dowiedz się więcej	120,000 hours 600,000 hours	10,000 hours 50,000 hours
	Enhanced add-on features: Language identification Diarization	$- $-	120,000 hours 600,000 hours	10,000 hours 50,000 hours
Text to Speech	Neural¹	$- $- Zarejestruj się, aby uzyskać dostęp Dowiedz się więcej	4.8B characters 24B characters	400M characters 2,000M characters

¹This includes real-time synthesis with prebuilt non-HD and non-AOAI neural voices. HD voices, AOAI voices, and custom voices (both professional and personal voices) are not included. Batch synthesis is not included.

²Pricing applies to real-time and batch use cases. There is no separate batch pricing for containers.

These features are being deprecated and only available for existing customers to use. Check details and learn how to migrate to new features.

Wystąpienie	Kategoria	Funkcje	Cena
Bezpłatnie - Internet/kontener Równoczesne żądania: 1	Text to Speech	Standard	5 million characters free per month
Bezpłatnie - Internet/kontener Równoczesne żądania: 1	Text to Speech	Custom	5 million characters free per month Endpoint hosting: 1 model free per month
Standard - Web/Container 100 concurrent requests for Base model 20 concurrent requests for Custom model	Text to Speech	Standard	$- per 1M characters
	Text to Speech	Custom	$- per 1M characters Endpoint hosting: $- za model na godzinę

Opcje cen i zakupu platformy Azure

Skontaktuj się z nami bezpośrednio

Zapoznaj się z przewodnikiem dotyczącym cen platformy Azure. Poznaj ceny rozwiązania w chmurze, dowiedz się więcej o optymalizacji kosztów i poproś o ofertę niestandardową.

Rozmowa ze specjalistą ds. sprzedaży

Zobacz sposoby zakupu

Kup usługi platformy Azure za pośrednictwem witryny internetowej platformy Azure, przedstawiciela firmy Microsoft lub partnera platformy Azure.

Poznaj swoje opcje

Dodatkowe zasoby

Często zadawane pytania

Często zadawane pytania dotyczące cennika platformy Azure

- For Speech to Text and Speech Translation, usage is billed in one-second increments.
- For Text to Speech: usage is billed per character. Check the definition of character in the pricing note.
- For custom neural voice hosting: usage is billed per endpoint per second. Check details in the pricing note.
- For personal voice profile storage: usage is billed per voice profile per day. Check details in the pricing note.
- For Text to Speech Avatar, usage is billed per second.
- For Speech to Text and Text to Speech (including Avatar), endpoint hosting for custom models is billed per second per model.
The Speech service enables users to adapt baseline models based on their own acoustic and language data, leading to custom speech models that can be used against both Speech to Text and Speech Translation.
The language model is a probability distribution over sequences of words. The language model helps the system decide among sequences of words that sound similar, based on the likelihood of the word sequences themselves. For example, “recognize speech” and “wreck a nice beach” sound alike but the first hypothesis is far more likely to occur, and therefore will be assigned a higher score by the language model. If you expect voice queries to your application to contain particular vocabulary items, such as product names or jargon that rarely occur in typical speech, it is likely that you can obtain improved performance by customizing the language model. For example, if you were building an app to search MSDN by voice, it’s likely that terms like “object-oriented” or “namespace” or “dot net” will appear more frequently than in typical voice applications. Customizing the language model will enable the system to learn this.
The acoustic model is a classifier that labels short fragments of audio into one of several phonemes, or sound units, in each language. These phonemes can then be stitched together to form words. For example, the word “speech” is comprised of four phonemes “s p iy ch”. These classifications are made on the order of 100 times per second. Customizing the acoustic model can enable the system to learn to do a better job recognizing speech in atypical environments. For example, if you have an app designed to be used by workers in a warehouse or factory, a customized acoustic model can more accurately recognize speech in the presence of the noises found in these environments.
Speech service offers a wide range of text-to-speech (TTS) voice fonts, however custom neural voice allows you to build your own custom voice that suits your needs and your brand. Read the blog for more information.
Language identification allows you to identify a switch in spoken language and transcribe speech accordingly. This can be applied in scenarios where the audio language is unknown, or when speaker(s) may speak multiple languages. Single Language Identification is available at no additional cost. Continuous Language Identification is an enhanced add-on feature. Visit docs to learn more.
- Pronunciation assessment evaluates speech pronunciation and gives speakers feedback on the accuracy and fluency of spoken audio. With pronunciation assessment, language learners can practice, get instant feedback, and improve their pronunciation so that they can speak and present with confidence. Educators can use the capability to evaluate pronunciation of multiple speakers in real time. Visit docs to learn more.
- It is charged as standard Speech to Text, example:
  For evaluation of 8 seconds of speech, you will be charged around $-

Porozmawiaj ze specjalistą ds. sprzedaży, który przedstawi Ci cennik platformy Azure. Zapoznaj się z informacjami o cenach swojego rozwiązania w chmurze.

Poproś o przesłanie wyceny

Uzyskaj bezpłatne usługi online i $200 środków na eksplorowanie platformy Azure przez 30 dni.

Wypróbuj bezpłatnie platformę Azure

Mowa platformy Azure AI — cennik