음성 텍스트 변환, 텍스트 음성 변환 및 음성 번역을 위한 통합 음성 서비스
The unified Speech services provide a wide range of speech recognition and generation capabilities including speech transcription, text-to-speech and speech translation.
가격 옵션 살펴보기
필터를 적용하여 요구 사항에 맞게 가격 옵션을 사용자 지정하세요.
가격은 예상값일 뿐이며 실제 가격 견적이 아닙니다. 실제 가격 책정은 Microsoft와 체결한 계약 유형, 구매 날짜 및 환율에 따라 다를 수 있습니다. 가격은 미국 달러를 기준으로 계산되며 전월 마지막 영업일 이전 영업일 기준 2일 동안 수집된 런던 종가 현물 환율을 사용하여 변환됩니다. 월말 전 영업일 기준 2일이 주요 시장의 공휴일인 경우에는 통상 영업일 기준 2일의 직전일을 기준금리 설정일로 합니다. 이 비율은 다가오는 달의 모든 거래에 적용됩니다. Azure 가격 책정 계산기에 로그인하여 Microsoft의 현재 프로그램/제안을 기반으로 가격 책정을 확인하세요. 가격 책정에 대한 자세한 내용을 알아보거나 가격 견적을 요청하려면 Azure 영업 전문가에게 문의하세요. Azure 가격 책정에 대한 자주 묻는 질문을 참조하세요.
미국 정부 기관은 Azure Government 서비스를 종량제 온라인 구독을 통해 직접 구입하거나 라이선스 솔루션 공급자를 통해 선불 약정 없이 구입할 수 없습니다.
중요: R$로 표시된 가격은 단순 참조용입니다. 국제 거래이기 때문에 최종 가격은 환율과 IOF 세금 포함 여부에 따라 달라집니다. eNF는 발행되지 않습니다.
미국 정부 기관은 Azure Government 서비스를 종량제 온라인 구독을 통해 직접 구입하거나 라이선스 솔루션 공급자를 통해 선불 약정 없이 구입할 수 없습니다.
중요: R$로 표시된 가격은 단순 참조용입니다. 국제 거래이기 때문에 최종 가격은 환율과 IOF 세금 포함 여부에 따라 달라집니다. eNF는 발행되지 않습니다.
Free (F0)
| 범주 | 기능 | 가격 |
|---|---|---|
| Speech to Text (per second billing) |
Standard | Real-time Transcription: 5 audio hours free per month2 |
| Custom |
Real-time Transcription: 5 audio hours free per month2 Endpoint hosting: 1 model free per month1 |
|
| Text to Speech (per character billing) |
Neural | 0.5 million characters free per month |
| Speech Translation (per second billing) |
표준 | 5 audio hours free per month |
See the documentation for information on quotas, limits and instructions on how to increase concurrent requests.
1Unused models will be automatically decommissioned after 7 days.
2Free audio hours for speech to text is shared between Standard and Custom, Batch is not supported.
Pay as You Go: pay only for what you use
| 범주 | 기능 | 가격 | |
|---|---|---|---|
| Voice Live Pro (per m token billing)VL1 | Text |
Input: $- Cached Input: $- Output: $- |
|
| Audio - Standard |
Input: $- Cached Input: $- Output: $- |
||
| Audio - CustomVL2 |
Input: $- Cached Input: $- Output: $- |
||
| Native audio with speech-to-speech real-time model |
Input: $- Cached Input: $- Output: $- |
||
| Voice Live Standard (per m token billing)VL1 | Text |
Input: $- Cached Input: $- Output: $- |
|
| Audio - Standard |
Input: $- Cached Input: $- Output: $- |
||
| Audio - CustomVL2 |
Input: $- Cached Input: $- Output: $- |
||
| Native audio with speech-to-speech real-time model |
Input: $- Cached Input: $- Output: $- |
||
| Voice Live Lite (per m token billing)VL1 | Text |
Input: $- Cached Input: $- Output: $- |
|
| Audio - Standard |
Input: $- Cached Input: $- Output: $- |
||
| Audio - CustomVL2 |
Cached Input: $- Output: $- |
||
| Native audio with speech-to-speech real-time model |
Input: $- Cached Input: $- |
||
| Voice Live BYO (per m token billing)VL1 | Audio - Standard |
Input: $- Output: $- |
|
| Audio - CustomVL2 |
Input: $- Output: $- |
||
| Voice Live Avatar (per minute billing) | Avatar output with Voice Live | Charged through Text to Speech Avatar ‘interactive avatar (real-time)’. See below Text to Speech pricing table for details. | |
| Speech to Text (per second billing) |
Standard Transcription |
Real-time Transcription: $- per hour Fast Transcription: $- per hour8 Batch Transcription: $- per hour1 |
|
| Custom Transcription |
Real-time Transcription: $- per hour Batch Transcription: $- per hour1 Endpoint hosting: $-/모델/시간 Custom Speech Training5: $- per compute hour |
||
Enhanced add-on features:
|
Real-time: $- per hour per feature Batch (Continuous Language identification, Diarization): Included in Standard/Custom (no extra charge) |
||
| Speech Translation (per second billing) |
Real-time Speech Translation | $- per audio hour3 | |
| Live Interpreter |
Input audio: $- per audio hour Output text: $- per 1M characters Output audio (Standard voice): $- per audio hourLI Output audio (Custom voice): $- per audio hourLI |
||
| Video Translation |
Input video: $- per hour Output video (Standard voice): $- per hour Output video (Personal voice): $- per hour |
||
| LLM Speech (Preview)9 | Standard Transcription | $- per hour | |
| Standard Translation | $- per hour | ||
| Text to Speech7 | Standard Voice |
Neural (real-time and batch): $- per 1M characters Neural HD (real-time and batch)4: $- per 1M characters |
|
| Custom Voice |
Professional Voice:
Synthesis (real-time and batch): $- per 1M characters
Synthesis (neural HD real-time and batch): $- per 1M characters Voice model training: 컴퓨팅 시간당 $-개, up to 학습당 $- Endpoint hosting: $-/모델/시간 |
||
|
Personal Voice6:
Synthesis (real-time and batch): $- per 1M characters
Voice creation: Free Voice profile storage: $- per 1,000 voice profiles per month |
|||
| Enhanced Add-on feature: Avatar |
Standard:
Interactive avatar (real-time): $- per minute
Interactive 4K avatar (real-time): $- per minute Avatar video (batch): $- per minute 4K avatar video (batch): $- per minute |
||
|
Custom:
Avatar model training: $- per compute hour
Interactive avatar (real-time): $- per minute Interactive 4K avatar (real-time): $- per minute Avatar video (batch): $- per minute 4K avatar video (batch): $- per minute Endpoint hosting: $- per model per hour |
|||
See the documentation for information on quotas, limits and instructions on how to increase concurrent requests.
Speech to text hours are measured as the hours of audio sent to the service, billed in second increments.
1To take advantage of this new Batch Transcription pricing you need to use Speech to text REST API V3.2 or later versions. See Speech to text REST API for information.
2This reflects public preview pricing.
3This price includes 1 audio input and output, up to 2 text translation language using standard or custom Speech to Text and standard Translation. For custom Translation or 3+ translation languages, please reference the Azure Translator in Foundry Tools Text Translation pricing page.
4Selected text to speech voices are available via two model variants: Neural and NeuralHD. Learn more here.
5Custom Speech Training applies when customizing any base model released on or after October 1, 2023.
6Personal Voice is a limited access feature restricted to certain pre-approved use cases only, with a need to applying for access. To learn more about the service, check the document.
7Text to Speech: speech synthesis usage is billed per character. Avatar is billed per second. Training and model hosting is billed per second.
8To use Fast Transcription you need to use Speech to text REST API 2024-11-15 or later versions. See Speech to text REST API for information.
9LLM Speech (preview) is currently sharing the same price and SKU ID with Fast Transcription.
VL1With Voice Live Pro, developers can choose from larger LLMs such as GPT-Realtime, GPT-4o and GPT-4.1 models. With Voice Live Standard, developers can choose from smaller LLMs such as GPT-4o-Mini-Realtime, GPT-4o Mini and GPT-4.1 Mini models. With Voice Live Lite, developers can choose from SLMs and equivalent models such as GPT-4.1 Nano and Phi models. Models for each tier will be updated or retired as they become available. To learn more how Voice Live API pricing works, click here.
VL2You will be charged separately for custom speech and custom voice model training and hosting. Refer to the ‘Speech to Text – Custom Transcription’ and ‘Text to Speech – Custom Voice – Professional’ pricing for details. Custom voice is a limited access feature. Learn more about how to create custom voices.
LIThis price includes text output
Commitment Tiers – Standard
| 범주 | 기능 | 가격(월별) | 초과분 |
|---|---|---|---|
| Speech to Text | Standard | 2,000시간 동안 $- | 시간당 $- |
| 10,000시간 동안 $- | 시간당 $- | ||
| 50,000시간 동안 $- | 시간당 $- | ||
| Custom | 2,000시간 동안 $- | 시간당 $- | |
| 10,000시간 동안 $- | 시간당 $- | ||
| 50,000시간 동안 $- | 시간당 $- | ||
Enhanced add-on features:2
|
2,000시간 동안 $- | 시간당 $- | |
| 10,000시간 동안 $- | 시간당 $- | ||
| 50,000시간 동안 $- | 시간당 $- | ||
| Text to Speech | Neural1 | 80M 문자에 대해 $- | 1M 문자당 $- |
| 400M 문자에 대해 $- | 1M 문자당 $- | ||
| 2,000M 문자에 대해 $- | 1M 문자당 $- |
1This includes both real-time synthesis and batch synthesis with prebuilt non-HD and non-AOAI neural voices. HD voices, AOAI voices, Custom Neural Voice and Personal Voice are not included.
2Real-time speech to text only, Continuous Language Identification and Diarization add-on features included with batch speech to text.
Commitment Tiers – Connected container
| 범주 | 기능 | 가격(월별) | 초과분 |
|---|---|---|---|
| Speech to Text2 | Standard | 2,000시간 동안 $- | 시간당 $- |
| 10,000시간 동안 $- | 시간당 $- | ||
| 50,000시간 동안 $- | 시간당 $- | ||
| Custom | 2,000시간 동안 $- | 시간당 $- | |
| 10,000시간 동안 $- | 시간당 $- | ||
| 50,000시간 동안 $- | 시간당 $- | ||
Enhanced add-on features:2
|
2,000시간 동안 $- | 시간당 $- | |
| 10,000시간 동안 $- | 시간당 $- | ||
| 50,000시간 동안 $- | 시간당 $- | ||
| Text to Speech | Neural1 | 80M 문자에 대해 $- | 1M 문자당 $- |
| 400M 문자에 대해 $- | 1M 문자당 $- | ||
| 2,000M 문자에 대해 $- | 1M 문자당 $- |
1This includes real-time synthesis with prebuilt non-HD and non-AOAI neural voices. HD voices, AOAI voices, and custom voices (both professional and personal voices) are not included. Batch synthesis is not included.
2Pricing applies to real-time and batch use cases. There is no separate batch pricing for containers.
See the documentation for information on Commitment tiers.
Commitment Tiers – Disconnected container
Sign up to access speech in disconnected containers, or learn more
| 범주 | 기능 | Price (per year) | Max usage (per year) | Projected usage (per month) |
|---|---|---|---|---|
| Speech to Text2 | Standard |
$-
$- 등록하여 액세스 권한 얻기 자세한 내용 |
120,000 hours
600,000 hours |
10,000 hours
50,000 hours |
| Custom |
$-
$- 등록하여 액세스 권한 얻기 자세한 내용 |
120,000 hours
600,000 hours |
10,000 hours
50,000 hours |
|
Enhanced add-on features:
|
$-
$- |
120,000 hours
600,000 hours |
10,000 hours
50,000 hours |
|
| Text to Speech | Neural1 |
$-
$- 등록하여 액세스 권한 얻기 자세한 내용 |
4.8B characters
24B characters |
400M characters
2,000M characters |
1This includes real-time synthesis with prebuilt non-HD and non-AOAI neural voices. HD voices, AOAI voices, and custom voices (both professional and personal voices) are not included. Batch synthesis is not included.
2Pricing applies to real-time and batch use cases. There is no separate batch pricing for containers.
These features are being deprecated and only available for existing customers to use. Check details and learn how to migrate to new features.
| 인스턴스 | 범주 | 기능 | 가격 |
|---|---|---|---|
| 무료 - 웹/컨테이너 1개 동시 요청 |
Text to Speech | Standard | 5 million characters free per month |
| Custom |
5 million characters free per month Endpoint hosting: 1 model free per month |
||
| Standard - Web/Container 100 concurrent requests for Base model 20 concurrent requests for Custom model |
Text to Speech | Standard | $- per 1M characters |
| Custom |
$- per 1M characters Endpoint hosting: $-/모델/시간 |
Azure 가격 책정 및 구매 옵션
Microsoft와 직접 연락하기
Azure 가격 책정을 살펴보세요. 클라우드 솔루션의 가격 책정을 이해하고 비용 최적화에 대해 알아보고 사용자 지정 제안을 요청하세요.
판매 전문가에게 문의하기추가 리소스
Azure AI 음성
Azure AI 음성 기능에 대해 자세히 알아보세요.
가격 계산기
요구 사항에 맞는 모든 Azure 제품을 사용하는 데 드는 월별 예상 비용을 산출해 보세요.
설명서
기술 자습서, 동영상, 추가 Azure AI 음성 리소스를 검토하세요.
자주 묻는 질문
-
- For Speech to Text and Speech Translation, usage is billed in one-second increments.
- For Text to Speech: usage is billed per character. Check the definition of character in the pricing note.
- For custom neural voice hosting: usage is billed per endpoint per second. Check details in the pricing note.
- For personal voice profile storage: usage is billed per voice profile per day. Check details in the pricing note.
- For Text to Speech Avatar, usage is billed per second.
- For Speech to Text and Text to Speech (including Avatar), endpoint hosting for custom models is billed per second per model.
-
The Speech service enables users to adapt baseline models based on their own acoustic and language data, leading to custom speech models that can be used against both Speech to Text and Speech Translation.
-
The language model is a probability distribution over sequences of words. The language model helps the system decide among sequences of words that sound similar, based on the likelihood of the word sequences themselves. For example, “recognize speech” and “wreck a nice beach” sound alike but the first hypothesis is far more likely to occur, and therefore will be assigned a higher score by the language model. If you expect voice queries to your application to contain particular vocabulary items, such as product names or jargon that rarely occur in typical speech, it is likely that you can obtain improved performance by customizing the language model. For example, if you were building an app to search MSDN by voice, it’s likely that terms like “object-oriented” or “namespace” or “dot net” will appear more frequently than in typical voice applications. Customizing the language model will enable the system to learn this.
-
The acoustic model is a classifier that labels short fragments of audio into one of several phonemes, or sound units, in each language. These phonemes can then be stitched together to form words. For example, the word “speech” is comprised of four phonemes “s p iy ch”. These classifications are made on the order of 100 times per second. Customizing the acoustic model can enable the system to learn to do a better job recognizing speech in atypical environments. For example, if you have an app designed to be used by workers in a warehouse or factory, a customized acoustic model can more accurately recognize speech in the presence of the noises found in these environments.
-
Speech service offers a wide range of text-to-speech (TTS) voice fonts, however custom neural voice allows you to build your own custom voice that suits your needs and your brand. Read the blog for more information.
-
Language identification allows you to identify a switch in spoken language and transcribe speech accordingly. This can be applied in scenarios where the audio language is unknown, or when speaker(s) may speak multiple languages. Single Language Identification is available at no additional cost. Continuous Language Identification is an enhanced add-on feature. Visit docs to learn more.
-
- Pronunciation assessment evaluates speech pronunciation and gives speakers feedback on the accuracy and fluency of spoken audio. With pronunciation assessment, language learners can practice, get instant feedback, and improve their pronunciation so that they can speak and present with confidence. Educators can use the capability to evaluate pronunciation of multiple speakers in real time. Visit docs to learn more.
- It is charged as standard Speech to Text, example:
For evaluation of 8 seconds of speech, you will be charged around $-
판매 전문가에게 문의하여 Azure 가격을 알아보세요. 클라우드 솔루션의 가격을 파악하세요.
별도 비용이 없는 클라우드 서비스와 $200 크레딧을 사용하여 30일간 Azure를 체험해 보세요.