跳转至主内容
Azure

Azure AI 语音定价

针对语音转文本、文本转语音和语音翻译的统一语音服务

The unified Speech services provide a wide range of speech recognition and generation capabilities including speech transcription, text-to-speech and speech translation.

浏览定价选项

应用筛选器来根据你的需求自定义定价选项。

价格仅是估算值,不应用作实际报价单。实际定价可能因与 Microsoft 签订的协议类型、购买日期和货币汇率而异。价格基于美元计算,并使用在上一个月末最后一个工作日之前的两个工作日内捕获的伦敦收盘即期汇率进行转换。如果本月底之前的两个工作日在主要市场中属于银行假日,则汇率确定日通常是紧接在此两个工作日之后的一天。此汇率适用于下个月的所有交易。登录到 Azure 定价计算器查看基于你与 Microsoft 的当前计划/产品/服务的定价。有关定价或请求报价的详细信息,请联系 Azure 销售专家。请参阅经常询问有关 Azure 定价的问题

Free (F0)

类别 功能 价格
Speech to Text
(per second billing)
Standard Real-time Transcription: 5 audio hours free per month2
Custom Real-time Transcription: 5 audio hours free per month2
Endpoint hosting: 1 model free per month1
Text to Speech
(per character billing)
Neural 0.5 million characters free per month
Speech Translation
(per second billing)
标准 5 audio hours free per month
Speech services quotas and limits by tier (Free F0)

See the documentation for information on quotas, limits and instructions on how to increase concurrent requests.

1Unused models will be automatically decommissioned after 7 days.

2Free audio hours for speech to text is shared between Standard and Custom, Batch is not supported.

Pay as You Go: pay only for what you use

类别 功能 价格
Voice Live Pro (per m token billing)VL1 Text Input: $-
Cached Input: $-
Output: $-
Audio - Standard Input: $-
Cached Input: $-
Output: $-
Audio - CustomVL2 Input: $-
Cached Input: $-
Output: $-
Native audio with speech-to-speech real-time model Input: $-
Cached Input: $-
Output: $-
Voice Live Standard (per m token billing)VL1 Text Input: $-
Cached Input: $-
Output: $-
Audio - Standard Input: $-
Cached Input: $-
Output: $-
Audio - CustomVL2 Input: $-
Cached Input: $-
Output: $-
Native audio with speech-to-speech real-time model Input: $-
Cached Input: $-
Output: $-
Voice Live Lite (per m token billing)VL1 Text Input: $-
Cached Input: $-
Output: $-
Audio - Standard Input: $-
Cached Input: $-
Output: $-
Audio - CustomVL2 Cached Input: $-
Output: $-
Native audio with speech-to-speech real-time model Input: $-
Cached Input: $-
Voice Live BYO (per m token billing)VL1 Audio - Standard Input: $-
Output: $-
Audio - CustomVL2 Input: $-
Output: $-
Voice Live Avatar (per minute billing) Avatar output with Voice Live Charged through Text to Speech Avatar ‘interactive avatar (real-time)’. See below Text to Speech pricing table for details.
Speech to Text
(per second billing)
Standard Transcription Real-time Transcription: $- per hour
Fast Transcription: $- per hour8
Batch Transcription: $- per hour1
Custom Transcription Real-time Transcription: $- per hour
Batch Transcription: $- per hour1
Endpoint hosting: $-/模型/小时
Custom Speech Training5: $- per compute hour
Enhanced add-on features:
  • Continuous Language identification
  • Diarization
  • Pronunciation Assessment (prosody)
Real-time: $- per hour per feature
Batch (Continuous Language identification, Diarization): Included in Standard/Custom (no extra charge)
Speech Translation
(per second billing)
Real-time Speech Translation $- per audio hour3
Live Interpreter Input audio: $- per audio hour
Output text: $- per 1M characters
Output audio (Standard voice): $- per audio hourLI
Output audio (Custom voice): $- per audio hourLI
Video Translation Input video: $- per hour
Output video (Standard voice): $- per hour
Output video (Personal voice): $- per hour
LLM Speech (Preview)9 Standard Transcription $- per hour
Standard Translation $- per hour
Text to Speech7 Standard Voice Neural (real-time and batch): $- per 1M characters
Neural HD (real-time and batch)4: $- per 1M characters
Custom Voice Professional Voice:
Synthesis (real-time and batch): $- per 1M characters
Synthesis (neural HD real-time and batch): $- per 1M characters
Voice model training: 每计算小时数 $-, up to 每次培训 $-
Endpoint hosting: $-/模型/小时
Personal Voice6:
Synthesis (real-time and batch): $- per 1M characters
Voice creation: Free
Voice profile storage: $- per 1,000 voice profiles per month
Enhanced Add-on feature: Avatar Standard:
Interactive avatar (real-time): $- per minute
Interactive 4K avatar (real-time): $- per minute
Avatar video (batch): $- per minute
4K avatar video (batch): $- per minute
Custom:
Avatar model training: $- per compute hour
Interactive avatar (real-time): $- per minute
Interactive 4K avatar (real-time): $- per minute
Avatar video (batch): $- per minute
4K avatar video (batch): $- per minute
Endpoint hosting: $- per model per hour
Speech-to-Text pricing details and features by tier

See the documentation for information on quotas, limits and instructions on how to increase concurrent requests.

Speech to text hours are measured as the hours of audio sent to the service, billed in second increments.

1To take advantage of this new Batch Transcription pricing you need to use Speech to text REST API V3.2 or later versions. See Speech to text REST API for information.

2This reflects public preview pricing.

3This price includes 1 audio input and output, up to 2 text translation language using standard or custom Speech to Text and standard Translation. For custom Translation or 3+ translation languages, please reference the Azure Translator in Foundry Tools Text Translation pricing page.

4Selected text to speech voices are available via two model variants: Neural and NeuralHD. Learn more here.

5Custom Speech Training applies when customizing any base model released on or after October 1, 2023.

6Personal Voice is a limited access feature restricted to certain pre-approved use cases only, with a need to applying for access. To learn more about the service, check the document.

7Text to Speech: speech synthesis usage is billed per character. Avatar is billed per second. Training and model hosting is billed per second.

8To use Fast Transcription you need to use Speech to text REST API 2024-11-15 or later versions. See Speech to text REST API for information.

9LLM Speech (preview) is currently sharing the same price and SKU ID with Fast Transcription.

VL1With Voice Live Pro, developers can choose from larger LLMs such as GPT-Realtime, GPT-4o and GPT-4.1 models. With Voice Live Standard, developers can choose from smaller LLMs such as GPT-4o-Mini-Realtime, GPT-4o Mini and GPT-4.1 Mini models. With Voice Live Lite, developers can choose from SLMs and equivalent models such as GPT-4.1 Nano and Phi models. Models for each tier will be updated or retired as they become available. To learn more how Voice Live API pricing works, click here.

VL2You will be charged separately for custom speech and custom voice model training and hosting. Refer to the ‘Speech to Text – Custom Transcription’ and ‘Text to Speech – Custom Voice – Professional’ pricing for details. Custom voice is a limited access feature. Learn more about how to create custom voices.

LIThis price includes text output

Commitment Tiers – Standard

类别 功能 价格(每月) 超额
Speech to Text Standard 2,000 个小时的定价为 $- $- 每小时
10,000 个小时的定价为 $- $- 每小时
50,000 个小时的定价为 $- $- 每小时
Custom 2,000 个小时的定价为 $- $- 每小时
10,000 个小时的定价为 $- $- 每小时
50,000 个小时的定价为 $- $- 每小时
Enhanced add-on features:2
  • Continuous Language identification
  • Diarization
  • Pronunciation Assessment (prosody)
2,000 个小时的定价为 $- $- 每小时
10,000 个小时的定价为 $- $- 每小时
50,000 个小时的定价为 $- $- 每小时
Text to Speech Neural1 80 百万个字符的定价为 $- 每 1 百万个字符的定价为 $-
400 百万个字符的定价为 $- 每 1 百万个字符的定价为 $-
2,000 百万个字符的定价为 $- 每 1 百万个字符的定价为 $-

1This includes both real-time synthesis and batch synthesis with prebuilt non-HD and non-AOAI neural voices. HD voices, AOAI voices, Custom Neural Voice and Personal Voice are not included.

2Real-time speech to text only, Continuous Language Identification and Diarization add-on features included with batch speech to text.

Commitment Tiers – Connected container

类别 功能 价格(每月) 超额
Speech to Text2 Standard 2,000 个小时的定价为 $- $- 每小时
10,000 个小时的定价为 $- $- 每小时
50,000 个小时的定价为 $- $- 每小时
Custom 2,000 个小时的定价为 $- $- 每小时
10,000 个小时的定价为 $- $- 每小时
50,000 个小时的定价为 $- $- 每小时
Enhanced add-on features:2
  • Language identification
  • Diarization
2,000 个小时的定价为 $- $- 每小时
10,000 个小时的定价为 $- $- 每小时
50,000 个小时的定价为 $- $- 每小时
Text to Speech Neural1 80 百万个字符的定价为 $- 每 1 百万个字符的定价为 $-
400 百万个字符的定价为 $- 每 1 百万个字符的定价为 $-
2,000 百万个字符的定价为 $- 每 1 百万个字符的定价为 $-

1This includes real-time synthesis with prebuilt non-HD and non-AOAI neural voices. HD voices, AOAI voices, and custom voices (both professional and personal voices) are not included. Batch synthesis is not included.

2Pricing applies to real-time and batch use cases. There is no separate batch pricing for containers.

See the documentation for information on Commitment tiers.

Commitment Tiers – Disconnected container

Sign up to access speech in disconnected containers, or learn more

类别 功能 Price (per year) Max usage (per year) Projected usage (per month)
Speech to Text2 Standard $-
$-
注册以获取访问权限
了解详细信息
120,000 hours
600,000 hours
10,000 hours
50,000 hours
Custom $-
$-
注册以获取访问权限
了解详细信息
120,000 hours
600,000 hours
10,000 hours
50,000 hours
Enhanced add-on features:
  • Language identification
  • Diarization
$-
$-
120,000 hours
600,000 hours
10,000 hours
50,000 hours
Text to Speech Neural1 $-
$-
注册以获取访问权限
了解详细信息
4.8B characters
24B characters
400M characters
2,000M characters

1This includes real-time synthesis with prebuilt non-HD and non-AOAI neural voices. HD voices, AOAI voices, and custom voices (both professional and personal voices) are not included. Batch synthesis is not included.

2Pricing applies to real-time and batch use cases. There is no separate batch pricing for containers.

These features are being deprecated and only available for existing customers to use. Check details and learn how to migrate to new features.

实例 类别 功能 价格
免费 - Web/容器
1 并发请求
Text to Speech Standard 5 million characters free per month
Custom 5 million characters free per month
Endpoint hosting: 1 model free per month
Standard - Web/Container
100 concurrent requests for Base model
20 concurrent requests for Custom model
Text to Speech Standard $- per 1M characters
Custom $- per 1M characters
Endpoint hosting: $-/模型/小时

Azure 定价和购买选项

直接与我们联系

获取 Azure 定价演练。了解云解决方案的定价、学习成本优化和请求自定义建议。

与销售专家交谈

查看购买方式

通过 Azure 网站、Microsoft 代表或 Azure 合作伙伴购买 Azure 服务。

浏览你的选项

其他资源

Azure AI 语音

详细了解 Azure AI 语音 特性和功能。

定价计算器

估计每月使用任何 Azure 产品组合应产生的费用。

文档

查看技术教程、视频和更多 Azure AI 语音 资源。

    • For Speech to Text and Speech Translation, usage is billed in one-second increments.
    • For Text to Speech: usage is billed per character. Check the definition of character in the pricing note.
    • For custom neural voice hosting: usage is billed per endpoint per second. Check details in the pricing note.
    • For personal voice profile storage: usage is billed per voice profile per day. Check details in the pricing note.
    • For Text to Speech Avatar, usage is billed per second.
    • For Speech to Text and Text to Speech (including Avatar), endpoint hosting for custom models is billed per second per model.
  • The Speech service enables users to adapt baseline models based on their own acoustic and language data, leading to custom speech models that can be used against both Speech to Text and Speech Translation.

  • The language model is a probability distribution over sequences of words. The language model helps the system decide among sequences of words that sound similar, based on the likelihood of the word sequences themselves. For example, “recognize speech” and “wreck a nice beach” sound alike but the first hypothesis is far more likely to occur, and therefore will be assigned a higher score by the language model. If you expect voice queries to your application to contain particular vocabulary items, such as product names or jargon that rarely occur in typical speech, it is likely that you can obtain improved performance by customizing the language model. For example, if you were building an app to search MSDN by voice, it’s likely that terms like “object-oriented” or “namespace” or “dot net” will appear more frequently than in typical voice applications. Customizing the language model will enable the system to learn this.

  • The acoustic model is a classifier that labels short fragments of audio into one of several phonemes, or sound units, in each language. These phonemes can then be stitched together to form words. For example, the word “speech” is comprised of four phonemes “s p iy ch”. These classifications are made on the order of 100 times per second. Customizing the acoustic model can enable the system to learn to do a better job recognizing speech in atypical environments. For example, if you have an app designed to be used by workers in a warehouse or factory, a customized acoustic model can more accurately recognize speech in the presence of the noises found in these environments.

  • Speech service offers a wide range of text-to-speech (TTS) voice fonts, however custom neural voice allows you to build your own custom voice that suits your needs and your brand. Read the blog for more information.

  • Language identification allows you to identify a switch in spoken language and transcribe speech accordingly. This can be applied in scenarios where the audio language is unknown, or when speaker(s) may speak multiple languages. Single Language Identification is available at no additional cost. Continuous Language Identification is an enhanced add-on feature. Visit docs to learn more.

    • Pronunciation assessment evaluates speech pronunciation and gives speakers feedback on the accuracy and fluency of spoken audio. With pronunciation assessment, language learners can practice, get instant feedback, and improve their pronunciation so that they can speak and present with confidence. Educators can use the capability to evaluate pronunciation of multiple speakers in real time. Visit docs to learn more.
    • It is charged as standard Speech to Text, example:
      For evaluation of 8 seconds of speech, you will be charged around $-

与销售专家交谈,演练 Azure 定价情况。了解你的云解决方案的定价。

获取免费云服务和价值 $200 的赠金来探索 Azure 30 天。

添加到估价。 按“V”在计算器上查看