Text to Speech

Convert text to lifelike speech for more natural interfaces

Speak human, not robot

Use Text to Speech – part of the Speech service – to build apps and services that speak naturally. Bring your solutions to life with dozens of voices in a wide range of languages. Create lifelike voices with the Neural Text to Speech capability built on breakthrough research in speech synthesis technology. Customise models to create a unique voice for your solution and brand.

Lifelike speech

Enable fluid, natural-sounding speech that matches the stress patterns and intonation of human voices.

Global engagement

Reach global audiences with more than 80 voices and 45 languages and variants.

Customised experiences

Build unique, branded voices for your apps, starting from just a few minutes of training data.

Optimised audio

Fine-tune voice output for your scenarios by easily adjusting attributes such as rate, volume and pronunciation.

Produce natural-sounding speech

Give your apps a new voice with natural, humanlike intonation and clear articulation. Using deep neural networks, Text to Speech makes the voices of computers expressive and nearly indistinguishable from natural spoken voice.

English (UK): Jessa

Sentence Voice sample
The third type, a logarithm of the unsigned fold change, is undoubtedly the most tractable.
As the name suggests, the original submarines came from Yugoslavia.
This is easy enough if you have an unfinished attic directly above the bathroom.

English (UK): Guy

Sentence Voice sample
Susan Candiotti reports they've given up their trip.
Carol knows my lifestyle.
The seagrass fiber is tough, durable, and smooth.

Chinese (CN): Xiaoxiao

Sentence Voice sample

German (DE): Katja

Sentence Voice sample
Bestimmte Berufsgruppen sind nur noch schwer zu rekrutieren.
Sein Gedicht steckt voller Übertreibungen, die für den Schriftsteller allerdings typisch sind.
Er organisiert eine Unterstützung der schwächeren durch die stärksten Bundesländer.

Italian (IT): Elsa

Sentence Voice sample
Tenete conto di un fattore importante.
Alcuni prodotti in gran parte sono di buona qualità.
Crisi? Vietato rilassarsi, siamo ancora in emergenza.

Want to build this?

Engage global audiences in real time

Convert text to audio in real time, creating fluid conversational experiences. Engage global audiences using more than 80 voices and 45 languages and variants.

Language Sample text Voice sample
English (US) An airport spokesman said more than 110 planes were damaged by hail.
Chinese (CN) 广告收入的比例高达90%以上
Japanese (JP) 皆様のご協力のたまものと
German (DE) Der Anstieg der Verbraucherpreise in der Eurozone verlangsamt sich weiter.
Spanish (ES) El alcalde de Santiago convoca a los medios para inaugurar dos semáforos.
Turkish (TR) Tren durduğu sırada vagonun ortasında bir patlama meydana geldi.

Want to build this?

Create a unique brand voice

Build your unique voice without a single line of code, starting from just a few minutes of training audio. Develop a highly realistic, humanlike custom voice by using deep neural network models with the Custom Neural Voice capability, which can be used for real-time scenarios and synthesising long-form audio content.



Sample text Voice sample

Want to start building your own voice model?

Easily tailor audio output

Fine-tune your text to audio output in real time by controlling parameters including speed, pronunciation, pitch, volume, intonation and pauses. With neural voices, you can adjust the speaking style to express emotions such as cheerfulness or empathy, or to fit specific scenarios such as chatting, for a casual tone, or newscasting, for a formal tone.

Learn more about voice tuning

Deploy anywhere, from the cloud to the edge

Run Text to Speech in the cloud or on premises with containers for scenarios where data security and low latency are paramount. Speech containers now support both standard and custom voices.

Learn more about Speech in containers

Security for the enterprise

  • Microsoft invests over USD 1 billion annually on cyber security research and development.

  • We employ more than 3,500 security experts who are completely focused on securing your data and privacy.

  • Azure has more certifications than any other cloud provider. View the comprehensive list.

Get the power, control and customisation you need with flexible pricing

Only pay for what you use, with no upfront costs. With Text to Speech, you pay as you go, based on number of characters you convert to audio.

Guidelines for responsible neural voices

Learn about responsible deployment of synthetic voices

Synthetic voices must be designed in a way that they earn the trust of others. Learn the principles to building synthetic voices that create confidence in your company and services.

Read our responsible deployment guidelines

Obtain consent from voice talent

Help voice talent understand how neural Text to Speech works and how it may be used once they complete the audio recording process.

Read our disclosure guidance for voice talent

Be transparent

Make sure that users understand when they’re hearing a synthetic voice, and voice talent is aware of how their voice will be used.

See our disclosure guidelines Learn about our responsible approach

Contact us

The Custom Neural Voice capability is in gated preview. Learn more about the gating process and how to get access here.

Get started with Text to Speech in three steps

Get instant access and a $200 credit by signing up for an Azure free account.

Sign in to the Azure portal and add Speech.

Learn how to embed Text to Speech from the quickstarts and documentation.

Developer resources for Text to Speech

Documentation and tutorial

Get started with Text to Speech.


Take a Pluralsight course that walks you through using Text to Speech.

Take the course

Frequently asked questions about Text to Speech

  • Standard voices are created using statistical parametric synthesis and concatenation synthesis techniques. These voices are highly intelligible and sound natural and can be used to let your apps speak in more than 45 languages with a wide range of voice options.

    Neural voices use deep neural networks to overcome the limits of traditional text-to-speech systems in matching the patterns of stress and intonation in spoken language and in synthesizing units of speech into a computer voice. Standard text-to-speech breaks down prosody into separate steps for linguistic analysis and acoustic prediction that are governed by independent models, which can result in muffled voice synthesis. Our neural capability does prosody prediction and voice synthesis simultaneously, which results in a more fluid and natural-sounding voice.
  • See the documentation for a full list.
  • Check the regional availability.

Get Started with Speech