• 4 min read

Improve speech-to-text accuracy with Azure Custom Speech

With Microsoft Azure Cognitive Services for Speech, customers can build voice-enabled apps confidently and quickly in more than 140 languages. We make it easy for customers to transcribe speech to text (STT) with high accuracy, produce natural-sounding text-to-speech (TTS) voices, and translate spoken audio.

With Microsoft Azure Cognitive Services for Speech, customers can build voice-enabled apps confidently and quickly in more than 140 languages. We make it easy for customers to transcribe speech to text (STT) with high accuracy, produce natural-sounding text-to-speech (TTS) voices, and translate spoken audio. In the past few years, we are inspired by the ways customers seek our customization features to fine-tune speech recognition to their use cases.

As our speech technology continues to change and evolve, we want to introduce four custom speech-to-text capabilities and their respective customer use cases. With these features, you can evaluate and improve the speech-to-text accuracy for your applications and products. A custom speech model is trained on top of a base model. With a custom model, you can improve recognition of domain-specific vocabulary by providing text data to train the model. You can also improve recognition based on the specific audio conditions of the application by providing audio data with reference transcriptions.

Custom Speech data types and use cases

Our Custom Speech features will let you customize Microsoft's speech-to-text engine. You will be able to customize the language model by tailoring it to the vocabulary of the application and customize the acoustic model to adapt to the speaking style of your users. By uploading text and/or audio data through Custom Speech, you'll be able to create these custom models, combine them with Microsoft's state-of-the-art speech models, and deploy them to a custom speech-to-text endpoint that can be accessed from any device.

Phrase list: A real-time accuracy enhancement feature that does not need model training. For example, in a meeting or podcast scenario, you can add a list of participant names, products, and uncommon jargon using phrase list to boost their recognition.

Plain text: Our simplest custom speech model can be made using just text data. Customers in the media industry use this in use cases such as commentary of sports events. Because each sporting event’s vocabulary differs significantly from others, building a custom model specific to a sport increases accuracy by biasing to the vocabulary of the event.

Structured text: This is text data that boosts patterns of sentences in speech. These patterns could be utterances that differ only by individual words or phrases, for example, “May I speak with name” where name is a list of possible names of individuals. The pattern can link to this list of entities (name in this case), and you can also provide their unique pronunciations.

Audio: You can train a custom speech model using audio data, with or without human-labeled transcripts. With human-labeled transcripts, you can improve recognition accuracy on speaking styles, accents, or specific background noises. For American English, you can now train without needing a labeled transcript to improve acoustic aspects such as slight accents, speaking styles, and background noises.

Research milestones

Microsoft’s speech and dialog research group achieved a milestone in reaching human parity in 2016 on the Switchboard conversational speech recognition task, meaning we had created technology that recognized words in a conversation as well as professional human transcribers. After further experimentation, we then followed up with a 5.1 percent word error rate, exceeding human parity in 2017. A technical report published outlines the details of our system. Today, Custom Speech helps enterprises and developers improve upon the milestones achieved by Microsoft Research.

Customer inspiration

Peloton: In the past, Peloton provided subtitles only for its on-demand classes. But that meant that the signature live experience so valued by members was not accessible to those who are deaf or hard of hearing. While the decision to introduce live subtitles was clear, executing on that vision proved a bit murkier. A primary challenge was determining how automated speech recognition software could facilitate Peloton’s specific vocabulary, including the numerical phrases used for class countdowns and to set resistance and cadence levels. Latency was another issue—subtitles wouldn’t be very useful, after all, if they lagged behind what instructors were saying. Peloton chose Azure Cognitive Services because it was cost-effective and allowed Peloton to customize its own machine learning model for converting speech to text—and was significantly faster than other solutions on the market. Microsoft also provided a team of engineers that worked alongside Peloton throughout the development process.

Speech Services and Responsible AI

We are excited about the future of Azure Speech with human-like, diverse, and delightful quality under the high-level architecture of the XYZ-code AI framework. Our technology advancements are also guided by Microsoft’s Responsible AI process, and our principles of fairness, inclusiveness, reliability and safety, transparency, privacy and security, and accountability. We put these ethical standards into practice through the Office of Responsible AI (ORA)—which sets our rules and governance processes, the AI Ethics and Effects in Engineering and Research (Aether) Committee—which advises our leadership on the challenges and opportunities presented by AI innovations, and Responsible AI Strategy in Engineering (RAISE)—a team that enables the implementation of Microsoft Responsible AI rules across engineering groups.

Get started with Azure Cognitive Services for Speech

You can use Speech Studio to test how custom speech features would help improve recognition for your audio. In addition, start building new customer experiences with Azure Neural TTS and STT. In addition, the Custom Neural Voice capability enables organizations to create a unique brand voice in multiple languages and styles.

Resources