At the beginning of this year the Azure Search team began work on one of the most frequently requested capabilities: support for multiple languages. We’re excited to announce that this work is now complete. As of now, Azure Search customers using the 2015-02-28 API version can index and query documents in 56 languages. Linguistics are supported through the REST API, Azure Search SDK, and Azure Portal. This post describes the best practices of working with content in different languages using Azure Search.
Processing natural language
The role of a full-text search engine, in simple terms, is to process and store documents in a way that enables efficient querying and retrieval. At a high level, it all comes down to extracting important words from documents, putting them in an index, and then using the index to find documents that match words of a given query. The process of extracting words from documents and search queries is called lexical analysis. Components that perform lexical analysis are called analyzers.
A good analyzer must be able to cope with the challenges of processing natural language, for instance:
- Handling inflected word forms. For example, users often expect searches for “running” to match words like “run” and “ran”
- Word breaking, especially in languages like Chinese, Japanese and Thai, where words are not separated by spaces
- Dealing with compound words
- Text normalization
- Removing irrelevant words and characters
In the Leveraging Multilanguage Capabilities in Azure Search Microsoft Virtual Academy presentation, I explain how different approaches to these challenges may affect the important recall/precision balance. In some situations, the solution chosen will determine whether an important document is found, or not.
In Azure Search we provide two sets of different language analyzer types described below. Our users have an opportunity to pick what works best for their specific scenario.
Language analyzers in Azure Search
The concept of language analyzers is familiar to users of the popular open-source, full-text search engine, Lucene, which works at the core of Azure Search. We exposed Lucene language analyzers as the first iteration of our vision to provide multi-language support. Since then, we have worked with the Office team developing Natural Language Processing technology for the past 16 years for products like Word, Windows Desktop Search, SharePoint, and Bing. We were excited to combine the power of open-source with Microsoft’s internal assets to deliver what we call Microsoft language analyzers.
By introducing Microsoft language analyzers, we increased the number of languages supported in Azure Search from 35 to 56! On top of that, we were able to tap into years of research and development in natural language processing at Microsoft. Here are some of the features setting Microsoft analyzers apart from similar solutions on the market:
- Lemmatization to handle inflected word forms: Microsoft language analyzers can reason about the grammar rules of a language. They handle inflection based on voice, mood, tense, person, number, gender, etc.
- Use of statistical language models for accurate word breaking in languages that don’t separate words with spaces or punctuation
- Decompounding (in German, Danish, Dutch, Swedish, Norwegian, Estonian, Finish, Hungarian, Slovak) so users can search for individual segments of compound words
- Normalization of different date and currency formats
- Handling of URLs and email addresses
Here is a story of a customer who benefited from using Microsoft analyzers.
n2y, a family-run business providing a solution for their customers that allows them to quickly search through a collection of 20,000 picture representations (symbols) of words for non-readers.
Initially we used Lucene to power our search, but our users were frustrated because when searching for even common words such as “ran,” they were not able to find any results, even though our content contained similar words such as “run” or “running.” When we moved to Azure Search, we were able to leverage the Microsoft language analyzers that allowed us to show hundreds of relevant results which was equally as effective across the other 14 languages we support.
- Michael Clark, CTO, n2y
Microsoft language analyzers have a deep understanding of language. This made a huge difference in the case of n2y given the nature of their dataset. For n2y’s customers it was important to find every potentially relevant symbol for a given query. The same expectation is shared by everyone who uses a search box on a website or in any mobile application. Internet users have come to expect that all search engines can guess the intent of their query, like in Google or Bing, rather than just match the exact keywords. By introducing Microsoft language analyzers we’re making it easier to fulfill this expectation.
It’s important to note that nothing comes for free in life, and using more powerful analyzers is no exception. Indexing documents with Microsoft language analyzers is two to three times slower (depending on the language) than their Lucene counterpart. There is simply more processing that needs to be done to get the benefits outlined in this post. We recommend Azure Search customers with heavy indexing loads compare the two analyzer types, and pick the one that performs best in their scenario. Search requests are typically equally fast regardless of the analyzer type used.
You can find our more about best practices of using language analyzer in Azure Search here.
Lucene language analyzers have been available to Azure Search users since we announced general availability and the first fully supported API version: 2015-02-28. At the same time in March, we released Microsoft analyzers in the 2015-02-28-Preview API version to let early adopters experiment with the new capabilities. Since then, driven by feedback, we have significantly improved performance of Microsoft analyzers and resolved all identified issues. We wanted to thank everyone who participated in the process. Keep the feedback coming!
Going forward, we’re excited to work with teams at Microsoft, like Bing and Microsoft Research, to bring more, exciting capabilities to Azure Search. Stay tuned!