• 5 min read

Custom analyzers in Azure Search

In Azure Search we believe developers who build Web and mobile applications shouldn’t have to be search experts in order to provide a great product experience.

In Azure Search we believe that developers who build Web and mobile applications shouldn’t have to be search experts in order to add a great search experience to their products. We put simplicity at the core of our value proposition, making the process of creating and configuring search services both fast and easy. In most cases, customers using the Azure portal, Indexers and our SDK are able to integrate search into their applications in less than 30 minutes.

While we value simplicity, we also strive to meet the expectations of customers who have a wide range of requirements. To address more advanced scenarios, we've opened up certain aspects of the service's internal operations. For example, the way documents are ranked can be configured with Scoring Profiles so you can redefine the relative importance of fields in the index based on a number of factors.

Following the same principle, we want to hand over more ways to control lexical analysis by providing the ability to define custom analyzers.

Customizing lexical analysis

I describe the concept of lexical analysis in my previous post: Language support in Azure Search. I explain there the complexities of analyzing content in natural language and how this process varies based on the language used. This problem can be generalized to any text containing source code, markup language, product codes and so forth. The lexical analyzer in these cases needs to be configured correctly to be able to extract important terms while filtering out everything else.

An analyzer is a configuration that specifies:

  • A sequence of character filters – filter out or replace certain characters or symbols from input text
  • A tokenizer – divides continuous text into independent tokens, for example breaking a sentence into words
  • A sequence of token filters – filter out or modify the tokens generated by the tokenizer such as a lowercase filter that converts all characters to lowercase

In the 2015-02-28-Preview API version we're exposing a new interface that lets you create custom analyzers based on the predefined tokenizers, token filters and char filters from Lucene. A detailed description of the new API with the list of all supported tokenizers, token filters and char filters can be found here.

In the next section I'll show you how to build an index definition for searching over a catalog of names as an example of custom analyzers configuration.

People search with custom analyzers

In this simplified example, we want to search over a hypothetical catalog of all actors. Here is the index definition:

{
  “name”:”names”,
  “fields”:[
    { “name”:”id”, “type”:”Edm.String”, “key”:true, “searchable”:false },
    { “name”:”name”, “type”:”Edm.String” }
  ]
}

For the sake of this example, the index has only one field “name”, besides the key. The name field is searchable, thus analyzed with the default, standard analyzer which uses the standard tokenizer and the lowercase filter. If you search for the name “stallone,” a document with Sylvester Stallone will be returned. However, if you searched for name “Skarsgard,” the actor “Stellan Skarsgård” would not be returned, as the name in the query is misspelled with 'a' instead of 'å.'

People often skip diacritics when writing on the keyboard. Fortunately, this situation can be easily addressed with the help of the ASCII folding filter which converts all non-ASCII characters to their ASCII equivalents. We can add the ASCII folding filter to the standard analyzer configuration with the following change to the index definition:

{
  “name”:”names”,
  “fields”:[
    { “name”:”id”, “type”:”Edm.String”, “key”:true, “searchable”:false },
    { “name”:”name”, “type”:”Edm.String”, “analyzer”:”my_standard” }
  ],
  “analyzers”:[
    {
      “name”:”my_standard”,
      “@odata.type”:”#Microsoft.Azure.Search.CustomAnalyzer”,
      “tokenizer”:”standard”,
      “tokenFilters”:[ “lowercase”, “asciifolding” ]
    }
  ]

}

Notice, the name field now sets the analyzer property to be my_standard. The ASCII folding filter will be applied at search time and indexing time.

Phonetic search

To further improve users experience, we can use a phonetic filter to enable “sounds like” queries. For example, if you searched for “Alek Boldwin” the document for Alec Baldwin would not be returned. To address this issue, you can configure your field to use the Phonetic token filter to index the names by sound using a phonetic encoding algorithm. We can add this filter by modifying the index definition as follows:

{
  “name”:”names”,
  “fields”:[
    { “name”:”id”, “type”:”Edm.String”, “key”:true, “searchable”:false },
    { “name”:”name”, “type”:”Edm.String”, “analyzer”:”my_standard” }
  ],
  “analyzers”:[
    {
      “name”:”my_standard”,
      “@odata.type”:”#Microsoft.Azure.Search.CustomAnalyzer”,
      “tokenizer”:”standard”,
      “tokenFilters”:[ “lowercase”, “asciifolding”, “phonetic” ]
    }
  ]
}

With the ASCII folding and Phonetic token filters the application for searching over the catalog names will be more tolerant of user mistakes and omissions.

Index and search-specific analyzers

To give you even more flexibility we've added two new field properties: indexAnalyzer and searchAnalyzer. Now it's possible to set a different analyzer to be applied at search time and indexing time. For instance, if in our previous example we wanted to enable fast prefix matching on the actor name, one of way we could do that would be to apply the edgeNGram token filter at indexing time. For that purpose, I'm going to add a dedicated field to index name fragments for efficient prefix matching. Let's look at the modified index definition.

{
  “name”:”names”,
  “fields”:[
    { “name”:”id”, “type”:”Edm.String”, “key”:true, “searchable”:false },
    { “name”:”name”,”type”:”Edm.String”, “analyzer”:”my_standard” },
    { “name”:”partialName”, “type”:”Edm.String”, “searchAnalyzer”:”standard”, “indexAnalyzer”:”prefix” }
  ],
  “analyzers”:[
    {
      “name”:”my_standard”,
      “@odata.type”:”#Microsoft.Azure.Search.CustomAnalyzer”,
      “tokenizer”:”standard”,
      “tokenFilters”:[ “lowercase”, “asciifolding”, “phonetic” ]
    },
    {
      “name”:”prefix”,
      “@odata.type”:”#Microsoft.Azure.Search.CustomAnalyzer”,
      “tokenizer”:”standard”,
      “tokenFilters”:[ “lowercase”, “my_edgeNGram” ]
    }
  ],
  “tokenFilters”:[
    {
      “name”:”my_edgeNGram”,
      “@odata.type”:”#Microsoft.Azure.Search.EdgeNGramTokenFilter”,
      “minGram”:2,
      “maxGram”:20
    }
  ]

}

With this setup for each name multiple terms will be indexed. For example, for the name Pacino the prefix analyzer will generate:

    pa
    pac
    paci
    pacin
    pacino

Therefore, search queries like “Paci” against the partialName field will return expected results. Note, the searchAnalyzer for the partialName field is set to “standard” as the query terms are already in the correct, prefix form.

If you're building an index definition similar to the one demonstrated, you could use Scoring Profiles to make sure matches against the exact name field are always scored higher than partial matches.

“scoringProfiles”:[
  {
    “name”:”exactFirst”,
    “text”:{
      “weights”:{ “name”:2, “partialName”:1 }     

    }
  }
]

Search requests would look like this:

https://[service name].search.windows.net/indexes/names/docs?search=Paci&scoringProfile=exactFirst&api-version=2015-02-28-Preview

Alternatively, you could use the recently exposed Lucene query language and define the relative importance of fields as a part of your query:

https://[service name].search.windows.net/indexes/names/docs?search=name:Paci^2 partialName:Paci&queryType=full&api-version=2015-02-28-Preview

Summary

By allowing you to create custom analyzer configurations we're putting a lot of power in your hands. With power comes responsibility. You will need a basic understanding of how documents and search queries are analyzed in order to achieve the desired results. Hopefully this article will make it easier for you to get started.

For more details please go to our MSDN documentation page for this topic. Feel free to reach out to Azure Search with questions on Stack Overflow and our forum. We'll be happy to help, and as always, keep the feedback coming!