• <1 minute

Searching for text with DocumentDB

A common ask among DocumentDB customers is, “How do I search for documents containing some string value?” This post will explore two different methods for doing just this!

A common ask among DocumentDB customers is, “How do I search for documents containing some text value?” In this post, we will explore two different ways of doing this, depending on what you are wanting to accomplish.

1. Tokenizing words

The first method is easy to implement and works well when your requirements are just relatively simple word matching. Consider a document that looks like the JSON below:

{
    "id": "CDC101",
    "title": "Fundamentals of database design",     "credits":  10
}

How would I search for all documents that had the word ‘database’ in the title field? A simple way to do this would be to tokenize the title field and create JSON as below:

{
    "id": "CDC101",
    "title": "Fundamentals of database design",
    "titleWords": [ "fundamentals", "database", "design" ],
    "credits":  10
}
 
Note:  consider using a RegEx to transform words to lowercase and remove any punctuation. Also, strip out stop words like “to”, “the”, “of”  etc. (https://en.wikipedia.org/wiki/Stop_words)

Now searching for words in the title becomes easy with the following query:

SELECT VALUE r
FROM root r JOIN word IN r.titleWords
WHERE word = "database"

Note:  Notice I used the VALUE keyword in this SQL statement. The VALUE keyword provides a way to return JSON value. For more on the VALUE keyword, refer to the Query with DocumentDB article.

If you want the same query expressed in C# in LINQ then it would be –

List results = client.CreateDocumentQuery(collection.SelfLink)
    .SelectMany(r => r.AllWords
        .Where(word => word == "database")
        .Select(word => r)
    ).ToList();
 

  The query above is very efficient. It will take advantage of the fact that each word in the array will be indexed by default in DocumentDB allowing for quick equality matching on a word, as we are doing here. Another great advantage to this approach is that it will honor the consistency levels of your database, meaning that any changes to your tokenized words will be available for querying immediately.  

2. Using Azure Search

If you have many words to tokenize, you can't accommodate the extra storage required to store the additional array of words, or you need more complex multi-faceted full-text searching capabilities across multiple fields etc. then the above approach will not work for you and you need the assistance of a more powerful full-text capable search engine. Luckily Azure has one of these that is super easy to setup and use, it is called Azure Search. You can setup a data source pointing to your DocumentDB database and have a Search indexer crawl through your data on a predefined schedule. For detailed steps on setting this up, check out Connecting DocumentDB with Azure Search using indexers. You can also download a sample ASP.NET MVC web application using DocumentDB and Search together. In order to run this sample, simply create a Search account and a DocumentDB account. Once you have these, update the web.config with your endpoints and keys (obtainable from the Azure Management Portal). The download includes sample todo items which you can import in to DocumentDB if you want to start with some canned data, or you can just crack open the solution in Visual Studio and run the project to start with a clean slate. Add some todo items, hit the index button to force a manual re-index and then go do some searching. So, there you have it – > Searching for text within DocumentDB is easy! To learn more about DocumentDB, visit our service page and to learn more about DocumentDB query syntax, please visit our Query Playground page.