AI + Machine Learning, Azure Cognitive Search, How to, Web

Searching for text with DocumentDB

By Ryan CrawCour Program Manager, Azure Stream Analytics

Searching for text with DocumentDB • 2 min read

Posted on March 9, 2015
2 min read

A common ask among DocumentDB customers is, “How do I search for documents containing some text value?” In this post, we will explore two different ways of doing this, depending on what you are wanting to accomplish.

1. Tokenizing words

The first method is easy to implement and works well when your requirements are just relatively simple word matching. Consider a document that looks like the JSON below:

{
    "id": "CDC101",
    "title": "Fundamentals of database design",     "credits":  10
}

How would I search for all documents that had the word ‘database’ in the title field? A simple way to do this would be to tokenize the title field and create JSON as below:

{
    "id": "CDC101",
    "title": "Fundamentals of database design",
    "titleWords": [ "fundamentals", "database", "design" ],
    "credits":  10
}

Note: consider using a RegEx to transform words to lowercase and remove any punctuation. Also, strip out stop words like “to”, “the”, “of” etc. (https://en.wikipedia.org/wiki/Stop_words)

Now searching for words in the title becomes easy with the following query:

SELECT VALUE r
FROM root r JOIN word IN r.titleWords
WHERE word = "database"

Note: Notice I used the VALUE keyword in this SQL statement. The VALUE keyword provides a way to return JSON value. For more on the VALUE keyword, refer to the Query with DocumentDB article.

If you want the same query expressed in C# in LINQ then it would be –

List results = client.CreateDocumentQuery(collection.SelfLink)
    .SelectMany(r => r.AllWords
        .Where(word => word == "database")
        .Select(word => r)
    ).ToList();

The query above is very efficient. It will take advantage of the fact that each word in the array will be indexed by default in DocumentDB allowing for quick equality matching on a word, as we are doing here. Another great advantage to this approach is that it will honor the consistency levels of your database, meaning that any changes to your tokenized words will be available for querying immediately.

2. Using Azure Search

If you have many words to tokenize, you can't accommodate the extra storage required to store the additional array of words, or you need more complex multi-faceted full-text searching capabilities across multiple fields etc. then the above approach will not work for you and you need the assistance of a more powerful full-text capable search engine. Luckily Azure has one of these that is super easy to setup and use, it is called Azure Search. You can setup a data source pointing to your DocumentDB database and have a Search indexer crawl through your data on a predefined schedule. For detailed steps on setting this up, check out Connecting DocumentDB with Azure Search using indexers. You can also download a sample ASP.NET MVC web application using DocumentDB and Search together. In order to run this sample, simply create a Search account and a DocumentDB account. Once you have these, update the web.config with your endpoints and keys (obtainable from the Azure Management Portal). The download includes sample todo items which you can import in to DocumentDB if you want to start with some canned data, or you can just crack open the solution in Visual Studio and run the project to start with a clean slate. Add some todo items, hit the index button to force a manual re-index and then go do some searching. So, there you have it – > Searching for text within DocumentDB is easy! To learn more about DocumentDB, visit our service page and to learn more about DocumentDB query syntax, please visit our Query Playground page.

Searching for text with DocumentDB

1. Tokenizing words

2. Using Azure Search

Explore

Related posts

AI-powered dialogues: Global telecommunications with Azure OpenAI Service

Microsoft Azure AI, data, and application innovations help turn your AI ambitions into reality

AI for business leaders: Discover AI advantages in this Microsoft AI learning series

What’s new in Azure Data & AI: Helping organizations manage the data deluge

Popular

AI + machine learning

Analytics

Compute

Containers

Databases

DevOps

Developer tools

Hybrid + multicloud

Identity

Integration

Internet of Things

Management and governance

Media

Migration

Mixed reality

Mobile

Networking

Security

Storage

Web

Virtual desktop infrastructure

Use cases

Application development

AI

Cloud migration and modernization

Data and analytics

Hybrid cloud and infrastructure

Internet of Things

Security and governance

Organization type

Resources

1. Tokenizing words

2. Using Azure Search

Explore

Related posts