AI + Machine Learning, Azure Cognitive Search, How to, Web

Searching for text with DocumentDB

By Ryan CrawCour Program Manager, Azure Stream Analytics

Searching for text with DocumentDB • 2 min read

Posted on March 9, 2015
2 min read

A common ask among DocumentDB customers is, “How do I search for documents containing some text value?” In this post, we will explore two different ways of doing this, depending on what you are wanting to accomplish.

1. Tokenizing words

The first method is easy to implement and works well when your requirements are just relatively simple word matching. Consider a document that looks like the JSON below:

{
    "id": "CDC101",
    "title": "Fundamentals of database design",     "credits":  10
}

How would I search for all documents that had the word ‘database’ in the title field? A simple way to do this would be to tokenize the title field and create JSON as below:

{
    "id": "CDC101",
    "title": "Fundamentals of database design",
    "titleWords": [ "fundamentals", "database", "design" ],
    "credits":  10
}

Note: consider using a RegEx to transform words to lowercase and remove any punctuation. Also, strip out stop words like “to”, “the”, “of” etc. (https://en.wikipedia.org/wiki/Stop_words)

Now searching for words in the title becomes easy with the following query:

SELECT VALUE r
FROM root r JOIN word IN r.titleWords
WHERE word = "database"

Note: Notice I used the VALUE keyword in this SQL statement. The VALUE keyword provides a way to return JSON value. For more on the VALUE keyword, refer to the Query with DocumentDB article.

If you want the same query expressed in C# in LINQ then it would be –

List results = client.CreateDocumentQuery(collection.SelfLink)
    .SelectMany(r => r.AllWords
        .Where(word => word == "database")
        .Select(word => r)
    ).ToList();

The query above is very efficient. It will take advantage of the fact that each word in the array will be indexed by default in DocumentDB allowing for quick equality matching on a word, as we are doing here. Another great advantage to this approach is that it will honor the consistency levels of your database, meaning that any changes to your tokenized words will be available for querying immediately.

2. Using Azure Search

If you have many words to tokenize, you can't accommodate the extra storage required to store the additional array of words, or you need more complex multi-faceted full-text searching capabilities across multiple fields etc. then the above approach will not work for you and you need the assistance of a more powerful full-text capable search engine. Luckily Azure has one of these that is super easy to setup and use, it is called Azure Search. You can setup a data source pointing to your DocumentDB database and have a Search indexer crawl through your data on a predefined schedule. For detailed steps on setting this up, check out Connecting DocumentDB with Azure Search using indexers. You can also download a sample ASP.NET MVC web application using DocumentDB and Search together. In order to run this sample, simply create a Search account and a DocumentDB account. Once you have these, update the web.config with your endpoints and keys (obtainable from the Azure Management Portal). The download includes sample todo items which you can import in to DocumentDB if you want to start with some canned data, or you can just crack open the solution in Visual Studio and run the project to start with a clean slate. Add some todo items, hit the index button to force a manual re-index and then go do some searching. So, there you have it – > Searching for text within DocumentDB is easy! To learn more about DocumentDB, visit our service page and to learn more about DocumentDB query syntax, please visit our Query Playground page.

Searching for text with DocumentDB

1. Tokenizing words

2. Using Azure Search

Explore

Related posts

L’IA, les données et les innovations d’application Microsoft Azure aident à transformer vos ambitions d’IA en réalité

AI for business leaders: Discover AI advantages in this Microsoft AI learning series

What’s new in Azure Data & AI: Helping organizations manage the data deluge

Microsoft référencé comme leader dans le rapport IDC MarketScape : Évaluation fournisseurs 2022 Plateformes logicielles d’IA Vision par ordinateur à usage général

Join the conversation

Sélection

IA + Machine Learning

Analyse

Calcul

Conteneurs

Bases de données

DevOps

Outils de développement

Hybride + multicloud

Identité

Intégration

Internet des Objets

Gestion et gouvernance

Données multimédias

Migration

Réalité mixte

Mobile

Mise en réseau

Sécurité

Stockage

Web

Bureau virtuel Windows

Cas d'utilisation

Développement d’applications

IA

Migration et modernisation cloud

Données et analyse

Cloud hybride et infrastructure

Internet des Objets

Sécurité et gouvernance

Type d’organisation

Ressources

1. Tokenizing words

2. Using Azure Search

Explore

Related posts

Join the conversation