AI + Machine Learning, Azure Cognitive Search, How to

Leverage OCR to full text search your images within Azure Search

By Liam Cavanagh Principal Program Manager, Azure Search

Leverage OCR to full text search your images within Azure Search • 2 min read

Posted on February 18, 2016
2 min read

With Azure Search we try to help you build really great search applications over your data. Through capabilities like the Azure Search Indexer, we have tried to make it convenient to ingest data from common data sources to enable this full text search support.

Recently we released the Azure Search Indexer for Azure Blob Storage which allows extraction of text from common file types such as Office, PDF and HTML. One file type we have not yet added support for, but is a common ask, is of images. The idea being you have a file such as JPG, TIFF or PDF with embedded images, you might want to be able to extract the text from these images which can be used to enhance your search index. Imagine you have medical imagery, faxes or scanned documents and want to search over them. This technique is called Optical Character Recognition (OCR) and I want to show you how this can be used to help enhance the content in your Azure Search index.

In talking with customers, I found it is very common to have images embedded within PDF documents, so this is the main focus of the sample because I would not only need to run OCR against the image, but also extract the images from the PDF’s. You can see the sample of how this was accomplished in the following GitHub repository. Here is the general flow of what is done in the sample:

The main technologies I used to accomplish this were:

iTextSharp: This is a really convenient .NET port of iText that is a PDF library which allows you to manipulate content in PDF files. I would also like to point out a special thanks to Jerome Viveiros who wrote a great sample on how to use iTextSharp on his post which formed a basis of much of what I used in my sample that extracts the images from the PDF file.
Project Oxford Vision API: There are many ways you can extract text from images (such as the Tesseract). I chose to use Project Oxford’s Vision API’s. I found the code really simple to use and the extracted text was of very high quality. The main downside was that it is currently limited to 5,000 API calls/month, which can be limiting if you have a lot of documents, but I also understand, from a Program Manager on this team, this limit can be increased if needed.
Azure Search: This is the search service where the output from the OCR process is sent. The text, if formatted into a JSON document to be sent to Azure Search, then becomes full text searchable from your application.

A full outline of how to do this can be found in the following GitHub repository. If you would like to see OCR added to the Azure Search Indexer, please cast your vote.

Next steps

This sample is just a starting point. You will more than likely want to extend it further. Perhaps you will want to add the title of the file, or metadata relating to the file (file size, last updated, etc.) which can then be used for further faceting and filtering by your user. You might also want to add a URL reference to the actual image file so you can allow users to open it directly from your application.

Until we integrate OCR capabilities into the Azure Search Indexer, I hope you will find this helpful to help get your content from images into Azure Search. If you have any questions or feedback on this, please let me in the comments below.

Leverage OCR to full text search your images within Azure Search

Next steps

Explore

Related posts

Microsoft Azure AI, data, and application innovations help turn your AI ambitions into reality

AI for business leaders: Discover AI advantages in this Microsoft AI learning series

What’s new in Azure Data & AI: Helping organizations manage the data deluge

Microsoft named a Leader in the IDC MarketScape: Worldwide General-Purpose Computer Vision AI Software Platform 2022 Vendor Assessment

Popular

AI + machine learning

Analytics

Compute

Containers

Databases

DevOps

Developer tools

Hybrid + multicloud

Identity

Integration

Internet of Things

Management and governance

Media

Migration

Mixed reality

Mobile

Networking

Security

Storage

Web

Virtual desktop infrastructure

Use cases

Application development

AI

Cloud migration and modernization

Data and analytics

Hybrid cloud and infrastructure

Internet of Things

Security and governance

Organization type

Resources

Next steps

Explore

Related posts