With Azure Search we try to help you build really great search applications over your data. Through capabilities like the Azure Search Indexer, we have tried to make it convenient to ingest data from common data sources to enable this full text search support.
Recently we released the Azure Search Indexer for Azure Blob Storage which allows extraction of text from common file types such as Office, PDF and HTML. One file type we have not yet added support for, but is a common ask, is of images. The idea being you have a file such as JPG, TIFF or PDF with embedded images, you might want to be able to extract the text from these images which can be used to enhance your search index. Imagine you have medical imagery, faxes or scanned documents and want to search over them. This technique is called Optical Character Recognition (OCR) and I want to show you how this can be used to help enhance the content in your Azure Search index.
In talking with customers, I found it is very common to have images embedded within PDF documents, so this is the main focus of the sample because I would not only need to run OCR against the image, but also extract the images from the PDF’s. You can see the sample of how this was accomplished in the following GitHub repository. Here is the general flow of what is done in the sample:
The main technologies I used to accomplish this were:
- iTextSharp: This is a really convenient .NET port of iText that is a PDF library which allows you to manipulate content in PDF files. I would also like to point out a special thanks to Jerome Viveiros who wrote a great sample on how to use iTextSharp on his post which formed a basis of much of what I used in my sample that extracts the images from the PDF file.
- Project Oxford Vision API: There are many ways you can extract text from images (such as the Tesseract). I chose to use Project Oxford’s Vision API’s. I found the code really simple to use and the extracted text was of very high quality. The main downside was that it is currently limited to 5,000 API calls/month, which can be limiting if you have a lot of documents, but I also understand, from a Program Manager on this team, this limit can be increased if needed.
- Azure Search: This is the search service where the output from the OCR process is sent. The text, if formatted into a JSON document to be sent to Azure Search, then becomes full text searchable from your application.
A full outline of how to do this can be found in the following GitHub repository. If you would like to see OCR added to the Azure Search Indexer, please cast your vote.
This sample is just a starting point. You will more than likely want to extend it further. Perhaps you will want to add the title of the file, or metadata relating to the file (file size, last updated, etc.) which can then be used for further faceting and filtering by your user. You might also want to add a URL reference to the actual image file so you can allow users to open it directly from your application.
Until we integrate OCR capabilities into the Azure Search Indexer, I hope you will find this helpful to help get your content from images into Azure Search. If you have any questions or feedback on this, please let me in the comments below.