Questions? Feedback? powered by Olark live chat software
Ignorar Navegação

Azure Search indexer for Azure Blob Storage now in public preview

Publicado em 9 fevereiro, 2016

Senior Software Engineer, Azure Search

In Azure Search, we strive to remove the friction from indexing data so you can get to building great search experiences faster. Our indexers for Azure SQL Database and DocumentDB have been a hit with customers, and many of them have asked us to build similar magic for Azure Blob Storage.

Extracting text from blobs can be tricky. Formats like PDF and DOC/XLS are binary and difficult to parse; content type detection and metadata extraction can be non-trivial tasks. Good tools exist, but integrating them into an indexing workflow still takes considerable effort and saddles customers with a bunch of code and infrastructure to maintain. We set out to free our customers from that complexity.

Today, we are excited to announce that the Azure Search indexer for Azure Blob Storage (or Azure Search blob indexer for short) is available in public preview. Azure Search blob indexer lets you seamlessly extract textual content and blob metadata from the documents in the following formats:

  • PDF
  • Microsoft Office: DOCX/DOC, XLSX/XLS, PPTX/PPT, MSG (Outlook emails)
  • HTML, XML, ZIP, EML
  • Plain text files

In addition to extracting text content, the indexer extracts custom blob metadata and content-specific document metadata (for example, title and author metadata for PDFs). For more details on its features, take a look at Indexing Documents in Azure Blob Storage with Azure Search.

Several customers are already using blob indexing. For example, ALS is one of the world’s largest and most diversified analytical testing services (e.g., food testing) providers. Giving clients a convenient access to the relevant legislative and regulatory documents is a key component of the ALS application. The documents in question (mostly PDFs and Word docs) are stored in blob storage. Azure Search blob indexer proved convenient for making them searchable with little development effort. In the words of Nuno Coimbra, senior developer at ALS:

Blob indexer allowed us to follow an almost “shoot and forget” approach to document data extraction and indexing, taking from our hands that kind of plumbing. When it comes to availability and scalability, this is a much easier solution for us to maintain.

Invu is a software and solution vendor providing document management and accounts payable solutions in the UK. According to Stuart Evans, Invu’s CTO:

A key part of our premier business solution is a powerful and cost effective search function. We are currently using Azure Search for this as well as the Azure Search indexer to enable text search over our metadata stored in our Azure DocumentDB. Recently, we’ve been experimenting with the new blob indexer feature. It will allow us to remove a lot of complex code from our application.

Setting up blob indexing

To set up blob indexing, create an Azure blob datasource, a search index (if you don’t have one already), then create an indexer that connects that datasource to the target index. For now, you’ll need to use the REST API; soon, we’ll add support for blob indexing to our .NET SDK and Azure portal.

Create blob datasource

POST https://[service name].search.windows.net/datasources?api-version=2015-02-28-Preview
Content-Type: application/json
api-key: [admin key]

{
   "name" : "my-blob-datasource",
   "type" : "azureblob",
   "credentials" : { "connectionString" : "<my storage connection string>" },
   "container" : { "name" : "my-container", "query" : "my-folder (can be null)" }
}

Create search index

POST https://[service name].search.windows.net/indexes?api-version=2015-02-28-Preview 
Content-Type:application/json
api-key: [admin key]
{
  "name" : "my-index",
  "fields": [
    {"name": "id", "type": "Edm.String", "key": true, "searchable": false}, 
    {"name": "content", "type": "Edm.String", "filterable": false, "searchable": true } ] 
}

Create indexer

PUT https://[service name].search.windows.net/indexers/blob-indexer?api-version=2015-02-28-Preview
Content-Type: application/json
api-key: [admin key]
{
  "dataSourceName" : " my-blob-datasource ",
  "targetIndexName" : "my-index",
  "schedule" : { "interval" : "PT2H" }
}

That’s all there is to it! Your indexer will run every two hours (you can configure the schedule interval to be anywhere from five minutes to 24 hours) and pick up any new or updated blobs. You can also monitor its execution in the Azure portal, or programmatically using the Get Indexer Status API.

Help us make Azure Search better

We hope that you will find the Azure Search blob indexer useful. If you have any suggestions for improving the blob indexer, or Azure Search in general, please let us know.