Azure Storage Integration with CDN, Search and HDInsight

In this sample, we demonstrate how to host, analyze and search content uploaded to Azure Storage blobs. This code was originally used in the 2017 Azure Storage Build Talk.

If you don't have a Microsoft Azure subscription you can get a FREE trial account here.

Running this sample

To run this sample:

  1. Create an Azure Storage Account. Select general purpose, not blob storage, for the account type.

  2. Unzip 'clinical-trials.zip'. The files here represent a curated subset of all publically available clinical trials.

    Note: To use the entire set of clinical trials for this sample, visit clinicaltrials.gov. Make sure to select the following options when downloading.

    You can convert the XML files to text files by running the code in 'convert-clinical-trials.cs', making sure to edit the input and output directories.

  3. Create a container named 'data' in your storage account and upload the text files to a virtual directory named 'clinical-trials'. You can do this using Azure CLI 2.0 with Azure Storage. Reference az storage blob upload-batch for detailed guidance. You can also take advantage of AzCopy command-line utility, Azure Storage client libraries, or other Azure Storage client tools.

  4. Make sure to Grant anonymous users permission to the storage container.

  5. Use the Azure CDN to access blobs with custom domains over HTTPS. When adding your CDN endpoint make sure to enter '/data' in the 'origin path' field. The clinical trials will be accessible at '<domain-name>/clinical-trials/<file-name>'.

  6. To perform a full text search on the clinical trials data, Integrate Azure Search with blob storage directly from the Azure portal. When creating your data source, select the 'data' container from the storage account you created. Input 'clinical-trials' in the 'blob folder' field. When constructing your index, make sure the 'content' field is 'searchable'. You can immediately start to Query your Azure Search index using the Azure Portal, even before all of the documents have been indexed.

  7. To convert the clinical trials text files to JSON, Create an Apache Spark cluster in Azure HDInsight. When creating your cluster, make sure to select the storage account you created earlier and specify 'data' as the default container. Open Jupyter and create a PySpark notebook. Instructions can be found at the previous linked article. Finally, run the code in 'texttojson.py'.

  8. To search the newly generated JSON files, execute the REST requests in 'jsonsearchsetup.txt'. You'll need to fill in the key and name of both your search service and storage account. To better understand the requests in 'jsonsearchsetup.txt', see Indexing JSON blobs with Azure Search blob indexer.

  9. To create a front end for your search service, visit the Azure Search Generator. Input the query key for your search service, name of your search service, and the JSON schema for your index. Executing the second request in 'jsonsearchsetup.txt' to create your index will return the JSON schema. Instead of using the Azure Search Generator, you can also fill in your search service query key and name in 'search-clinical-trials.htm'.

  10. To access your front end, enable CORS on your search index directly from the Azure portal, and upload the .HTM file to the 'data' container in your storage account. You can then access it via the custom domain you set up in step 5.

Appendix

A very brief example showing you how to do a cross tabulation on the clinical trials data in PySpark can be found in quickstats.py. Run the code in the PySpark notebook you set up in step 7.

You can search just the metadata of your storage blobs by selecting 'storage metadata only' in the 'data to extract' field when creating a data source for your search index (as in step 6). The query in 'metadatasearchquery.txt' returns the 3 largest blobs greater than 195 GB in descending size order.