Skip to main content
Azure
  • 5 min read

Power your genomic data analysis on Azure with Azure CycleCloud

At Microsoft, we recognize the challenges faced by the genomics community and are striving to build an ecosystem that can facilitate genomics computing work for all. We’ve focused our efforts on three main core areas: research and discovery in genomics data, building out a platform to enable rapid automation and analysis at scale, and optimized and secure pipelines at a clinical level. One of the core Azure services that has enabled us to leverage an HPC environment to perform the genomic analysis is Azure CycleCloud.

Researchers around the world have access to a greater variety and volume of genomics data than ever before. Genomics is now available to a vast majority of researchers, pushing forward the discovery at a tremendous pace and changing people’s lives. This growth is happening because of the perfect storm between genomic testing and technological improvements. In the span of a few decades, the cost of human genome sequencing has gone from millions of dollars to hundreds of dollars.

At Microsoft, we recognize the challenges faced by the genomics community and are striving to build an ecosystem (backed by open source and Microsoft products and services) that can facilitate genomics computing work for all. We’ve focused our efforts on 3 main core areas: research and discovery in genomics data, building out a platform to enable rapid automation and analysis at scale, and optimized and secure pipelines at a clinical level. One of the core Azure services that has enabled us to leverage an HPC environment to perform the genomic analysis is Azure CycleCloud.

Genomic analysis at scale requires hyperscale compute

Sequencing a single individual’s genome, or even that of a small cohort of individuals, generates a significant amount of data and requires a massive amount of computational power to analyze. The computing power required to effectively analyze, share, and disseminate this data has historically been constrained by what could be provided on-premises for research organizations. For most researchers, the scarcity of availability of high performance computing (HPC) clusters has hamstrung their research potential, threatening high upfront infrastructure investments and long-term maintenance costs. Furthermore, to capture an accurate representation of population health, research studies today need to be globalized meaning that genomic data needs to be securely stored, shared, and transmitted across the world—creating a computing demand that is a heavy lead for even the most sophisticated on-premises. As such, the lack of adequate computing technology has delayed and constrained the ability of genomic research communities to easily collaborate and share findings. 

Cloud computing and the broader digital transformation of the health industry have been powerful enablers of modern genomic breakthroughs, unleashing a practically limitless—and more widely and affordably available—ability to meet the computing demands needed by research organizations and medical institutions to advance genomic science.

HPC on Azure and Azure CycleCloud for genomic analysis

Azure CycleCloud is an enterprise-friendly tool for orchestrating and managing HPC environments on Azure. With Azure CycleCloud, users can provision infrastructure for HPC systems, deploy familiar HPC schedulers, and automatically scale the infrastructure to run jobs efficiently at any scale. Through Azure CycleCloud, users can create different types of file systems and mount them to the compute cluster nodes to support HPC workloads. With dynamic scaling of clusters, the business can get the resources it needs at the right time and the right price, with Azure CycleCloud’s automated configuration enabling IT to ultimately focus on providing high-value services to the business users.

Workflow managers (like Cromwell, Galaxy, Nextflow, and Snakemake) are used for accelerating genome analysis by making them more efficient and scalable. For instance, a typical next-generation sequencing machine can sequence anywhere from 12 to 192 samples per run and creates an output file (called Binary Base Call [BCL] are the raw data generated by NGS). This output file is converted into several FastQ files (FastQ file is a text-based format for storing both a nucleotide sequence and its corresponding quality score). Each FastQ must be further converted into BAM Format (a binary format for storing sequence data) and then Variant Call Format (VCF), which specifies the format of a text file used in bioinformatics for storing gene sequence variations. A bioinformatician or a clinical scientist then picks these files up for further analysis. The sequence of conversion steps from BCL to FASTQ files can take between several hours to several days on commodity hardware. One way to drastically reduce this timeline is to use Azure CycleCloud or Azure Batch to configure these steps as a set of jobs that can be run in parallel.

Accelerating germline testing by optimizing secondary analysis using Azure CycleCloud

Belfast Health and Social Care Trust is the largest integrated health and social care trust in the United Kingdom. They deliver integrated health and social care services to approximately 340,000 citizens in Belfast and provide many regional specialist services to all of Northern Ireland. Belfast Trust also comprises the major network of teaching and training hospitals in Northern Ireland.

Within Belfast Trust, the Regional Molecular Diagnostics Service Northern Ireland (RMDS) has been funded to develop and deliver a service that provides molecular testing of germline and somatic disorders through the introduction of a comprehensive portfolio of Next Generation Sequencing (NGS) panels and exomes aimed at improving patient outcomes in Northern Ireland.

The main goal of this strategic initiative is to enhance high-quality patient care by improving the turnaround time for genomic analysis in accordance with the best practice guidelines by The Association for Clinical Genomic Science and delivering an equitable molecular service in line with mainland United Kingdom labs.

Initially, the sheer complexity and size of the genomic data generated were considered a major computational barrier to the delivery of the service. To meet the computational demand, Belfast Trust developed an accredited germline computational pipeline using the Snakemake workflow manager on Azure. The initial pipeline analyzed only targeted panels, which are smaller in size than clinical exomes, whole exomes, or whole genomes. To convert the raw data to the format necessary for genomic analysis, the jobs were configured for sequential execution on a single virtual machine.

Illustration of sequence of BCL to VCF data conversion steps for genomic analysis.

The challenge arose when analyzing clinical exomes, whole exomes, or whole genomes, whose larger size made the analysis more complex and time-consuming. For instance, a 12-sample targeted panel analysis could be completed in two hours. However, it took 48 hours to execute a complete analysis of a 12-sample clinical exome. To solve this problem, Belfast Trust worked with Microsoft Consulting Services to identify ways to reduce the complexity and time for analyzing these larger data sets. Azure CycleCloud was leveraged to parallelize the pipeline execution.

Illustration of running parallel pipeline executions for sample analysis

The resulting solution was executed on several virtual machines in parallel to analyze the sample pipeline jobs, with astounding results. A 12-sample targeted panel analysis was completed in 20 minutes and a 12-sample clinical exome analysis was completed in 4 hours and 30 minutes, a 6 to 10 times improvement in analysis duration. The virtual machines utilized in the solution were also smaller in size compared to what was previously used, which brought down the cost of the entire pipeline by roughly 3 times.

Reference architecture illustrating the solution used by Belfast Trust, using imported samples from Next Generation Sequencer and flowing through Azure CycleCloud, the Snakemake Engine, and the role of parallel VM executions

“We were able to extend our service by overcoming computational barriers for testing of germline disorders by leveraging HPC on Azure. In collaboration with Microsoft Consulting Services, the Belfast Trust has developed an end-to-end Azure Cloud-based solution for data transfer, pipeline analysis, tertiary analysis, and storage solutions for genomic data. Through this collaboration, and leveraging Azure CycleCloud for analysis, we were able to save analysis time by 6 to 10 times and reduce the cost of analysis by roughly 3 times. This will enable us to expand our capacity to undertake more analysis and testing.”—Shirley Heggarty Ph.D. FRCPath, Director, Regional Genetics Laboratory, Belfast City Hospital

This optimized solution will allow Belfast Trust to manage resources and scale up its sequencing runs more efficiently in the future.

Genomics research is playing an increasingly central role in precision medicine—refining diagnoses, prescribing personalized treatments for patients, and helping us have an even deeper understanding of human health. The advancements of on-demand, geographically available, and affordable HPC services will play a critical role for research organizations and technology partners alike to continue to build with purpose—and to chase new breakthroughs in the field of genomics.

Learn more

Learn more about Microsoft Genomics solutions: