Skip navigation

Genomics Data Lake

Genomics Collection Data Lake

The Genomics Data Lake provides a variety of public datasets that you can access for free and integrate into your genomics analysis workflows and applications. The datasets include genome sequences, variant info, and subject/sample metadata in BAM, FASTA, VCF, CSV file formats.

The Genomics Data Lake is hosted in the West US 2 and West Central US Azure region. Allocating compute resources in West US 2 and West Central US is recommended for affinity.

USE OF DATASETS IS SUBJECT TO TERMS AND CONDITIONS SET BY THE DATASET OWNERS. SEE THE DETAILS PAGE FOR EACH DATASET FOR APPLICABLE TERMS AND CONDITIONS.

DatasetsDescription
Illumina Platinum GenomesIllumina Platinum Genomes
Human Reference GenomesHuman Reference Genomes
ClinVar AnnotationsClinVar Annotations
Genome in a BottleGenome in a Bottle
SnpEffSnpEff: Genomic variant annotations and functional effect prediction toolbox
gnomADgnomAD: Genome Aggregation Database