Azure Open Datasets

Article
04/19/2022

Improve the accuracy of your machine learning models with publicly available datasets. Save time on data discovery and preparation by using curated datasets that are ready to use in machine learning projects.

Transportation

Dataset	Description
TartanAir: AirSim Simulation Dataset	AirSim Autonomous vehicle data generated to solve Simultaneous Localization and Mapping (SLAM).
NYC Taxi & Limousine Commission - yellow taxi trip records	The yellow taxi trip records include pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.
NYC Taxi & Limousine Commission - green taxi trip records	The green taxi trip records include pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.
NYC Taxi & Limousine Commission - For-Hire Vehicle (FHV) trip records	The For-Hire Vehicle trip records include the dispatching base license number and the pick-up date, time, and taxi zone location ID.

Health and genomics

Dataset	Description
COVID-19 Data Lake	COVID-19 Data Lake collection is a collection of COVID-19 related datasets from various sources, covering testing and patient outcome tracking data, social distancing policy, hospital capacity, mobility, etc.
COVID-19 Open Research Dataset	A full-text and metadata dataset of COVID-19 and coronavirus-related scholarly articles optimized for machine readability and made available for use by the global research community.
Genomics Data Lake	The Genomics Data Lake provides various public datasets that you can access for free and integrate into your genomics analysis workflows and applications. The datasets include genome sequences, variant info and subject/sample metadata in BAM, FASTA, VCF, CSV file formats.

Labor and economics

Dataset	Description
US Labor Force Statistics	US Labor Force Statistics provides Labor Force Statistics, labor force participation rates, and the civilian noninstitutional population by age, gender, race, and ethnic groups. in the United States.
US National Employment Hours and Earnings	The Current Employment Statistics (CES) program produces detailed industry estimates of nonfarm employment, hours, and earnings of workers on payrolls in the United States.
US State Employment Hours and Earnings	The Current Employment Statistics (CES) program produces detailed industry estimates of nonfarm employment, hours, and earnings of workers on payrolls in the United States.
US Local Area Unemployment Statistics	The US Local Area Unemployment Statistics datasets provides monthly and annual employment, unemployment, and labor force data for Census regions and divisions, States, counties, metropolitan areas, and many cities in the United States.
US Consumer Price Index	The Consumer Price Index (CPI) is a measure of the average change over time in the prices paid by urban consumers for a market basket of consumer goods and services.
US Producer Price Index - Industry	The Producer Price Index (PPI) is a measure of average change over time in the selling prices received by domestic producers for their output.
US Producer Price Index - Commodities	The Producer Price Index (PPI) is a measure of average change over time in the selling prices received by domestic producers for their commodities.

Population and safety

Dataset	Description
US Population by County	US population by gender and race for each US county sourced from 2000 and 2010 Decennial Census. This dataset is sourced from the United States Census Bureau.
US Population by ZIP Code	US population by gender and race for each US ZIP code sourced from 2010 Decennial Census. This dataset is sourced from the United States Census Bureau.
Boston Safety Data	Read data about 311 calls reported to the city of Boston. This dataset is stored in Parquet format and is updated daily.
Chicago Safety Data	Read data about 311 calls reported to the city of Chicago. This dataset is stored in Parquet format and is updated daily.
New York City Safety Data	This dataset contains all New York City 311 service requests from 2010 to the present. Itâ€™s stored in Parquet format and updated daily.
San Francisco Safety Data	Fire department calls for service and 311 cases in San Francisco. This dataset contains historical records accumulated from 2015 to the present.
Seattle Safety Data	Seattle Fire Department 911 dispatches. This dataset is updated daily, and contains historical records accumulated from 2010 to the present

Supplemental and common datasets

Dataset	Description
Diabetes	The Diabetes dataset has 442 samples with 10 features, making it ideal for getting started with machine learning algorithms.
OJ Sales Simulated Data	This dataset is derived from the Dominick’s OJ dataset and includes extra simulated data with the goal of providing a dataset that makes it easy to simultaneously train thousands of models on Azure Machine Learning.
MNIST database of handwritten digits	The MNIST database of handwritten digits has a training set of 60,000 examples and a test set of 10,000 examples. The digits have been size-normalized and centered in a fixed-size image.
Microsoft News recommendation dataset	Microsoft News Dataset (MIND) is a large-scale dataset for news recommendation research. It serves as a benchmark dataset for news recommendation, and facilitates research in news recommendation and recommender systems.
Public holidays	Worldwide public holiday data sourced from PyPI holidays package and Wikipedia, covering 38 countries or regions from 1970 to 2099.
Russian open speech to text	Russian Open STT is a large-scale open speech to text dataset for the Russian language

Feedback

Coming soon: Throughout 2024 we will be phasing out GitHub Issues as the feedback mechanism for content and replacing it with a new feedback system. For more information see: https://aka.ms/ContentUserFeedback.

Submit and view feedback for

This product This page

View all page feedback