Skip Navigation

COVID-19 Data Lake

Bing COVID-19 Data

COVID-19 Pandemic Data Lake Bing

Bing COVID-19 data includes confirmed, fatal, and recovered cases from all regions, updated daily.
This data is reflected in the Bing COVID-19 Tracker.

Bing collects data from multiple trusted, reliable sources, including the World Health Organization (WHO), Centers for Disease Control and Prevention (CDC), national and state public health departments, BNO News, 24/7 Wall St., and Wikipedia.

For more information and original source data see this link. For license terms see this link.

Datasets:
Modified datasets are available in CSV, JSON, JSON-Lines, and Parquet.
https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.csv
https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.json
https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.jsonl
https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.parquet

All modified datasets have ISO 3166 subdivision codes and load times added, and use lower case column names with underscore separators.

Raw data: https://pandemicdatalake.blob.core.windows.net/public/raw/covid-19/bing_covid-19_data/latest/Bing-COVID19-Data.csv

Previous versions of modified and raw data: https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/bing_covid-19_data/

Data Volume
All datasets are updated daily. As of May 11, 2020 they contained 125,576 rows (CSV 16.1MB, JSON 40.0 MB, JSONL 39.6 MB, Parquet 1.1 MB).

License and Use Rights; Attribution
This data is available strictly for educational and academic purposes, such as medical research, government agencies, and academic institutions, under terms and conditions available here.

Data used or cited in publications should include an attribution to ‘Bing COVID-19 Tracker’ with a link to www.bing.com/covid.

Contact
For any questions or feedback about this or other datasets in the COVID-19 Data Lake, please contact askcovid19dl@microsoft.com.

Notices

MICROSOFT PROVIDES AZURE OPEN DATASETS ON AN “AS IS” BASIS. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, GUARANTEES OR CONDITIONS WITH RESPECT TO YOUR USE OF THE DATASETS. TO THE EXTENT PERMITTED UNDER YOUR LOCAL LAW, MICROSOFT DISCLAIMS ALL LIABILITY FOR ANY DAMAGES OR LOSSES, INCLUDING DIRECT, CONSEQUENTIAL, SPECIAL, INDIRECT, INCIDENTAL OR PUNITIVE, RESULTING FROM YOUR USE OF THE DATASETS.

This dataset is provided under the original terms that Microsoft received source data. The dataset may include data sourced from Microsoft.

Access

Available inWhen to use
Azure Notebooks

Quickly explore the dataset with Jupyter notebooks hosted on Azure or your local machine.

Azure Databricks

Use this when you need the scale of an Azure managed Spark cluster to process the dataset.

Azure Synapse

Use this when you need the scale of an Azure managed Spark cluster to process the dataset.

Preview

id updated confirmed deaths iso2 iso3 country_region admin_region_1 iso_subdivision admin_region_2 load_time confirmed_change deaths_change
338995 2020-01-21 262 0 null null Worldwide null null null 6/11/2021 12:05:25 AM
338996 2020-01-22 313 0 null null Worldwide null null null 6/11/2021 12:05:25 AM 51 0
338997 2020-01-23 578 0 null null Worldwide null null null 6/11/2021 12:05:25 AM 265 0
338998 2020-01-24 841 0 null null Worldwide null null null 6/11/2021 12:05:25 AM 263 0
338999 2020-01-25 1320 0 null null Worldwide null null null 6/11/2021 12:05:25 AM 479 0
339000 2020-01-26 2014 0 null null Worldwide null null null 6/11/2021 12:05:25 AM 694 0
339001 2020-01-27 2798 0 null null Worldwide null null null 6/11/2021 12:05:25 AM 784 0
339002 2020-01-28 4593 0 null null Worldwide null null null 6/11/2021 12:05:25 AM 1795 0
339003 2020-01-29 6065 0 null null Worldwide null null null 6/11/2021 12:05:25 AM 1472 0
339004 2020-01-30 7818 0 null null Worldwide null null null 6/11/2021 12:05:25 AM 1753 0
Name Data type Unique Values (sample) Description
admin_region_1 string 864 Texas
Georgia

Region within country_region

admin_region_2 string 3,143 Washington County
Jefferson County

Region within admin_region_1

confirmed int 141,361 1
2

Confirmed case count for the region

confirmed_change int 13,494 1
2

Change of confirmed case count from the previous day

country_region string 240 United States
India

Country/region

deaths int 23,665 1
2

Death case count for the region

deaths_change smallint 2,162 1
2

Change of death count from the previous day

id int 2,022,096 17858613
56589564

Unique identifier

iso_subdivision string 484 US-TX
US-GA

Two-part ISO subdivision code

iso2 string 229 US
IN

2 letter country code identifier

iso3 string 229 USA
IND

3 letter country code identifier

latitude double 5,676 42.28708
19.59852

Latitude of the centroid of the region

load_time timestamp 1 2021-06-11 00:05:25.474000

The date and time the file was loaded from the Bing source on GitHub

longitude double 5,694 -2.5396
-155.5186

Longitude of the centroid of the region

recovered int 87,731 1
2

Recovered count for the region

recovered_change int 12,039 1
2

Change of recovered case count from the previous day

updated date 504 2021-06-02
2021-06-04

The as at date for the record

Select your preferred service:

Azure Notebooks

Azure Databricks

Azure Synapse

Azure Notebooks

Package: Language: Python

Download the dataset file using thebuilt-in capability download from a http URL in Pandas. Pandas has readers for various file formats:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_parquet.html

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html (use lines=True for json lines)

In [1]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

df = pd.read_parquet("https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.parquet")
df.head(10)
Out[1]:
id updated confirmed confirmed_change deaths deaths_change recovered recovered_change latitude longitude iso2 iso3 country_region admin_region_1 iso_subdivision admin_region_2 admin_region_2_code load_time
0 338995 2020-01-21 262.0 0.0 0.0 0.0 NaN NaN NaN NaN None None Worldwide None None None None 2020-05-08 00:05:35.629
1 338996 2020-01-22 313.0 51.0 0.0 0.0 NaN NaN NaN NaN None None Worldwide None None None None 2020-05-08 00:05:35.629
2 338997 2020-01-23 578.0 265.0 0.0 0.0 NaN NaN NaN NaN None None Worldwide None None None None 2020-05-08 00:05:35.629
3 338998 2020-01-24 841.0 263.0 0.0 0.0 NaN NaN NaN NaN None None Worldwide None None None None 2020-05-08 00:05:35.629
4 338999 2020-01-25 1320.0 479.0 0.0 0.0 NaN NaN NaN NaN None None Worldwide None None None None 2020-05-08 00:05:35.629
5 339000 2020-01-26 2014.0 694.0 0.0 0.0 NaN NaN NaN NaN None None Worldwide None None None None 2020-05-08 00:05:35.629
6 339001 2020-01-27 2798.0 784.0 0.0 0.0 NaN NaN NaN NaN None None Worldwide None None None None 2020-05-08 00:05:35.629
7 339002 2020-01-28 4593.0 1795.0 0.0 0.0 NaN NaN NaN NaN None None Worldwide None None None None 2020-05-08 00:05:35.629
8 339003 2020-01-29 6065.0 1472.0 0.0 0.0 NaN NaN NaN NaN None None Worldwide None None None None 2020-05-08 00:05:35.629
9 339004 2020-01-30 7818.0 1753.0 0.0 0.0 NaN NaN NaN NaN None None Worldwide None None None None 2020-05-08 00:05:35.629

Lets check the data types of the various fields and verify that the updated column is datettime format

In [2]:
df.dtypes
Out[2]:
id                              int32
updated                datetime64[ns]
confirmed                     float64
confirmed_change              float64
deaths                        float64
deaths_change                 float64
recovered                     float64
recovered_change              float64
latitude                      float64
longitude                     float64
iso2                           object
iso3                           object
country_region                 object
admin_region_1                 object
iso_subdivision                object
admin_region_2                 object
admin_region_2_code            object
load_time              datetime64[ns]
dtype: object

We will now look into Worldwide data and plot some simple charts to visualize the data

In [3]:
df_Worldwide=df[df['country_region']=='Worldwide']
In [4]:
df_Worldwide_pivot=df_Worldwide.pivot_table(df_Worldwide, index=['country_region','updated'])

df_Worldwide_pivot
Out[4]:
confirmed confirmed_change deaths deaths_change id recovered recovered_change
country_region updated
Worldwide 2020-01-21 262.0 0.0 0.0 0.0 338995 NaN NaN
2020-01-22 313.0 51.0 0.0 0.0 338996 NaN NaN
2020-01-23 578.0 265.0 0.0 0.0 338997 NaN NaN
2020-01-24 841.0 263.0 0.0 0.0 338998 NaN NaN
2020-01-25 1320.0 479.0 0.0 0.0 338999 NaN NaN
2020-01-26 2014.0 694.0 0.0 0.0 339000 NaN NaN
2020-01-27 2798.0 784.0 0.0 0.0 339001 NaN NaN
2020-01-28 4593.0 1795.0 0.0 0.0 339002 NaN NaN
2020-01-29 6065.0 1472.0 0.0 0.0 339003 NaN NaN
2020-01-30 7818.0 1753.0 0.0 0.0 339004 NaN NaN
2020-01-31 9826.0 2008.0 0.0 0.0 339005 NaN NaN
2020-02-01 11953.0 2127.0 0.0 0.0 339006 NaN NaN
2020-02-02 14557.0 2604.0 0.0 0.0 339007 NaN NaN
2020-02-03 17386.0 2829.0 362.0 362.0 339008 NaN NaN
2020-02-04 20625.0 3239.0 426.0 64.0 339009 NaN NaN
2020-02-05 24549.0 3924.0 492.0 66.0 339010 NaN NaN
2020-02-06 28256.0 3707.0 565.0 73.0 339011 NaN NaN
2020-02-07 31420.0 3164.0 638.0 73.0 339012 NaN NaN
2020-02-08 34822.0 3402.0 724.0 86.0 339013 NaN NaN
2020-02-09 37494.0 2672.0 813.0 89.0 339014 NaN NaN
2020-02-10 40484.0 2990.0 910.0 97.0 339015 NaN NaN
2020-02-11 42968.0 2484.0 1018.0 108.0 339016 NaN NaN
2020-02-12 44996.0 2028.0 1115.0 97.0 339017 NaN NaN
2020-02-13 46823.0 1827.0 1369.0 254.0 339018 NaN NaN
2020-02-14 64219.0 17396.0 1383.0 14.0 339019 NaN NaN
2020-02-15 66884.0 2665.0 1526.0 143.0 339020 NaN NaN
2020-02-16 68912.0 2028.0 1669.0 143.0 339021 NaN NaN
2020-02-17 70975.0 2063.0 1775.0 106.0 339022 NaN NaN
2020-02-18 72778.0 1803.0 1873.0 98.0 339023 NaN NaN
2020-02-19 75204.0 2426.0 2009.0 136.0 339024 NaN NaN
... ... ... ... ... ... ... ...
2020-04-06 1341907.0 69792.0 74565.0 5256.0 4013596 276259.0 16247.0
2020-04-07 1426096.0 84189.0 81259.0 6694.0 4309179 300054.0 23795.0
2020-04-08 1504971.0 78875.0 87984.0 6725.0 4551545 328661.0 28607.0
2020-04-09 1587209.0 82238.0 94850.0 6866.0 4728946 353291.0 24630.0
2020-04-10 1684833.0 97624.0 102136.0 7286.0 4898837 375499.0 22208.0
2020-04-11 1764622.0 79789.0 107904.0 5768.0 5310451 385999.0 10500.0
2020-04-12 1840093.0 75471.0 113672.0 5768.0 5124486 421372.0 35373.0
2020-04-13 1912923.0 72830.0 118966.0 5294.0 5261547 448053.0 26681.0
2020-04-14 1970879.0 57956.0 125678.0 6712.0 5305989 472948.0 24895.0
2020-04-15 2056055.0 85176.0 133572.0 7894.0 5347997 511019.0 38071.0
2020-04-16 2151199.0 95144.0 143725.0 10153.0 5429132 541501.0 30482.0
2020-04-17 2234109.0 82910.0 153379.0 9654.0 5440228 567695.0 26194.0
2020-04-18 2310572.0 76463.0 158691.0 5312.0 5448942 590682.0 22987.0
2020-04-19 2394291.0 83719.0 164938.0 6247.0 5457903 611880.0 21198.0
2020-04-20 2470410.0 76119.0 169794.0 4856.0 5467891 645335.0 33455.0
2020-04-21 2560504.0 90094.0 176926.0 7132.0 5495212 679793.0 34458.0
2020-04-22 2628894.0 68390.0 182992.0 6066.0 5562397 709050.0 29257.0
2020-04-23 2699338.0 70444.0 188437.0 5445.0 6650426 737735.0 28685.0
2020-04-24 2790986.0 91648.0 195920.0 7483.0 6906714 781382.0 43647.0
2020-04-25 2868539.0 77553.0 201502.0 5582.0 6955177 811660.0 30278.0
2020-04-26 2965363.0 96824.0 206265.0 4763.0 7002306 863464.0 51804.0
2020-04-27 3002303.0 36940.0 208131.0 1866.0 7055144 878813.0 15349.0
2020-04-28 3083467.0 81164.0 213824.0 5693.0 7098114 915988.0 37175.0
2020-04-29 3170335.0 86868.0 224708.0 10884.0 7522181 958353.0 42365.0
2020-04-30 3249022.0 78687.0 230804.0 6096.0 7997577 1006112.0 47759.0
2020-05-01 3303296.0 54274.0 235290.0 4486.0 8478160 1039588.0 33476.0
2020-05-02 3419184.0 115888.0 243355.0 8065.0 8922151 1092644.0 53056.0
2020-05-03 3502126.0 82942.0 247107.0 3752.0 9388541 1124127.0 31483.0
2020-05-04 3578301.0 76175.0 251059.0 3952.0 9863005 1162279.0 38152.0
2020-05-05 3659271.0 80970.0 256736.0 5677.0 10308709 1197340.0 35061.0

106 rows × 7 columns

In [5]:
df_Worldwide.plot(kind='line',x='updated',y="confirmed",grid=True)
df_Worldwide.plot(kind='line',x='updated',y="deaths",grid=True)
df_Worldwide.plot(kind='line',x='updated',y="confirmed_change",grid=True)
df_Worldwide.plot(kind='line',x='updated',y="deaths_change",grid=True)
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fbf154a8198>

Azure Databricks

Package: Language: Python
In [1]:
# Azure storage access info
blob_account_name = "pandemicdatalake"
blob_container_name = "public"
blob_relative_path = "curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.parquet"
blob_sas_token = r""
In [2]:
# Allow SPARK to read from Blob remotely
wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set(
  'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
  blob_sas_token)
print('Remote blob path: ' + wasbs_path)
In [3]:
# SPARK read parquet, note that it won't load any data yet by now
df = spark.read.parquet(wasbs_path)
print('Register the DataFrame as a SQL temporary view: source')
df.createOrReplaceTempView('source')
In [4]:
# Display top 10 rows
print('Displaying top 10 rows: ')
display(spark.sql('SELECT * FROM source LIMIT 10'))

Azure Synapse

Package: Language: Python
In [1]:
# Azure storage access info
blob_account_name = "pandemicdatalake"
blob_container_name = "public"
blob_relative_path = "curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.parquet"
blob_sas_token = r""
In [2]:
# Allow SPARK to read from Blob remotely
wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set(
  'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
  blob_sas_token)
print('Remote blob path: ' + wasbs_path)
In [3]:
# SPARK read parquet, note that it won't load any data yet by now
df = spark.read.parquet(wasbs_path)
print('Register the DataFrame as a SQL temporary view: source')
df.createOrReplaceTempView('source')
In [4]:
# Display top 10 rows
print('Displaying top 10 rows: ')
display(spark.sql('SELECT * FROM source LIMIT 10'))