Skip Navigation

COVID-19 Data Lake

European Centre for Disease Prevention and Control (ECDC) Covid-19 Cases

COVID-19 Pandemic Data Lake ECDC

The latest available public data on geographic distribution of COVID-19 cases worldwide from the European Center for Disease Prevention and Control (ECDC). Each row/entry contains the number of new cases reported per day and per country or region.

For more information about this dataset, see here.

Datasets:
Modified versions of the dataset are available in CSV, JSON, JSON-Lines, and Parquet, updated daily:
https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.csv
https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.json
https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.jsonl
https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.parquet

All modified versions have iso_country_region codes and load times added, and use lower case column names with underscore separators.

Raw data:
https://pandemicdatalake.blob.core.windows.net/public/raw/covid-19/ecdc_cases/latest/ECDCCases.csv

Previous versions of modified and raw data:
https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/
https://pandemicdatalake.blob.core.windows.net/public/raw/covid-19/ecdc_cases/

Data Volume
As of May 28, 2020 they contained 19,876 rows (CSV 1.5 MB, JSON 4.9 MB, JSONL 4.9 MB, Parquet 54.1 KB).

Data Source
Raw data is ingested daily from the ECDC csv file here. For more information on this dataset, including its origins, see here. ECDC is the original source of this data.

Data Quality
The ECDC does not guarantee the accuracy or timeliness of the data. Please read the disclaimer here.

License and Use Rights; Attribution
This data is made available and may be used as permitted under the ECDC copyright policy here. For any documents where the copyright lies with a third party, permission for reproduction must be obtained from the copyright holder.

ECDC must always be acknowledged as the original source of this data. Such acknowledgement must be included in each copy of the material.

Contact
For any questions or feedback about this or other datasets in the COVID-19 Data Lake, please contact askcovid19dl@microsoft.com.

Notices

MICROSOFT PROVIDES AZURE OPEN DATASETS ON AN “AS IS” BASIS. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, GUARANTEES OR CONDITIONS WITH RESPECT TO YOUR USE OF THE DATASETS. TO THE EXTENT PERMITTED UNDER YOUR LOCAL LAW, MICROSOFT DISCLAIMS ALL LIABILITY FOR ANY DAMAGES OR LOSSES, INCLUDING DIRECT, CONSEQUENTIAL, SPECIAL, INDIRECT, INCIDENTAL OR PUNITIVE, RESULTING FROM YOUR USE OF THE DATASETS.

This dataset is provided under the original terms that Microsoft received source data. The dataset may include data sourced from Microsoft.

Access

Available inWhen to use
Azure Notebooks

Quickly explore the dataset with Jupyter notebooks hosted on Azure or your local machine.

Azure Databricks

Use this when you need the scale of an Azure managed Spark cluster to process the dataset.

Azure Synapse

Use this when you need the scale of an Azure managed Spark cluster to process the dataset.

Preview

date_rep day month year cases deaths countries_and_territories geo_id country_territory_code continent_exp load_date iso_country
2020-12-01 1 12 2020 272 11 Afghanistan AF AFG Asia 12/2/2020 12:06:38 AM AF
2020-11-30 30 11 2020 0 0 Afghanistan AF AFG Asia 12/2/2020 12:06:38 AM AF
2020-11-29 29 11 2020 228 11 Afghanistan AF AFG Asia 12/2/2020 12:06:38 AM AF
2020-11-28 28 11 2020 214 15 Afghanistan AF AFG Asia 12/2/2020 12:06:38 AM AF
2020-11-27 27 11 2020 0 0 Afghanistan AF AFG Asia 12/2/2020 12:06:38 AM AF
2020-11-26 26 11 2020 200 12 Afghanistan AF AFG Asia 12/2/2020 12:06:38 AM AF
2020-11-25 25 11 2020 185 13 Afghanistan AF AFG Asia 12/2/2020 12:06:38 AM AF
2020-11-24 24 11 2020 246 17 Afghanistan AF AFG Asia 12/2/2020 12:06:38 AM AF
2020-11-23 23 11 2020 252 8 Afghanistan AF AFG Asia 12/2/2020 12:06:38 AM AF
2020-11-22 22 11 2020 154 12 Afghanistan AF AFG Asia 12/2/2020 12:06:38 AM AF
Name Data type Unique Values (sample) Description
cases smallint 5,156 1
2

Number of reported cases

continent_exp string 6 Europe
Africa

Continent name

countries_and_territories string 214 Belgium
Mexico

Country or territory name

country_territory_code string 213 NOR
ITA

Three letter country or territory code

date_rep date 337 2020-11-18
2020-11-27

Date of the report

day smallint 31 1
29

Day of month

deaths smallint 1,016 1
2

Number of reported deaths

geo_id string 214 TW
CN

Geo identifier

iso_country string 214 AU
FR

ISO 3166 country or region code

load_date timestamp 1 2020-12-02 00:06:38.339000

Date the data was loaded to Azure

month smallint 12 10
7

Month number

year smallint 2 2020
2019

Year

Select your preferred service:

Azure Notebooks

Azure Databricks

Azure Synapse

Azure Notebooks

Package: Language: Python

This notebook documents the URLs and sample code to access the European Centre for Disease Prevention and Control (ECDC) Covid-19 Cases dataset URLs of different dataset file formats hosted on Azure Blob Storage:

CSV: https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.csv

JSON: https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.json

JSONL: https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.jsonl

Parquet: https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.parquet

Download the dataset file using the built-in capability download from a http URL in Pandas. Pandas has readers for various file formats:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_parquet.html

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html (use lines=True for json lines)

Start by loading the dataset file into a pandas dataframe and view some sample rows

In [1]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

df = pd.read_parquet("https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.parquet")
df.head(10)
Out[1]:
date_rep day month year cases deaths countries_and_territories geo_id country_territory_code pop_data_2018 continent_exp load_date iso_country
0 2020-06-19 19 6 2020 658.0 42.0 Afghanistan AF AFG NaN Asia 2020-06-20 00:05:37.480 AF
1 2020-06-18 18 6 2020 564.0 13.0 Afghanistan AF AFG NaN Asia 2020-06-20 00:05:37.480 AF
2 2020-06-17 17 6 2020 783.0 13.0 Afghanistan AF AFG NaN Asia 2020-06-20 00:05:37.480 AF
3 2020-06-16 16 6 2020 761.0 7.0 Afghanistan AF AFG NaN Asia 2020-06-20 00:05:37.480 AF
4 2020-06-15 15 6 2020 664.0 20.0 Afghanistan AF AFG NaN Asia 2020-06-20 00:05:37.480 AF
5 2020-06-14 14 6 2020 556.0 5.0 Afghanistan AF AFG NaN Asia 2020-06-20 00:05:37.480 AF
6 2020-06-13 13 6 2020 656.0 20.0 Afghanistan AF AFG NaN Asia 2020-06-20 00:05:37.480 AF
7 2020-06-12 12 6 2020 747.0 21.0 Afghanistan AF AFG NaN Asia 2020-06-20 00:05:37.480 AF
8 2020-06-11 11 6 2020 684.0 21.0 Afghanistan AF AFG NaN Asia 2020-06-20 00:05:37.480 AF
9 2020-06-10 10 6 2020 542.0 15.0 Afghanistan AF AFG NaN Asia 2020-06-20 00:05:37.480 AF

Lets check the data types of the various fields and verify that the updated column is datettime format

In [2]:
df.dtypes
Out[2]:
date_rep                             object
day                                   int16
month                                 int16
year                                  int16
cases                               float64
deaths                              float64
countries_and_territories            object
geo_id                               object
country_territory_code               object
pop_data_2018                       float64
continent_exp                        object
load_date                    datetime64[ns]
iso_country                          object
dtype: object

This dataset contains data for different countries around the world. Lets verify which countries we have data for.

We will start by looking at the latest data for each available country:

In [3]:
df.groupby('countries_and_territories').first().filter(['continent_exp','cases', 'deaths','date_rep'])
Out[3]:
continent_exp cases deaths date_rep
countries_and_territories
Afghanistan Asia 658.0 42.0 2020-06-19
Albania Europe 66.0 1.0 2020-06-19
Algeria Africa 117.0 12.0 2020-06-19
Andorra Europe 1.0 0.0 2020-06-19
Angola Africa 0.0 0.0 2020-06-19
Anguilla America 0.0 0.0 2020-06-19
Antigua_and_Barbuda America 0.0 0.0 2020-06-19
Argentina America 0.0 16.0 2020-06-19
Armenia Europe 665.0 7.0 2020-06-19
Aruba America 0.0 0.0 2020-06-19
Australia Oceania 21.0 0.0 2020-06-19
Austria Europe 40.0 1.0 2020-06-19
Azerbaijan Europe 338.0 6.0 2020-06-19
Bahamas America 0.0 0.0 2020-06-19
Bahrain Asia 469.0 6.0 2020-06-19
Bangladesh Asia 3803.0 38.0 2020-06-19
Barbados America 0.0 0.0 2020-06-19
Belarus Europe 625.0 7.0 2020-06-19
Belgium Europe 104.0 8.0 2020-06-19
Belize America 0.0 0.0 2020-06-19
Benin Africa 25.0 2.0 2020-06-19
Bermuda America 0.0 0.0 2020-06-19
Bhutan Asia 0.0 0.0 2020-06-19
Bolivia America 814.0 18.0 2020-06-19
Bonaire, Saint Eustatius and Saba America 0.0 0.0 2020-06-19
Bosnia_and_Herzegovina Europe 33.0 0.0 2020-06-19
Botswana Africa 19.0 0.0 2020-06-19
Brazil America 22765.0 1238.0 2020-06-19
British_Virgin_Islands America 0.0 0.0 2020-06-19
Brunei_Darussalam Asia 0.0 0.0 2020-06-19
... ... ... ... ...
Spain Europe 585.0 0.0 2020-06-18
Sri_Lanka Asia 23.0 0.0 2020-06-19
Sudan Africa 0.0 0.0 2020-06-19
Suriname America 16.0 1.0 2020-06-19
Sweden Europe 1481.0 12.0 2020-06-19
Switzerland Europe 17.0 0.0 2020-06-19
Syria Asia 9.0 0.0 2020-06-19
Taiwan Asia 1.0 0.0 2020-06-19
Tajikistan Asia 58.0 0.0 2020-06-19
Thailand Asia 0.0 0.0 2020-06-19
Timor_Leste Asia 0.0 0.0 2020-06-19
Togo Africa 3.0 0.0 2020-06-19
Trinidad_and_Tobago America 0.0 0.0 2020-06-19
Tunisia Africa 4.0 0.0 2020-06-19
Turkey Asia 1304.0 21.0 2020-06-19
Uganda Africa 9.0 0.0 2020-06-19
Ukraine Europe 829.0 23.0 2020-06-19
United_Arab_Emirates Asia 388.0 3.0 2020-06-19
United_Kingdom Europe 1218.0 135.0 2020-06-19
United_Republic_of_Tanzania Africa 0.0 0.0 2020-06-19
United_States_Virgin_Islands America 0.0 0.0 2020-06-19
United_States_of_America America 27762.0 717.0 2020-06-19
Uruguay America 1.0 0.0 2020-06-19
Uzbekistan Asia 158.0 0.0 2020-06-19
Venezuela America 334.0 1.0 2020-06-19
Vietnam Asia 7.0 0.0 2020-06-19
Western_Sahara Africa 0.0 0.0 2020-06-19
Yemen Asia 4.0 4.0 2020-06-19
Zambia Africa 4.0 0.0 2020-06-19
Zimbabwe Africa 62.0 0.0 2020-06-19

206 rows × 4 columns

Next, we will do some aggregations to make sure columns such as condirmed cases and deaths tally with the latest data. You should see that confirmed and death numbers for latest date in the above table match with the aggregation of cases and deaths.

In [4]:
df.groupby('continent_exp').agg({'countries_and_territories': 'count','cases': 'count','deaths': 'count'})
Out[4]:
countries_and_territories cases deaths
continent_exp
Africa 5191 5191 5191
America 4839 4827 4839
Asia 5861 5861 5861
Europe 7371 7371 7370
Oceania 876 876 876

Now we color code the chart for presentation

In [5]:
import plotly.graph_objects as go
import plotly.express as px
import matplotlib.pyplot as plt

df.loc[: , ['countries_and_territories', 'cases', 'deaths']].groupby(['countries_and_territories'
         ]).max().sort_values(by='cases',ascending=False).reset_index()[:15].style.background_gradient(cmap='rainbow')
Out[5]:
countries_and_territories cases deaths
0 United_States_of_America 32425 4928
1 Brazil 32188 1473
2 China 15141 1290
3 India 13586 2003
4 Russia 11656 312
5 Ecuador 11536 410
6 Spain 9181 950
7 Peru 8875 260
8 United_Kingdom 8719 1172
9 France 7578 2004
10 Chile 6938 649
11 Pakistan 6825 136
12 Italy 6557 971
13 Germany 6294 315
14 Mexico 5662 1091

Now we will use a line graph to plot some data for presentation

In [6]:
df_Worldwide=df[df['countries_and_territories']=='United_States_of_America']
In [7]:
df.plot(kind='line',x='date_rep',y="cases",grid=True)
df.plot(kind='line',x='date_rep',y="deaths",grid=True)
#df_Worldwide.plot(kind='line',x='date_rep',y="confirmed_change",grid=True)
#df_Worldwide.plot(kind='line',x='date_rep',y="deaths_change",grid=True)
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f40b19d59e8>
In [ ]:
 

Azure Databricks

Package: Language: Python
In [1]:
# Azure storage access info
blob_account_name = "pandemicdatalake"
blob_container_name = "public"
blob_relative_path = "curated/covid-19/ecdc_cases/latest/ecdc_cases.parquet"
blob_sas_token = r""
In [2]:
# Allow SPARK to read from Blob remotely
wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set(
  'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
  blob_sas_token)
print('Remote blob path: ' + wasbs_path)
In [3]:
# SPARK read parquet, note that it won't load any data yet by now
df = spark.read.parquet(wasbs_path)
print('Register the DataFrame as a SQL temporary view: source')
df.createOrReplaceTempView('source')
In [4]:
# Display top 10 rows
print('Displaying top 10 rows: ')
display(spark.sql('SELECT * FROM source LIMIT 10'))

Azure Synapse

Package: Language: Python
In [1]:
# Azure storage access info
blob_account_name = "pandemicdatalake"
blob_container_name = "public"
blob_relative_path = "curated/covid-19/ecdc_cases/latest/ecdc_cases.parquet"
blob_sas_token = r""
In [2]:
# Allow SPARK to read from Blob remotely
wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set(
  'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
  blob_sas_token)
print('Remote blob path: ' + wasbs_path)
In [3]:
# SPARK read parquet, note that it won't load any data yet by now
df = spark.read.parquet(wasbs_path)
print('Register the DataFrame as a SQL temporary view: source')
df.createOrReplaceTempView('source')
In [4]:
# Display top 10 rows
print('Displaying top 10 rows: ')
display(spark.sql('SELECT * FROM source LIMIT 10'))

Analyze COVID data with Synapse SQL serverless endpoint

How Azure Synapse Analytics helps you analyze COVID data with Synapse SQL serverless endpoint

Linear regression analysis on COVID data using SQL endpoint in Azure Synapse Analytics

Linear regression analysis on COVID data using SQL endpoint in Azure Synapse Analytics