Hopp over navigasjon

COVID-19 Data Lake

European Centre for Disease Prevention and Control (ECDC) Covid-19 Cases

COVID-19 Pandemic Data Lake ECDC

De siste tilgjengelige offentlige dataene om den geografiske spredningen av COVID-19 over hele verden fra European Center for Disease Prevention and Control (ECDC). Hver rad/oppføring inneholder antall nye smittede rapportert per dag og per land eller område.

Se her for mer informasjon om dette datasettet.

Datasett:
Endrede versjoner av datasettet er tilgjengelig i CSV, JSON, JSON-Lines og Parkett, og oppdateres daglig:
https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.csv
https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.json
https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.jsonl
https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.parquet

Alle modifiserte versjoner har iso_country_region-koder og -lastetider lagt til, og bruker kolonnenavn med små bokstaver understrekingstegn.

Rådata:
https://pandemicdatalake.blob.core.windows.net/public/raw/covid-19/ecdc_cases/latest/ECDCCases.csv

Tidligere versjoner av modifiserte- og rådata:
https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/
https://pandemicdatalake.blob.core.windows.net/public/raw/covid-19/ecdc_cases/

Datavolum
Per 28. mai 2020 inneholdt de 19 876 rader (CSV 1,5 MB, JSON 4,9 MB, JSONL 4,9 MB, Parquet 54,1 kB).

Datakilde
Rådata innlemmes daglig fra ECDC-csv-filen her. Se her for mer informasjon om dette datasettet, inkludert opphavet. ECDC er den originale kilden til disse dataene.

Datakvalitet
ECDC garanterer ikke nøyaktigheten til dataene eller hvorvidt de gjelder akkurat her og nå. Les ansvarsfraskrivelsen her.

Lisens og bruksrettigheter. Tillegg
Disse dataene er tilgjengelige og kan brukes som tillatt etter ECDCs policy om opphavsrett her. For dokumenter der opphavsretten ligger hos en tredjepart, må tillatelse for reproduksjon innhentes fra opphavsrettsinnehaveren.

ECDC må alltid anerkjennes som originalkilden for dataene. Slike anerkjennelser må inkluderes i hver kopi av materialet.

Kontakt
Hvis du har spørsmål eller tilbakemelding om dette eller andre datasett i COVID-19-datasjøen, kan du kontakte askcovid19dl@microsoft.com.

Merknader

MICROSOFT LEVERER AZURE OPEN DATASETS PÅ EN “SOM DE ER”-BASIS. MICROSOFT GIR INGEN GARANTIER, UTTRYKTE ELLER IMPLISERTE, ELLER BETINGELSER MED HENSYN TIL DIN BRUK AV DATASETTENE. I DEN GRAD LOKAL LOV TILLATER DET, FRASKRIVER MICROSOFT SEG ALT ANSVAR FOR EVENTUELLE SKADER ELLER TAP, INKLUDERT DIREKTE SKADE, FØLGESKADE, DOKUMENTERT ERSTATNINGSKRAV, INDIREKTE SKADE ELLER ERSTATNING UTOVER DET SOM VILLE VÆRE NORMALT, SOM FØLGE AV DIN BRUK AV DATASETTENE.

Dette datasettet leveres i henhold til de originale vilkårene Microsoft mottok kildedata. Datasettet kan inkludere data hentet fra Microsoft.

Access

Available inWhen to use
Azure Notebooks

Quickly explore the dataset with Jupyter notebooks hosted on Azure or your local machine.

Azure Databricks

Use this when you need the scale of an Azure managed Spark cluster to process the dataset.

Azure Synapse

Use this when you need the scale of an Azure managed Spark cluster to process the dataset.

Preview

date_rep day month year cases deaths countries_and_territories geo_id country_territory_code continent_exp load_date iso_country
2020-12-14 14 12 2020 746 6 Afghanistan AF AFG Asia 2/26/2021 12:06:35 AM AF
2020-12-13 13 12 2020 298 9 Afghanistan AF AFG Asia 2/26/2021 12:06:35 AM AF
2020-12-12 12 12 2020 113 11 Afghanistan AF AFG Asia 2/26/2021 12:06:35 AM AF
2020-12-11 11 12 2020 63 10 Afghanistan AF AFG Asia 2/26/2021 12:06:35 AM AF
2020-12-10 10 12 2020 202 16 Afghanistan AF AFG Asia 2/26/2021 12:06:35 AM AF
2020-12-09 9 12 2020 135 13 Afghanistan AF AFG Asia 2/26/2021 12:06:35 AM AF
2020-12-08 8 12 2020 200 6 Afghanistan AF AFG Asia 2/26/2021 12:06:35 AM AF
2020-12-07 7 12 2020 210 26 Afghanistan AF AFG Asia 2/26/2021 12:06:35 AM AF
2020-12-06 6 12 2020 234 10 Afghanistan AF AFG Asia 2/26/2021 12:06:35 AM AF
2020-12-05 5 12 2020 235 18 Afghanistan AF AFG Asia 2/26/2021 12:06:35 AM AF
Name Data type Unique Values (sample) Description
cases smallint 5,515 1
2

Antall rapporterte tilfeller

continent_exp string 6 Europe
Africa

Navn på kontinent

countries_and_territories string 214 Iceland
Belgium

Navn på land eller område

country_territory_code string 213 ITA
NOR

Trebokstavskode for land eller område

date_rep date 350 2020-12-03
2020-11-30

Dato for rapporten

day smallint 31 14
13

Dag i måned

deaths smallint 1,049 1
2

Antall rapporterte døde

geo_id string 214 SE
JP

Identifikator for geografisk område

iso_country string 214 DK
NL

ISO 3166 land eller områdekode

load_date timestamp 1 2021-02-26 00:06:35.856000

Dato dataene ble lastet inn i Azure

month smallint 12 10
7

Månedsnummer

year smallint 2 2020
2019

År

Select your preferred service:

Azure Notebooks

Azure Databricks

Azure Synapse

Azure Notebooks

Package: Language: Python

This notebook documents the URLs and sample code to access the European Centre for Disease Prevention and Control (ECDC) Covid-19 Cases dataset URLs of different dataset file formats hosted on Azure Blob Storage:

CSV: https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.csv

JSON: https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.json

JSONL: https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.jsonl

Parquet: https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.parquet

Download the dataset file using the built-in capability download from a http URL in Pandas. Pandas has readers for various file formats:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_parquet.html

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html (use lines=True for json lines)

Start by loading the dataset file into a pandas dataframe and view some sample rows

In [1]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

df = pd.read_parquet("https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.parquet")
df.head(10)
Out[1]:
date_rep day month year cases deaths countries_and_territories geo_id country_territory_code pop_data_2018 continent_exp load_date iso_country
0 2020-06-19 19 6 2020 658.0 42.0 Afghanistan AF AFG NaN Asia 2020-06-20 00:05:37.480 AF
1 2020-06-18 18 6 2020 564.0 13.0 Afghanistan AF AFG NaN Asia 2020-06-20 00:05:37.480 AF
2 2020-06-17 17 6 2020 783.0 13.0 Afghanistan AF AFG NaN Asia 2020-06-20 00:05:37.480 AF
3 2020-06-16 16 6 2020 761.0 7.0 Afghanistan AF AFG NaN Asia 2020-06-20 00:05:37.480 AF
4 2020-06-15 15 6 2020 664.0 20.0 Afghanistan AF AFG NaN Asia 2020-06-20 00:05:37.480 AF
5 2020-06-14 14 6 2020 556.0 5.0 Afghanistan AF AFG NaN Asia 2020-06-20 00:05:37.480 AF
6 2020-06-13 13 6 2020 656.0 20.0 Afghanistan AF AFG NaN Asia 2020-06-20 00:05:37.480 AF
7 2020-06-12 12 6 2020 747.0 21.0 Afghanistan AF AFG NaN Asia 2020-06-20 00:05:37.480 AF
8 2020-06-11 11 6 2020 684.0 21.0 Afghanistan AF AFG NaN Asia 2020-06-20 00:05:37.480 AF
9 2020-06-10 10 6 2020 542.0 15.0 Afghanistan AF AFG NaN Asia 2020-06-20 00:05:37.480 AF

Lets check the data types of the various fields and verify that the updated column is datettime format

In [2]:
df.dtypes
Out[2]:
date_rep                             object
day                                   int16
month                                 int16
year                                  int16
cases                               float64
deaths                              float64
countries_and_territories            object
geo_id                               object
country_territory_code               object
pop_data_2018                       float64
continent_exp                        object
load_date                    datetime64[ns]
iso_country                          object
dtype: object

This dataset contains data for different countries around the world. Lets verify which countries we have data for.

We will start by looking at the latest data for each available country:

In [3]:
df.groupby('countries_and_territories').first().filter(['continent_exp','cases', 'deaths','date_rep'])
Out[3]:
continent_exp cases deaths date_rep
countries_and_territories
Afghanistan Asia 658.0 42.0 2020-06-19
Albania Europe 66.0 1.0 2020-06-19
Algeria Africa 117.0 12.0 2020-06-19
Andorra Europe 1.0 0.0 2020-06-19
Angola Africa 0.0 0.0 2020-06-19
Anguilla America 0.0 0.0 2020-06-19
Antigua_and_Barbuda America 0.0 0.0 2020-06-19
Argentina America 0.0 16.0 2020-06-19
Armenia Europe 665.0 7.0 2020-06-19
Aruba America 0.0 0.0 2020-06-19
Australia Oceania 21.0 0.0 2020-06-19
Austria Europe 40.0 1.0 2020-06-19
Azerbaijan Europe 338.0 6.0 2020-06-19
Bahamas America 0.0 0.0 2020-06-19
Bahrain Asia 469.0 6.0 2020-06-19
Bangladesh Asia 3803.0 38.0 2020-06-19
Barbados America 0.0 0.0 2020-06-19
Belarus Europe 625.0 7.0 2020-06-19
Belgium Europe 104.0 8.0 2020-06-19
Belize America 0.0 0.0 2020-06-19
Benin Africa 25.0 2.0 2020-06-19
Bermuda America 0.0 0.0 2020-06-19
Bhutan Asia 0.0 0.0 2020-06-19
Bolivia America 814.0 18.0 2020-06-19
Bonaire, Saint Eustatius and Saba America 0.0 0.0 2020-06-19
Bosnia_and_Herzegovina Europe 33.0 0.0 2020-06-19
Botswana Africa 19.0 0.0 2020-06-19
Brazil America 22765.0 1238.0 2020-06-19
British_Virgin_Islands America 0.0 0.0 2020-06-19
Brunei_Darussalam Asia 0.0 0.0 2020-06-19
... ... ... ... ...
Spain Europe 585.0 0.0 2020-06-18
Sri_Lanka Asia 23.0 0.0 2020-06-19
Sudan Africa 0.0 0.0 2020-06-19
Suriname America 16.0 1.0 2020-06-19
Sweden Europe 1481.0 12.0 2020-06-19
Switzerland Europe 17.0 0.0 2020-06-19
Syria Asia 9.0 0.0 2020-06-19
Taiwan Asia 1.0 0.0 2020-06-19
Tajikistan Asia 58.0 0.0 2020-06-19
Thailand Asia 0.0 0.0 2020-06-19
Timor_Leste Asia 0.0 0.0 2020-06-19
Togo Africa 3.0 0.0 2020-06-19
Trinidad_and_Tobago America 0.0 0.0 2020-06-19
Tunisia Africa 4.0 0.0 2020-06-19
Turkey Asia 1304.0 21.0 2020-06-19
Uganda Africa 9.0 0.0 2020-06-19
Ukraine Europe 829.0 23.0 2020-06-19
United_Arab_Emirates Asia 388.0 3.0 2020-06-19
United_Kingdom Europe 1218.0 135.0 2020-06-19
United_Republic_of_Tanzania Africa 0.0 0.0 2020-06-19
United_States_Virgin_Islands America 0.0 0.0 2020-06-19
United_States_of_America America 27762.0 717.0 2020-06-19
Uruguay America 1.0 0.0 2020-06-19
Uzbekistan Asia 158.0 0.0 2020-06-19
Venezuela America 334.0 1.0 2020-06-19
Vietnam Asia 7.0 0.0 2020-06-19
Western_Sahara Africa 0.0 0.0 2020-06-19
Yemen Asia 4.0 4.0 2020-06-19
Zambia Africa 4.0 0.0 2020-06-19
Zimbabwe Africa 62.0 0.0 2020-06-19

206 rows × 4 columns

Next, we will do some aggregations to make sure columns such as condirmed cases and deaths tally with the latest data. You should see that confirmed and death numbers for latest date in the above table match with the aggregation of cases and deaths.

In [4]:
df.groupby('continent_exp').agg({'countries_and_territories': 'count','cases': 'count','deaths': 'count'})
Out[4]:
countries_and_territories cases deaths
continent_exp
Africa 5191 5191 5191
America 4839 4827 4839
Asia 5861 5861 5861
Europe 7371 7371 7370
Oceania 876 876 876

Now we color code the chart for presentation

In [5]:
import plotly.graph_objects as go
import plotly.express as px
import matplotlib.pyplot as plt

df.loc[: , ['countries_and_territories', 'cases', 'deaths']].groupby(['countries_and_territories'
         ]).max().sort_values(by='cases',ascending=False).reset_index()[:15].style.background_gradient(cmap='rainbow')
Out[5]:
countries_and_territories cases deaths
0 United_States_of_America 32425 4928
1 Brazil 32188 1473
2 China 15141 1290
3 India 13586 2003
4 Russia 11656 312
5 Ecuador 11536 410
6 Spain 9181 950
7 Peru 8875 260
8 United_Kingdom 8719 1172
9 France 7578 2004
10 Chile 6938 649
11 Pakistan 6825 136
12 Italy 6557 971
13 Germany 6294 315
14 Mexico 5662 1091

Now we will use a line graph to plot some data for presentation

In [6]:
df_Worldwide=df[df['countries_and_territories']=='United_States_of_America']
In [7]:
df.plot(kind='line',x='date_rep',y="cases",grid=True)
df.plot(kind='line',x='date_rep',y="deaths",grid=True)
#df_Worldwide.plot(kind='line',x='date_rep',y="confirmed_change",grid=True)
#df_Worldwide.plot(kind='line',x='date_rep',y="deaths_change",grid=True)
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f40b19d59e8>
In [ ]:
 

Azure Databricks

Package: Language: Python
In [1]:
# Azure storage access info
blob_account_name = "pandemicdatalake"
blob_container_name = "public"
blob_relative_path = "curated/covid-19/ecdc_cases/latest/ecdc_cases.parquet"
blob_sas_token = r""
In [2]:
# Allow SPARK to read from Blob remotely
wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set(
  'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
  blob_sas_token)
print('Remote blob path: ' + wasbs_path)
In [3]:
# SPARK read parquet, note that it won't load any data yet by now
df = spark.read.parquet(wasbs_path)
print('Register the DataFrame as a SQL temporary view: source')
df.createOrReplaceTempView('source')
In [4]:
# Display top 10 rows
print('Displaying top 10 rows: ')
display(spark.sql('SELECT * FROM source LIMIT 10'))

Azure Synapse

Package: Language: Python
In [1]:
# Azure storage access info
blob_account_name = "pandemicdatalake"
blob_container_name = "public"
blob_relative_path = "curated/covid-19/ecdc_cases/latest/ecdc_cases.parquet"
blob_sas_token = r""
In [2]:
# Allow SPARK to read from Blob remotely
wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set(
  'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
  blob_sas_token)
print('Remote blob path: ' + wasbs_path)
In [3]:
# SPARK read parquet, note that it won't load any data yet by now
df = spark.read.parquet(wasbs_path)
print('Register the DataFrame as a SQL temporary view: source')
df.createOrReplaceTempView('source')
In [4]:
# Display top 10 rows
print('Displaying top 10 rows: ')
display(spark.sql('SELECT * FROM source LIMIT 10'))

Analyze COVID data with Synapse SQL serverless endpoint

How Azure Synapse Analytics helps you analyze COVID data with Synapse SQL serverless endpoint

Linear regression analysis on COVID data using SQL endpoint in Azure Synapse Analytics

Linear regression analysis on COVID data using SQL endpoint in Azure Synapse Analytics