Hopp over navigasjon

COVID-19 Data Lake

Bing COVID-19 Data

COVID-19 Pandemic Data Lake Bing

Bing COVID-19-data inkluderer bekreftede, fatale og tilfrisknede tilfeller fra alle områder, daglig oppdatert.
Disse dataene reflekteres i Bing COVID-19-sporingen.

Bing samler data fra flere pålitelige kilder, inkludert Verdens helseorganisasjon (WHO), Centers for Disease Control and Prevention (CDC), nasjonale og statlige helseavdelinger, BNO News, 24/7 Wall St. og Wikipedia.

For mer informasjon og originale kildedata, se denne koblingen. Se denne koblingen for lisensvilkår.

Datasett:
Modifiserte datasett er tilgjengelige i CSV, JSON, JSON-linjer og Parquet.
https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.csv
https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.json
https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.jsonl
https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.parquet

Alle modifiserte datasett har underinndelingskodene ISO 3166 og lastetider lagt til, og bruker kolonnenavn i små bokstaver med understrekingstegn.

Rådata: https://pandemicdatalake.blob.core.windows.net/public/raw/covid-19/bing_covid-19_data/latest/Bing-COVID19-Data.csv

Tidligere versjoner av modifiserte data og rådata: https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/bing_covid-19_data/

Datavolum
Alle datasett oppdateres daglig. Fra 11. mai 2020 inneholdt de 125 576 rader (CSV 16,1 MB, JSON 40,0 MB, JSONL 39,6 MB, Parquet 1,1 MB).

Lisens og bruksrettigheter. Tillegg
Disse dataene er tilgjengelige utelukkende til undervisning og akademisk bruk, som medisinsk forskning, offentlige myndigheter og akademiske institusjoner, under vilkår som kan finnes her.

Data som brukes eller siteres i publikasjoner skal tillegges «Bing COVID-19 Tracker» med en kobling til www.bing.com/covid.

Kontakt
Hvis du har spørsmål eller tilbakemelding om dette eller andre datasett i COVID-19-datasjøen, kan du kontakte askcovid19dl@microsoft.com.

Merknader

MICROSOFT LEVERER AZURE OPEN DATASETS PÅ EN “SOM DE ER”-BASIS. MICROSOFT GIR INGEN GARANTIER, UTTRYKTE ELLER IMPLISERTE, ELLER BETINGELSER MED HENSYN TIL DIN BRUK AV DATASETTENE. I DEN GRAD LOKAL LOV TILLATER DET, FRASKRIVER MICROSOFT SEG ALT ANSVAR FOR EVENTUELLE SKADER ELLER TAP, INKLUDERT DIREKTE SKADE, FØLGESKADE, DOKUMENTERT ERSTATNINGSKRAV, INDIREKTE SKADE ELLER ERSTATNING UTOVER DET SOM VILLE VÆRE NORMALT, SOM FØLGE AV DIN BRUK AV DATASETTENE.

Dette datasettet leveres i henhold til de originale vilkårene Microsoft mottok kildedata. Datasettet kan inkludere data hentet fra Microsoft.

Access

Available inWhen to use
Azure Notebooks

Quickly explore the dataset with Jupyter notebooks hosted on Azure or your local machine.

Azure Databricks

Use this when you need the scale of an Azure managed Spark cluster to process the dataset.

Azure Synapse

Use this when you need the scale of an Azure managed Spark cluster to process the dataset.

Preview

id updated confirmed deaths iso2 iso3 country_region admin_region_1 iso_subdivision admin_region_2 load_time confirmed_change deaths_change
338995 2020-01-21 262 0 null null Worldwide null null null 2/26/2021 12:06:48 AM
338996 2020-01-22 313 0 null null Worldwide null null null 2/26/2021 12:06:48 AM 51 0
338997 2020-01-23 578 0 null null Worldwide null null null 2/26/2021 12:06:48 AM 265 0
338998 2020-01-24 841 0 null null Worldwide null null null 2/26/2021 12:06:48 AM 263 0
338999 2020-01-25 1320 0 null null Worldwide null null null 2/26/2021 12:06:48 AM 479 0
339000 2020-01-26 2014 0 null null Worldwide null null null 2/26/2021 12:06:48 AM 694 0
339001 2020-01-27 2798 0 null null Worldwide null null null 2/26/2021 12:06:48 AM 784 0
339002 2020-01-28 4593 0 null null Worldwide null null null 2/26/2021 12:06:48 AM 1795 0
339003 2020-01-29 6065 0 null null Worldwide null null null 2/26/2021 12:06:48 AM 1472 0
339004 2020-01-30 7818 0 null null Worldwide null null null 2/26/2021 12:06:48 AM 1753 0
Name Data type Unique Values (sample) Description
admin_region_1 string 864 Texas
Georgia

Område i country_region

admin_region_2 string 3,142 Washington County
Jefferson County

Område i admin_region_1

confirmed int 98,923 1
2

Antall påviste tilfeller for området

confirmed_change int 10,766 1
2

Endring av antall påviste tilfeller fra forrige dag

country_region string 236 United States
India

Land/område

deaths int 17,557 1
2

Antall dødsfall for området

deaths_change smallint 1,775 1
2

Endring av dødstall fra forrige dag

id int 1,535,512 18285380
93516479

Unik identifikator

iso_subdivision string 484 US-TX
US-GA

Todelt underinndelingskode for ISO

iso2 string 226 US
IN

Landskodeidentifikator på to bokstaver

iso3 string 226 USA
IND

Landskodeidentifikator på tre bokstaver

latitude double 5,668 42.28708
19.59852

Breddegrad for geometrisk sentrum for området

load_time timestamp 1 2021-02-26 00:06:48.988000

Dato og klokkeslett filen ble lastet inn fra Bing-kilden på GitHub

longitude double 5,686 -2.5396
-155.5186

Lengdegrad for geometrisk sentrum for området

recovered int 60,071 1
2

Antall friskmeldte for området

recovered_change int 9,116 1
2

Endring av antall friskmeldte fra forrige dag

updated date 401 2021-02-04
2021-02-07

Som på dato for oppføringen

Select your preferred service:

Azure Notebooks

Azure Databricks

Azure Synapse

Azure Notebooks

Package: Language: Python

Download the dataset file using thebuilt-in capability download from a http URL in Pandas. Pandas has readers for various file formats:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_parquet.html

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html (use lines=True for json lines)

In [1]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

df = pd.read_parquet("https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.parquet")
df.head(10)
Out[1]:
id updated confirmed confirmed_change deaths deaths_change recovered recovered_change latitude longitude iso2 iso3 country_region admin_region_1 iso_subdivision admin_region_2 admin_region_2_code load_time
0 338995 2020-01-21 262.0 0.0 0.0 0.0 NaN NaN NaN NaN None None Worldwide None None None None 2020-05-08 00:05:35.629
1 338996 2020-01-22 313.0 51.0 0.0 0.0 NaN NaN NaN NaN None None Worldwide None None None None 2020-05-08 00:05:35.629
2 338997 2020-01-23 578.0 265.0 0.0 0.0 NaN NaN NaN NaN None None Worldwide None None None None 2020-05-08 00:05:35.629
3 338998 2020-01-24 841.0 263.0 0.0 0.0 NaN NaN NaN NaN None None Worldwide None None None None 2020-05-08 00:05:35.629
4 338999 2020-01-25 1320.0 479.0 0.0 0.0 NaN NaN NaN NaN None None Worldwide None None None None 2020-05-08 00:05:35.629
5 339000 2020-01-26 2014.0 694.0 0.0 0.0 NaN NaN NaN NaN None None Worldwide None None None None 2020-05-08 00:05:35.629
6 339001 2020-01-27 2798.0 784.0 0.0 0.0 NaN NaN NaN NaN None None Worldwide None None None None 2020-05-08 00:05:35.629
7 339002 2020-01-28 4593.0 1795.0 0.0 0.0 NaN NaN NaN NaN None None Worldwide None None None None 2020-05-08 00:05:35.629
8 339003 2020-01-29 6065.0 1472.0 0.0 0.0 NaN NaN NaN NaN None None Worldwide None None None None 2020-05-08 00:05:35.629
9 339004 2020-01-30 7818.0 1753.0 0.0 0.0 NaN NaN NaN NaN None None Worldwide None None None None 2020-05-08 00:05:35.629

Lets check the data types of the various fields and verify that the updated column is datettime format

In [2]:
df.dtypes
Out[2]:
id                              int32
updated                datetime64[ns]
confirmed                     float64
confirmed_change              float64
deaths                        float64
deaths_change                 float64
recovered                     float64
recovered_change              float64
latitude                      float64
longitude                     float64
iso2                           object
iso3                           object
country_region                 object
admin_region_1                 object
iso_subdivision                object
admin_region_2                 object
admin_region_2_code            object
load_time              datetime64[ns]
dtype: object

We will now look into Worldwide data and plot some simple charts to visualize the data

In [3]:
df_Worldwide=df[df['country_region']=='Worldwide']
In [4]:
df_Worldwide_pivot=df_Worldwide.pivot_table(df_Worldwide, index=['country_region','updated'])

df_Worldwide_pivot
Out[4]:
confirmed confirmed_change deaths deaths_change id recovered recovered_change
country_region updated
Worldwide 2020-01-21 262.0 0.0 0.0 0.0 338995 NaN NaN
2020-01-22 313.0 51.0 0.0 0.0 338996 NaN NaN
2020-01-23 578.0 265.0 0.0 0.0 338997 NaN NaN
2020-01-24 841.0 263.0 0.0 0.0 338998 NaN NaN
2020-01-25 1320.0 479.0 0.0 0.0 338999 NaN NaN
2020-01-26 2014.0 694.0 0.0 0.0 339000 NaN NaN
2020-01-27 2798.0 784.0 0.0 0.0 339001 NaN NaN
2020-01-28 4593.0 1795.0 0.0 0.0 339002 NaN NaN
2020-01-29 6065.0 1472.0 0.0 0.0 339003 NaN NaN
2020-01-30 7818.0 1753.0 0.0 0.0 339004 NaN NaN
2020-01-31 9826.0 2008.0 0.0 0.0 339005 NaN NaN
2020-02-01 11953.0 2127.0 0.0 0.0 339006 NaN NaN
2020-02-02 14557.0 2604.0 0.0 0.0 339007 NaN NaN
2020-02-03 17386.0 2829.0 362.0 362.0 339008 NaN NaN
2020-02-04 20625.0 3239.0 426.0 64.0 339009 NaN NaN
2020-02-05 24549.0 3924.0 492.0 66.0 339010 NaN NaN
2020-02-06 28256.0 3707.0 565.0 73.0 339011 NaN NaN
2020-02-07 31420.0 3164.0 638.0 73.0 339012 NaN NaN
2020-02-08 34822.0 3402.0 724.0 86.0 339013 NaN NaN
2020-02-09 37494.0 2672.0 813.0 89.0 339014 NaN NaN
2020-02-10 40484.0 2990.0 910.0 97.0 339015 NaN NaN
2020-02-11 42968.0 2484.0 1018.0 108.0 339016 NaN NaN
2020-02-12 44996.0 2028.0 1115.0 97.0 339017 NaN NaN
2020-02-13 46823.0 1827.0 1369.0 254.0 339018 NaN NaN
2020-02-14 64219.0 17396.0 1383.0 14.0 339019 NaN NaN
2020-02-15 66884.0 2665.0 1526.0 143.0 339020 NaN NaN
2020-02-16 68912.0 2028.0 1669.0 143.0 339021 NaN NaN
2020-02-17 70975.0 2063.0 1775.0 106.0 339022 NaN NaN
2020-02-18 72778.0 1803.0 1873.0 98.0 339023 NaN NaN
2020-02-19 75204.0 2426.0 2009.0 136.0 339024 NaN NaN
... ... ... ... ... ... ... ...
2020-04-06 1341907.0 69792.0 74565.0 5256.0 4013596 276259.0 16247.0
2020-04-07 1426096.0 84189.0 81259.0 6694.0 4309179 300054.0 23795.0
2020-04-08 1504971.0 78875.0 87984.0 6725.0 4551545 328661.0 28607.0
2020-04-09 1587209.0 82238.0 94850.0 6866.0 4728946 353291.0 24630.0
2020-04-10 1684833.0 97624.0 102136.0 7286.0 4898837 375499.0 22208.0
2020-04-11 1764622.0 79789.0 107904.0 5768.0 5310451 385999.0 10500.0
2020-04-12 1840093.0 75471.0 113672.0 5768.0 5124486 421372.0 35373.0
2020-04-13 1912923.0 72830.0 118966.0 5294.0 5261547 448053.0 26681.0
2020-04-14 1970879.0 57956.0 125678.0 6712.0 5305989 472948.0 24895.0
2020-04-15 2056055.0 85176.0 133572.0 7894.0 5347997 511019.0 38071.0
2020-04-16 2151199.0 95144.0 143725.0 10153.0 5429132 541501.0 30482.0
2020-04-17 2234109.0 82910.0 153379.0 9654.0 5440228 567695.0 26194.0
2020-04-18 2310572.0 76463.0 158691.0 5312.0 5448942 590682.0 22987.0
2020-04-19 2394291.0 83719.0 164938.0 6247.0 5457903 611880.0 21198.0
2020-04-20 2470410.0 76119.0 169794.0 4856.0 5467891 645335.0 33455.0
2020-04-21 2560504.0 90094.0 176926.0 7132.0 5495212 679793.0 34458.0
2020-04-22 2628894.0 68390.0 182992.0 6066.0 5562397 709050.0 29257.0
2020-04-23 2699338.0 70444.0 188437.0 5445.0 6650426 737735.0 28685.0
2020-04-24 2790986.0 91648.0 195920.0 7483.0 6906714 781382.0 43647.0
2020-04-25 2868539.0 77553.0 201502.0 5582.0 6955177 811660.0 30278.0
2020-04-26 2965363.0 96824.0 206265.0 4763.0 7002306 863464.0 51804.0
2020-04-27 3002303.0 36940.0 208131.0 1866.0 7055144 878813.0 15349.0
2020-04-28 3083467.0 81164.0 213824.0 5693.0 7098114 915988.0 37175.0
2020-04-29 3170335.0 86868.0 224708.0 10884.0 7522181 958353.0 42365.0
2020-04-30 3249022.0 78687.0 230804.0 6096.0 7997577 1006112.0 47759.0
2020-05-01 3303296.0 54274.0 235290.0 4486.0 8478160 1039588.0 33476.0
2020-05-02 3419184.0 115888.0 243355.0 8065.0 8922151 1092644.0 53056.0
2020-05-03 3502126.0 82942.0 247107.0 3752.0 9388541 1124127.0 31483.0
2020-05-04 3578301.0 76175.0 251059.0 3952.0 9863005 1162279.0 38152.0
2020-05-05 3659271.0 80970.0 256736.0 5677.0 10308709 1197340.0 35061.0

106 rows × 7 columns

In [5]:
df_Worldwide.plot(kind='line',x='updated',y="confirmed",grid=True)
df_Worldwide.plot(kind='line',x='updated',y="deaths",grid=True)
df_Worldwide.plot(kind='line',x='updated',y="confirmed_change",grid=True)
df_Worldwide.plot(kind='line',x='updated',y="deaths_change",grid=True)
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fbf154a8198>

Azure Databricks

Package: Language: Python
In [1]:
# Azure storage access info
blob_account_name = "pandemicdatalake"
blob_container_name = "public"
blob_relative_path = "curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.parquet"
blob_sas_token = r""
In [2]:
# Allow SPARK to read from Blob remotely
wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set(
  'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
  blob_sas_token)
print('Remote blob path: ' + wasbs_path)
In [3]:
# SPARK read parquet, note that it won't load any data yet by now
df = spark.read.parquet(wasbs_path)
print('Register the DataFrame as a SQL temporary view: source')
df.createOrReplaceTempView('source')
In [4]:
# Display top 10 rows
print('Displaying top 10 rows: ')
display(spark.sql('SELECT * FROM source LIMIT 10'))

Azure Synapse

Package: Language: Python
In [1]:
# Azure storage access info
blob_account_name = "pandemicdatalake"
blob_container_name = "public"
blob_relative_path = "curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.parquet"
blob_sas_token = r""
In [2]:
# Allow SPARK to read from Blob remotely
wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set(
  'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
  blob_sas_token)
print('Remote blob path: ' + wasbs_path)
In [3]:
# SPARK read parquet, note that it won't load any data yet by now
df = spark.read.parquet(wasbs_path)
print('Register the DataFrame as a SQL temporary view: source')
df.createOrReplaceTempView('source')
In [4]:
# Display top 10 rows
print('Displaying top 10 rows: ')
display(spark.sql('SELECT * FROM source LIMIT 10'))