Ignorar navegação

COVID-19 Data Lake

Bing COVID-19 Data

COVID-19 Pandemic Data Lake Bing

Os dados de COVID-19 do Bing incluem os casos confirmados, fatais e recuperados de todas as regiões, atualizados diariamente.
Esses dados são refletidos no Rastreador de COVID-19 do Bing.

O Bing coleta dados de diversas fontes seguras e confiáveis, incluindo a OMS (Organização Mundial de Saúde), o CDC (Centros de Controle e Prevenção de Doenças), departamentos estaduais e nacionais de saúde dos EUA, o BNO News, o 24/7 Wall St. e a Wikipedia.

Para obter mais informações e dados de origem, confira este link. Para ver os termos de licença, confira este link.

Conjuntos de dados:
Os conjuntos de dados modificados estão disponíveis em CSV, JSON, JSON-Lines e Parquet.
https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.csv
https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.json
https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.jsonl
https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.parquet

Todos os conjuntos de dados modificados têm códigos da subdivisão ISO 3166 e tempos de carregamento adicionados, além de usarem nomes de coluna em minúsculas com sublinhados como separadores.

Dados brutos: https://pandemicdatalake.blob.core.windows.net/public/raw/covid-19/bing_covid-19_data/latest/Bing-COVID19-Data.csv

Versões anteriores de dados brutos e modificados: https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/bing_covid-19_data/

Volume de dados
Todos os conjuntos de dados são atualizados diariamente. Em 11 de maio de 2020, eles continham 125.576 linhas (CSV com 16,1 MB, JSON com 40 MB, JSONL com 39,6 MB, Parquet com 1,1 MB).

Licença e direitos de uso; atribuição
Esses dados estão disponíveis somente para fins educacionais e acadêmicos, como pesquisa médica, agências governamentais e instituições acadêmicas, sob os termos e as condições disponíveis aqui.

Os dados usados ou citados em publicações devem incluir uma atribuição para o ‘Rastreador da COVID-19 do Bing’ com um link para www.bing.com/covid.

Contato
Caso tenha perguntas ou comentários sobre este ou outros conjuntos de dados do Data Lake COVID-19, entre em contato com askcovid19dl@microsoft.com.

Avisos

A MICROSOFT FORNECE O AZURE OPEN DATASETS NO ESTADO EM QUE SE ENCONTRA. A MICROSOFT NÃO OFERECE GARANTIAS OU COBERTURAS, EXPRESSAS OU IMPLÍCITAS, EM RELAÇÃO AO USO DOS CONJUNTOS DE DADOS. ATÉ O LIMITE PERMITIDO PELA LEGISLAÇÃO LOCAL, A MICROSOFT SE EXIME DE TODA A RESPONSABILIDADE POR DANOS OU PERDAS, INCLUSIVE DIRETOS, CONSEQUENTES, ESPECIAIS, INDIRETOS, ACIDENTAIS OU PUNITIVOS, RESULTANTES DO USO DOS CONJUNTOS DE DADOS.

Esse conjunto de dados é fornecido de acordo com os termos originais com que a Microsoft recebeu os dados de origem. O conjunto de dados pode incluir dados originados da Microsoft.

Access

Available inWhen to use
Azure Notebooks

Quickly explore the dataset with Jupyter notebooks hosted on Azure or your local machine.

Azure Databricks

Use this when you need the scale of an Azure managed Spark cluster to process the dataset.

Azure Synapse

Use this when you need the scale of an Azure managed Spark cluster to process the dataset.

Preview

id updated confirmed deaths iso2 iso3 country_region admin_region_1 iso_subdivision admin_region_2 load_time confirmed_change deaths_change
338995 2020-01-21 262 0 null null Worldwide null null null 6/18/2021 12:05:16 AM
338996 2020-01-22 313 0 null null Worldwide null null null 6/18/2021 12:05:16 AM 51 0
338997 2020-01-23 578 0 null null Worldwide null null null 6/18/2021 12:05:16 AM 265 0
338998 2020-01-24 841 0 null null Worldwide null null null 6/18/2021 12:05:16 AM 263 0
338999 2020-01-25 1320 0 null null Worldwide null null null 6/18/2021 12:05:16 AM 479 0
339000 2020-01-26 2014 0 null null Worldwide null null null 6/18/2021 12:05:16 AM 694 0
339001 2020-01-27 2798 0 null null Worldwide null null null 6/18/2021 12:05:16 AM 784 0
339002 2020-01-28 4593 0 null null Worldwide null null null 6/18/2021 12:05:16 AM 1795 0
339003 2020-01-29 6065 0 null null Worldwide null null null 6/18/2021 12:05:16 AM 1472 0
339004 2020-01-30 7818 0 null null Worldwide null null null 6/18/2021 12:05:16 AM 1753 0
Name Data type Unique Values (sample) Description
admin_region_1 string 864 Texas
Georgia

Região em country_region

admin_region_2 string 3,143 Washington County
Jefferson County

Região em admin_region_1

confirmed int 141,361 1
2

Contagem dos casos confirmados na região

confirmed_change int 13,494 1
2

Alteração na contagem de casos confirmados do dia anterior

country_region string 240 United States
India

País/Região

deaths int 23,665 1
2

Contagem dos casos de mortes por região

deaths_change smallint 2,162 1
2

Alteração na contagem de mortes do dia anterior

id int 2,022,096 133786008
742546

Identificador exclusivo

iso_subdivision string 484 US-TX
US-GA

Código de subdivisão ISO de duas partes

iso2 string 229 US
IN

Identificador de código do país de duas letras

iso3 string 229 USA
IND

Identificador de código do país de três letras

latitude double 5,676 42.28708
19.59852

Latitude do centroide da região

load_time timestamp 1 2021-06-18 00:05:16.436000

A data e hora em que o arquivo foi carregado da origem do Bing no GitHub

longitude double 5,694 -2.5396
-155.5186

Longitude do centroide da região

recovered int 87,731 1
2

Contagem dos recuperados na região

recovered_change int 12,039 1
2

Alteração na contagem de casos recuperados do dia anterior

updated date 504 2021-06-12
2021-06-02

A data no momento do registro

Select your preferred service:

Azure Notebooks

Azure Databricks

Azure Synapse

Azure Notebooks

Package: Language: Python

Download the dataset file using thebuilt-in capability download from a http URL in Pandas. Pandas has readers for various file formats:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_parquet.html

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html (use lines=True for json lines)

In [1]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

df = pd.read_parquet("https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.parquet")
df.head(10)
Out[1]:
id updated confirmed confirmed_change deaths deaths_change recovered recovered_change latitude longitude iso2 iso3 country_region admin_region_1 iso_subdivision admin_region_2 admin_region_2_code load_time
0 338995 2020-01-21 262.0 0.0 0.0 0.0 NaN NaN NaN NaN None None Worldwide None None None None 2020-05-08 00:05:35.629
1 338996 2020-01-22 313.0 51.0 0.0 0.0 NaN NaN NaN NaN None None Worldwide None None None None 2020-05-08 00:05:35.629
2 338997 2020-01-23 578.0 265.0 0.0 0.0 NaN NaN NaN NaN None None Worldwide None None None None 2020-05-08 00:05:35.629
3 338998 2020-01-24 841.0 263.0 0.0 0.0 NaN NaN NaN NaN None None Worldwide None None None None 2020-05-08 00:05:35.629
4 338999 2020-01-25 1320.0 479.0 0.0 0.0 NaN NaN NaN NaN None None Worldwide None None None None 2020-05-08 00:05:35.629
5 339000 2020-01-26 2014.0 694.0 0.0 0.0 NaN NaN NaN NaN None None Worldwide None None None None 2020-05-08 00:05:35.629
6 339001 2020-01-27 2798.0 784.0 0.0 0.0 NaN NaN NaN NaN None None Worldwide None None None None 2020-05-08 00:05:35.629
7 339002 2020-01-28 4593.0 1795.0 0.0 0.0 NaN NaN NaN NaN None None Worldwide None None None None 2020-05-08 00:05:35.629
8 339003 2020-01-29 6065.0 1472.0 0.0 0.0 NaN NaN NaN NaN None None Worldwide None None None None 2020-05-08 00:05:35.629
9 339004 2020-01-30 7818.0 1753.0 0.0 0.0 NaN NaN NaN NaN None None Worldwide None None None None 2020-05-08 00:05:35.629

Lets check the data types of the various fields and verify that the updated column is datettime format

In [2]:
df.dtypes
Out[2]:
id                              int32
updated                datetime64[ns]
confirmed                     float64
confirmed_change              float64
deaths                        float64
deaths_change                 float64
recovered                     float64
recovered_change              float64
latitude                      float64
longitude                     float64
iso2                           object
iso3                           object
country_region                 object
admin_region_1                 object
iso_subdivision                object
admin_region_2                 object
admin_region_2_code            object
load_time              datetime64[ns]
dtype: object

We will now look into Worldwide data and plot some simple charts to visualize the data

In [3]:
df_Worldwide=df[df['country_region']=='Worldwide']
In [4]:
df_Worldwide_pivot=df_Worldwide.pivot_table(df_Worldwide, index=['country_region','updated'])

df_Worldwide_pivot
Out[4]:
confirmed confirmed_change deaths deaths_change id recovered recovered_change
country_region updated
Worldwide 2020-01-21 262.0 0.0 0.0 0.0 338995 NaN NaN
2020-01-22 313.0 51.0 0.0 0.0 338996 NaN NaN
2020-01-23 578.0 265.0 0.0 0.0 338997 NaN NaN
2020-01-24 841.0 263.0 0.0 0.0 338998 NaN NaN
2020-01-25 1320.0 479.0 0.0 0.0 338999 NaN NaN
2020-01-26 2014.0 694.0 0.0 0.0 339000 NaN NaN
2020-01-27 2798.0 784.0 0.0 0.0 339001 NaN NaN
2020-01-28 4593.0 1795.0 0.0 0.0 339002 NaN NaN
2020-01-29 6065.0 1472.0 0.0 0.0 339003 NaN NaN
2020-01-30 7818.0 1753.0 0.0 0.0 339004 NaN NaN
2020-01-31 9826.0 2008.0 0.0 0.0 339005 NaN NaN
2020-02-01 11953.0 2127.0 0.0 0.0 339006 NaN NaN
2020-02-02 14557.0 2604.0 0.0 0.0 339007 NaN NaN
2020-02-03 17386.0 2829.0 362.0 362.0 339008 NaN NaN
2020-02-04 20625.0 3239.0 426.0 64.0 339009 NaN NaN
2020-02-05 24549.0 3924.0 492.0 66.0 339010 NaN NaN
2020-02-06 28256.0 3707.0 565.0 73.0 339011 NaN NaN
2020-02-07 31420.0 3164.0 638.0 73.0 339012 NaN NaN
2020-02-08 34822.0 3402.0 724.0 86.0 339013 NaN NaN
2020-02-09 37494.0 2672.0 813.0 89.0 339014 NaN NaN
2020-02-10 40484.0 2990.0 910.0 97.0 339015 NaN NaN
2020-02-11 42968.0 2484.0 1018.0 108.0 339016 NaN NaN
2020-02-12 44996.0 2028.0 1115.0 97.0 339017 NaN NaN
2020-02-13 46823.0 1827.0 1369.0 254.0 339018 NaN NaN
2020-02-14 64219.0 17396.0 1383.0 14.0 339019 NaN NaN
2020-02-15 66884.0 2665.0 1526.0 143.0 339020 NaN NaN
2020-02-16 68912.0 2028.0 1669.0 143.0 339021 NaN NaN
2020-02-17 70975.0 2063.0 1775.0 106.0 339022 NaN NaN
2020-02-18 72778.0 1803.0 1873.0 98.0 339023 NaN NaN
2020-02-19 75204.0 2426.0 2009.0 136.0 339024 NaN NaN
... ... ... ... ... ... ... ...
2020-04-06 1341907.0 69792.0 74565.0 5256.0 4013596 276259.0 16247.0
2020-04-07 1426096.0 84189.0 81259.0 6694.0 4309179 300054.0 23795.0
2020-04-08 1504971.0 78875.0 87984.0 6725.0 4551545 328661.0 28607.0
2020-04-09 1587209.0 82238.0 94850.0 6866.0 4728946 353291.0 24630.0
2020-04-10 1684833.0 97624.0 102136.0 7286.0 4898837 375499.0 22208.0
2020-04-11 1764622.0 79789.0 107904.0 5768.0 5310451 385999.0 10500.0
2020-04-12 1840093.0 75471.0 113672.0 5768.0 5124486 421372.0 35373.0
2020-04-13 1912923.0 72830.0 118966.0 5294.0 5261547 448053.0 26681.0
2020-04-14 1970879.0 57956.0 125678.0 6712.0 5305989 472948.0 24895.0
2020-04-15 2056055.0 85176.0 133572.0 7894.0 5347997 511019.0 38071.0
2020-04-16 2151199.0 95144.0 143725.0 10153.0 5429132 541501.0 30482.0
2020-04-17 2234109.0 82910.0 153379.0 9654.0 5440228 567695.0 26194.0
2020-04-18 2310572.0 76463.0 158691.0 5312.0 5448942 590682.0 22987.0
2020-04-19 2394291.0 83719.0 164938.0 6247.0 5457903 611880.0 21198.0
2020-04-20 2470410.0 76119.0 169794.0 4856.0 5467891 645335.0 33455.0
2020-04-21 2560504.0 90094.0 176926.0 7132.0 5495212 679793.0 34458.0
2020-04-22 2628894.0 68390.0 182992.0 6066.0 5562397 709050.0 29257.0
2020-04-23 2699338.0 70444.0 188437.0 5445.0 6650426 737735.0 28685.0
2020-04-24 2790986.0 91648.0 195920.0 7483.0 6906714 781382.0 43647.0
2020-04-25 2868539.0 77553.0 201502.0 5582.0 6955177 811660.0 30278.0
2020-04-26 2965363.0 96824.0 206265.0 4763.0 7002306 863464.0 51804.0
2020-04-27 3002303.0 36940.0 208131.0 1866.0 7055144 878813.0 15349.0
2020-04-28 3083467.0 81164.0 213824.0 5693.0 7098114 915988.0 37175.0
2020-04-29 3170335.0 86868.0 224708.0 10884.0 7522181 958353.0 42365.0
2020-04-30 3249022.0 78687.0 230804.0 6096.0 7997577 1006112.0 47759.0
2020-05-01 3303296.0 54274.0 235290.0 4486.0 8478160 1039588.0 33476.0
2020-05-02 3419184.0 115888.0 243355.0 8065.0 8922151 1092644.0 53056.0
2020-05-03 3502126.0 82942.0 247107.0 3752.0 9388541 1124127.0 31483.0
2020-05-04 3578301.0 76175.0 251059.0 3952.0 9863005 1162279.0 38152.0
2020-05-05 3659271.0 80970.0 256736.0 5677.0 10308709 1197340.0 35061.0

106 rows × 7 columns

In [5]:
df_Worldwide.plot(kind='line',x='updated',y="confirmed",grid=True)
df_Worldwide.plot(kind='line',x='updated',y="deaths",grid=True)
df_Worldwide.plot(kind='line',x='updated',y="confirmed_change",grid=True)
df_Worldwide.plot(kind='line',x='updated',y="deaths_change",grid=True)
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fbf154a8198>

Azure Databricks

Package: Language: Python
In [1]:
# Azure storage access info
blob_account_name = "pandemicdatalake"
blob_container_name = "public"
blob_relative_path = "curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.parquet"
blob_sas_token = r""
In [2]:
# Allow SPARK to read from Blob remotely
wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set(
  'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
  blob_sas_token)
print('Remote blob path: ' + wasbs_path)
In [3]:
# SPARK read parquet, note that it won't load any data yet by now
df = spark.read.parquet(wasbs_path)
print('Register the DataFrame as a SQL temporary view: source')
df.createOrReplaceTempView('source')
In [4]:
# Display top 10 rows
print('Displaying top 10 rows: ')
display(spark.sql('SELECT * FROM source LIMIT 10'))

Azure Synapse

Package: Language: Python
In [1]:
# Azure storage access info
blob_account_name = "pandemicdatalake"
blob_container_name = "public"
blob_relative_path = "curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.parquet"
blob_sas_token = r""
In [2]:
# Allow SPARK to read from Blob remotely
wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set(
  'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
  blob_sas_token)
print('Remote blob path: ' + wasbs_path)
In [3]:
# SPARK read parquet, note that it won't load any data yet by now
df = spark.read.parquet(wasbs_path)
print('Register the DataFrame as a SQL temporary view: source')
df.createOrReplaceTempView('source')
In [4]:
# Display top 10 rows
print('Displaying top 10 rows: ')
display(spark.sql('SELECT * FROM source LIMIT 10'))