Hopp over navigasjon

COVID-19 Data Lake

COVID Tracking Project

COVID-19 Pandemic Data Lake

Datasettet COVID Tracking Project inneholder de siste tallene for tester, bekreftede tilfeller, sykehusinnleggelser og pasientutfall fra alle amerikanske stater og territorier.

Se her for mer informasjon om dette datasettet.

Datasett:
Modifiserte versjoner av datasettet er tilgjengelige i CSV, JSON, JSON-linjer og Parquet.
https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/covid_tracking/latest/covid_tracking.csv
https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/covid_tracking/latest/covid_tracking.json
https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/covid_tracking/latest/covid_tracking.jsonl
https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/covid_tracking/latest/covid_tracking.parquet

Alle modifiserte versjoner har underinndelingskodene ISO 3166 og lastetider lagt til, og bruker kolonnenavn i små bokstaver med understrekingstegn.

Rådata:
https://pandemicdatalake.blob.core.windows.net/public/raw/covid-19/covid_tracking/latest/daily.json

Tidligere versjoner av modifiserte- og rådata:
https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/covid_tracking/
https://pandemicdatalake.blob.core.windows.net/public/raw/covid-19/covid_tracking/

Datavolum
Alle datasett oppdateres daglig. Fra 13. mai 2020 inneholdt de 4 100 rader (CSV 574 kB, JSON 1,8 MB, JSONL 1,8 MB, Parquet 334 kB).

Datakilde
Disse dataene er opprinnelig publisert av COVID Tracking Project hos Atlantic. Rådata er innhentet fra COVID Tracking GitHub-repo ved bruk av filen States_daily_4p_et.csv her. For mer informasjon om dette datasettet, inkludert dens opprinnelse fra COVID Tracking Project-API, se her.

Datakvalitet
COVID Tracking Project klassifiserer datakvaliteten for hver stat og gir ytterligere informasjon om deres vurdering av kvaliteten på dataene her. Data i GitHub-repositorium kan ligge en time etter API-en. Bruk av API-en er nødvendig for å få tilgang til de nyeste dataene.

Lisens og bruksrettigheter. Tillegg
Denne informasjonen er lisensiert under vilkårene og betingelsene i Apache License 2.0 som beskrevet her.

All bruk av dataene må inneholde alle merknader om opphavsrett, patent, varemerke og tilskrivelse.

Kontakt
Hvis du har spørsmål eller tilbakemelding om dette eller andre datasett i COVID-19-datasjøen, kan du kontakte askcovid19dl@microsoft.com.

Merknader

MICROSOFT LEVERER AZURE OPEN DATASETS PÅ EN “SOM DE ER”-BASIS. MICROSOFT GIR INGEN GARANTIER, UTTRYKTE ELLER IMPLISERTE, ELLER BETINGELSER MED HENSYN TIL DIN BRUK AV DATASETTENE. I DEN GRAD LOKAL LOV TILLATER DET, FRASKRIVER MICROSOFT SEG ALT ANSVAR FOR EVENTUELLE SKADER ELLER TAP, INKLUDERT DIREKTE SKADE, FØLGESKADE, DOKUMENTERT ERSTATNINGSKRAV, INDIREKTE SKADE ELLER ERSTATNING UTOVER DET SOM VILLE VÆRE NORMALT, SOM FØLGE AV DIN BRUK AV DATASETTENE.

Dette datasettet leveres i henhold til de originale vilkårene Microsoft mottok kildedata. Datasettet kan inkludere data hentet fra Microsoft.

Access

Available inWhen to use
Azure Notebooks

Quickly explore the dataset with Jupyter notebooks hosted on Azure or your local machine.

Azure Databricks

Use this when you need the scale of an Azure managed Spark cluster to process the dataset.

Azure Synapse

Use this when you need the scale of an Azure managed Spark cluster to process the dataset.

Preview

date state positive hospitalized_currently hospitalized_cumulative on_ventilator_currently data_quality_grade last_update_et hash date_checked death hospitalized total total_test_results pos_neg fips death_increase hospitalized_increase negative_increase positive_increase total_test_results_increase fips_code iso_subdivision load_time iso_country negative in_icu_cumulative on_ventilator_cumulative recovered in_icu_currently
2021-02-24 AK 55736 46 1260 4 null 2/24/2021 3:59:00 AM 67b3b6ca1627ea40d08871803b2659b08b55daae 2/24/2021 3:59:00 AM 290 1260 55736 1662156 55736 2 0 0 0 176 8731 2 US-AK 2/26/2021 12:07:50 AM US
2021-02-24 AL 490220 773 45250 null 2/24/2021 11:00:00 AM 676bdea053983e254017ca1a5c4545ebe6b40100 2/24/2021 11:00:00 AM 9744 45250 2375371 2269033 2375371 1 84 0 2971 1247 3947 1 US-AL 2/26/2021 12:07:50 AM US 1885151 2641 1500 275245
2021-02-24 AR 317396 545 14649 99 null 2/24/2021 12:00:00 AM 2db69acfbfe82aa932fd048848b370f4d670e601 2/24/2021 12:00:00 AM 5387 14649 2685347 2618676 2685347 5 10 32 8380 803 8839 5 US-AR 2/26/2021 12:07:50 AM US 2367951 1509 307306 204
2021-02-24 AS 0 null 12/1/2020 12:00:00 AM f43db694c3c66828b057fcd5303d23ff2014fad3 12/1/2020 12:00:00 AM 0 2140 2140 2140 60 0 0 0 0 0 60 US-AS 2/26/2021 12:07:50 AM US 2140
2021-02-24 AZ 811968 1449 57156 253 null 2/24/2021 12:00:00 AM 66eb7b9f8629ac10b33a2ddb54fadd311466dd44 2/24/2021 12:00:00 AM 15693 57156 3772165 7512395 3772165 4 43 84 6987 1310 34072 4 US-AZ 2/26/2021 12:07:50 AM US 2960197 430
2021-02-24 CA 3455361 6764 null 2/24/2021 2:59:00 AM 73cbf2d1ea6a9377fc80bad073db69cfc87adb38 2/24/2021 2:59:00 AM 3455361 47652172 3455361 6 314 0 0 5303 138805 6 US-CA 2/26/2021 12:07:50 AM US 1842
2021-02-24 CO 423558 427 23349 null 2/24/2021 1:59:00 AM fed32c2407fd9bb049293894590d501160cdf06c 2/24/2021 1:59:00 AM 5917 23349 2573475 6080273 2573475 8 10 56 6888 1168 36690 8 US-CO 2/26/2021 12:07:50 AM US 2149917
2021-02-24 CT 278184 495 12257 null 2/23/2021 11:59:00 PM 4f5151c89fba8c04fff802fafb839ed51d90fde1 2/23/2021 11:59:00 PM 7595 12257 278184 6227431 278184 9 23 0 0 1493 28724 9 US-CT 2/26/2021 12:07:50 AM US
2021-02-24 DC 39943 211 31 null 2/23/2021 12:00:00 AM ffa9847c58964ef84776090f728ca4890320369b 2/23/2021 12:00:00 AM 1001 39943 1204605 39943 11 3 0 0 99 3000 11 US-DC 2/26/2021 12:07:50 AM US 28532 57
2021-02-24 DE 85506 182 null 2/23/2021 6:00:00 PM fc43ea23c4303a5eaaedc86de2f02d3ed7defd03 2/23/2021 6:00:00 PM 1402 619410 1368734 619410 10 23 0 1231 278 5059 10 US-DE 2/26/2021 12:07:50 AM US 533904 27
Name Data type Unique Values (sample) Description
date date 409 2021-01-10
2020-12-14

Dato daglig totalt antall ble samlet inn.

date_checked string 9,222 2020-12-01T00:00:00Z
2020-09-01T00:00:00Z

Avskrevet

death smallint 7,082 2
5

Totalt antall personer som har dødd som resultat av COVID-19 så langt.

death_increase smallint 419 1
2

Avskrevet

fips smallint 56 26
55

FIPS-kode for folketelling for delstaten.

fips_code string 60 53
25

FIPS-kode for folketelling for delstaten.

hash string 20,164 a2e6b70aa4ad18fbb792510418b5462b6130687f
d9642fd54080446d9ab3509df0eaacd5c516ea91

En hash for oppføringen

hospitalized int 7,368 89995
4

Avskrevet

hospitalized_cumulative int 7,368 89995
4

Totalt antall personer som har vært innlagt på sykehus for COVID-19 så langt, inkludert dem som siden har blitt friske eller dødd.

hospitalized_currently smallint 3,848 8
13

Antall personer på sykehus for COVID-19 på denne dagen.

hospitalized_increase smallint 612 1
2

Avskrevet

in_icu_cumulative smallint 2,221 990
220

Totalt antall personer som har vært innlagt på intensivavdeling for COVID-19 så langt, inkludert dem som siden har blitt friske eller dødd.

in_icu_currently smallint 1,583 2
8

Antall personer på intensivavdeling for COVID-19 på denne dagen.

iso_country string 1 US

ISO 3166 land eller områdekode

iso_subdivision string 57 US-UM
US-WA

ISO 3166-underinndelingskode

last_update_et timestamp 9,222 2020-12-01 00:00:00
2020-09-01 00:00:00

Siste tidspunkt dagens data ble oppdatert

load_time timestamp 1 2021-02-26 00:07:50.712000

Dato og klokkeslett dataene ble lastet inn i Azure fra kilden

negative int 13,110 305972
2140

Totalt antall personer som har testet negativt for COVID-19 så langt.

negative_increase int 8,905 6
19

Avskrevet

on_ventilator_cumulative smallint 657 411
412

Totalt antall personer som har brukt en respirator for COVID-19 så langt, inkludert dem som siden har blitt friske eller dødd.

on_ventilator_currently smallint 833 4
10

Antall personer som har brukt en respirator for COVID-19 på denne dagen.

pending smallint 925 2
17

Antall tester hvor resultatene ennå ikke er bestemt.

pos_neg int 17,926 2140
2

Avskrevet

positive int 16,293 2
1

Totalt antall personer som har testet positivt for COVID-19 så langt.

positive_increase smallint 4,700 1
2

Avskrevet

recovered int 8,010 29
19

Totalt antall personer som er friskmeldte fra COVID-19 så langt.

state string 56 MI
PA

Tobokstavskode for delstaten.

total int 17,940 2140
2

Avskrevet

total_test_results int 18,106 2140
3

Totalt antall testresultater gitt av staten

total_test_results_increase int 13,019 1
2

Avskrevet

Select your preferred service:

Azure Notebooks

Azure Databricks

Azure Synapse

Azure Notebooks

Package: Language: Python

Start by loading the dataset file into a pandas dataframe and view some sample rows

In [1]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

df = pd.read_parquet("https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/covid_tracking/latest/covid_tracking.parquet ")
df.head(10)
Out[1]:
date state positive negative pending hospitalized_currently hospitalized_cumulative in_icu_currently in_icu_cumulative on_ventilator_currently ... fips death_increase hospitalized_increase negative_increase positive_increase total_test_results_increase fips_code iso_subdivision load_time iso_country
0 2020-05-28 AK 425.0 47545.0 NaN 10.0 NaN NaN NaN NaN ... 2 0.0 0.0 1594.0 13.0 1607.0 2 US-AK 2020-05-29 07:54:18.597 US
1 2020-05-28 AL 16310.0 184171.0 NaN NaN 1765.0 NaN 566.0 NaN ... 1 9.0 46.0 4220.0 467.0 4687.0 1 US-AL 2020-05-29 07:54:18.597 US
2 2020-05-28 AR 6538.0 112364.0 NaN 104.0 640.0 NaN NaN 27.0 ... 5 0.0 13.0 3045.0 261.0 3306.0 5 US-AR 2020-05-29 07:54:18.597 US
3 2020-05-28 AS 0.0 174.0 NaN NaN NaN NaN NaN NaN ... 60 0.0 0.0 0.0 0.0 0.0 60 US-AS 2020-05-29 07:54:18.597 US
4 2020-05-28 AZ 17763.0 185151.0 NaN 945.0 2848.0 374.0 NaN 222.0 ... 4 26.0 817.0 6147.0 501.0 6648.0 4 US-AZ 2020-05-29 07:54:18.597 US
5 2020-05-28 CA 101697.0 1688862.0 NaN 4529.0 NaN 1325.0 NaN NaN ... 6 89.0 0.0 50948.0 2717.0 53665.0 6 US-CA 2020-05-29 07:54:18.597 US
6 2020-05-28 CO 24767.0 138356.0 NaN 464.0 4196.0 NaN NaN NaN ... 8 40.0 36.0 3501.0 202.0 3703.0 8 US-CO 2020-05-29 07:54:18.597 US
7 2020-05-28 CT 41559.0 193966.0 NaN 648.0 12538.0 NaN NaN NaN ... 9 57.0 0.0 9907.0 256.0 10163.0 9 US-CT 2020-05-29 07:54:18.597 US
8 2020-05-28 DC 8492.0 35022.0 NaN 343.0 NaN 108.0 NaN 65.0 ... 11 8.0 0.0 731.0 86.0 817.0 11 US-DC 2020-05-29 07:54:18.597 US
9 2020-05-28 DE 9171.0 47630.0 NaN 192.0 NaN NaN NaN NaN ... 10 1.0 0.0 1305.0 75.0 1380.0 10 US-DE 2020-05-29 07:54:18.597 US

10 rows × 31 columns

Lets check the data types of the various fields and verify that the updated column is datettime format

In [2]:
df.dtypes
Out[2]:
date                           datetime64[ns]
state                                  object
positive                              float64
negative                              float64
pending                               float64
hospitalized_currently                float64
hospitalized_cumulative               float64
in_icu_currently                      float64
in_icu_cumulative                     float64
on_ventilator_currently               float64
on_ventilator_cumulative              float64
recovered                             float64
data_quality_grade                     object
last_update_et                 datetime64[ns]
hash                                   object
date_checked                           object
death                                 float64
hospitalized                          float64
total                                   int32
total_test_results                      int32
pos_neg                                 int32
fips                                    int16
death_increase                        float64
hospitalized_increase                 float64
negative_increase                     float64
positive_increase                     float64
total_test_results_increase           float64
fips_code                              object
iso_subdivision                        object
load_time                      datetime64[ns]
iso_country                            object
dtype: object

This dataset contains data for the United States. Lets verify that we have data for all the US states.

We will start by looking at the latest data for each state:

In [3]:
df.groupby('state').first().filter(['date','positive', 'death'])
Out[3]:
date positive death
state
AK 2020-05-28 425.0 10.0
AL 2020-05-28 16310.0 590.0
AR 2020-05-28 6538.0 120.0
AS 2020-05-28 0.0 0.0
AZ 2020-05-28 17763.0 857.0
CA 2020-05-28 101697.0 3973.0
CO 2020-05-28 24767.0 1392.0
CT 2020-05-28 41559.0 3826.0
DC 2020-05-28 8492.0 453.0
DE 2020-05-28 9171.0 345.0
FL 2020-05-28 53285.0 2446.0
GA 2020-05-28 45070.0 1962.0
GU 2020-05-28 172.0 5.0
HI 2020-05-28 644.0 17.0
IA 2020-05-28 18573.0 504.0
ID 2020-05-28 2731.0 82.0
IL 2020-05-28 115833.0 5186.0
IN 2020-05-28 33068.0 2068.0
KS 2020-05-28 9337.0 205.0
KY 2020-05-28 9077.0 400.0
LA 2020-05-28 38802.0 2740.0
MA 2020-05-28 94895.0 6640.0
MD 2020-05-28 49709.0 2428.0
ME 2020-05-28 2189.0 84.0
MI 2020-05-28 56014.0 5372.0
MN 2020-05-28 22947.0 977.0
MO 2020-05-28 12673.0 707.0
MP 2020-05-28 22.0 2.0
MS 2020-05-28 14372.0 693.0
MT 2020-05-28 485.0 17.0
NC 2020-05-28 25412.0 827.0
ND 2020-05-28 2481.0 57.0
NE 2020-05-28 12976.0 163.0
NH 2020-05-28 4286.0 223.0
NJ 2020-05-28 157815.0 11401.0
NM 2020-05-28 7252.0 329.0
NV 2020-05-28 8208.0 406.0
NY 2020-05-28 366733.0 23722.0
OH 2020-05-28 33915.0 2098.0
OK 2020-05-28 6270.0 326.0
OR 2020-05-28 4086.0 151.0
PA 2020-05-28 70042.0 5373.0
PR 2020-05-28 3486.0 131.0
RI 2020-05-28 14494.0 677.0
SC 2020-05-28 10788.0 470.0
SD 2020-05-28 4793.0 54.0
TN 2020-05-28 21679.0 356.0
TX 2020-05-28 59776.0 1601.0
UT 2020-05-28 8921.0 106.0
VA 2020-05-28 41401.0 1338.0
VI 2020-05-28 69.0 6.0
VT 2020-05-28 974.0 55.0
WA 2020-05-28 20406.0 1095.0
WI 2020-05-28 16974.0 550.0
WV 2020-05-28 1906.0 74.0
WY 2020-05-28 874.0 15.0

Next, we will do some aggregations to make sure columns such as positive_increase and death_increase tally with the latest data. You should see that positive and death numbers for latest date in the above table match with the aggregation of positive_increase and death_increase.

In [4]:
df.groupby(df.state).agg({'state': 'count','positive_increase': 'sum','death_increase': 'sum'})
Out[4]:
state positive_increase death_increase
state
AK 84 425.0 10.0
AL 83 16310.0 590.0
AR 84 6538.0 120.0
AS 74 0.0 0.0
AZ 86 17761.0 857.0
CA 86 101644.0 3973.0
CO 85 24767.0 1392.0
CT 83 41559.0 3826.0
DC 85 8492.0 453.0
DE 84 9171.0 345.0
FL 86 53283.0 2446.0
GA 86 45068.0 1962.0
GU 74 169.0 5.0
HI 166 1286.0 34.0
IA 84 18573.0 504.0
ID 83 2731.0 82.0
IL 86 115829.0 5186.0
IN 84 33067.0 2068.0
KS 84 9337.0 205.0
KY 84 9077.0 400.0
LA 83 38802.0 2740.0
MA 78 94887.0 6640.0
MD 85 49709.0 2428.0
ME 83 2189.0 84.0
MI 178 112010.0 10744.0
MN 84 22947.0 977.0
MO 83 12673.0 707.0
MP 74 22.0 2.0
MS 83 14372.0 693.0
MT 83 485.0 17.0
NC 86 25411.0 827.0
ND 83 2481.0 57.0
NE 85 12976.0 163.0
NH 86 4284.0 223.0
NJ 85 157814.0 11401.0
NM 84 7252.0 329.0
NV 85 8207.0 406.0
NY 86 366727.0 23722.0
OH 85 33915.0 2098.0
OK 83 6269.0 326.0
OR 86 4083.0 151.0
PA 168 140080.0 10746.0
PR 74 3481.0 131.0
RI 89 14493.0 677.0
SC 86 10788.0 470.0
SD 83 4793.0 54.0
TN 85 21678.0 356.0
TX 86 59775.0 1601.0
UT 83 8920.0 106.0
VA 85 41401.0 1338.0
VI 74 68.0 6.0
VT 84 974.0 55.0
WA 128 20405.0 1095.0
WI 172 33946.0 1100.0
WV 84 1906.0 74.0
WY 83 874.0 15.0

Lets do some basic visualizations for New York

In [5]:
df_NY=df[df['state'] == 'NY']
df_NY.plot(kind='line',x='date',y="positive",grid=True)
df_NY.plot(kind='line',x='date',y="positive_increase",grid=True)
df_NY.plot(kind='line',x='date',y="death",grid=True)
df_NY.plot(kind='line',x='date',y="death_increase",grid=True)
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f043c78eba8>

Lets aggregate the data for all of the US and do some visualizations

In [6]:
df_US=df.groupby(df.date).agg({'positive': 'sum','positive_increase': 'sum','death':'sum','death_increase': 'sum'}).reset_index()

df_US.plot(kind='line',x='date',y="positive",grid=True)
df_US.plot(kind='line',x='date',y="positive_increase",grid=True)
df_US.plot(kind='line',x='date',y="death",grid=True)
df_US.plot(kind='line',x='date',y="death_increase",grid=True)
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f043c8560f0>

Azure Databricks

Package: Language: Python
In [1]:
# Azure storage access info
blob_account_name = "pandemicdatalake"
blob_container_name = "public"
blob_relative_path = "curated/covid-19/covid_tracking/latest/covid_tracking.parquet"
blob_sas_token = r""
In [2]:
# Allow SPARK to read from Blob remotely
wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set(
  'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
  blob_sas_token)
print('Remote blob path: ' + wasbs_path)
In [3]:
# SPARK read parquet, note that it won't load any data yet by now
df = spark.read.parquet(wasbs_path)
print('Register the DataFrame as a SQL temporary view: source')
df.createOrReplaceTempView('source')
In [4]:
# Display top 10 rows
print('Displaying top 10 rows: ')
display(spark.sql('SELECT * FROM source LIMIT 10'))

Azure Synapse

Package: Language: Python
In [1]:
# Azure storage access info
blob_account_name = "pandemicdatalake"
blob_container_name = "public"
blob_relative_path = "curated/covid-19/covid_tracking/latest/covid_tracking.parquet"
blob_sas_token = r""
In [2]:
# Allow SPARK to read from Blob remotely
wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set(
  'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
  blob_sas_token)
print('Remote blob path: ' + wasbs_path)
In [3]:
# SPARK read parquet, note that it won't load any data yet by now
df = spark.read.parquet(wasbs_path)
print('Register the DataFrame as a SQL temporary view: source')
df.createOrReplaceTempView('source')
In [4]:
# Display top 10 rows
print('Displaying top 10 rows: ')
display(spark.sql('SELECT * FROM source LIMIT 10'))