Fuldtekst- og metadata-datasæt for videnskabelige artikler, der er relateret til COVID-19 og coronavirus, og som er optimeret til maskinlæsning og gjort tilgængelige til brug for det globale forskningscommunity.
Som reaktion på COVID-19-pandemien er Allen Institute for AI gået sammen med førende forskergrupper om at forberede og distribuere COVID-19 Open Research Dataset (CORD-19). Dette datasæt består af en gratis samling af mere end 47.000 videnskabelige artikler, herunder mere end 36.000 i fuldtekst, om COVID-19 og virusser i coronavirusfamilien. Artiklerne kan bruges af det globale forskningscommunity.
Dette datasæt er beregnet til at gøre det muligt for forskere at benytte de seneste fremskridt inden for anvendelsen af naturligt sprog til at skabe ny indsigt i kampen mod denne smitsomme sygdom.
Samlingen kan blive opdateret, efterhånden som ny forskning publiceres i fagblade og i arkiveringstjenester, såsom bioRxiv, medRxiv og andet.
Licensbetingelser
Dette datasæt er gjort tilgængelig af Allen Institute of AI og Semantic Scholar . Når du åbner, downloader eller på anden måde bruger indhold i CORD-19 Dataset, accepterer du den Datasætlicens, der er relateret til brugen af datasættet. I metadatafilen finder du specifikke licensoplysninger for individuelle artikler i datasættet. Der er yderligere licensoplysninger tilgængelige på PMC-webstedet, medRxiv-webstedet og bioRxiv-webstedet.
Mængde og opbevaring
Dette datasæt er gemt i Json-formatet, og den seneste udgivelse indeholder mere end 36.000 artikler i fuldtekst. Hver enkelt artikel vises som et enkelt JSON-objekt. Skemaet er tilgængeligt her.
Lagerplacering
Dette datasæt er gemt i Azure-området Det østlige USA. Tildeling af beregningsressourcer i det østlige USA anbefales af tilhørsmæssige årsager.
Citat
Når du inkluderer CORD-19-data i en publikation eller videredistribution, skal du henvise til datasættet som følger:
I bibliografi:
COVID-19 Open Research Dataset (CORD-19). 2020. Version YYYY-MM-DD. Hentet fra COVID-19 Open Research Dataset (CORD-19). Tilgået ÅÅÅÅ-MM-DD. doi:10.5281/zenodo.3715505
I tekst: (CORD-19, 2020)
Kontakt
Hvis du har spørgsmål vedrørende dette datasæt, kan du kontakte partnerships@allenai.org.
Meddelelser
MICROSOFT STILLER AZURE OPEN DATASETS TIL RÅDIGHED, SOM DE ER OG FOREFINDES. MICROSOFT FRASKRIVER SIG ETHVERT ANSVAR, UDTRYKKELIGT ELLER STILTIENDE, OG GARANTIER ELLER BETINGELSER MED HENSYN TIL BRUGEN AF DATASÆTTENE. I DET OMFANG DET ER TILLADT I HENHOLD TIL GÆLDENDE LOVGIVNING FRASKRIVER MICROSOFT SIG ETHVERT ANSVAR FOR SKADER ELLER TAB, INKLUSIVE DIREKTE, FØLGESKADER, SÆRLIGE SKADER, INDIREKTE SKADER, HÆNDELIGE SKADER ELLER PONALE SKADER, DER MÅTTE OPSTÅ I FORBINDELSE MED BRUG AF DATASÆTTENE.
Access
Available in | When to use |
---|---|
Azure Notebooks | Quickly explore the dataset with Jupyter notebooks hosted on Azure or your local machine. |
Select your preferred service:
Azure Notebooks
from azure.storage.blob import BlockBlobService
# storage account details
azure_storage_account_name = "azureopendatastorage"
azure_storage_sas_token = "sv=2019-02-02&ss=bfqt&srt=sco&sp=rlcup&se=2025-04-14T00:21:16Z&st=2020-04-13T16:21:16Z&spr=https&sig=JgwLYbdGruHxRYTpr5dxfJqobKbhGap8WUtKFadcivQ%3D"
# create a blob service
blob_service = BlockBlobService(
account_name=azure_storage_account_name,
sas_token=azure_storage_sas_token,
)
# container housing CORD-19 data
container_name = "covid19temp"
# download metadata.csv
metadata_filename = 'metadata.csv'
blob_service.get_blob_to_path(
container_name=container_name,
blob_name=metadata_filename,
file_path=metadata_filename
)
import pandas as pd
# read metadata.csv into a dataframe
metadata_filename = 'metadata.csv'
metadata = pd.read_csv(metadata_filename)
metadata.head(3)
simple_schema = ['cord_uid', 'source_x', 'title', 'abstract', 'authors', 'full_text_file', 'url']
def make_clickable(address):
'''Make the url clickable'''
return '<a href="{0}">{0}</a>'.format(address)
def preview(text):
'''Show only a preview of the text data.'''
return text[:30] + '...'
format_ = {'title': preview, 'abstract': preview, 'authors': preview, 'url': make_clickable}
metadata[simple_schema].head().style.format(format_)
# let's take a quick look around
num_entries = len(metadata)
print("There are {} many entries in this dataset:".format(num_entries))
metadata_with_text = metadata[metadata['full_text_file'].isna() == False]
with_full_text = len(metadata_with_text)
print("-- {} have full text entries".format(with_full_text))
with_doi = metadata['doi'].count()
print("-- {} have DOIs".format(with_doi))
with_pmcid = metadata['pmcid'].count()
print("-- {} have PubMed Central (PMC) ids".format(with_pmcid))
with_microsoft_id = metadata['Microsoft Academic Paper ID'].count()
print("-- {} have Microsoft Academic paper ids".format(with_microsoft_id))
# choose a random example with pdf parse available
metadata_with_pdf_parse = metadata[metadata['has_pdf_parse']]
example_entry = metadata_with_pdf_parse.iloc[42]
# construct path to blob containing full text
blob_name = '{0}/pdf_json/{1}.json'.format(example_entry['full_text_file'], example_entry['sha']) # note the repetition in the path
print("Full text blob for this entry:")
print(blob_name)
import json
blob_as_json_string = blob_service.get_blob_to_text(container_name=container_name, blob_name=blob_name)
data = json.loads(blob_as_json_string.content)
# in addition to the body text, the metadata is also stored within the individual json files
print("Keys within data:", ', '.join(data.keys()))
from nltk.tokenize import sent_tokenize
# the text itself lives under 'body_text'
text = data['body_text']
# many NLP tasks play nicely with a list of sentences
sentences = []
for paragraph in text:
sentences.extend(sent_tokenize(paragraph['text']))
print("An example sentence:", sentences[0])
# choose a random example with pmc parse available
metadata_with_pmc_parse = metadata[metadata['has_pmc_xml_parse']]
example_entry = metadata_with_pmc_parse.iloc[42]
# construct path to blob containing full text
blob_name = '{0}/pmc_json/{1}.xml.json'.format(example_entry['full_text_file'], example_entry['pmcid']) # note the repetition in the path
print("Full text blob for this entry:")
print(blob_name)
blob_as_json_string = blob_service.get_blob_to_text(container_name=container_name, blob_name=blob_name)
data = json.loads(blob_as_json_string.content)
# the text itself lives under 'body_text'
text = data['body_text']
# many NLP tasks play nicely with a list of sentences
sentences = []
for paragraph in text:
sentences.extend(sent_tokenize(paragraph['text']))
print("An example sentence:", sentences[0])
# get and sort list of available blobs
blobs = blob_service.list_blobs(container_name)
sorted_blobs = sorted(list(blobs), key=lambda e: e.name, reverse=True)
# we can now iterate directly though the blobs
count = 0
for blob in sorted_blobs:
if blob.name[-5:] == ".json":
count += 1
print("There are {} many json files".format(count))
metadata_multiple_shas = metadata[metadata['sha'].str.len() > 40]
print("There are {} many entries with multiple shas".format(len(metadata_multiple_shas)))
metadata_multiple_shas.head(3)
container_name = "covid19temp"
blobs = blob_service.list_blobs(container_name)
sorted_blobs = sorted(list(blobs), key=lambda e: e.name, reverse=True)
import re
dirs = {}
pattern = '([\w]+)\/([\w]+)\/([\w.]+).json'
for blob in sorted_blobs:
m = re.match(pattern, blob.name)
if m:
dir_ = m[1] + '/' + m[2]
if dir_ in dirs:
dirs[dir_] += 1
else:
dirs[dir_] = 1
dirs
import azureml.core
print("Azure ML SDK Version: ", azureml.core.VERSION)
from azureml.core import Dataset
cord19_dataset = Dataset.File.from_files('https://azureopendatastorage.blob.core.windows.net/covid19temp')
mount = cord19_dataset.mount()
import os
COVID_DIR = '/covid19temp'
path = mount.mount_point + COVID_DIR
with mount:
print(os.listdir(path))
import pandas as pd
# create mount context
mount.start()
# specify path to metadata.csv
COVID_DIR = 'covid19temp'
metadata_filename = '{}/{}/{}'.format(mount.mount_point, COVID_DIR, 'metadata.csv')
# read metadata
metadata = pd.read_csv(metadata_filename)
metadata.head(3)
simple_schema = ['cord_uid', 'source_x', 'title', 'abstract', 'authors', 'full_text_file', 'url']
def make_clickable(address):
'''Make the url clickable'''
return '<a href="{0}">{0}</a>'.format(address)
def preview(text):
'''Show only a preview of the text data.'''
return text[:30] + '...'
format_ = {'title': preview, 'abstract': preview, 'authors': preview, 'url': make_clickable}
metadata[simple_schema].head().style.format(format_)
# let's take a quick look around
num_entries = len(metadata)
print("There are {} many entries in this dataset:".format(num_entries))
metadata_with_text = metadata[metadata['full_text_file'].isna() == False]
with_full_text = len(metadata_with_text)
print("-- {} have full text entries".format(with_full_text))
with_doi = metadata['doi'].count()
print("-- {} have DOIs".format(with_doi))
with_pmcid = metadata['pmcid'].count()
print("-- {} have PubMed Central (PMC) ids".format(with_pmcid))
with_microsoft_id = metadata['Microsoft Academic Paper ID'].count()
print("-- {} have Microsoft Academic paper ids".format(with_microsoft_id))
# choose a random example with pdf parse available
metadata_with_pdf_parse = metadata[metadata['has_pdf_parse']]
example_entry = metadata_with_pdf_parse.iloc[42]
# construct path to blob containing full text
filepath = '{0}/{1}/pdf_json/{2}.json'.format(path, example_entry['full_text_file'], example_entry['sha'])
print("Full text filepath:")
print(filepath)
import json
try:
with open(filepath, 'r') as f:
data = json.load(f)
except FileNotFoundError as e:
# in case the mount context has been closed
mount.start()
with open(filepath, 'r') as f:
data = json.load(f)
# in addition to the body text, the metadata is also stored within the individual json files
print("Keys within data:", ', '.join(data.keys()))
from nltk.tokenize import sent_tokenize
# the text itself lives under 'body_text'
text = data['body_text']
# many NLP tasks play nicely with a list of sentences
sentences = []
for paragraph in text:
sentences.extend(sent_tokenize(paragraph['text']))
print("An example sentence:", sentences[0])
# choose a random example with pmc parse available
metadata_with_pmc_parse = metadata[metadata['has_pmc_xml_parse']]
example_entry = metadata_with_pmc_parse.iloc[42]
# construct path to blob containing full text
filename = '{0}/pmc_json/{1}.xml.json'.format(example_entry['full_text_file'], example_entry['pmcid']) # note the repetition in the path
print("Path to file: {}\n".format(filename))
with open(mount.mount_point + '/' + COVID_DIR + '/' + filename, 'r') as f:
data = json.load(f)
# the text itself lives under 'body_text'
text = data['body_text']
# many NLP tasks play nicely with a list of sentences
sentences = []
for paragraph in text:
sentences.extend(sent_tokenize(paragraph['text']))
print("An example sentence:", sentences[0])
metadata_multiple_shas = metadata[metadata['sha'].str.len() > 40]
print("There are {} many entries with multiple shas".format(len(metadata_multiple_shas)))
metadata_multiple_shas.head(3)