탐색 건너뛰기

COVID-19 Open Research Dataset

CORD-19 COVID-19 coronavirus SARS MERS

코로나19 및 코로나바이러스 관련 학술 문건의 전문 및 메타데이터의 데이터 세트가 컴퓨터 가독성을 위해 최적화되었으며 이 데이터 세트를 전 세계 연구계에서 사용할 수 있습니다.

코로나19 팬데믹에 대응하여 Allen Institute for AI는 선도적인 연구 그룹과 협력하여 47,000개 이상의 학술적인 문건으로 구성된 무료 리소스인 C9 ORD-19(코로나1공개 연구 데이터 세트)를 준비하여 배포했습니다. 여기에는 전 세계 연구계가 사용할 수 있는 코로나19 및 코로나바이러스군의 바이러스에 관한 36,000건 이상의 자료 전문이 포함되어 있습니다.

이 데이터 세트의 목적은 전염병과의 사투를 지원하는 새로운 인사이트를 위해 자연어 처리에서 이룩한 최근의 발전을 적용하도록 연구자들을 동원하는 데 있습니다.

새로운 연구가 피어 리뷰 간행물 및 bioRxiv, medRxiv 등의 기록 보관 서비스에 게시되면 자료가 업데이트될 수도 있습니다.

사용 조건

이 데이터 세트는 Allen Institute of AI 및 Semantic Scholar에서 제공합니다. CORD-19 데이터 세트에서 제공되는 콘텐츠에 액세스하거나 콘텐츠를 다운로드하거나 사용하면 이 데이터 세트 사용과 관련된 데이터 세트 라이선스에 동의하는 것으로 간주합니다. 데이터 세트의 개별 문건의 특정 라이선스 정보는 메타데이터 파일로 제공됩니다. 추가 라이선스 정보는 PMC 웹 사이트, medRxiv 웹 사이트bioRxiv 웹 사이트에서 제공됩니다.

볼륨 및 보존

이 데이터 세트는 JSON 포맷으로 저장되며 최신 릴리스에는 36,000건의 전문 자료가 포함되어 있습니다. 각 논문은 단일 JSON 개체로 표시됩니다. 스키마는 여기서 제공됩니다.

스토리지 위치

이 데이터 세트는 미국 동부 Azure 지역에 저장됩니다. 선호도를 위해 미국 동부에 컴퓨팅 리소스를 할당하는 것이 좋습니다.

인용

게시 또는 재배포에 CORD-19 데이터를 포함할 때는 데이터 세트에 대해 다음과 같이 인용하세요.

참고 문헌:

CORD-19(코로나19 공개 연구 데이터 세트). 2020. 버전 YYYY-MM-DD. CORD-19(코로나19 공개 연구 데이터 세트)에서 검색되었습니다. 액세스한 날짜 YYYY-MM-DD. doi:10.5281/zenodo.3715505

텍스트: (CORD-19, 2020)

Contact

이 데이터 세트에 대한 질문이 있는 경우 partnerships@allenai.org으로 문의하세요.

알림

Microsoft는 Azure Open Datasets를 “있는 그대로” 제공합니다. Microsoft는 귀하의 데이터 세트 사용과 관련하여 어떠한 명시적이거나 묵시적인 보증, 보장 또는 조건을 제공하지 않습니다. 귀하가 거주하는 지역의 법규가 허용하는 범위 내에서 Microsoft는 귀하의 데이터 세트 사용으로 인해 발생하는 일체의 직접적, 결과적, 특별, 간접적, 부수적 또는 징벌적 손해 또는 손실을 비롯한 모든 손해 또는 손실에 대한 모든 책임을 부인합니다.

Access

Available inWhen to use
Azure Notebooks

Quickly explore the dataset with Jupyter notebooks hosted on Azure or your local machine.

Select your preferred service:

Azure Notebooks

Azure Notebooks

Package: Language: Python Python

The CORD-19 Dataset

CORD-19 is a collection of over 50,000 scholarly articles - including over 40,000 with full text - about COVID-19, SARS-CoV-2, and related coronaviruses. This dataset has been made freely available with the goal to aid research communities combat the COVID-19 pandemic.

The goal of this notebook is two-fold:

  1. Demonstrate how to access the CORD-19 dataset on Azure: We connect to the Azure blob storage account housing the CORD-19 dataset.
  2. Walk though the structure of the dataset: Articles in the dataset are stored as json files. We provide examples showing:

    • How to find the articles (navigating the container)
    • How to read the articles (navigating the json schema)

Dependencies: This notebook requires the following libraries:

  • Azure storage (e.g. pip install azure-storage)
  • NLTK (docs)
  • Pandas (e.g. pip install pandas)

Getting the CORD-19 data from Azure

The CORD-19 data has been uploaded as an Azure Open Dataset here. We create a blob service linked to this CORD-19 open dataset.

In [1]:
from azure.storage.blob import BlockBlobService

# storage account details
azure_storage_account_name = "azureopendatastorage"
azure_storage_sas_token = "sv=2019-02-02&ss=bfqt&srt=sco&sp=rlcup&se=2025-04-14T00:21:16Z&st=2020-04-13T16:21:16Z&spr=https&sig=JgwLYbdGruHxRYTpr5dxfJqobKbhGap8WUtKFadcivQ%3D"

# create a blob service
blob_service = BlockBlobService(
    account_name=azure_storage_account_name,
    sas_token=azure_storage_sas_token,
)

We can use this blob service as a handle on the data. We can navigate the dataset making use of the BlockBlobService APIs. See here for more details:

The CORD-19 data is stored in the covid19temp container. This is the file structure within the container together with an example file.

metadata.csv
custom_license/
    pdf_json/
        0001418189999fea7f7cbe3e82703d71c85a6fe5.json        # filename is sha-hash
        ...
    pmc_json/
        PMC1065028.xml.json                                  # filename is the PMC ID
        ...
noncomm_use_subset/
    pdf_json/
        0036b28fddf7e93da0970303672934ea2f9944e7.json
        ...
    pmc_json/
        PMC1616946.xml.json
        ...
comm_use_subset/
    pdf_json/
        000b7d1517ceebb34e1e3e817695b6de03e2fa78.json
        ...
    pmc_json/
        PMC1054884.xml.json
        ...
biorxiv_medrxiv/                                             # note: there is no pmc_json subdir
    pdf_json/
        0015023cc06b5362d332b3baf348d11567ca2fbb.json
        ...

Each .json file corresponds to an individual article in the dataset. This is where the title, authors, abstract and (where available) the full text data is stored.

Using metadata.csv

The CORD-19 dataset comes with metadata.csv - a single file that records basic information on all the papers available in the CORD-19 dataset. This is a good place to start exploring!

In [2]:
# container housing CORD-19 data
container_name = "covid19temp"

# download metadata.csv
metadata_filename = 'metadata.csv'
blob_service.get_blob_to_path(
    container_name=container_name,
    blob_name=metadata_filename,
    file_path=metadata_filename
)
Out[2]:
<azure.storage.blob.models.Blob at 0x298a57d4a58>
In [3]:
import pandas as pd

# read metadata.csv into a dataframe
metadata_filename = 'metadata.csv'
metadata = pd.read_csv(metadata_filename)
In [4]:
metadata.head(3)
Out[4]:
cord_uid sha source_x title doi pmcid pubmed_id license abstract publish_time authors journal Microsoft Academic Paper ID WHO #Covidence has_pdf_parse has_pmc_xml_parse full_text_file url
0 xqhn0vbp 1e1286db212100993d03cc22374b624f7caee956 PMC Airborne rhinovirus detection and effect of ul... 10.1186/1471-2458-3-5 PMC140314 12525263.0 no-cc BACKGROUND: Rhinovirus, the most common cause ... 2003-01-13 Myatt, Theodore A; Johnston, Sebastian L; Rudn... BMC Public Health NaN NaN True True custom_license https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...
1 gi6uaa83 8ae137c8da1607b3a8e4c946c07ca8bda67f88ac PMC Discovering human history from stomach bacteria 10.1186/gb-2003-4-5-213 PMC156578 12734001.0 no-cc Recent analyses of human pathogens have reveal... 2003-04-28 Disotell, Todd R Genome Biol NaN NaN True True custom_license https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...
2 le0ogx1s NaN PMC A new recruit for the army of the men of death 10.1186/gb-2003-4-7-113 PMC193621 12844350.0 no-cc The army of the men of death, in John Bunyan's... 2003-06-27 Petsko, Gregory A Genome Biol NaN NaN False True custom_license https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...

That's a lot to take in at first glance, so let's apply a little polish.

In [5]:
simple_schema = ['cord_uid', 'source_x', 'title', 'abstract', 'authors', 'full_text_file', 'url']

def make_clickable(address):
    '''Make the url clickable'''
    return '<a href="{0}">{0}</a>'.format(address)

def preview(text):
    '''Show only a preview of the text data.'''
    return text[:30] + '...'

format_ = {'title': preview, 'abstract': preview, 'authors': preview, 'url': make_clickable}

metadata[simple_schema].head().style.format(format_)
Out[5]:
cord_uid source_x title abstract authors full_text_file url
0 xqhn0vbp PMC Airborne rhinovirus detection ... BACKGROUND: Rhinovirus, the mo... Myatt, Theodore A; Johnston, S... custom_license https://www.ncbi.nlm.nih.gov/pmc/articles/PMC140314/
1 gi6uaa83 PMC Discovering human history from... Recent analyses of human patho... Disotell, Todd R... custom_license https://www.ncbi.nlm.nih.gov/pmc/articles/PMC156578/
2 le0ogx1s PMC A new recruit for the army of ... The army of the men of death, ... Petsko, Gregory A... custom_license https://www.ncbi.nlm.nih.gov/pmc/articles/PMC193621/
3 fy4w7xz8 PMC Association of HLA class I wit... BACKGROUND: The human leukocyt... Lin, Marie; Tseng, Hsiang-Kuan... custom_license https://www.ncbi.nlm.nih.gov/pmc/articles/PMC212558/
4 0qaoam29 PMC A double epidemic model for th... BACKGROUND: An epidemic of a S... Ng, Tuen Wai; Turinici, Gabrie... custom_license https://www.ncbi.nlm.nih.gov/pmc/articles/PMC222908/
In [6]:
# let's take a quick look around
num_entries = len(metadata)
print("There are {} many entries in this dataset:".format(num_entries))

metadata_with_text = metadata[metadata['full_text_file'].isna() == False]
with_full_text = len(metadata_with_text)
print("-- {} have full text entries".format(with_full_text))

with_doi = metadata['doi'].count()
print("-- {} have DOIs".format(with_doi))

with_pmcid = metadata['pmcid'].count()
print("-- {} have PubMed Central (PMC) ids".format(with_pmcid))

with_microsoft_id = metadata['Microsoft Academic Paper ID'].count()
print("-- {} have Microsoft Academic paper ids".format(with_microsoft_id))
There are 51078 many entries in this dataset:
-- 42511 have full text entries
-- 47741 have DOIs
-- 41082 have PubMed Central (PMC) ids
-- 964 have Microsoft Academic paper ids

Example: Read full text

Notice that metadata.csv does not contain the full-text itself. Let's see an example of how to read that. We will locate and unpack the full text json and convert it to a list of sentences.

In [7]:
# choose a random example with pdf parse available
metadata_with_pdf_parse = metadata[metadata['has_pdf_parse']]
example_entry = metadata_with_pdf_parse.iloc[42]

# construct path to blob containing full text
blob_name = '{0}/pdf_json/{1}.json'.format(example_entry['full_text_file'], example_entry['sha'])  # note the repetition in the path
print("Full text blob for this entry:")
print(blob_name)
Full text blob for this entry:
custom_license/pdf_json/f1d1b9694aa43c837d9b758cb2d45d8a24d293e3.json

We can now read the json content associated to this blob as follows.

In [8]:
import json
blob_as_json_string = blob_service.get_blob_to_text(container_name=container_name, blob_name=blob_name)
data = json.loads(blob_as_json_string.content)

# in addition to the body text, the metadata is also stored within the individual json files
print("Keys within data:", ', '.join(data.keys()))
Keys within data: paper_id, metadata, abstract, body_text, bib_entries, ref_entries, back_matter

For the purposes of this example we are interested in the body_text, which stores the text data as follows:

"body_text": [                      # list of paragraphs in full body
    {
        "text": <str>,
        "cite_spans": [             # list of character indices of inline citations
                                    # e.g. citation "[7]" occurs at positions 151-154 in "text"
                                    #      linked to bibliography entry BIBREF3
            {
                "start": 151,
                "end": 154,
                "text": "[7]",
                "ref_id": "BIBREF3"
            },
            ...
        ],
        "ref_spans": <list of dicts similar to cite_spans>,     # e.g. inline reference to "Table 1"
        "section": "Abstract"
    },
    ...
]

The full json schema is available here.

In [9]:
from nltk.tokenize import sent_tokenize

# the text itself lives under 'body_text'
text = data['body_text']

# many NLP tasks play nicely with a list of sentences
sentences = []
for paragraph in text:
    sentences.extend(sent_tokenize(paragraph['text']))

print("An example sentence:", sentences[0])
An example sentence: It is now widely admitted that actual genomes have a common ancestor (LUCA, Last Universal Common Ancestor).

PDF vs PMC XML Parse

In the above example we looked at a case with has_pdf_parse == True. In that case the blob file path was of the form:

'<full_text_file>/pdf_json/<sha>.json'

Alternatively, for cases with has_pmc_xml_parse == True use the following format:

'<full_text_file>/pmc_json/<pmcid>.xml.json'

For example:

In [10]:
# choose a random example with pmc parse available
metadata_with_pmc_parse = metadata[metadata['has_pmc_xml_parse']]
example_entry = metadata_with_pmc_parse.iloc[42]

# construct path to blob containing full text
blob_name = '{0}/pmc_json/{1}.xml.json'.format(example_entry['full_text_file'], example_entry['pmcid'])  # note the repetition in the path
print("Full text blob for this entry:")
print(blob_name)

blob_as_json_string = blob_service.get_blob_to_text(container_name=container_name, blob_name=blob_name)
data = json.loads(blob_as_json_string.content)

# the text itself lives under 'body_text'
text = data['body_text']

# many NLP tasks play nicely with a list of sentences
sentences = []
for paragraph in text:
    sentences.extend(sent_tokenize(paragraph['text']))

print("An example sentence:", sentences[0])
Full text blob for this entry:
custom_license/pmc_json/PMC546170.xml.json
An example sentence: Double-stranded small interfering RNA (siRNA) molecules have drawn much attention since it was unambiguously shown that they mediate potent gene knock-down in a variety of mammalian cells (1).

Iterate through blobs directly

In the above examples we used the metadata.csv file to navigate the data, construct the blob file path, and read data from the blob. An alternative is the iterate through the blobs themselves.

In [11]:
# get and sort list of available blobs
blobs = blob_service.list_blobs(container_name)
sorted_blobs = sorted(list(blobs), key=lambda e: e.name, reverse=True)

Now we can iterate through the blobs directly. For example, let's count the number json files available.

In [12]:
# we can now iterate directly though the blobs
count = 0
for blob in sorted_blobs:
    if blob.name[-5:] == ".json":
        count += 1
print("There are {} many json files".format(count))
There are 59784 many json files

Appendix

Data quality issues

This is a large dataset that for obvious reasons was put together rather hastily! Here are some data quality issues's we've observed.

Multiple shas

We observe that in some cases there are multiple shas for a given entry.

In [13]:
metadata_multiple_shas = metadata[metadata['sha'].str.len() > 40]

print("There are {} many entries with multiple shas".format(len(metadata_multiple_shas)))

metadata_multiple_shas.head(3)
There are 1999 many entries with multiple shas
Out[13]:
cord_uid sha source_x title doi pmcid pubmed_id license abstract publish_time authors journal Microsoft Academic Paper ID WHO #Covidence has_pdf_parse has_pmc_xml_parse full_text_file url
20 fpj5urao e9c78584c08ba79d735e150eff98297eb57f12dd; cdb2... PMC Moderate mutation rate in the SARS coronavirus... 10.1186/1471-2148-4-21 PMC446188 15222897.0 no-cc BACKGROUND: The outbreak of severe acute respi... 2004-06-28 Zhao, Zhongming; Li, Haipeng; Wu, Xiaozhuang; ... BMC Evol Biol NaN NaN True True custom_license https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...
93 2vlvz5o9 bd92cbae7179f07d59d1ce4d7ca96e37ebb40ec9; 7526... PMC Design of Wide-Spectrum Inhibitors Targeting C... 10.1371/journal.pbio.0030324 PMC1197287 16128623.0 cc-by The genus Coronavirus contains about 25 specie... 2005-09-06 Yang, Haitao; Xie, Weiqing; Xue, Xiaoyu; Yang,... PLoS Biol NaN NaN True True comm_use_subset https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...
181 2g4m0dy7 2bd6e33d92632dfcba4056a2d7355ced5b7ab1fd; 7648... PMC Reducing the Impact of the Next Influenza Pand... 10.1371/journal.pmed.0030361 PMC1526768 16881729.0 cc-by BACKGROUND: The outbreak of highly pathogenic ... 2006-08-08 Wu, Joseph T; Riley, Steven; Fraser, Christoph... PLoS Med NaN NaN True True comm_use_subset https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...

Layout of the container

Here we use a simple regex to explore the file structure of the container in case this is updated in the future.

In [14]:
container_name = "covid19temp"
blobs = blob_service.list_blobs(container_name)
sorted_blobs = sorted(list(blobs), key=lambda e: e.name, reverse=True)
In [15]:
import re
dirs = {}

pattern = '([\w]+)\/([\w]+)\/([\w.]+).json'
for blob in sorted_blobs:
    
    m = re.match(pattern, blob.name)
    
    if m:
        dir_ = m[1] + '/' + m[2]
        
        if dir_ in dirs:
            dirs[dir_] += 1
        else:
            dirs[dir_] = 1
        
dirs
Out[15]:
{'noncomm_use_subset/pmc_json': 2221,
 'noncomm_use_subset/pdf_json': 2507,
 'custom_license/pmc_json': 7854,
 'custom_license/pdf_json': 26799,
 'comm_use_subset/pmc_json': 9169,
 'comm_use_subset/pdf_json': 9539,
 'biorxiv_medrxiv/pdf_json': 1695}

The CORD-19 Dataset

CORD-19 is a collection of over 50,000 scholarly articles - including over 40,000 with full text - about COVID-19, SARS-CoV-2, and related coronaviruses. This dataset has been made freely available with the goal to aid research communities combat the COVID-19 pandemic.

The goal of this notebook is two-fold:

  1. Demonstrate how to access the CORD-19 dataset on Azure: We use AzureML Dataset to provide a context for the CORD-19 data.
  2. Walk though the structure of the dataset: Articles in the dataset are stored as json files. We provide examples showing:

    • How to find the articles (navigating the directory structure)
    • How to read the articles (navigating the json schema)

Dependencies: This notebook requires the following libraries:

  • AzureML Python SDK (e.g. pip install --upgrade azureml-sdk)
  • Pandas (e.g. pip install pandas)
  • NLTK (docs) (e.g. pip install nltk)

Note. If your NLTK does not have punkt package you will need to run:

import nltk
nltk.download('punkt')

Getting the CORD-19 data from Azure

The CORD-19 data has been uploaded as an Azure Open Dataset here. In this notebook we use AzureML Dataset to reference the CORD-19 open dataset.

In [1]:
import azureml.core
print("Azure ML SDK Version: ", azureml.core.VERSION)
Azure ML SDK Version:  1.2.0
In [2]:
from azureml.core import  Dataset
cord19_dataset = Dataset.File.from_files('https://azureopendatastorage.blob.core.windows.net/covid19temp')
mount = cord19_dataset.mount()

The mount() method creates a context manager for mounting file system streams defined by the dataset as local files.

Use mount.start() and mount.stop() or alternatively use with mount(): to manage context.

Note. Mount is only supported on Unix or Unix-like operating systems and libfuse must be present. If you are running inside a docker container, the docker container must be started with the --privileged flag or started with --cap-add SYS_ADMIN --device /dev/fuse. For more details see the docs

In [3]:
import os

COVID_DIR = '/covid19temp'
path = mount.mount_point + COVID_DIR

with mount:
    print(os.listdir(path))
['antiviral_with_properties_compressed.sdf', 'biorxiv_medrxiv', 'biorxiv_medrxiv_compressed.tar.gz', 'comm_use_subset', 'comm_use_subset_compressed.tar.gz', 'custom_license', 'custom_license_compressed.tar.gz', 'metadata.csv', 'noncomm_use_subset', 'noncomm_use_subset_compressed.tar.gz']

This is the file structure within the CORD-19 dataset together with an example file.

metadata.csv
custom_license/
    pdf_json/
        0001418189999fea7f7cbe3e82703d71c85a6fe5.json        # filename is sha-hash
        ...
    pmc_json/
        PMC1065028.xml.json                                  # filename is the PMC ID
        ...
noncomm_use_subset/
    pdf_json/
        0036b28fddf7e93da0970303672934ea2f9944e7.json
        ...
    pmc_json/
        PMC1616946.xml.json
        ...
comm_use_subset/
    pdf_json/
        000b7d1517ceebb34e1e3e817695b6de03e2fa78.json
        ...
    pmc_json/
        PMC1054884.xml.json
        ...
biorxiv_medrxiv/                                             # note: there is no pmc_json subdir
    pdf_json/
        0015023cc06b5362d332b3baf348d11567ca2fbb.json
        ...

Each .json file corresponds to an individual article in the dataset. This is where the title, authors, abstract and (where available) the full text data is stored.

Using metadata.csv

The CORD-19 dataset comes with metadata.csv - a single file that records basic information on all the papers available in the CORD-19 dataset. This is a good place to start exploring!

In [4]:
import pandas as pd

# create mount context
mount.start()

# specify path to metadata.csv
COVID_DIR = 'covid19temp'
metadata_filename = '{}/{}/{}'.format(mount.mount_point, COVID_DIR, 'metadata.csv')

# read metadata
metadata = pd.read_csv(metadata_filename)
metadata.head(3)
Out[4]:
cord_uid sha source_x title doi pmcid pubmed_id license abstract publish_time authors journal Microsoft Academic Paper ID WHO #Covidence has_pdf_parse has_pmc_xml_parse full_text_file url
0 xqhn0vbp 1e1286db212100993d03cc22374b624f7caee956 PMC Airborne rhinovirus detection and effect of ul... 10.1186/1471-2458-3-5 PMC140314 12525263.0 no-cc BACKGROUND: Rhinovirus, the most common cause ... 2003-01-13 Myatt, Theodore A; Johnston, Sebastian L; Rudn... BMC Public Health NaN NaN True True custom_license https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...
1 gi6uaa83 8ae137c8da1607b3a8e4c946c07ca8bda67f88ac PMC Discovering human history from stomach bacteria 10.1186/gb-2003-4-5-213 PMC156578 12734001.0 no-cc Recent analyses of human pathogens have reveal... 2003-04-28 Disotell, Todd R Genome Biol NaN NaN True True custom_license https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...
2 le0ogx1s NaN PMC A new recruit for the army of the men of death 10.1186/gb-2003-4-7-113 PMC193621 12844350.0 no-cc The army of the men of death, in John Bunyan's... 2003-06-27 Petsko, Gregory A Genome Biol NaN NaN False True custom_license https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...
In [5]:
simple_schema = ['cord_uid', 'source_x', 'title', 'abstract', 'authors', 'full_text_file', 'url']

def make_clickable(address):
    '''Make the url clickable'''
    return '<a href="{0}">{0}</a>'.format(address)

def preview(text):
    '''Show only a preview of the text data.'''
    return text[:30] + '...'

format_ = {'title': preview, 'abstract': preview, 'authors': preview, 'url': make_clickable}

metadata[simple_schema].head().style.format(format_)
Out[5]:
cord_uid source_x title abstract authors full_text_file url
0 xqhn0vbp PMC Airborne rhinovirus detection ... BACKGROUND: Rhinovirus, the mo... Myatt, Theodore A; Johnston, S... custom_license https://www.ncbi.nlm.nih.gov/pmc/articles/PMC140314/
1 gi6uaa83 PMC Discovering human history from... Recent analyses of human patho... Disotell, Todd R... custom_license https://www.ncbi.nlm.nih.gov/pmc/articles/PMC156578/
2 le0ogx1s PMC A new recruit for the army of ... The army of the men of death, ... Petsko, Gregory A... custom_license https://www.ncbi.nlm.nih.gov/pmc/articles/PMC193621/
3 fy4w7xz8 PMC Association of HLA class I wit... BACKGROUND: The human leukocyt... Lin, Marie; Tseng, Hsiang-Kuan... custom_license https://www.ncbi.nlm.nih.gov/pmc/articles/PMC212558/
4 0qaoam29 PMC A double epidemic model for th... BACKGROUND: An epidemic of a S... Ng, Tuen Wai; Turinici, Gabrie... custom_license https://www.ncbi.nlm.nih.gov/pmc/articles/PMC222908/
In [6]:
# let's take a quick look around
num_entries = len(metadata)
print("There are {} many entries in this dataset:".format(num_entries))

metadata_with_text = metadata[metadata['full_text_file'].isna() == False]
with_full_text = len(metadata_with_text)
print("-- {} have full text entries".format(with_full_text))

with_doi = metadata['doi'].count()
print("-- {} have DOIs".format(with_doi))

with_pmcid = metadata['pmcid'].count()
print("-- {} have PubMed Central (PMC) ids".format(with_pmcid))

with_microsoft_id = metadata['Microsoft Academic Paper ID'].count()
print("-- {} have Microsoft Academic paper ids".format(with_microsoft_id))
There are 52398 many entries in this dataset:
-- 43794 have full text entries
-- 49058 have DOIs
-- 43652 have PubMed Central (PMC) ids
-- 964 have Microsoft Academic paper ids

Example: Read full text

Notice that metadata.csv does not contain the full-text itself. Let's see an example of how to read that. We will locate and unpack the full text json and convert it to a list of sentences.

In [7]:
# choose a random example with pdf parse available
metadata_with_pdf_parse = metadata[metadata['has_pdf_parse']]
example_entry = metadata_with_pdf_parse.iloc[42]

# construct path to blob containing full text
filepath = '{0}/{1}/pdf_json/{2}.json'.format(path, example_entry['full_text_file'], example_entry['sha'])
print("Full text filepath:")
print(filepath)
Full text filepath:
/tmp/tmp4azhkde4/covid19temp/custom_license/pdf_json/f1d1b9694aa43c837d9b758cb2d45d8a24d293e3.json

We can now read the json content associated to this file as follows.

In [8]:
import json

try:
    with open(filepath, 'r') as f:
        data = json.load(f)
except FileNotFoundError as e:
    # in case the mount context has been closed
    mount.start()
    with open(filepath, 'r') as f:
        data = json.load(f)
        
# in addition to the body text, the metadata is also stored within the individual json files
print("Keys within data:", ', '.join(data.keys()))
Keys within data: paper_id, metadata, abstract, body_text, bib_entries, ref_entries, back_matter

For the purposes of this example we are interested in the body_text, which stores the text data as follows:

"body_text": [                      # list of paragraphs in full body
    {
        "text": <str>,
        "cite_spans": [             # list of character indices of inline citations
                                    # e.g. citation "[7]" occurs at positions 151-154 in "text"
                                    #      linked to bibliography entry BIBREF3
            {
                "start": 151,
                "end": 154,
                "text": "[7]",
                "ref_id": "BIBREF3"
            },
            ...
        ],
        "ref_spans": <list of dicts similar to cite_spans>,     # e.g. inline reference to "Table 1"
        "section": "Abstract"
    },
    ...
]

The full json schema is available here.

In [9]:
from nltk.tokenize import sent_tokenize
# the text itself lives under 'body_text'
text = data['body_text']

# many NLP tasks play nicely with a list of sentences
sentences = []
for paragraph in text:
    sentences.extend(sent_tokenize(paragraph['text']))

print("An example sentence:", sentences[0])
An example sentence: It is now widely admitted that actual genomes have a common ancestor (LUCA, Last Universal Common Ancestor).

PDF vs PMC XML Parse

In the above example we looked at a case with has_pdf_parse == True. In that case the file path was of the form:

'<full_text_file>/pdf_json/<sha>.json'

Alternatively, for cases with has_pmc_xml_parse == True use the following format:

'<full_text_file>/pmc_json/<pmcid>.xml.json'

For example:

In [10]:
# choose a random example with pmc parse available
metadata_with_pmc_parse = metadata[metadata['has_pmc_xml_parse']]
example_entry = metadata_with_pmc_parse.iloc[42]

# construct path to blob containing full text
filename = '{0}/pmc_json/{1}.xml.json'.format(example_entry['full_text_file'], example_entry['pmcid'])  # note the repetition in the path
print("Path to file: {}\n".format(filename))

with open(mount.mount_point + '/' + COVID_DIR + '/' + filename, 'r') as f:
    data = json.load(f)

# the text itself lives under 'body_text'
text = data['body_text']

# many NLP tasks play nicely with a list of sentences
sentences = []
for paragraph in text:
    sentences.extend(sent_tokenize(paragraph['text']))

print("An example sentence:", sentences[0])
Path to file: custom_license/pmc_json/PMC546170.xml.json

An example sentence: Double-stranded small interfering RNA (siRNA) molecules have drawn much attention since it was unambiguously shown that they mediate potent gene knock-down in a variety of mammalian cells (1).

Appendix

Data quality issues

This is a large dataset that for obvious reasons was put together rather hastily! Here are some data quality issues's we've observed.

In [11]:
metadata_multiple_shas = metadata[metadata['sha'].str.len() > 40]

print("There are {} many entries with multiple shas".format(len(metadata_multiple_shas)))

metadata_multiple_shas.head(3)
There are 2047 many entries with multiple shas
Out[11]:
cord_uid sha source_x title doi pmcid pubmed_id license abstract publish_time authors journal Microsoft Academic Paper ID WHO #Covidence has_pdf_parse has_pmc_xml_parse full_text_file url
20 fpj5urao e9c78584c08ba79d735e150eff98297eb57f12dd; cdb2... PMC Moderate mutation rate in the SARS coronavirus... 10.1186/1471-2148-4-21 PMC446188 15222897.0 no-cc BACKGROUND: The outbreak of severe acute respi... 2004-06-28 Zhao, Zhongming; Li, Haipeng; Wu, Xiaozhuang; ... BMC Evol Biol NaN NaN True True custom_license https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...
93 2vlvz5o9 bd92cbae7179f07d59d1ce4d7ca96e37ebb40ec9; 7526... PMC Design of Wide-Spectrum Inhibitors Targeting C... 10.1371/journal.pbio.0030324 PMC1197287 16128623.0 cc-by The genus Coronavirus contains about 25 specie... 2005-09-06 Yang, Haitao; Xie, Weiqing; Xue, Xiaoyu; Yang,... PLoS Biol NaN NaN True True comm_use_subset https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...
181 2g4m0dy7 2bd6e33d92632dfcba4056a2d7355ced5b7ab1fd; 7648... PMC Reducing the Impact of the Next Influenza Pand... 10.1371/journal.pmed.0030361 PMC1526768 16881729.0 cc-by BACKGROUND: The outbreak of highly pathogenic ... 2006-08-08 Wu, Joseph T; Riley, Steven; Fraser, Christoph... PLoS Med NaN NaN True True comm_use_subset https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...