COVID-19: Courtesy of https://alachuacounty.us/

CoronAI: Deep Information Extraction for COVID-19-related Articles

An unsupervised approach to mine the CORD-19 dataset of coronavirus articles

Shayan Fazeli

Published in

Towards Data Science

5 min readMar 21, 2020

Introduction

Our world has seen many viruses and outbreaks but not too many of them come close to COVID-19 in terms of global impact. Nevertheless, I am optimistic that measures such as California’s “safer at home” and “social distancing” help to eradicate this virus entirely. We have lost many great people to this disease, and it does not seem to be willing to make it any easier for us to stop it.

In response to White House’s call for action regarding the use of AI in helping with the COVID-19 research, the COVID-19 public dataset has been released by AllenAI, Microsoft Research, Chan-Zuckerberg Initiative, NIH, and Whitehouse. This dataset includes around 44,000 research articles covering various topics related to Coronavirus (such as SARS, MERS, etc.)

Since the dataset comes with no labels, the main tasks defined on it are generic questions such as “what information do we have regarding vaccinations?”. There are also great works on organizing papers based on their topics such as this.

In this post, we are introducing a small tool for utilizing natural language processing in unsupervised clustering of topics in these articles, so that we can have a better sense of what is going on in them.

Text Segments: Atoms of This Dataset

Every document is composed of different sections, and the section titles are amongst what we do have in the released dataset. Before the words in each article, the smallest entities that preserve semantics are the text segments and sentences. In this post, it is shown how these can be used all in one bucket to help with unsupervised clustering of focus points in these articles. Later on, given the different sections and the inter-section or inter-paper affinity (perhaps even using graph convolutional networks, graph attention networks, etc.) or the topic organization work that was mentioned below, better clustering techniques can be applied to obtain finer results.

We use Natural Language Processing Toolkit’s sentence tokenizer to break down every article to these atomic segments, and then generate the state of the art representations for them to be used in the later stages of our pipeline.

Text Representation: A BERT-based Approach

Using our word_embedding_warehouse you can easily fetch and process the BERT-based pre-trained representation weights to use in PyTorch or Tensorflow (the documentation is available here). The codes in it are easy to read and should be straightforward to follow.

To use it, obtain the package from github using:

git clone https://github.com/shayanfazeli/word_embedding_warehouse.git

Then redirect to its repository and install it:

python3 setup.py install

Now you can run it and fetch the data. Make sure you choose a folder with a large unused space available and run the following:

build_word_embedding_warehouse --warehouse_folder=.../path_to_your_folder --clear_cache=0

The representations will be created in it. One thing to note is that heed the “dropout” values in the configurations, you might need to manually set them to zero.

Pre-processing, Data Preparation, and Generating Representations

Our CoronAI package and its applications can be used to generate representations. Currently, I only have used the large version of NCBI-Bert (now renamed to BlueBert in here) to generate the representations. These weights are fine-tuned on PubMed archive and MIMIC-III public repository of Electronic Health Records, therefore, it is plausible to assume that the model is fairly equipped to deal with the medical domain.

These weights are generated as described in the previous section. Now, proceed to get the CoronAI package using the following command:

git clone https://github.com/shayanfazeli/coronai.git

Then proceed and redirect to its folder, and install the package:

python3 setup.py install

The documentation for different sections of this repository is available in this link.

At this point, you can use the applications to generate the representations and perform unsupervised clustering.

Generating BERT-based Representations

You can generate the BERT-based representations using the following commands (for more information, please visit the github for coronai).

coronai_segment2vector --gpu=2 \
--input_csv=.../text_segment_dataset.csv \
--output_pkl=.../output.pkl \
--path_to_bert_weights=.../NCBI_BERT_pubmed_mimic_uncased_L-24_H-1024_A-16 \
--batch_size=25

Note that for the `NCBI_BERT_pubmed_mimic_uncased_L` representations, you can download them from CoronAI’s shared drive (link at the bottom of this article).

Unsupervised Segment Group Discovery

For this purpose, another CoronAI application can be used. The motivation for this section is the fact that the current tasks designed on this dataset are the generic tasks such as “what do we know about its vaccination?”. However, using this approach we can get a better sense of what is actually included in this dataset. We can start with small number of clusters, and increase them and look for the point of start for plateauing.

Impact of the number of clusters on the distance-based loss

Groups: Interpretation

As an example of interpreting the groups, for the case of 296 clusters, the word cloud for every cluster was created. I then asked a friend of mine who is a medical practitioner to take a look at them. Here is the representation of a sample cluster and a possible explanation:

Cluster 35: This example shows that the semantics of this group is mainly around the concept of vaccination

A larger list of examples is available on this page.

CoronAI’s Shared Drive

Please view here.

P.S. the code and this post will be edited as time goes by since they are all written in a hasty manner. Any contribution is welcome as well. Thank you.

References

[1] Gardner, Matt, et al. “Allennlp: A deep semantic natural language processing platform.” arXiv preprint arXiv:1803.07640 (2018).

[2] Lee, Jinhyuk, et al. “BioBERT: a pre-trained biomedical language representation model for biomedical text mining.” Bioinformatics 36.4 (2020): 1234–1240.

[3] Peng, Yifan, Shankai Yan, and Zhiyong Lu. “Transfer learning in biomedical natural language processing: An evaluation of BERT and ELMo on ten benchmarking datasets.” arXiv preprint arXiv:1906.05474 (2019).

[4] COVID-19 Open Research Dataset and Challenge: https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/kernels (2020).