The latest threat to global health is the COVID-19 outbreak. Although there exist large datasets of chest X-rays (CXR) and computed tomography (CT) scans, few COVID-19 image collections are currently available due to patient privacy. At the same time, there is a rapid growth of COVID-19-relevant articles in the biomedical literature. Here, we present COVID-19-CT-CXR, a public database of COVID-19 CXR and CT images, which are automatically extracted from COVID-19-relevant articles from the PubMed Central Open Access (PMC-OA) Subset. We extracted figures, associated captions, and relevant figure descriptions in the article and separated compound figures into subfigures. We also designed a deep-learning model to distinguish them from other figure types and to classify them accordingly. The final database includes 1,327 CT and 263 CXR images (as of May 9, 2020) with their relevant text. To demonstrate the utility of COVID-19-CT-CXR, we conducted four case studies. (1) We show that COVID-19-CT-CXR, when used as additional training data, is able to contribute to improved DL performance for the classification of COVID-19 and non-COVID-19 CT. (2) We collected CT images of influenza and trained a DL baseline to distinguish a diagnosis of COVID-19, influenza, or normal or other types of diseases on CT. (3) We trained an unsupervised one-class classifier from non-COVID-19 CXR and performed anomaly detection to detect COVID-19 CXR. (4) From text-mined captions and figure descriptions, we compared clinical symptoms and clinical findings of COVID-19 vs. those of influenza to demonstrate the disease differences in the scientific publications. We believe that our work is complementary to existing resources and hope that it will contribute to medical image analysis of the COVID-19 pandemic. The dataset, code, and DL models are publicly available at https://github.com/ncbi-nlp/COVID-19-CT-CXR.
翻译:对全球健康的最新威胁是COVID-19爆发。尽管存在大量胸前X光(CXR)和计算断层扫描(CT)的大规模数据集,但目前由于病人隐私,COVID-19图像收藏量很少。与此同时,生物医学文献中与COVID-19有关的文章迅速增加。在这里,我们提供了COVID-19-CT-CXR,一个COVID-19 CXRR和CT图像公开数据库,一个COVID-19 CXRD的公开数据库,一个COVID-19 CXD 相关文章自动摘自PubMed中央公开存取(PMC-OA)子集。我们提取了数字、相关标题和相关图解图解,并将复合图解分为子。我们还设计了一个深层学习模型,将它们与其他图案类型区分开来。最后数据库包括1,327 COVID和263 CX图像(截至2020年5月9日),以及D的预感。我们向CVVID分类和C-C-C-C-C-C-CX现有数据的一个数据,我们进行了4个案例研究。我们从经过训练的变变的DNA数据分析,我们展示了C-CVI-C-C-C-C-C-C-D数据,我们用了一个数据,我们用了一种经过了一种经的变变异变的文本-C-C-C-C-C-D数据数据,我们数据数据,我们用来用来用来了一种数据。