加速COVID-19研究,进行图形采矿和变压器学习 (Accelerating COVID-19 research with graph mining and transformer-based learning)

In 2020, the White House released the, "Call to Action to the Tech Community on New Machine Readable COVID-19 Dataset," wherein artificial intelligence experts are asked to collect data and develop text mining techniques that can help the science community answer high-priority scientific questions related to COVID-19. The Allen Institute for AI and collaborators announced the availability of a rapidly growing open dataset of publications, the COVID-19 Open Research Dataset (CORD-19). As the pace of research accelerates, biomedical scientists struggle to stay current. To expedite their investigations, scientists leverage hypothesis generation systems, which can automatically inspect published papers to discover novel implicit connections. We present an automated general purpose hypothesis generation systems AGATHA-C and AGATHA-GP for COVID-19 research. The systems are based on graph-mining and the transformer model. The systems are massively validated using retrospective information rediscovery and proactive analysis involving human-in-the-loop expert analysis. Both systems achieve high-quality predictions across domains (in some domains up to 0.97% ROC AUC) in fast computational time and are released to the broad scientific community to accelerate biomedical research. In addition, by performing the domain expert curated study, we show that the systems are able to discover on-going research findings such as the relationship between COVID-19 and oxytocin hormone.

翻译：2020年,白宫发布了“呼吁科技界采取行动,研究新机器可读COVID-19数据集”,其中要求人工情报专家收集数据并开发有助于科学界解答与COVID-19有关的高度优先科学问题的文本采矿技术。AllenAI和协作者研究所宣布,随着研究步伐加快,生物医学科学家难以跟上时代。为了加快调查速度,科学家利用假设生成系统,自动检查出版的论文以发现新的隐性联系。我们为COVID-19研究提供了自动化通用假设生成系统AGATHA-C和AGATHA-GP,这些系统以图形挖掘和变异器模型为基础。这两个系统利用追溯性信息重新发现和主动分析来大规模验证了出版物的数据集,即COVI 19 公开研究数据集(CORD-19 ) 。随着研究步伐加快,生物医学科学家们在快速计算时间内(在某些领域达到0.97% ROC ACC AS) 实现高质量的预测,并发布给广泛的科学界,以加速研究。