Coronavirus disease 2019 (COVID-19) is a global pandemic. Although much has been learned about the novel coronavirus since its emergence, there are many open questions related to tracking its spread, describing symptomology, predicting the severity of infection, and forecasting healthcare utilization. Free-text clinical notes contain critical information for resolving these questions. Data-driven, automatic information extraction models are needed to use this text-encoded information in large-scale studies. This work presents a new clinical corpus, referred to as the COVID-19 Annotated Clinical Text (CACT) Corpus, which comprises 1,472 notes with detailed annotations characterizing COVID-19 diagnoses, testing, and clinical presentation. We introduce a span-based event extraction model that jointly extracts all annotated phenomena, achieving high performance in identifying COVID-19 and symptom events with associated assertion values (0.83-0.97 F1 for events and 0.73-0.79 F1 for assertions). In a secondary use application, we explored the prediction of COVID-19 test results using structured patient data (e.g. vital signs and laboratory results) and automatically extracted symptom information. The automatically extracted symptoms improve prediction performance, beyond structured data alone.
翻译:2019年科罗纳病毒(COVID-19)是一种全球流行病,尽管自新冠状病毒(COVID-19)出现以来,人们已经对之了解很多,但在跟踪其传播情况、描述症状学、预测感染严重程度和预测保健利用情况方面有许多尚未解决的问题。免费的临床说明载有解决这些问题的关键信息。在大规模研究中,需要数据驱动的自动信息提取模型来使用这一文本编码信息。这项工作提出了一个新的临床数据,称为COVID-19附加说明临床文本(CACT),由1 472份说明组成,详细说明了COVID-19诊断、测试和临床介绍。我们采用了一个基于跨时间的事件提取模型,共同提取所有附加说明的现象,在确定COVID-19和症状事件及相关的主张值(0.83-0.97 F1)方面取得很高的性能。在二级应用中,我们利用结构化的病人数据(例如生命迹象和实验室结果)和自动提取的症状信息,对COVID-19测试结果进行了预测。