临床试验结果自动列表:联合实体和关系提取方法,以变压器为基础的语言表示 (Automated tabulation of clinical trial results: A joint entity and relation extraction approach with transformer-based language representations)

Evidence-based medicine, the practice in which healthcare professionals refer to the best available evidence when making decisions, forms the foundation of modern healthcare. However, it relies on labour-intensive systematic reviews, where domain specialists must aggregate and extract information from thousands of publications, primarily of randomised controlled trial (RCT) results, into evidence tables. This paper investigates automating evidence table generation by decomposing the problem across two language processing tasks: \textit{named entity recognition}, which identifies key entities within text, such as drug names, and \textit{relation extraction}, which maps their relationships for separating them into ordered tuples. We focus on the automatic tabulation of sentences from published RCT abstracts that report the results of the study outcomes. Two deep neural net models were developed as part of a joint extraction pipeline, using the principles of transfer learning and transformer-based language representations. To train and test these models, a new gold-standard corpus was developed, comprising almost 600 result sentences from six disease areas. This approach demonstrated significant advantages, with our system performing well across multiple natural language processing tasks and disease areas, as well as in generalising to disease domains unseen during training. Furthermore, we show these results were achievable through training our models on as few as 200 example sentences. The final system is a proof of concept that the generation of evidence tables can be semi-automated, representing a step towards fully automating systematic reviews.

翻译：以证据为基础的医学,即保健专业人员在决策时参考现有最佳证据的做法,构成了现代保健的基础;然而,它依赖劳动密集型系统审查,即域专家必须从数千种出版物中,主要是随机控制的试验结果(RCT)汇总和提取信息,将其纳入证据表格;本文件调查证据表格的自动生成,将问题分解到两种语言处理任务:\ textit{name实体识别},其中确定了文本中的关键实体,如药品名称和Textit{关系提取},这些实体在将它们分为订购品。我们侧重于从已出版的RCT摘要中自动列表判决,报告研究结果的结果。两个深神经网模型是联合提取管道的一部分,采用转移学习和变换语言表述的原则。为了培训和测试这些模型,开发了一个新的黄金标准资料库,由六个疾病领域的近600个判决结果组成。这一方法展示了巨大的优势,我们的系统在多个自然语言处理任务和疾病领域之间运作良好,我们系统的系统在报告结果的RCT摘要中自动列表。两个深层次网络模型是可实现的系统,在生成的模范式培训中展示了一种可实现的系统。