Background: Extractive question-answering (EQA) is a useful natural language processing (NLP) application for answering patient-specific questions by locating answers in their clinical notes. Realistic clinical EQA can have multiple answers to a single question and multiple focus points in one question, which are lacking in the existing datasets for development of artificial intelligence solutions. Objective: Create a dataset for developing and evaluating clinical EQA systems that can handle natural multi-answer and multi-focus questions. Methods: We leveraged the annotated relations from the 2018 National NLP Clinical Challenges (n2c2) corpus to generate an EQA dataset. Specifically, the 1-to-N, M-to-1, and M-to-N drug-reason relations were included to form the multi-answer and multi-focus QA entries, which represent more complex and natural challenges in addition to the basic one-drug-one-reason cases. A baseline solution was developed and tested on the dataset. Results: The derived RxWhyQA dataset contains 96,939 QA entries. Among the answerable questions, 25% require multiple answers, and 2% ask about multiple drugs within one question. There are frequent cues observed around the answers in the text, and 90% of the drug and reason terms occur within the same or an adjacent sentence. The baseline EQA solution achieved a best f1-measure of 0.72 on the entire dataset, and on specific subsets, it was: 0.93 on the unanswerable questions, 0.48 on single-drug questions versus 0.60 on multi-drug questions, 0.54 on the single-answer questions versus 0.43 on multi-answer questions. Discussion: The RxWhyQA dataset can be used to train and evaluate systems that need to handle multi-answer and multi-focus questions. Specifically, multi-answer EQA appears to be challenging and therefore warrants more investment in research.
翻译:提取问答( EQA) 是一种有用的自然语言处理( NLP) 应用程序, 用于通过在临床笔记中找到答案来回答患者特有的问题。 现实的临床 EQA 可以对一个问题中的单个问题和多个焦点点有多重答案, 这些问题缺乏现有的用于开发人工智能解决方案的数据集。 目标 : 为开发和评价临床 EQA系统创建一个数据集, 该系统可以处理自然的多答案和多重点问题。 方法 : 我们利用了2018年国家NLP临床挑战( n2c2) 的附加说明关系来生成 EQA 数据集。 具体来说, 1到 NM, M到 1, M- N和 M- N 药物关系, 在一个问题中, 一个多答案, 一个多答案是“ R- dalder ”, 一个“ R- drealder ”, 一个“ R- drealder ” 和“ R- dread- drealge ” 。 因此, 一个最复杂和最难的解的答案是“ R- frental ” 。 。 在一个答案中, 一个常见的答案是“ R- dal- deal- deal- lad- deal- deal- la lad- lad- lad- lad- deal la la la la lat lad- lat lad- lat lat la lax lax lax lax lad lad lad lad lad lad lad lad lad lad lad lad lad lad lad la lad lad lad lad lad las lad lad lad lad lad lad lads lad lads lad ” lad lad lad lad lad lad lad lad lad lad lad lad lad lads lad lad lad lad lad lad lad lad la