Clinical language processing has received a lot of attention in recent years, resulting in new models or methods for disease phenotyping, mortality prediction, and other tasks. Unfortunately, many of these approaches are tested under different experimental settings (e.g., data sources, training and testing splits, metrics, evaluation criteria, etc.) making it difficult to compare approaches and determine state-of-the-art. To address these issues and facilitate reproducibility and comparison, we present the Clinical Language Understanding Evaluation (CLUE) benchmark with a set of four clinical language understanding tasks, standard training, development, validation and testing sets derived from MIMIC data, as well as a software toolkit. It is our hope that these data will enable direct comparison between approaches, improve reproducibility, and reduce the barrier-to-entry for developing novel models or methods for these clinical language understanding tasks.
翻译:近年来,临床语言处理工作受到了很多关注,导致疾病流行、死亡率预测和其他任务的新模式或方法,不幸的是,其中许多方法是在不同的实验环境(例如数据来源、培训和测试分解、衡量标准、评价标准等)下测试的,这使得难以比较方法并确定最新技术。为了解决这些问题,便利复制和比较,我们提出了临床语言理解评价基准,并提出了一套由MIMIC数据衍生的临床语言理解任务、标准培训、开发、验证和测试套件以及软件工具包,我们希望这些数据能够使各种方法之间直接比较,改进复制,减少为这些临床语言理解任务开发新模式或方法的障碍。