Retrieving answer passages from long documents is a complex task requiring semantic understanding of both discourse and document context. We approach this challenge specifically in a clinical scenario, where doctors retrieve cohorts of patients based on diagnoses and other latent medical aspects. We introduce CAPR, a rule-based self-supervision objective for training Transformer language models for domain-specific passage matching. In addition, we contribute a novel retrieval dataset based on clinical notes to simulate this scenario on a large corpus of clinical notes. We apply our objective in four Transformer-based architectures: Contextual Document Vectors, Bi-, Poly- and Cross-encoders. From our extensive evaluation on MIMIC-III and three other healthcare datasets, we report that CAPR outperforms strong baselines in the retrieval of domain-specific passages and effectively generalizes across rule-based and human-labeled passages. This makes the model powerful especially in zero-shot scenarios where only limited training data is available.
翻译:从长篇文档中获取解答段落是一项复杂的任务,需要从语义上理解话语和文件背景。我们具体在临床情况下应对这一挑战,医生根据诊断和其他潜在医学方面来找回患者组群。我们引入了CAPR,这是用于培训变异语言模型以进行特定领域通道匹配的基于规则的自我监督目标。此外,我们根据临床说明贡献了一个新的检索数据集,以大量临床笔记来模拟这一情景。我们在四个基于变异器的建筑中应用了我们的目标:背景文件矢量、双、多边和交叉编码。根据我们对MIMIC-III和其他三个保健数据集的广泛评价,我们报告CAPR在检索特定领域通道方面超越了强有力的基线,并有效地将基于规则的和人类标签的通道加以概括。这使得模型在只有有限培训数据的零速假设中特别强大。