While Transformer language models (LMs) are state-of-the-art for information extraction, long text introduces computational challenges requiring suboptimal preprocessing steps or alternative model architectures. Sparse-attention LMs can represent longer sequences, overcoming performance hurdles. However, it remains unclear how to explain predictions from these models, as not all tokens attend to each other in the self-attention layers, and long sequences pose computational challenges for explainability algorithms when runtime depends on document length. These challenges are severe in the medical context where documents can be very long, and machine learning (ML) models must be auditable and trustworthy. We introduce a novel Masked Sampling Procedure (MSP) to identify the text blocks that contribute to a prediction, apply MSP in the context of predicting diagnoses from medical text, and validate our approach with a blind review by two clinicians. Our method identifies about 1.7x more clinically informative text blocks than the previous state-of-the-art, runs up to 100x faster, and is tractable for generating important phrase pairs. MSP is particularly well-suited to long LMs but can be applied to any text classifier. We provide a general implementation of MSP.
翻译:虽然变换语言模型(LMS)是信息提取的最先进,但长文本带来了计算挑战,需要低于最优化的预处理步骤或替代模型结构。偏观的LMS可以代表较长的序列,克服性能障碍。然而,对于如何解释这些模型的预测,还不清楚,因为并不是所有象征在自省层中互相关注,长序列对运行时间取决于文件长度的可解释性算法构成计算挑战。在医学方面,这些挑战十分严峻,因为文件可能非常长,而机器学习(ML)模型必须可以审计和可信。我们引入了一部新型的蒙面抽样取样程序(MSP),以确定有助于预测的文本块,在预测医学文本诊断时应用MSP,并通过两名临床医师的盲目审查来验证我们的方法。我们的方法确定了比以往的状态更具有临床信息性的文本块约1.7x,速度可达100x,并且机器学习(MLM)模型必须具有可审计性和可信赖性。我们采用的任何MSP系统都非常适合长期性。