We show that large language models, such as GPT-3, perform well at zero-shot information extraction from clinical text despite not being trained specifically for the clinical domain. We present several examples showing how to use these models as tools for the diverse tasks of (i) concept disambiguation, (ii) evidence extraction, (iii) coreference resolution, and (iv) concept extraction, all on clinical text. The key to good performance is the use of simple task-specific programs that map from the language model outputs to the label space of the task. We refer to these programs as resolvers, a generalization of the verbalizer, which defines a mapping between output tokens and a discrete label space. We show in our examples that good resolvers share common components (e.g., "safety checks" that ensure the language model outputs faithfully match the input data), and that the common patterns across tasks make resolvers lightweight and easy to create. To better evaluate these systems, we also introduce two new datasets for benchmarking zero-shot clinical information extraction based on manual relabeling of the CASI dataset (Moon et al., 2014) with labels for new tasks. On the clinical extraction tasks we studied, the GPT-3 + resolver systems significantly outperform existing zero- and few-shot baselines.
翻译:我们展示了大型语言模型,如GPT-3,尽管没有为临床领域专门培训,但临床文本的零光信息提取效果良好。我们展示了几个例子,说明如何将这些模型用作各种任务的工具,这些任务包括:(一) 概念模糊,(二) 证据提取,(三) 参考分辨率,(四) 概念提取,所有临床文本。良好表现的关键是使用简单的任务特定程序,从语言模型输出到任务的标签空间。我们把这些程序称为解答器,口述器的概括化,它界定了输出符号和离散标签空间之间的映射。我们的例子显示,好的解答器共有共同的组成部分(例如,“安全检查”,确保语言模型产出忠实地与输入数据匹配),以及任务之间的共同模式使得解答器轻度和容易创建。为了更好地评估这些系统,我们还引入了两个新的数据集,用于根据CASI数据集的手工重新标签(Moon et al. 2014) 和新标签(Gximal) 的零光标出新的分辨率基准。