Relation extraction in the biomedical domain is a challenging task due to a lack of labeled data and a long-tail distribution of fact triples. Many works leverage distant supervision which automatically generates labeled data by pairing a knowledge graph with raw textual data. Distant supervision produces noisy labels and requires additional techniques, such as multi-instance learning (MIL), to denoise the training signal. However, MIL requires multiple instances of data and struggles with very long-tail datasets such as those found in the biomedical domain. In this work, we propose a novel reformulation of MIL for biomedical relation extraction that abstractifies biomedical entities into their corresponding semantic types. By grouping entities by types, we are better able to take advantage of the benefits of MIL and further denoise the training signal. We show this reformulation, which we refer to as abstractified multi-instance learning (AMIL), improves performance in biomedical relationship extraction. We also propose a novel relationship embedding architecture that further improves model performance.
翻译:在生物医学领域,由于缺乏贴标签的数据和三重事实的长尾分布,联系提取是一项艰巨的任务。 许多工作利用远程监督,通过将知识图表与原始文本数据配对,自动生成标签数据。 疏漏监管会产生吵闹的标签,需要额外的技术,例如多因子学习(MIL),以掩盖培训信号。 然而, MIL需要多种数据和与生物医学领域等非常长尾数据集的挣扎。 在这项工作中,我们提议对生物医学关系提取的MIL进行新颖的重新配置,将生物医学实体抽象化为相应的语义类型。 通过将生物医学实体按类型分组,我们能够更好地利用MIL的好处,并进一步将培训信号嵌入。 我们展示了这种重新配置,我们称之为抽象化的多因子学习(MIL), 提高生物医学关系提取的性能。 我们还提议了一个新的嵌入关系架构,以进一步改进模型性能。