Extracting biomedical relations from large corpora of scientific documents is a challenging natural language processing task. Existing approaches usually focus on identifying a relation either in a single sentence (mention-level) or across an entire corpus (pair-level). In both cases, recent methods have achieved strong results by learning a point estimate to represent the relation; this is then used as the input to a relation classifier. However, the relation expressed in text between a pair of biomedical entities is often more complex than can be captured by a point estimate. To address this issue, we propose a latent variable model with an arbitrarily flexible distribution to represent the relation between an entity pair. Additionally, our model provides a unified architecture for both mention-level and pair-level relation extraction. We demonstrate that our model achieves results competitive with strong baselines for both tasks while having fewer parameters and being significantly faster to train. We make our code publicly available.
翻译:从大型科学文件公司中提取生物医学关系是一项艰巨的自然语言处理任务。现有方法通常侧重于在单句中(感官级别)或在整个实体(皮尔级别)中确定一种关系。在这两种情况下,最近的方法都取得了显著成果,通过学习一个点估计来代表关系;然后将这种方法用作关系分类者的投入。然而,一对生物医学实体在文本中表达的关系往往比用点估计可以捕捉到的要复杂得多。为了解决这一问题,我们提出了一个潜在的变量模型,任意灵活地分配,以代表一对实体之间的关系。此外,我们的模型为引用水平和对对等关系提取提供了一个统一的结构。我们证明,我们的模型在两个任务上都取得了有很强的基线的竞争性结果,同时减少了参数并大大加快了培训速度。我们公开了我们的代码。