Most work in relation extraction forms a prediction by looking at a short span of text within a single sentence containing a single entity pair mention. This approach often does not consider interactions across mentions, requires redundant computation for each mention pair, and ignores relationships expressed across sentence boundaries. These problems are exacerbated by the document- (rather than sentence-) level annotation common in biological text. In response, we propose a model which simultaneously predicts relationships between all mention pairs in a document. We form pairwise predictions over entire paper abstracts using an efficient self-attention encoder. All-pairs mention scores allow us to perform multi-instance learning by aggregating over mentions to form entity pair representations. We further adapt to settings without mention-level annotation by jointly training to predict named entities and adding a corpus of weakly labeled data. In experiments on two Biocreative benchmark datasets, we achieve state of the art performance on the Biocreative V Chemical Disease Relation dataset for models without external KB resources. We also introduce a new dataset an order of magnitude larger than existing human-annotated biological information extraction datasets and more accurate than distantly supervised alternatives.
翻译:大部分与提取有关的工作都通过在包含单一实体对应的单句中查看短长的文本来作出预测。 这种方法通常不考虑相互交叉提及, 要求对每对提及进行重复计算, 忽略跨句界表达的关系。 这些问题因生物文本中常见的文件( 而不是句子) 级别注释而加剧。 作为回应, 我们提出了一个模型, 同时在文件中同时预测所有提及对配的文本之间的关系。 我们使用高效的自我注意编码器对整个纸摘要进行对称预测。 所有段落都提到分数, 使我们能够通过将提及合并成实体配对代表来进行多级学习。 我们还通过联合培训, 预测命名实体, 并添加一系列标签薄弱的数据, 来进一步适应环境环境, 而无需提及级别说明。 在两个生物科学基准数据集的实验中, 我们实现了在没有外部 KB 资源的情况下, 模型的生物科学应用V 化学疾病关联数据集的状态。 我们还引入了一个新的数据设置, 其规模比现有的生物信息注释提取数据集要大得多, 比远处监控的替代品要更精确。