Relation extraction in the biomedical domain is challenging due to the lack of labeled data and high annotation costs, needing domain experts. Distant supervision is commonly used to tackle the scarcity of annotated data by automatically pairing knowledge graph relationships with raw texts. Such a pipeline is prone to noise and has added challenges to scale for covering a large number of biomedical concepts. We investigated existing broad-coverage distantly supervised biomedical relation extraction benchmarks and found a significant overlap between training and test relationships ranging from 26% to 86%. Furthermore, we noticed several inconsistencies in the data construction process of these benchmarks, and where there is no train-test leakage, the focus is on interactions between narrower entity types. This work presents a more accurate benchmark MedDistant19 for broad-coverage distantly supervised biomedical relation extraction that addresses these shortcomings and is obtained by aligning the MEDLINE abstracts with the widely used SNOMED Clinical Terms knowledge base. Lacking thorough evaluation with domain-specific language models, we also conduct experiments validating general domain relation extraction findings to biomedical relation extraction.
翻译:在生物医学领域,由于缺乏贴标签的数据和高注解成本,缺乏域专家,联系提取工作具有挑战性。通过将知识图表关系与原始文本自动配对,经常使用不同的监督来解决附加说明的数据稀缺问题。这种管道容易引起噪音,并给涵盖大量生物医学概念带来更大的挑战。我们调查了现有的广泛覆盖的远程监督的生物医学关系提取基准,发现培训和测试关系之间有26%至86%的重大重叠。此外,我们注意到这些基准的数据构建过程存在一些不一致之处,而且没有进行火车测试的渗漏,重点是较窄的实体类型之间的相互作用。这项工作为广泛覆盖的远程监督生物医学关系提取提供了更准确的基准MedDIstant19,解决了这些缺陷,并且通过将MEDLINE摘要与广泛使用的SNOMED临床术语知识基础相匹配而获得。我们缺乏与具体语言模型的彻底评估,我们还进行实验,验证了与生物医学关系提取的一般领域关系。