Automated relation extraction (RE) from biomedical literature is critical for many downstream text mining applications in both research and real-world settings. However, most existing benchmarking datasets for bio-medical RE only focus on relations of a single type (e.g., protein-protein interactions) at the sentence level, greatly limiting the development of RE systems in biomedicine. In this work, we first review commonly used named entity recognition (NER) and RE datasets. Then we present BioRED, a first-of-its-kind biomedical RE corpus with multiple entity types (e.g., gene/protein, disease, chemical) and relation pairs (e.g., gene-disease; chemical-chemical) at the document level, on a set of 600 PubMed abstracts. Further, we label each relation as describing either a novel finding or previously known background knowledge, enabling automated algorithms to differentiate between novel and background information. We assess the utility of BioRED by benchmarking several existing state-of-the-art methods, including BERT-based models, on the NER and RE tasks. Our results show that while existing approaches can reach high performance on the NER task (F-score of 89.3%), there is much room for improvement for the RE task, especially when extracting novel relations (F-score of 47.7%). Our experiments also demonstrate that such a rich dataset can successfully facilitate the development of more accurate, efficient, and robust RE systems for biomedicine. The BioRED dataset and annotation guideline are freely available at https://ftp.ncbi.nlm.nih.gov/pub/lu/BioRED/.
翻译:生物医学文献中的自动关系提取(RE)对于许多研究和现实世界环境中的下游文字采矿应用至关重要,然而,生物医学可再生能源的现有大多数基准数据集仅侧重于600 PubMed摘要中的单一类型关系(例如蛋白质-蛋白质相互作用),大大限制了生物医学中可再生能源系统的开发。在这项工作中,我们首先审查常用名称实体识别(NER)和RE数据集;然后我们介绍BioRED,这是第一个具有多种实体类型(例如,基因/蛋白、疾病、化学)和关系配对(例如,基因-疾病;化学化学-化学)的首选基准数据集,仅侧重于600 PubMed摘要中的单一类型关系(例如,蛋白质-蛋白质-蛋白相互作用),大大限制生物医学系统的发展。我们目前对生物统计/REDR任务(包括基于BERT的模型)和RED等现有标准系统的免费应用。我们的成果显示,在生物/RED任务中,现有的最新数据可以顺利地显示,在RE3 上,我们现有的数据库中,更精确的精确性的数据可以显示,当我们现有的数据库中的数据能够顺利地显示,在REBI-RA3 上,我们现有的数据库中可以显示,而更精确的数据可以顺利地显示,在BI-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-ld-ld-ld-ld-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-ld-ld-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-