BioRED: 丰富的生物医学关系提取数据集 (BioRED: A Rich Biomedical Relation Extraction Dataset)

Automated relation extraction (RE) from biomedical literature is critical for many downstream text mining applications in both research and real-world settings. However, most existing benchmarking datasets for bio-medical RE only focus on relations of a single type (e.g., protein-protein interactions) at the sentence level, greatly limiting the development of RE systems in biomedicine. In this work, we first review commonly used named entity recognition (NER) and RE datasets. Then we present BioRED, a first-of-its-kind biomedical RE corpus with multiple entity types (e.g., gene/protein, disease, chemical) and relation pairs (e.g., gene-disease; chemical-chemical) at the document level, on a set of 600 PubMed abstracts. Further, we label each relation as describing either a novel finding or previously known background knowledge, enabling automated algorithms to differentiate between novel and background information. We assess the utility of BioRED by benchmarking several existing state-of-the-art methods, including BERT-based models, on the NER and RE tasks. Our results show that while existing approaches can reach high performance on the NER task (F-score of 89.3%), there is much room for improvement for the RE task, especially when extracting novel relations (F-score of 47.7%). Our experiments also demonstrate that such a rich dataset can successfully facilitate the development of more accurate, efficient, and robust RE systems for biomedicine. The BioRED dataset and annotation guideline are freely available at https://ftp.ncbi.nlm.nih.gov/pub/lu/BioRED/.

翻译：生物医学文献中的自动关系提取(RE)对于许多研究和现实世界环境中的下游文字采矿应用至关重要,然而,生物医学可再生能源的现有大多数基准数据集仅侧重于600 PubMed摘要中的单一类型关系(例如蛋白质-蛋白质相互作用),大大限制了生物医学中可再生能源系统的开发。在这项工作中,我们首先审查常用名称实体识别(NER)和RE数据集;然后我们介绍BioRED,这是第一个具有多种实体类型(例如,基因/蛋白、疾病、化学)和关系配对(例如,基因-疾病;化学化学-化学)的首选基准数据集,仅侧重于600 PubMed摘要中的单一类型关系(例如,蛋白质-蛋白质-蛋白相互作用),大大限制生物医学系统的发展。我们目前对生物统计/REDR任务(包括基于BERT的模型)和RED等现有标准系统的免费应用。我们的成果显示,在生物/RED任务中,现有的最新数据可以顺利地显示,在RE3 上,我们现有的数据库中,更精确的精确性的数据可以显示,当我们现有的数据库中的数据能够顺利地显示,在REBI-RA3 上,我们现有的数据库中可以显示,而更精确的数据可以顺利地显示,在BI-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-ld-ld-ld-ld-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-ld-ld-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日

【医学图像处理中的因果性】52页ppt，Causality Matters in Medical Imaging

专知会员服务

60+阅读 · 2020年3月14日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日