Biomedical named entity recognition (BioNER) seeks to automatically recognize biomedical entities in natural language text, serving as a necessary foundation for downstream text mining tasks and applications such as information extraction and question answering. Manually labeling training data for the BioNER task is costly, however, due to the significant domain expertise required for accurate annotation. The resulting data scarcity causes current BioNER approaches to be prone to overfitting, to suffer from limited generalizability, and to address a single entity type at a time (e.g., gene or disease). We therefore propose a novel all-in-one (AIO) scheme that uses external data from existing annotated resources to improve generalization. We further present AIONER, a general-purpose BioNER tool based on cutting-edge deep learning and our AIO schema. We evaluate AIONER on 14 BioNER benchmark tasks and show that AIONER is effective, robust, and compares favorably to other state-of-the-art approaches such as multi-task learning. We further demonstrate the practical utility of AIONER in three independent tasks to recognize entity types not previously seen in training data, as well as the advantages of AIONER over existing methods for processing biomedical text at a large scale (e.g., the entire PubMed data).
翻译:生物医学实体确认(Bioneer)试图在自然语言文本中自动承认生物医学实体,以此作为下游文字采矿任务和诸如信息提取和回答问题等应用的必要基础。但是,由于准确说明所需的大量领域专门知识,人工标注生物NER任务的培训数据成本很高。由此产生的数据稀缺导致目前的生物NER方法容易过度适应,受有限的普遍性影响,并同时处理单一类型的实体(例如基因或疾病)。因此,我们提出一个新的全方位(AIO)计划,利用现有附加说明资源提供的外部数据来改进一般化。我们进一步介绍Aioner,一个基于尖端深层学习和我们的AIO schemta的通用生物NER工具。我们对14项生物NER基准任务进行了评估,表明AionER是有效、稳健的,而且比较优于多重任务学习等其他状态型实体类型。我们进一步展示了AionER在三项独立任务中的实际效用,以承认在培训数据中未见的实体类型为特征的外部数据。我们进一步展示了Aioner(AION)这一通用生物能源工具,作为大规模数据处理的优势。