Tools to explore scientific literature are essential for scientists, especially in biomedicine, where about a million new papers are published every year. Many such tools provide users the ability to search for specific entities (e.g. proteins, diseases) by tracking their mentions in papers. PubMed, the most well known database of biomedical papers, relies on human curators to add these annotations. This can take several weeks for new papers, and not all papers get tagged. Machine learning models have been developed to facilitate the semantic indexing of scientific papers. However their performance on the more comprehensive ontologies of biomedical concepts does not reach the levels of typical entity recognition problems studied in NLP. In large part this is due to their low resources, where the ontologies are large, there is a lack of descriptive text defining most entities, and labeled data can only cover a small portion of the ontology. In this paper, we develop a new model that overcomes these challenges by (1) generalizing to entities unseen at training time, and (2) incorporating linking predictions into the mention segmentation decisions. Our approach achieves new state-of-the-art results for the UMLS ontology in both traditional recognition/linking (+8 F1 pts) as well as semantic indexing-based evaluation (+10 F1 pts).
翻译:科学文献的探索工具对于科学家来说至关重要,特别是在生物医学方面,每年出版约100万份新论文。许多此类工具通过跟踪文件中提及的具体实体(如蛋白质、疾病),使用户能够通过跟踪文件中的提及,搜索特定实体(如蛋白质、疾病)。最著名的生物医学文件数据库PubMed依靠人类管理者添加这些说明。这可能需要几周时间来编写新论文,而不是所有论文都贴上标签。已经开发了机器学习模型,以便利科学论文的语义索引化。然而,他们在生物医学概念的更全面理论学上的表现并没有达到NLP研究的典型实体识别问题的程度。这在很大程度上是由于他们的资源低,因为其本性很大,缺乏描述大多数实体的描述性文字,而且标签数据只能涵盖本科学的一小部分。在本文中,我们开发了一种克服这些挑战的新模型,即(1) 将培训时看不到的实体概括化,以及(2) 将预测与引用的分类决定相联系。我们的方法是,我们的方法是将新的状态-艺术成果结果作为传统学认识的FLS+Flink1 。