从学术条款标题中获取科学实体 (Pattern-based Acquisition of Scientific Entities from Scholarly Article Titles)

We describe a rule-based approach for the automatic acquisition of scientific entities from scholarly article titles. Two observations motivated the approach: (i) noting the concentration of an article's contribution information in its title; and (ii) capturing information pattern regularities via a system of rules that alleviate the human annotation task in creating gold standards that annotate single instances at a time. We identify a set of lexico-syntactic patterns that are easily recognizable, that occur frequently, and that generally indicates the scientific entity type of interest about the scholarly contribution. A subset of the acquisition algorithm is implemented for article titles in the Computational Linguistics (CL) scholarly domain. The tool called ORKG-Title-Parser, in its first release, identifies the following six concept types of scientific terminology from the CL paper titles, viz. research problem, solution, resource, language, tool, and method. It has been empirically evaluated on a collection of 50,237 titles that cover nearly all articles in the ACL Anthology. It has extracted 19,799 research problems; 18,111 solutions; 20,033 resources; 1,059 languages; 6,878 tools; and 21,687 methods at an average extraction precision of 75%. The code and related data resources are publicly available at https://gitlab.com/TIBHannover/orkg/orkg-title-parser. Finally, in the article, we discuss extensions and applications to areas such as scholarly knowledge graph (SKG) creation.

翻译：我们描述的是自动从学术文献标题中获取科学实体的基于规则的学术性方法,其中两项意见促成了这一方法:(一)注意到某文章贡献信息集中在其标题中;和(二)通过一套规则系统收集信息模式的规律性,通过一套规则系统,减轻人类在创建黄金标准方面的批注任务,同时对单一案例进行批注;我们确定了一系列易于识别的、经常发生的、并且一般表明科学实体对学术贡献感兴趣的类型。购置算法的一部分用于Computational Lealogics(CL)学术文献域中的文章标题;以及(ARCG-Title-Parker),在第一次发布时,从CL纸标题(即研究问题、解决方案、资源、资源、语言、工具、工具和方法)中确定了以下六种科学术语的概念类型:50,237个标题几乎覆盖了ACLA Anthlogy的所有文章。它提取了19,799个研究问题;18,111项解决办法;20,03-Parkerererererererer, 在其首次发布时,确定了Oral-G 20,8,数据领域;8,在SLILA/HA