This paper is devoted to the study of methods for information extraction (entity recognition and relation classification) from scientific texts on information technology. Scientific publications provide valuable information into cutting-edge scientific advances, but efficient processing of increasing amounts of data is a time-consuming task. In this paper, several modifications of methods for the Russian language are proposed. It also includes the results of experiments comparing a keyword extraction method, vocabulary method, and some methods based on neural networks. Text collections for these tasks exist for the English language and are actively used by the scientific community, but at present, such datasets in Russian are not publicly available. In this paper, we present a corpus of scientific texts in Russian, RuSERRC. This dataset consists of 1600 unlabeled documents and 80 labeled with entities and semantic relations (6 relation types were considered). The dataset and models are available at https://github.com/iis-research-team. We hope they can be useful for research purposes and development of information extraction systems.
翻译:本文件专门研究信息技术科学文本中的信息提取方法(实体识别和关系分类),科学出版物为尖端科学进步提供了宝贵信息,但高效处理越来越多的数据是一项耗时的工作,本文件提议对俄语方法进行若干修改,还包括比较关键词提取方法、词汇法和神经网络的某些方法的实验结果,这些任务的文本收藏为英语,科学界正在积极使用,但目前尚未公开提供俄文的这类数据集,我们本文以俄文提供一套科学文本,即RuSERRC。这一数据集包括1600份未贴标签的文件,80份与实体和语系关系(6种关系类型得到考虑)的标签,数据集和模型可在https://github.com/is-research-team查阅。我们希望这些数据集和模型对信息提取系统的研究和开发有用。