评价科学文件中被确认实体自动采掘和分类的嵌入模式 (Evaluation of Embedding Models for Automatic Extraction and Classification of Acknowledged Entities in Scientific Documents)

Acknowledgments in scientific papers may give an insight into aspects of the scientific community, such as reward systems, collaboration patterns, and hidden research trends. The aim of the paper is to evaluate the performance of different embedding models for the task of automatic extraction and classification of acknowledged entities from the acknowledgment text in scientific papers. We trained and implemented a named entity recognition (NER) task using the Flair NLP-framework. The training was conducted using three default Flair NER models with two differently-sized corpora. The Flair Embeddings model trained on the larger training corpus showed the best accuracy of 0.77. Our model is able to recognize six entity types: funding agency, grant number, individuals, university, corporation and miscellaneous. The model works more precise for some entity types than the others, thus, individuals and grant numbers showed very good F1-Score over 0.9. Most of the previous works on acknowledgement analysis were limited by the manual evaluation of data and therefore by the amount of processed data. This model can be applied for the comprehensive analysis of the acknowledgement texts and may potentially make a great contribution to the field of automated acknowledgement analysis.

翻译：科学论文中的承认可能有助于深入了解科学界的各个方面,如奖励制度、协作模式和隐藏的研究趋势。论文的目的是评价科学论文中确认的实体自动提取和分类任务的不同嵌入模型的绩效。我们利用Flair NLP框架培训和执行了一个名称实体确认任务。培训使用了三个默认的Flair NER模型,有两个不同大小的子公司。在大型培训中培训的Flair Embedding模型显示0.77的最佳准确度。我们的模式能够识别六种实体类型:供资机构、赠款号码、个人、大学、公司和杂项。模型对某些实体类型比对其他实体类型更精确,因此,个人和赠款数字显示的F1-STR超过0.9。以前关于确认分析的大部分工作受到数据手工评估的限制,因此受到处理数据的数量的限制。这一模型可用于对确认文本的全面分析,并有可能对自动确认分析领域做出巨大贡献。