The academic literature of social sciences records human civilization and studies human social problems. With its large-scale growth, the ways to quickly find existing research on relevant issues have become an urgent demand for researchers. Previous studies, such as SciBERT, have shown that pre-training using domain-specific texts can improve the performance of natural language processing tasks. However, the pre-trained language model for social sciences is not available so far. In light of this, the present research proposes a pre-trained model based on the abstracts published in the Social Science Citation Index (SSCI) journals. The models, which are available on GitHub (https://github.com/S-T-Full-Text-Knowledge-Mining/SSCI-BERT), show excellent performance on discipline classification, abstract structure-function recognition, and named entity recognition tasks with the social sciences literature.
翻译:社会科学的学术文献记录了人类文明并研究人类社会问题,随着其大规模增长,迅速找到相关问题现有研究的方法已成为研究人员的迫切需求。以前的研究,如SciBERT, 已经表明,使用特定领域的文本进行预先培训可以改善自然语言处理任务的绩效,然而,社会科学的预先培训语言模式迄今尚未出台。鉴于这一点,本研究根据《社会科学引用指数》杂志出版的摘要,提出了一个预先培训的模式。这些模型(https://github.com/S-T-Full-Text-Knowledge-Mining/SSCI-BERT)上提供,显示在社会科学文献的学科分类、抽象结构功能识别和实体识别任务方面的出色表现。