Academic knowledge services have substantially facilitated the development of the science enterprise by providing a plenitude of efficient research tools. However, many applications highly depend on ad-hoc models and expensive human labeling to understand scientific contents, hindering deployments into real products. To build a unified backbone language model for different knowledge-intensive academic applications, we pre-train an academic language model OAG-BERT that integrates both the heterogeneous entity knowledge and scientific corpora in the Open Academic Graph (OAG) -- the largest public academic graph to date. In OAG-BERT, we develop strategies for pre-training text and entity data along with zero-shot inference techniques. In OAG-BERT, we develop strategies for pre-training text and entity data along with zero-shot inference techniques. Its zero-shot capability furthers the path to mitigate the need of expensive annotations. OAG-BERT has been deployed for real-world applications, such as the reviewer recommendation function for National Nature Science Foundation of China (NSFC) -- one of the largest funding agencies in China -- and paper tagging in AMiner. All codes and pre-trained models are available via the CogDL toolkit.
翻译:学术知识服务通过提供大量高效研究工具,大大促进了科学企业的发展,然而,许多应用都高度依赖特别热量模型和昂贵的人类标签,以了解科学内容,阻碍实际产品的部署。为了为不同的知识密集型学术应用建立一个统一的骨干语言模型,我们预先培训了一种学术语言模型OAG-BERT,该模型将各种实体的知识与科学公司(OAG-BERT)结合起来,这是迄今为止最大的公共学术图表。在OAG-BERT中,我们制定了培训前文本和实体数据战略,同时采用零点推断技术。在OAG-BERT中,我们制定了培训前文本和实体数据战略,同时采用零点推断技术。它的零点能力进一步减少了昂贵说明的需求。OAG-BERT被部署用于现实世界应用,例如中国国家自然科学基金会(NSFC) -- -- 中国最大的供资机构之一 -- -- 的审查者建议功能 -- 和在Aminer的纸张标记。所有代码和预先训练前模型都通过CogL提供。