Citation information in scholarly data is an important source of insight into the reception of publications and the scholarly discourse. Outcomes of citation analyses and the applicability of citation based machine learning approaches heavily depend on the completeness of such data. One particular shortcoming of scholarly data nowadays is that non-English publications are often not included in data sets, or that language metadata is not available. Because of this, citations between publications of differing languages (cross-lingual citations) have only been studied to a very limited degree. In this paper, we present an analysis of cross-lingual citations based on over one million English papers, spanning three scientific disciplines and a time span of three decades. Our investigation covers differences between cited languages and disciplines, trends over time, and the usage characteristics as well as impact of cross-lingual citations. Among our findings are an increasing rate of citations to publications written in Chinese, citations being primarily to local non-English languages, and consistency in citation intent between cross- and monolingual citations. To facilitate further research, we make our collected data and source code publicly available.
翻译:学术数据中引用的资料是深入了解出版物和学术讨论的一个重要来源,引证分析的结果和引用机器学习方法的适用性在很大程度上取决于这些数据的完整性。当今学术数据的一个特别缺点是,非英文出版物往往没有列入数据集,或没有语文元数据。因此,对不同语文出版物之间的引用(跨语文引用)只进行了非常有限的研究。在本文件中,我们根据100多万份英文论文,对跨语文引用进行了分析,涵盖三个学科和30年的时间间隔。我们的调查涉及所引用的语言和学科之间的差异、长期趋势、使用特点以及跨语文引用的影响。我们的调查结果包括中文出版物的引用率不断提高,主要引用的是当地非英文语文,以及跨语文和单一语文引用意图的一致性。为了便于进一步研究,我们公布我们收集的数据和源代码。