基于转移概率的研究兴趣相似性度量 (Measuring Research Interest Similarity with Transition Probabilities)

We introduce a family of paper and author similarity measures based on the concept that papers are more similar if they are more likely to be retrieved during a literature search following backward and forward citations. Since this browsing process resembles a walk in a citation network, we operationalize the concept using the transition probability (TP) of random walkers. The proposed measures are continuous, symmetric, and can be implemented on any citation network. We conduct validation tests of the TP concept and other extant alternatives to gauge which metric can classify papers and predict future co-authors most consistently across different scales of analysis (co-authorships, journals, and disciplines). Our results show that the proposed basic TP measure outperforms alternative metrics such as personalized PageRank and the Node2vec machine-learning technique in classification tasks at various scales. Additionally, we discuss how publication-level data can be leveraged to approximate the research interest similarity of individual scientists. This paper is accompanied by a Python package that implements all the tested metrics.

翻译：本文提出了一类基于文献检索过程中通过前后向引文浏览行为的论文与作者相似性度量方法。其核心思想是：若两篇论文在文献检索中通过前后向引文路径被检索到的概率越高，则其相似性越强。由于该浏览过程类似于在引文网络中的随机游走，我们采用随机游走者的转移概率（TP）来量化这一概念。所提出的度量方法具有连续性、对称性，并可适用于任意引文网络。我们通过验证实验比较了TP方法与现有替代方案，以评估不同分析尺度（合著关系、期刊、学科领域）下各指标对论文分类及未来合作作者预测的一致性。结果表明，在多种尺度的分类任务中，基础TP度量方法优于个性化PageRank和Node2vec机器学习技术等替代方案。此外，我们探讨了如何利用出版物层级数据近似估算科研人员个体的研究兴趣相似性。本文附带提供了实现所有测试指标的Python软件包。