A standard measure of the influence of a research paper is the number of times it is cited. However, papers may be cited for many reasons, and citation count offers limited information about the extent to which a paper affected the content of subsequent publications. We therefore propose a novel method to quantify linguistic influence in timestamped document collections. There are two main steps: first, identify lexical and semantic changes using contextual embeddings and word frequencies; second, aggregate information about these changes into per-document influence scores by estimating a high-dimensional Hawkes process with a low-rank parameter matrix. We show that this measure of linguistic influence is predictive of $\textit{future}$ citations: the estimate of linguistic influence from the two years after a paper's publication is correlated with and predictive of its citation count in the following three years. This is demonstrated using an online evaluation with incremental temporal training/test splits, in comparison with a strong baseline that includes predictors for initial citation counts, topics, and lexical features.
翻译:研究论文影响的标准衡量标准是其引用次数。然而,可以引用论文有许多原因,引注数对论文影响随后出版物内容的程度提供有限信息。因此,我们提出在时间戳文件收藏中量化语言影响的新颖方法。主要有两个步骤:第一,使用背景嵌入和文字频率确定词汇和语义变化;第二,通过估计高维的霍克斯进程和低级参数矩阵,汇总关于这些变化对每个文件影响分数的影响的信息。我们表明,这一语言影响计量可以预测出$\textit{future}引用量:文件出版两年后对语言影响的估计与今后三年的引用数相关并预测。通过在线评估,用时间培训/测试分数递增的时间培训/测试分数,与包括初步引注、专题和词汇特征预测数的强基线进行比较,可以证明这一点。