Traditional document similarity measures provide a coarse-grained distinction between similar and dissimilar documents. Typically, they do not consider in what aspects two documents are similar. This limits the granularity of applications like recommender systems that rely on document similarity. In this paper, we extend similarity with aspect information by performing a pairwise document classification task. We evaluate our aspect-based document similarity for research papers. Paper citations indicate the aspect-based similarity, i.e., the section title in which a citation occurs acts as a label for the pair of citing and cited paper. We apply a series of Transformer models such as RoBERTa, ELECTRA, XLNet, and BERT variations and compare them to an LSTM baseline. We perform our experiments on two newly constructed datasets of 172,073 research paper pairs from the ACL Anthology and CORD-19 corpus. Our results show SciBERT as the best performing system. A qualitative examination validates our quantitative results. Our findings motivate future research of aspect-based document similarity and the development of a recommender system based on the evaluated techniques. We make our datasets, code, and trained models publicly available.
翻译:传统文件相似性措施为类似和不同文件提供了粗略的区别。一般情况下,它们不考虑两个文件的相似性。这限制了依赖文件相似性的建议系统等应用程序的颗粒性。在本文中,我们通过执行一个对称文件分类任务,将方面信息与方面信息加以扩展;我们评估研究论文的基于侧面的文件相似性。文件引用表明基于方面相似性,即引用的章节标题作为一对引证和引用的纸张的标签。我们应用了一系列变异性模型,如RoBERTA、ELECTRA、XLNet和BERT的变异性,并将其与LSTM基线进行比较。我们实验了两个新建的172,073个研究数据集,分别来自ACL Anthlogy和CORD-19系统。我们的结果显示SciBERT是最佳的系统。质量检查证实了我们的定量结果。我们的调查结果鼓励今后对基于方文件的相似性进行研究,并开发以评价技术为基础的建议系统。我们进行了数据设置、经过培训的代码和公开。