We present a new scientific document similarity model based on matching fine-grained aspects of texts. To train our model, we exploit a naturally-occurring source of supervision: sentences in the full-text of papers that cite multiple papers together (co-citations). Such co-citations not only reflect close paper relatedness, but also provide textual descriptions of how the co-cited papers are related. This novel form of textual supervision is used for learning to match aspects across papers. We develop multi-vector representations where vectors correspond to sentence-level aspects of documents, and present two methods for aspect matching: (1) A fast method that only matches single aspects, and (2) a method that makes sparse multiple matches with an Optimal Transport mechanism that computes an Earth Mover's Distance between aspects. Our approach improves performance on document similarity tasks in four datasets. Further, our fast single-match method achieves competitive results, paving the way for applying fine-grained similarity to large scientific corpora. Code, data, and models available at: https://github.com/allenai/aspire
翻译:我们提出一个新的科学文件相似性模型,其依据是文本的细细比方面。为了培训我们的模型,我们利用一种自然产生的监督来源:(1) 一种只与单一方面匹配的快速方法,(2) 一种与最优运输机制匹配的方法,即计算地球移动者之间的距离。我们的方法改进了四个数据集中类似文件任务的业绩。此外,我们的快速单相配方法取得了竞争性结果,为将微小相似性应用于大型科学公司铺平了道路。代码、数据和模型见:https://github.com/allenai/aspire。