Recent advancements in deep learning techniques have transformed the area of semantic text matching. However, most of the state-of-the-art models are designed to operate with short documents such as tweets, user reviews, comments, etc., and have fundamental limitations when applied to long-form documents such as scientific papers, legal documents, and patents. When handling such long documents, there are three primary challenges: (i) The presence of different contexts for the same word throughout the document, (ii) Small sections of contextually similar text between two documents, but dissimilar text in the remaining parts -- this defies the basic understanding of "similarity", and (iii) The coarse nature of a single global similarity measure which fails to capture the heterogeneity of the document content. In this paper, we describe CoLDE: Contrastive Long Document Encoder -- a transformer-based framework that addresses these challenges and allows for interpretable comparisons of long documents. CoLDE uses unique positional embeddings and a multi-headed chunkwise attention layer in conjunction with a contrastive learning framework to capture similarity at three different levels: (i) high-level similarity scores between a pair of documents, (ii) similarity scores between different sections within and across documents, and (iii) similarity scores between different chunks in the same document and also other documents. These fine-grained similarity scores aid in better interpretability. We evaluate CoLDE on three long document datasets namely, ACL Anthology publications, Wikipedia articles, and USPTO patents. Besides outperforming the state-of-the-art methods on the document comparison task, CoLDE also proves interpretable and robust to changes in document length and text perturbations.
翻译:最近深层学习技术的进步改变了语义文本匹配领域。然而,大多数最先进的模型设计成使用短文件,如推文、用户评论、评论等,在应用科学文件、法律文件和专利等长格式文件时,具有根本性的限制。在处理如此长的文件时,存在三大挑战:(一) 整个文件使用相同词的不同背景;(二) 两个文件之间背景相似的文本小部分,但在其余部分则不同 -- -- 这不符合对“相似性”的基本理解,以及(三) 单一全球相似性计量的粗略性质,未能反映文件内容的异性。在本文件中,我们描述COLDE:对比性长文档编码 -- -- 一种基于变式框架,可以应对这些挑战,并允许对长文件进行可解释的比较。COLDE使用独特的定位嵌入和多级粗度关注层,同时使用对比性学习框架,在三个不同级别上反映相似的相似性,在相似的文档中,在相似级文档和分级之间,也显示相似的高级数据解释。(i) 在不同的文档和分级中,在不同的分级文档和分级之间,在不同的分级中,我们之间,在不同的分级文档和分级文档和分级之间,不同分级之间,在不同的分级之间,比较。