Document retrieval enables users to find their required documents accurately and quickly. To satisfy the requirement of retrieval efficiency, prevalent deep neural methods adopt a representation-based matching paradigm, which saves online matching time by pre-storing document representations offline. However, the above paradigm consumes vast local storage space, especially when storing the document as word-grained representations. To tackle this, we present TGTR, a Topic-Grained Text Representation-based Model for document retrieval. Following the representation-based matching paradigm, TGTR stores the document representations offline to ensure retrieval efficiency, whereas it significantly reduces the storage requirements by using novel topicgrained representations rather than traditional word-grained. Experimental results demonstrate that compared to word-grained baselines, TGTR is consistently competitive with them on TREC CAR and MS MARCO in terms of retrieval accuracy, but it requires less than 1/10 of the storage space required by them. Moreover, TGTR overwhelmingly surpasses global-grained baselines in terms of retrieval accuracy.
翻译:为满足检索效率的要求,普遍存在的深层神经系统方法采用了一种基于代表的匹配模式,通过离线存储预储存文件演示来节省在线匹配时间。然而,上述模式耗用了巨大的本地存储空间,特别是在将文件储存成文字缩略图时。为了解决这个问题,我们提出了TGTR,一个基于专题的文本代表模式,用于文件检索。根据基于代表的匹配模式,TGTR将文件表述从网上存储,以确保检索效率,而通过使用新的专题表达方式而不是传统的单打印式,大大降低了存储要求。实验结果表明,与单打印基线相比,TGTR在检索准确性方面与TREC CAR和MS MARCO的存储空间始终具有竞争力,但要求的存储空间少于1/10。此外,TGTR在检索准确性方面超过了全球生成的基线。