Lexical and semantic matching capture different successful approaches to text retrieval and the fusion of their results has proven to be more effective and robust than either alone. Prior work performs hybrid retrieval by conducting lexical and semantic text matching using different systems (e.g., Lucene and Faiss, respectively) and then fusing their model outputs. In contrast, our work integrates lexical representations with dense semantic representations by densifying high-dimensional lexical representations into what we call low-dimensional dense lexical representations (DLRs). Our experiments show that DLRs can effectively approximate the original lexical representations, preserving effectiveness while improving query latency. Furthermore, we can combine dense lexical and semantic representations to generate dense hybrid representations (DHRs) that are more flexible and yield faster retrieval compared to existing hybrid techniques. Finally, we explore {\it jointly} training lexical and semantic representations in a single model and empirically show that the resulting DHRs are able to combine the advantages of each individual component. Our best DHR model is competitive with state-of-the-art single-vector and multi-vector dense retrievers in both in-domain and zero-shot evaluation settings. Furthermore, our model is both faster and requires smaller indexes, making our dense representation framework an attractive approach to text retrieval. Our code is available at https://github.com/castorini/dhr.
翻译:以往的工作是通过使用不同的系统(如卢塞内和费萨尔)进行精密和语义文本匹配,然后用其模型输出。相比之下,我们的工作将词汇表达与密集的语义表达方式结合起来,将高维词汇表达方式压缩为我们所称的低维密度词汇表达方式。我们的实验表明,DLR可以有效地接近原始的词汇表达方式,既保持有效性,又改进查询拉特。此外,我们可以将密集的词汇和语义表达方式结合起来,以产生密集的混合表达方式(如卢塞内和费斯),并比现有的混合技术更灵活和更快的检索方式。最后,我们共同探讨,将词汇和语义表达方式与密集的语义表达方式结合起来,将高维度的词汇表达方式与我们所称的低维度密集词汇表达方式(DLLRs)。我们的最佳DHR模式可以有效地与最初的词汇表达方式相近,既能保持效力,又能改进查询宽度,同时改进查询的拉特兰多语言和多语义表达方式表达方式,我们在现有的混合结构中需要一个更快速的版本的版本的版本的版本的版本/检索器中,我们需要更快速的计算。