Lexical and semantic matching capture different successful approaches to text retrieval and the fusion of their results has proven to be more effective and robust than either alone. Prior work performs hybrid retrieval by conducting lexical and semantic matching using different systems (e.g., Lucene and Faiss, respectively) and then fusing their model outputs. In contrast, our work integrates lexical representations with dense semantic representations by densifying high-dimensional lexical representations into what we call low-dimensional dense lexical representations (DLRs). Our experiments show that DLRs can effectively approximate the original lexical representations, preserving effectiveness while improving query latency. Furthermore, we can combine dense lexical and semantic representations to generate dense hybrid representations (DHRs) that are more flexible and yield faster retrieval compared to existing hybrid techniques. In addition, we explore it jointly training lexical and semantic representations in a single model and empirically show that the resulting DHRs are able to combine the advantages of the individual components. Our best DHR model is competitive with state-of-the-art single-vector and multi-vector dense retrievers in both in-domain and zero-shot evaluation settings. Furthermore, our model is both faster and requires smaller indexes, making our dense representation framework an attractive approach to text retrieval. Our code is available at https://github.com/castorini/dhr.
翻译:以往的工作是通过使用不同的系统(如卢塞内和费萨尔)进行精密和语义匹配,然后用其模型输出。相比之下,我们的工作将词汇表达与密集的语义表达方式结合起来,将高维词汇表达方式压缩到我们所称的低维密度词汇表达方式(DLRs)中。我们的实验表明,DLRs能够有效地接近原始的词汇表达方式,既保持有效性,又改进查询拉长。此外,我们可以将密集的词汇和语义表达方式结合起来,以产生密集的混合表达方式(DHRs),这些表达方式与现有的混合技术(如卢塞内和费斯)相比更加灵活,并产生更快的检索结果。此外,我们的工作还探讨将词汇和语义表达方式与密集的语义表达方式结合起来,将高维度词汇表达方式压缩到我们所称的低维度密集的词汇表达方式(DLLLRs)中。我们最好的DHR模式可以有效地与原有的词汇表达方式相匹配,既能保持效力,又能改进查询的拉长性。此外,我们更快速的文本/多端的检索方式需要我们一个具有吸引力的版本。</s>