Dual encoders are now the dominant architecture for dense retrieval. Yet, we have little understanding of how they represent text, and why this leads to good performance. In this work, we shed light on this question via distributions over the vocabulary. We propose to interpret the vector representations produced by dual encoders by projecting them into the model's vocabulary space. We show that the resulting distributions over vocabulary tokens are intuitive and contain rich semantic information. We find that this view can explain some of the failure cases of dense retrievers. For example, the inability of models to handle tail entities can be explained via a tendency of the token distributions to forget some of the tokens of those entities. We leverage this insight and propose a simple way to enrich query and passage representations with lexical information at inference time, and show that this significantly improves performance compared to the original model in out-of-domain settings.
翻译:双重编码器现在成为密集检索的主要结构。 然而, 我们对于它们如何代表文本, 以及为什么这会导致良好的表现, 我们几乎不理解。 在这项工作中, 我们通过在词汇表中的分布来说明这个问题。 我们提议通过将双重编码器生成的矢量表达法投射到模型的词汇空间来解释它们产生的矢量表达法。 我们显示, 词汇符号上产生的分布法是直观的, 包含丰富的语义信息。 我们发现, 这个观点可以解释密闭检索器的一些失败案例。 例如, 模型处理尾部实体的能力可以通过象征性分配法的倾向来解释, 从而忘记这些实体的一些符号。 我们利用这个洞察力, 并提出了一个简单的方法, 来用推断时的词汇信息来丰富查询和通过表达法表达法, 并表明这与外部环境中的原始模型相比, 显著改进了性能 。