The advent of contextualised language models has brought gains in search effectiveness, not just when applied for re-ranking the output of classical weighting models such as BM25, but also when used directly for passage indexing and retrieval, a technique which is called dense retrieval. In the existing literature in neural ranking, two dense retrieval families have become apparent: single representation, where entire passages are represented by a single embedding (usually BERT's [CLS] token, as exemplified by the recent ANCE approach), or multiple representations, where each token in a passage is represented by its own embedding (as exemplified by the recent ColBERT approach). These two families have not been directly compared. However, because of the likely importance of dense retrieval moving forward, a clear understanding of their advantages and disadvantages is paramount. To this end, this paper contributes a direct study on their comparative effectiveness, noting situations where each method under/over performs w.r.t. each other, and w.r.t. a BM25 baseline. We observe that, while ANCE is more efficient than ColBERT in terms of response time and memory usage, multiple representations are statistically more effective than the single representations for MAP and MRR@10. We also show that multiple representations obtain better improvements than single representations for queries that are the hardest for BM25, as well as for definitional queries, and those with complex information needs.
翻译:环境化语言模型的出现带来了搜索效果,不仅当应用BM25等古典加权模型的输出重新排序时,而且当直接用于通过索引和检索时,这种技术被称为密集检索。在神经排序的现有文献中,两个密集的检索家庭变得显而易见:单一表示,其中整个段落由单一嵌入(通常是BERT的标志,如最近的NC 方法所示)代表,或多次表示,其中每个标志都由其本身的嵌入(如最近的ColBERT方法所示)代表。这两个家庭没有直接比较。然而,由于大量检索可能很重要,因此,明确了解其优缺点至关重要。为此,本文件直接研究了其相对有效性:指出每个在/过去采用的方法都用一个嵌入式(通常是BERT的标志,如最近的NC)表示,而BERT在反应时间和记忆使用方面比ColBERT更有效,但多面表示,对于M10 和MR 来说,从统计角度来说,多面表示比最难的表示要好。