Neural retrieval models have superseded classic bag-of-words methods such as BM25 as the retrieval framework of choice. However, neural systems lack the interpretability of bag-of-words models; it is not trivial to connect a query change to a change in the latent space that ultimately determines the retrieval results. To shed light on this embedding space, we learn a "query decoder" that, given a latent representation of a neural search engine, generates the corresponding query. We show that it is possible to decode a meaningful query from its latent representation and, when moving in the right direction in latent space, to decode a query that retrieves the relevant paragraph. In particular, the query decoder can be useful to understand "what should have been asked" to retrieve a particular paragraph from the collection. We employ the query decoder to generate a large synthetic dataset of query reformulations for MSMarco, leading to improved retrieval performance. On this data, we train a pseudo-relevance feedback (PRF) T5 model for the application of query suggestion that outperforms both query reformulation and PRF information retrieval baselines.
翻译:神经检索模型取代了典型的一袋字方法,如BM25作为选择的检索框架。然而,神经系统缺乏字包模型的解释性;将查询变化与最终决定检索结果的潜在空间变化联系起来并非无关紧要;要让这个嵌入空间亮亮光,我们学习了一种“query decoder”,由于神经搜索引擎的潜在表现,该“query decoder”生成了相应的查询。我们表明,有可能从潜在代表中解码一个有意义的查询,在向潜在空间正确方向移动时,解码一个检索相关段落的查询。特别是,查询解码器可能有助于理解“什么应该被请求”从收藏中检索一个特定段落。我们使用查询解码器生成一个大型的合成数据集,用于对MSMarco的查询重新定位,从而改进了检索性能。关于这一数据,我们培训了一种假相关性反馈(PRF) T5 模型,用于应用查询建议,即超越查询的重订和PRF信息检索基线。