Ad-hoc search calls for the selection of appropriate answers from a massive-scale corpus. Nowadays, the embedding-based retrieval (EBR) becomes a promising solution, where deep learning based document representation and ANN search techniques are allied to handle this task. However, a major challenge is that the ANN index can be too large to fit into memory, given the considerable size of answer corpus. In this work, we tackle this problem with Bi-Granular Document Representation, where the lightweight sparse embeddings are indexed and standby in memory for coarse-grained candidate search, and the heavyweight dense embeddings are hosted in disk for fine-grained post verification. For the best of retrieval accuracy, a Progressive Optimization framework is designed. The sparse embeddings are learned ahead for high-quality search of candidates. Conditioned on the candidate distribution induced by the sparse embeddings, the dense embeddings are continuously learned to optimize the discrimination of ground-truth from the shortlisted candidates. Besides, two techniques: the contrastive quantization and the locality-centric sampling are introduced for the learning of sparse and dense embeddings, which substantially contribute to their performances. Thanks to the above features, our method effectively handles massive-scale EBR with strong advantages in accuracy: with up to +4.3% recall gain on million-scale corpus, and up to +17.5% recall gain on billion-scale corpus. Besides, Our method is applied to a major sponsored search platform with substantial gains on revenue (+1.95%), Recall (+1.01%) and CTR (+0.49%).
翻译:Adhoc 搜索要求从大规模文件库中选择合适的答案。 如今, 嵌入式检索( EBR) 将成为一个有希望的解决方案, 深层次学习基于文档的显示和 ANN 搜索技术是用来处理这项任务的。 然而, 一项重大挑战是, 答案程序规模庞大, ANN 指数可能太大, 无法与记忆相适应。 在这项工作中, 我们用Bi- Granulal文档代表系统解决这个问题, 轻量稀薄的嵌入在记忆中为混凝土候选人搜索索引, 以及超重重密集嵌入存储在磁盘中, 用于精细的打印后校验。 为了最佳的检索准确性, 设计了一个进步最佳的 Optiminal 框架。 稀释式嵌入式嵌入在前面, 由鲜薄的嵌入程序引发的候选人分布, 密集嵌入式的嵌入程序可以优化地从短名单候选人的底底底线。 此外, 有两种技术: 对比式的裁量和重重重重重重重重重重重重重重重重的嵌入, 。 5 。 和高重重重重重的加重的加重置和重置和重重置的加重置的加重置的加重置的加置的加重置的加重置的加重置的加重置的C,, 和重置的加重置的加分置的加分置的加分解算法 。