用于第一阶段检索的复合代码 Sparse 自定义编码器 (Composite Code Sparse Autoencoders for first stage retrieval)

We propose a Composite Code Sparse Autoencoder (CCSA) approach for Approximate Nearest Neighbor (ANN) search of document representations based on Siamese-BERT models. In Information Retrieval (IR), the ranking pipeline is generally decomposed in two stages: the first stage focus on retrieving a candidate set from the whole collection. The second stage re-ranks the candidate set by relying on more complex models. Recently, Siamese-BERT models have been used as first stage ranker to replace or complement the traditional bag-of-word models. However, indexing and searching a large document collection require efficient similarity search on dense vectors and this is why ANN techniques come into play. Since composite codes are naturally sparse, we first show how CCSA can learn efficient parallel inverted index thanks to an uniformity regularizer. Second, CCSA can be used as a binary quantization method and we propose to combine it with the recent graph based ANN techniques. Our experiments on MSMARCO dataset reveal that CCSA outperforms IVF with product quantization. Furthermore, CCSA binary quantization is beneficial for the index size, and memory usage for the graph-based HNSW method, while maintaining a good level of recall and MRR. Third, we compare with recent supervised quantization methods for image retrieval and find that CCSA is able to outperform them.

翻译：我们建议采用复合代码 Sprass Autencoder (CCSA) 方法, 用于根据Siamese- BERT 模型, 近距离邻居( ANN) 搜索基于 Siamese- BERT 模型的文件代表。在信息检索( IR) 中, 排名管道一般分解分为两个阶段: 第一阶段侧重于检索整个收藏的候选数据集。第二阶段可以使用更复杂的模型重新排序候选人。最近, SAamese- BERT 模型被用作第一阶段排名器, 以取代或补充传统的词包模型。然而, 大量文件收藏的索引和搜索需要对密度矢量的大型文件采集进行有效的类似搜索, 这也是为什么ANNN技术开始运行的原因。由于复合代码自然稀释, 我们首先展示了 CSA 如何通过统一校正对索引学习有效平行索引的方法。其次, CCSA 可以用作基于 ANN 的最新图表, 我们关于MSMAR 数据集的实验显示, CSA 超越了 IVF 格式, 和 CRO 类 CRA 的精准缩略缩缩图, 。

相关内容

中国通信标准化协会

关注 0

中国通信标准化协会 (英文译名为：China Communications Standards Association，缩写为：CCSA)于2002年12月18日在北京正式成立。协会是经业务主管部门批准，国家社团登记管理机关登记，由国内从事信息通信技术领域标准化的科研、技术开发、设计、产品制造、运营等企、事业单位及高等院校、社会团体自愿组成的行业性、全国性、开放性、非营利性社会组织。

【ICML2020】用于图结构化数据的卷积核网络，Convolutional Kernel Networks for Graph-Structured Data

专知会员服务

44+阅读 · 2020年6月29日

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

专知会员服务

28+阅读 · 2020年2月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日