Modern approaches for fast retrieval of similar vectors on billion-scaled datasets rely on compressed-domain approaches such as binary sketches or product quantization. These methods minimize a certain loss, typically the mean squared error or other objective functions tailored to the retrieval problem. In this paper, we re-interpret popular methods such as binary hashing or product quantizers as auto-encoders, and point out that they implicitly make suboptimal assumptions on the form of the decoder. We design backward-compatible decoders that improve the reconstruction of the vectors from the same codes, which translates to a better performance in nearest neighbor search. Our method significantly improves over binary hashing methods or product quantization on popular benchmarks.
翻译:在10亿尺度数据集上快速检索类似矢量的现代方法依赖于压缩域方法,如二进制草图或产品量化。这些方法最大限度地减少了某种损失,通常是平均平方错误或针对检索问题而定制的其他客观功能。在本文中,我们重新将二进制散列或产品量化器等受欢迎的方法作为自动生成器加以解释,并指出它们隐含地对解码器的形式作出次优的假设。我们设计了后进兼容的解码器,从相同的代码中改进矢量的重建,从而转化成近邻搜索的更好性能。我们的方法大大改进了二进制散列方法或对通用基准的产品量化。